From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: memory-cgroup bug Date: Thu, 22 Nov 2012 16:24:41 +0100 Message-ID: <20121122152441.GA9609@dhcp22.suse.cz> References: <20121121200207.01068046@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: attachment In-Reply-To: <20121121200207.01068046@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist On Wed 21-11-12 20:02:07, azurIt wrote: > Hi, > > i'm using memory cgroup for limiting our users and having a really > strange problem when a cgroup gets out of its memory limit. It's very > strange because it happens only sometimes (about once per week on > random user), out of memory is usually handled ok. What is your memcg configuration? Do you use deeper hierarchies, is use_hierarchy enabled? Is the memcg oom (aka memory.oom_control) enabled? Do you use soft limit for those groups? Is memcg swap accounting enabled and memsw limits in place? Is the machine under global memory pressure as well? Could you post sysrq+t or sysrq+w? > This happens when problem occures: > - no new processes can be started for this cgroup > - current processes are freezed and taking 100% of CPU > - when i try to 'strace' any of current processes, the whole strace > freezes until process is killed (strace cannot be terminated by > CTRL-c) > - problem can be resolved by raising memory limit for cgroup or > killing of few processes inside cgroup so some memory is freed > > I also garbbed the content of /proc//stack of freezed process: > [] mem_cgroup_handle_oom+0x241/0x3b0 > [] T.1146+0x5ab/0x5c0 Hmm what is this? > [] mem_cgroup_charge_common+0x56/0xa0 > [] mem_cgroup_newpage_charge+0x45/0x50 > [] do_wp_page+0x14e/0x800 > [] handle_pte_fault+0x264/0x940 > [] handle_mm_fault+0x138/0x260 > [] do_page_fault+0x13d/0x460 > [] page_fault+0x1f/0x30 > [] 0xffffffffffffffff > How many tasks are hung in mem_cgroup_handle_oom? If there were many of them then it'd smell like an issue fixed by 79dfdaccd1d5 (memcg: make oom_lock 0 and 1 based rather than counter) and its follow up fix 23751be00940 (memcg: fix hierarchical oom locking) but you are saying that you can reproduce with 3.2 and those went in for 3.1. 2.6.32 would make more sense. > I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32. I guess this is a clean vanilla (stable) kernel, right? Are you able to reproduce with the latest Linus tree? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_memory=2Dcgroup_bug?= Date: Thu, 22 Nov 2012 19:05:26 +0100 Message-ID: <20121122190526.390C7A28@pobox.sk> References: <20121121200207.01068046@pobox.sk> <20121122152441.GA9609@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20121122152441.GA9609@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= >> i'm using memory cgroup for limiting our users and having a really >> strange problem when a cgroup gets out of its memory limit. It's very >> strange because it happens only sometimes (about once per week on >> random user), out of memory is usually handled ok. > >What is your memcg configuration? Do you use deeper hierarchies, is >use_hierarchy enabled? Is the memcg oom (aka memory.oom_control) >enabled? Do you use soft limit for those groups? Is memcg swap >accounting enabled and memsw limits in place? >Is the machine under global memory pressure as well? >Could you post sysrq+t or sysrq+w? My cgroups hierarchy: /cgroups//uid/ where '' is system user id and 'uid' is just word 'uid'. Memory limits are set in /cgroups// and hierarchy is enabled. Pr= ocesses are inside /cgroups//uid/ . I'm using hard limits for me= mory and swap BUT system has no swap at all (it has 'only' 16 GB of real = RAM). memory.oom_control is set to 'oom_kill_disable 0'. Server has enoug= h of free memory when problem occurs. >> This happens when problem occures: >> - no new processes can be started for this cgroup >> - current processes are freezed and taking 100% of CPU >> - when i try to 'strace' any of current processes, the whole strace >> freezes until process is killed (strace cannot be terminated by >> CTRL-c) >> - problem can be resolved by raising memory limit for cgroup or >> killing of few processes inside cgroup so some memory is freed >>=20 >> I also garbbed the content of /proc//stack of freezed process: >> [] mem_cgroup_handle_oom+0x241/0x3b0 >> [] T.1146+0x5ab/0x5c0 > >Hmm what is this? Really doesn't know, i will get stack of all freezed processes next time = so we can compare it. >> [] mem_cgroup_charge_common+0x56/0xa0 >> [] mem_cgroup_newpage_charge+0x45/0x50 >> [] do_wp_page+0x14e/0x800 >> [] handle_pte_fault+0x264/0x940 >> [] handle_mm_fault+0x138/0x260 >> [] do_page_fault+0x13d/0x460 >> [] page_fault+0x1f/0x30 >> [] 0xffffffffffffffff >> > >How many tasks are hung in mem_cgroup_handle_oom? If there were many >of them then it'd smell like an issue fixed by 79dfdaccd1d5 (memcg: >make oom_lock 0 and 1 based rather than counter) and its follow up fix >23751be00940 (memcg: fix hierarchical oom locking) but you are saying >that you can reproduce with 3.2 and those went in for 3.1. 2.6.32 would >make more sense. Usually maximum of several 10s of processes but i will check it next time= . I was having much worse problems in 2.6.32 - when freezing happens, the= whole server was affected (i wasn't able to do anything and needs to wai= t until my scripts takes case of it and killed apache, so i don't have an= y detailed info). In 3.2 only target cgroup is affected. >> I'm currently using kernel 3.2.34 but i'm having this problem since 2.= 6.32. > >I guess this is a clean vanilla (stable) kernel, right? Are you able to >reproduce with the latest Linus tree? Well, no. I'm using, for example, newest stable grsecurity patch. I'm als= o using few of Andrea Righi's cgroup subsystems but i don't believe these= are doing problems: - cgroup-uid which is moving processes into cgroups based on UID - cgroup-task which can limit number of tasks in cgroup (i already tried= to disable this one, it didn't help) http://www.develer.com/~arighi/linux/patches/ Unfortunately i cannot just install new and untested kernel version cos i= 'm not able to reproduce this problem anytime (it's happening randomly in= production environment). Could it be that OOM cannot start and kill processes because there's no f= ree memory in cgroup? Thank you! azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: memory-cgroup bug Date: Thu, 22 Nov 2012 22:42:52 +0100 Message-ID: <20121122214249.GA20319@dhcp22.suse.cz> References: <20121121200207.01068046@pobox.sk> <20121122152441.GA9609@dhcp22.suse.cz> <20121122190526.390C7A28@pobox.sk> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=biRBPraKqBujWRYUdaZvYwEstsdQxToS5u4m128rRqw=; b=xHvnUlLaW47Zx7HGKp2ncHWet7surQ7H9w0aeV3mBBXoXi+ALCzGT8iwy6kBpCrYOQ T2Tj8lG9GRwk6gC7tUJhmKQqEhrxVz6/kOQz7zYdj25JqperhVmouYK0wMc3QyiOf8Fa +8fG9TP1mK1prHZg/+bP2heS9gVwfV7HLpttpYWFCJQ8SuiNmjqs1e0Q63DuuVFo/R0f QuDuX5nKlue9/RRVfweo8E0N4nGspwC8MJPdmHLGBM00i67wFLACHOKhNpiwmrSFSQej do+AYf1yAFL8yH89FIlUbnWbIV4z3tGWELBuI5j+35/A44k8Ku6x2fjo/IARuMwQI8mA AnZw== Content-Disposition: inline In-Reply-To: <20121122190526.390C7A28@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist On Thu 22-11-12 19:05:26, azurIt wrote: [...] > My cgroups hierarchy: > /cgroups//uid/ > > where '' is system user id and 'uid' is just word 'uid'. > > Memory limits are set in /cgroups// and hierarchy is > enabled. Processes are inside /cgroups//uid/ . I'm using > hard limits for memory and swap BUT system has no swap at all > (it has 'only' 16 GB of real RAM). memory.oom_control is set to > 'oom_kill_disable 0'. Server has enough of free memory when problem > occurs. OK, so so the global reclaim shouldn't be active. This is definitely good to know. > >> This happens when problem occures: > >> - no new processes can be started for this cgroup > >> - current processes are freezed and taking 100% of CPU > >> - when i try to 'strace' any of current processes, the whole strace > >> freezes until process is killed (strace cannot be terminated by > >> CTRL-c) > >> - problem can be resolved by raising memory limit for cgroup or > >> killing of few processes inside cgroup so some memory is freed > >> > >> I also garbbed the content of /proc//stack of freezed process: > >> [] mem_cgroup_handle_oom+0x241/0x3b0 > >> [] T.1146+0x5ab/0x5c0 > > > >Hmm what is this? > > Really doesn't know, i will get stack of all freezed processes next > time so we can compare it. > > >> [] mem_cgroup_charge_common+0x56/0xa0 > >> [] mem_cgroup_newpage_charge+0x45/0x50 > >> [] do_wp_page+0x14e/0x800 > >> [] handle_pte_fault+0x264/0x940 > >> [] handle_mm_fault+0x138/0x260 > >> [] do_page_fault+0x13d/0x460 > >> [] page_fault+0x1f/0x30 > >> [] 0xffffffffffffffff Btw. is this stack stable or is the task bouncing in some loop? And finally could you post the disassembly of your version of mem_cgroup_handle_oom, please? > >How many tasks are hung in mem_cgroup_handle_oom? If there were many > >of them then it'd smell like an issue fixed by 79dfdaccd1d5 (memcg: > >make oom_lock 0 and 1 based rather than counter) and its follow up fix > >23751be00940 (memcg: fix hierarchical oom locking) but you are saying > >that you can reproduce with 3.2 and those went in for 3.1. 2.6.32 would > >make more sense. > > > Usually maximum of several 10s of processes but i will check it next > time. I was having much worse problems in 2.6.32 - when freezing > happens, the whole server was affected (i wasn't able to do anything > and needs to wait until my scripts takes case of it and killed apache, > so i don't have any detailed info). Hmm, maybe the issue fixed by 1d65f86d (mm: preallocate page before lock_page() at filemap COW) which was merged in 3.1. > In 3.2 only target cgroup is affected. > > >> I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32. > > > >I guess this is a clean vanilla (stable) kernel, right? Are you able to > >reproduce with the latest Linus tree? > > > Well, no. I'm using, for example, newest stable grsecurity patch. That shouldn't be related > I'm also using few of Andrea Righi's cgroup subsystems but i don't > believe > these are doing problems: > - cgroup-uid which is moving processes into cgroups based on UID > - cgroup-task which can limit number of tasks in cgroup (i already > tried to disable this one, it didn't help) > http://www.develer.com/~arighi/linux/patches/ I am not familiar with those pathces but I will double check. > Unfortunately i cannot just install new and untested kernel version > cos i'm not able to reproduce this problem anytime (it's happening > randomly in production environment). This will make it a bit harder to debug but let's see maybe the new traces would help... > Could it be that OOM cannot start and kill processes because there's > no free memory in cgroup? That shouldn't happen. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_memory=2Dcgroup_bug?= Date: Thu, 22 Nov 2012 23:34:34 +0100 Message-ID: <20121122233434.3D5E35E6@pobox.sk> References: <20121121200207.01068046@pobox.sk>, <20121122152441.GA9609@dhcp22.suse.cz>, <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20121122214249.GA20319@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= >Btw. is this stack stable or is the task bouncing in some loop? Not sure, will check it next time. >And finally could you post the disassembly of your version of >mem_cgroup_handle_oom, please? How can i do this? >What does your kernel log says while this is happening. Are there any >memcg OOM messages showing up? I will get the logs next time. Thank you! azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: memory-cgroup bug Date: Fri, 23 Nov 2012 08:40:23 +0100 Message-ID: <20121123074023.GA24698@dhcp22.suse.cz> References: <20121121200207.01068046@pobox.sk> <20121122152441.GA9609@dhcp22.suse.cz> <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=7KlYqfXw1dy1NPwRUheAjUPJSLtj1WPHABvqyF99Luc=; b=VOw5yyclO2BvNJzJeIbqyFGnX1IICTSI4T/vUrYmSQrnOZI7aSKj4ROkGTKs7JzQt4 CvN+E1dFJ/4KmYqoylXXBMGnk7kcyCwwlNccywdEmDVw3MlsHu9HT0fxKCFfz97KiuZf UoA83BT4Esr9gA4PsQ0T+iXK5cJ1hUXqNKUiZF3/DIxgwL+7f8zdIVH1OXQK1Njjh+px uczfQ1/2U0a/oK0p2Sm1NHurrtfEGGNh2Qc4bvPiMImC7okzwwtZJwYF3eR6EewMkIoh sJdxWgDiVH7Oix84vDzjF72tqnNARzgIixLwjf154/9zHqMeObU6laK/2erwR5HJ7kCS 2nRQ== Content-Disposition: inline In-Reply-To: <20121122233434.3D5E35E6-Rm0zKEqwvD4@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist On Thu 22-11-12 23:34:34, azurIt wrote: [...] > >And finally could you post the disassembly of your version of > >mem_cgroup_handle_oom, please? > > How can i do this? Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom function. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_memory=2Dcgroup_bug?= Date: Fri, 23 Nov 2012 10:21:37 +0100 Message-ID: <20121123102137.10D6D653@pobox.sk> References: <20121121200207.01068046@pobox.sk>, <20121122152441.GA9609@dhcp22.suse.cz>, <20121122190526.390C7A28@pobox.sk>, <20121122214249.GA20319@dhcp22.suse.cz>, <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Return-path: In-Reply-To: <20121123074023.GA24698-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, =?utf-8?q?cgroups_mailinglist?= >Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or >use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom >function. If 'YOUR_VMLINUX' is supposed to be my kernel image: # gdb vmlinuz-3.2.34-grsec-1 GNU gdb (GDB) 7.0.1-debian Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: ... "/root/bug/vmlinuz-3.2.34-grsec-1": not in executable format: File format not recognized # objdump -d vmlinuz-3.2.34-grsec-1 objdump: vmlinuz-3.2.34-grsec-1: File format not recognized # file vmlinuz-3.2.34-grsec-1 vmlinuz-3.2.34-grsec-1: Linux kernel x86 boot executable bzImage, version 3.2.34-grsec (root@server01) #1, RO-rootFS, swap_dev 0x3, Normal VGA I'm probably doing something wrong :) It, luckily, happend again so i have more info. - there wasn't any logs in kernel from OOM for that cgroup - there were 16 processes in cgroup - processes in cgroup were taking togather 100% of CPU (it was allowed to use only one core, so 100% of that core) - memory.failcnt was groving fast - oom_control: oom_kill_disable 0 under_oom 0 (this was looping from 0 to 1) - limit_in_bytes was set to 157286400 - content of stat (as you can see, the whole memory limit was used): cache 0 rss 0 mapped_file 0 pgpgin 0 pgpgout 0 swap 0 pgfault 0 pgmajfault 0 inactive_anon 0 active_anon 0 inactive_file 0 active_file 0 unevictable 0 hierarchical_memory_limit 157286400 hierarchical_memsw_limit 157286400 total_cache 0 total_rss 157286400 total_mapped_file 0 total_pgpgin 10326454 total_pgpgout 10288054 total_swap 0 total_pgfault 12939677 total_pgmajfault 4283 total_inactive_anon 0 total_active_anon 157286400 total_inactive_file 0 total_active_file 0 total_unevictable 0 i also grabber oom_adj, oom_score_adj and stack of all processes, here it is: http://www.watchdog.sk/lkml/memcg-bug.tar Notice that stack is different for few processes. Stack for all processes were NOT chaging and was still the same. Btw, don't know if it matters but i was several cgroup subsystems mounted and i'm also using them (i was not activating freezer in this case, don't know if it can be active automatically by kernel or what, didn't checked if cgroup was freezed but i suppose it wasn't): none /cgroups cgroup defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0 Thank you. azur From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: memory-cgroup bug Date: Fri, 23 Nov 2012 10:28:29 +0100 Message-ID: <20121123092829.GE24698@dhcp22.suse.cz> References: <20121121200207.01068046@pobox.sk> <20121122152441.GA9609@dhcp22.suse.cz> <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=j/FZrF1DKknY5o/Da2kKJ5pBnj8DzUpEbbPjTdCMpFs=; b=q0WG1LVfOVPCMXRGpS6w334/BOHIRw6My994cfv3TwuTbltrckrAXFWzqTGh4u5rPT TKhY7dOKg5USHg5H5UYv2efMYdEWW4D5CFM/SgYCK06AOn9Ogk/CmhMrFb0bPpH5WGuq pb0m8YYMOGRra+dctUkGPXxDy+fC5YvYAEw4Ncfpe6tj0ULvORLOXERgZrh/21Co4j1t XU1ANkGpN0MHXibK3uPilDPdBghRTEvWLohiNOjjaFiyu2hBMcbkP9g7FOqapKZ763TJ CPdIBAJaIhh1AcS0X0ZzG2A5n5Soc/46jetSQ5P31DzBUgpmC4VjMYxW+sgFrhcNH1y+ TNhw== Content-Disposition: inline In-Reply-To: <20121123102137.10D6D653@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist On Fri 23-11-12 10:21:37, azurIt wrote: > >Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or > >use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom > >function. > If 'YOUR_VMLINUX' is supposed to be my kernel image: > > # gdb vmlinuz-3.2.34-grsec-1 > GNU gdb (GDB) 7.0.1-debian > Copyright (C) 2009 Free Software Foundation, Inc. > License GPLv3+: GNU GPL version 3 or later > This is free software: you are free to change and redistribute it. > There is NO WARRANTY, to the extent permitted by law. Type "show copying" > and "show warranty" for details. > This GDB was configured as "x86_64-linux-gnu". > For bug reporting instructions, please see: > ... > "/root/bug/vmlinuz-3.2.34-grsec-1": not in executable format: File format not recognized > > > # objdump -d vmlinuz-3.2.34-grsec-1 You need vmlinux not vmlinuz... -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Glauber Costa Subject: Re: memory-cgroup bug Date: Fri, 23 Nov 2012 13:34:59 +0400 Message-ID: <50AF4343.6070002@parallels.com> References: <20121121200207.01068046@pobox.sk>, <20121122152441.GA9609@dhcp22.suse.cz>, <20121122190526.390C7A28@pobox.sk>, <20121122214249.GA20319@dhcp22.suse.cz>, <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20121123102137.10D6D653@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: azurIt Cc: Michal Hocko , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist On 11/23/2012 01:21 PM, azurIt wrote: >> Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or >> use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom >> function. > If 'YOUR_VMLINUX' is supposed to be my kernel image: > > # gdb vmlinuz-3.2.34-grsec-1 this is vmlinuz, not vmlinux. This is the compressed image. > > # file vmlinuz-3.2.34-grsec-1 > vmlinuz-3.2.34-grsec-1: Linux kernel x86 boot executable bzImage, version 3.2.34-grsec (root@server01) #1, RO-rootFS, swap_dev 0x3, Normal VGA > > I'm probably doing something wrong :) You need this: [glauber@straightjacket linux-glommer]$ file vmlinux vmlinux: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, BuildID[sha1]=0xba936ee6b6096f9bc4c663f2a2ee0c2d2481c408, not stripped instead of bzImage. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_memory=2Dcgroup_bug?= Date: Fri, 23 Nov 2012 10:44:23 +0100 Message-ID: <20121123104423.338C7725@pobox.sk> References: <20121121200207.01068046@pobox.sk>, <20121122152441.GA9609@dhcp22.suse.cz>, <20121122190526.390C7A28@pobox.sk>, <20121122214249.GA20319@dhcp22.suse.cz>, <20121122233434.3D5E35E6@pobox.sk>, <20121123074023.GA24698@dhcp22.suse.cz>, <20121123102137.10D6D653@pobox.sk> <20121123092829.GE24698@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Return-path: In-Reply-To: <20121123092829.GE24698-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, =?utf-8?q?cgroups_mailinglist?= > CC: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" >On Fri 23-11-12 10:21:37, azurIt wrote: >> >Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or >> >use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom >> >function. >> If 'YOUR_VMLINUX' is supposed to be my kernel image: >> >> # gdb vmlinuz-3.2.34-grsec-1 >> GNU gdb (GDB) 7.0.1-debian >> Copyright (C) 2009 Free Software Foundation, Inc. >> License GPLv3+: GNU GPL version 3 or later >> This is free software: you are free to change and redistribute it. >> There is NO WARRANTY, to the extent permitted by law. Type "show copying" >> and "show warranty" for details. >> This GDB was configured as "x86_64-linux-gnu". >> For bug reporting instructions, please see: >> ... >> "/root/bug/vmlinuz-3.2.34-grsec-1": not in executable format: File format not recognized >> >> >> # objdump -d vmlinuz-3.2.34-grsec-1 > >You need vmlinux not vmlinuz... ok, got it but still no luck: # gdb vmlinux GNU gdb (GDB) 7.0.1-debian Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: ... Reading symbols from /root/bug/dddddddd/vmlinux...(no debugging symbols found)...done. (gdb) disassemble mem_cgroup_handle_oom No symbol table is loaded. Use the "file" command. # objdump -d vmlinux | grep mem_cgroup_handle_oom i can recompile the kernel if anything needs to be added into it. azur From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: memory-cgroup bug Date: Fri, 23 Nov 2012 11:04:38 +0100 Message-ID: <20121123100438.GF24698@dhcp22.suse.cz> References: <20121121200207.01068046@pobox.sk> <20121122152441.GA9609@dhcp22.suse.cz> <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=QcyjeqXMnmzBXRK+nws1KlErRA0E4vH5ngBv1Ti/2MM=; b=YdoNZbCNyN2FRF7LMvam8G6VQgCcfp0rGPBO6KozPFSACkgUz/PPORs63PL671JPV6 F7QE/iiUAlpsxHqq2ROOmypviqMb+obu1esDqT3jjqgG0blrLpNtSdAg4mibgu899DG9 6G61AhyTTkwcG1xjuIxxrFMgejV2eGasBvMCAf8y9RGWn+PjQbY+FDW0B1hFeJpGNDmz 8FAFOEW5Ip04htEyiYsmrbxhdYk45FyboLGgIIFHxcTfVjD4+p6qTMTmVVxHXX6ViXJM ZtLJgJiOd2xPOCqLpTvhE4VwgvrdGx6E1dUx4yrl2S3T3Xixon2w0Eceyu0zL2xTkcAr 1waA== Content-Disposition: inline In-Reply-To: <20121123102137.10D6D653@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist On Fri 23-11-12 10:21:37, azurIt wrote: [...] > It, luckily, happend again so i have more info. > > - there wasn't any logs in kernel from OOM for that cgroup > - there were 16 processes in cgroup > - processes in cgroup were taking togather 100% of CPU (it > was allowed to use only one core, so 100% of that core) > - memory.failcnt was groving fast > - oom_control: > oom_kill_disable 0 > under_oom 0 (this was looping from 0 to 1) So there was an OOM going on but no messages in the log? Really strange. Kame already asked about oom_score_adj of the processes in the group but it didn't look like all the processes would have oom disabled, right? > - limit_in_bytes was set to 157286400 > - content of stat (as you can see, the whole memory limit was used): > cache 0 > rss 0 This looks like a top-level group for your user. > mapped_file 0 > pgpgin 0 > pgpgout 0 > swap 0 > pgfault 0 > pgmajfault 0 > inactive_anon 0 > active_anon 0 > inactive_file 0 > active_file 0 > unevictable 0 > hierarchical_memory_limit 157286400 > hierarchical_memsw_limit 157286400 > total_cache 0 > total_rss 157286400 OK, so all the memory is anonymous and you have no swap so the oom is the only thing to do. > total_mapped_file 0 > total_pgpgin 10326454 > total_pgpgout 10288054 > total_swap 0 > total_pgfault 12939677 > total_pgmajfault 4283 > total_inactive_anon 0 > total_active_anon 157286400 > total_inactive_file 0 > total_active_file 0 > total_unevictable 0 > > > i also grabber oom_adj, oom_score_adj and stack of all processes, here > it is: > http://www.watchdog.sk/lkml/memcg-bug.tar Hmm, all processes waiting for oom are stuck at the very same place: $ grep mem_cgroup_handle_oom -r [0-9]* 30858/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 30859/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 30860/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 30892/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 30898/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 31588/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 32044/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 32358/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 6031/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 6534/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 7020/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 We are taking memcg_oom_lock spinlock twice in that function + we can schedule. As none of the tasks is scheduled this would suggest that you are blocked at the first lock. But who got the lock then? This is really strange. Btw. is sysrq+t resp. sysrq+w showing the same traces as /proc//stat? > Notice that stack is different for few processes. Yes others are in VFS resp ext3. ext3_write_begin looks a bit dangerous but it grabs the page before it really starts a transaction. > Stack for all processes were NOT chaging and was still the same. Could you take few snapshots over time? > Btw, don't know if it matters but i was several cgroup subsystems > mounted and i'm also using them (i was not activating freezer in this > case, don't know if it can be active automatically by kernel or what, No > didn't checked if cgroup was freezed but i suppose it wasn't): > none /cgroups cgroup defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0 Do you see the same issue if only memory controller was mounted (resp. cpuset which you seem to use as well from your description). I know you said booting into a vanilla kernel would be problematic but could you at least rule out te cgroup patches that you have mentioned? If you need to move a task to a group based by an uid you can use cgrules daemon (libcgroup1 package) for that as well. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: memory-cgroup bug Date: Fri, 23 Nov 2012 11:10:34 +0100 Message-ID: <20121123101034.GG24698@dhcp22.suse.cz> References: <20121121200207.01068046@pobox.sk> <20121122152441.GA9609@dhcp22.suse.cz> <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123092829.GE24698@dhcp22.suse.cz> <20121123104423.338C7725@pobox.sk> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=A1e0Rx5B8Hyx3nRdSNqn3kdFKPcUcGvj2OV2t0OKrEQ=; b=jr3YiFysRH/mcpKf5j/L4q0q3Nf2ILcNcA+VUuQ57moWOCQSRazY9AEFoUy3tyun+r 95G0dWX6waGSgUKwqtVMQKo2eT17lwUQond7wiSyV9W68roxN/Ha16Qhxv4+iiEUJz1C 38HMDJ0n9cJHRDRQwERcd6aav+V8pY/rruv0aTO2CNwVglM1gXI9arViXtOrHfwTvBll Dgxp0e7TGJ5O2+COirDJc+G9t+u7KV9rmEcu8InDSMLB3pkYrc7McDZNEJB3InZQQTEU hVGZ7OQz56lRuXMa2saiEkdSza0GfzOHqonAKgGPfF0F2FC1pi0vrzKFRJnmWw3Lhcbo YLKA== Content-Disposition: inline In-Reply-To: <20121123104423.338C7725-Rm0zKEqwvD4@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist On Fri 23-11-12 10:44:23, azurIt wrote: [...] > # gdb vmlinux > GNU gdb (GDB) 7.0.1-debian > Copyright (C) 2009 Free Software Foundation, Inc. > License GPLv3+: GNU GPL version 3 or later > This is free software: you are free to change and redistribute it. > There is NO WARRANTY, to the extent permitted by law. Type "show copying" > and "show warranty" for details. > This GDB was configured as "x86_64-linux-gnu". > For bug reporting instructions, please see: > ... > Reading symbols from /root/bug/dddddddd/vmlinux...(no debugging symbols found)...done. > (gdb) disassemble mem_cgroup_handle_oom > No symbol table is loaded. Use the "file" command. > > > > # objdump -d vmlinux | grep mem_cgroup_handle_oom > Hmm, strange so the function is on the stack but it has been inlined? Doesn't make much sense to me. > i can recompile the kernel if anything needs to be added into it. If you could instrument mem_cgroup_handle_oom with some printks (before we take the memcg_oom_lock, before we schedule and into mem_cgroup_out_of_memory) -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_memory=2Dcgroup_bug?= Date: Fri, 23 Nov 2012 15:59:04 +0100 Message-ID: <20121123155904.490039C5@pobox.sk> References: <20121121200207.01068046@pobox.sk>, <20121122152441.GA9609@dhcp22.suse.cz>, <20121122190526.390C7A28@pobox.sk>, <20121122214249.GA20319@dhcp22.suse.cz>, <20121122233434.3D5E35E6@pobox.sk>, <20121123074023.GA24698@dhcp22.suse.cz>, <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Return-path: In-Reply-To: <20121123100438.GF24698-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, =?utf-8?q?cgroups_mailinglist?= >If you could instrument mem_cgroup_handle_oom with some printks (before >we take the memcg_oom_lock, before we schedule and into >mem_cgroup_out_of_memory) If you send me patch i can do it. I'm, unfortunately, not able to code it. >> It, luckily, happend again so i have more info. >> >> - there wasn't any logs in kernel from OOM for that cgroup >> - there were 16 processes in cgroup >> - processes in cgroup were taking togather 100% of CPU (it >> was allowed to use only one core, so 100% of that core) >> - memory.failcnt was groving fast >> - oom_control: >> oom_kill_disable 0 >> under_oom 0 (this was looping from 0 to 1) > >So there was an OOM going on but no messages in the log? Really strange. >Kame already asked about oom_score_adj of the processes in the group but >it didn't look like all the processes would have oom disabled, right? There were no messages telling that some processes were killed because of OOM. >> - limit_in_bytes was set to 157286400 >> - content of stat (as you can see, the whole memory limit was used): >> cache 0 >> rss 0 > >This looks like a top-level group for your user. Yes, it was from /cgroup// >> mapped_file 0 >> pgpgin 0 >> pgpgout 0 >> swap 0 >> pgfault 0 >> pgmajfault 0 >> inactive_anon 0 >> active_anon 0 >> inactive_file 0 >> active_file 0 >> unevictable 0 >> hierarchical_memory_limit 157286400 >> hierarchical_memsw_limit 157286400 >> total_cache 0 >> total_rss 157286400 > >OK, so all the memory is anonymous and you have no swap so the oom is >the only thing to do. What will happen if the same situation occurs globally? No swap, every bit of memory used. Will kernel be able to start OOM killer? Maybe the same thing is happening in cgroup - there's simply no space to run OOM killer. And maybe this is why it's happening rarely - usually there are still at least few KBs of memory left to start OOM killer. >Hmm, all processes waiting for oom are stuck at the very same place: >$ grep mem_cgroup_handle_oom -r [0-9]* >30858/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 >30859/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 >30860/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 >30892/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 >30898/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 >31588/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 >32044/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 >32358/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 >6031/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 >6534/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 >7020/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 > >We are taking memcg_oom_lock spinlock twice in that function + we can >schedule. As none of the tasks is scheduled this would suggest that you >are blocked at the first lock. But who got the lock then? >This is really strange. >Btw. is sysrq+t resp. sysrq+w showing the same traces as >/proc//stat? Unfortunately i'm connecting remotely to the servers (SSH). >> Notice that stack is different for few processes. > >Yes others are in VFS resp ext3. ext3_write_begin looks a bit dangerous >but it grabs the page before it really starts a transaction. Maybe these processes were throttled by cgroup-blkio at the same time and are still keeping the lock? So the problem occurs when there are low on memory and cgroup is doing IO out of it's limits. Only guessing and telling my thoughts. >> Stack for all processes were NOT chaging and was still the same. > >Could you take few snapshots over time? Will do next time but i can't keep services freezed for a long time or customers will be angry. >> didn't checked if cgroup was freezed but i suppose it wasn't): >> none /cgroups cgroup defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0 > >Do you see the same issue if only memory controller was mounted (resp. >cpuset which you seem to use as well from your description). Uh, we are using all mounted subsystems :( I will be able to umount only freezer and maybe blkio for some time. Will it help? >I know you said booting into a vanilla kernel would be problematic but >could you at least rule out te cgroup patches that you have mentioned? >If you need to move a task to a group based by an uid you can use >cgrules daemon (libcgroup1 package) for that as well. We are using cgroup-uid cos it's MUCH MUCH MUCH more efective and better. For example, i don't believe that cgroup-task will work with that daemon. What will happen if cgrules won't be able to add process into cgroup because of task limit? Process will probably continue and will run outside of any cgroup which is wrong. With cgroup-task + cgroup-uid, such processes cannot be even started (and this is what we need). From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_memory=2Dcgroup_bug?= Date: Sun, 25 Nov 2012 01:10:47 +0100 Message-ID: <20121125011047.7477BB5E@pobox.sk> References: <20121121200207.01068046@pobox.sk>, <20121122152441.GA9609@dhcp22.suse.cz>, <20121122190526.390C7A28@pobox.sk>, <20121122214249.GA20319@dhcp22.suse.cz>, <20121122233434.3D5E35E6@pobox.sk>, <20121123074023.GA24698@dhcp22.suse.cz>, <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Return-path: In-Reply-To: <20121123100438.GF24698-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, =?utf-8?q?cgroups_mailinglist?= >Could you take few snapshots over time? Here it is, now from different server, snapshot was taken every second for 10 minutes (hope it's enough): www.watchdog.sk/lkml/memcg-bug-2.tar.gz From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: memory-cgroup bug Date: Sun, 25 Nov 2012 11:17:07 +0100 Message-ID: <20121125101707.GA10623@dhcp22.suse.cz> References: <20121121200207.01068046@pobox.sk> <20121122152441.GA9609@dhcp22.suse.cz> <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121123155904.490039C5@pobox.sk> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=eNvewVOM2rwznXYR+2RZv1I7E/C+4EuMZBsxp/Ow36s=; b=E75F8VXFkL7ICTWH353Xm7geNb49eJYsVwmIgy62NZei8YSI6h6vL8wYUzMOC33/EJ oNFZ+nN/yOkpUBC9qYJfXs7k+jMHcZmmeBah2iT4OT4VgjD7+15ZoAVSHnl2COAf3eVT TjrwSXf5Mws44H8jbBzNRBMDFPTI9G+PwCSKUg+Hk/oA8WFKQkEFV2Dpz8JXGPK5INJo Yd8hjH1idkhRAybwI7BixYS8mVndV3cpaDFSxDJP2EakYbXuYd88WESarNRCYSYETUKN Z20Q2xs8oCJ9c3Tx0x+lm+RuBCMZUS3qBSqRVFDmAGgNefqHNmrus7CiglnEYfxfHV4m pf2g== Content-Disposition: inline In-Reply-To: <20121123155904.490039C5-Rm0zKEqwvD4@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist On Fri 23-11-12 15:59:04, azurIt wrote: > >If you could instrument mem_cgroup_handle_oom with some printks (before > >we take the memcg_oom_lock, before we schedule and into > >mem_cgroup_out_of_memory) > > > If you send me patch i can do it. I'm, unfortunately, not able to code it. Inlined at the end of the email. Please note I have compile tested it. It might produce a lot of output. > >> It, luckily, happend again so i have more info. > >> > >> - there wasn't any logs in kernel from OOM for that cgroup > >> - there were 16 processes in cgroup > >> - processes in cgroup were taking togather 100% of CPU (it > >> was allowed to use only one core, so 100% of that core) > >> - memory.failcnt was groving fast > >> - oom_control: > >> oom_kill_disable 0 > >> under_oom 0 (this was looping from 0 to 1) > > > >So there was an OOM going on but no messages in the log? Really strange. > >Kame already asked about oom_score_adj of the processes in the group but > >it didn't look like all the processes would have oom disabled, right? > > > There were no messages telling that some processes were killed because of OOM. dmesg | grep "Out of memory" doesn't tell anything, right? > >> - limit_in_bytes was set to 157286400 > >> - content of stat (as you can see, the whole memory limit was used): > >> cache 0 > >> rss 0 > > > >This looks like a top-level group for your user. > > > Yes, it was from /cgroup// > > > >> mapped_file 0 > >> pgpgin 0 > >> pgpgout 0 > >> swap 0 > >> pgfault 0 > >> pgmajfault 0 > >> inactive_anon 0 > >> active_anon 0 > >> inactive_file 0 > >> active_file 0 > >> unevictable 0 > >> hierarchical_memory_limit 157286400 > >> hierarchical_memsw_limit 157286400 > >> total_cache 0 > >> total_rss 157286400 > > > >OK, so all the memory is anonymous and you have no swap so the oom is > >the only thing to do. > > > What will happen if the same situation occurs globally? No swap, every > bit of memory used. Will kernel be able to start OOM killer? OOM killer is not a task. It doesn't allocate any memory. It just walks the process list and picks up a task with the highest score. If the global oom is not able to find any such a task (e.g. because all of them have oom disabled) the the system panics. > Maybe the same thing is happening in cgroup cgroup oom differs only in that aspect that the system doesn't panic if there is no suitable task to kill. [...] > >> Notice that stack is different for few processes. > > > >Yes others are in VFS resp ext3. ext3_write_begin looks a bit dangerous > >but it grabs the page before it really starts a transaction. > > > Maybe these processes were throttled by cgroup-blkio at the same time > and are still keeping the lock? If you are thinking about memcg_oom_lock then this is not possible because the lock is held only for short times. There is no other lock that memcg oom holds. > So the problem occurs when there are low on memory and cgroup is doing > IO out of it's limits. Only guessing and telling my thoughts. The lockup (if this is what happens) still might be related to the IO controller if the killed task cannot finish due to pending IO, though. [...] > >> didn't checked if cgroup was freezed but i suppose it wasn't): > >> none /cgroups cgroup defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0 > > > >Do you see the same issue if only memory controller was mounted (resp. > >cpuset which you seem to use as well from your description). > > > Uh, we are using all mounted subsystems :( I will be able to umount > only freezer and maybe blkio for some time. Will it help? Not sure about that without further data. > >I know you said booting into a vanilla kernel would be problematic but > >could you at least rule out te cgroup patches that you have mentioned? > >If you need to move a task to a group based by an uid you can use > >cgrules daemon (libcgroup1 package) for that as well. > > > We are using cgroup-uid cos it's MUCH MUCH MUCH more efective and > better. For example, i don't believe that cgroup-task will work with > that daemon. What will happen if cgrules won't be able to add process > into cgroup because of task limit? Process will probably continue and > will run outside of any cgroup which is wrong. With cgroup-task + > cgroup-uid, such processes cannot be even started (and this is what we > need). I am not familiar with cgroup-task controller so I cannot comment on that. --- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c8425b1..7f26ec8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1863,6 +1863,7 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) { struct oom_wait_info owait; bool locked, need_to_kill; + int ret = false; owait.mem = memcg; owait.wait.flags = 0; @@ -1873,6 +1874,7 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) mem_cgroup_mark_under_oom(memcg); /* At first, try to OOM lock hierarchy under memcg.*/ + printk("XXX: %d waiting for memcg_oom_lock\n", current->pid); spin_lock(&memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); /* @@ -1887,12 +1889,14 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) mem_cgroup_oom_notify(memcg); spin_unlock(&memcg_oom_lock); + printk("XXX: %d need_to_kill:%d locked:%d\n", current->pid, need_to_kill, locked); if (need_to_kill) { finish_wait(&memcg_oom_waitq, &owait.wait); mem_cgroup_out_of_memory(memcg, mask); } else { schedule(); finish_wait(&memcg_oom_waitq, &owait.wait); + printk("XXX: %d woken up\n", current->pid); } spin_lock(&memcg_oom_lock); if (locked) @@ -1903,10 +1907,13 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) mem_cgroup_unmark_under_oom(memcg); if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) - return false; + goto out; /* Give chance to dying process */ schedule_timeout_uninterruptible(1); - return true; + ret = true; +out: + printk("XXX: %d done with %d\n", current->pid, ret); + return ret; } /* diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 069b64e..a7db813 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -568,6 +568,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) */ if (fatal_signal_pending(current)) { set_thread_flag(TIF_MEMDIE); + printk("XXX: %d skipping task with fatal signal pending\n", current->pid); return; } @@ -576,8 +577,10 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) read_lock(&tasklist_lock); retry: p = select_bad_process(&points, limit, mem, NULL); - if (!p || PTR_ERR(p) == -1UL) + if (!p || PTR_ERR(p) == -1UL) { + printk("XXX: %d nothing to kill\n", current->pid); goto out; + } if (oom_kill_process(p, gfp_mask, 0, points, limit, mem, NULL, "Memory cgroup out of memory")) -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: memory-cgroup bug Date: Sun, 25 Nov 2012 13:05:24 +0100 Message-ID: <20121125120524.GB10623@dhcp22.suse.cz> References: <20121121200207.01068046@pobox.sk> <20121122152441.GA9609@dhcp22.suse.cz> <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=JRO4EuFLAQhtK8TXCvKVK3sMn0Ads9nE0CIBsPpZJ5M=; b=vQn43ISmP0JGGL/6hMj2Zeuv7tCN/g+2+vW3mRQyvFZSteHgFIvCTV3Iwu6bgwd1n0 icKctXZCCouhcxEBEkwkZHnYQh2FjjpaIca66cJX2QTad3421QgMhmiN/eYXPklSrBl7 hcaVMGcTMVfh+kurohcQ0hr2rVIZ0D1qHr8ZkwUiWryg1q2JC5Ybdy0jUByONhZK869n gCy/pj4XONbc6iWJzilXdr7pCJDapxES0vZ/KmTdK7Pw9kYK7uPb2LvbbsQB4NxHAq7I cYE2fsc01/E/9FJnOiAAiSHOocEusQHStTIvsRhK//1QbP+4kpUmsghq7FZ7zn8RQr8L V9pg== Content-Disposition: inline In-Reply-To: <20121125011047.7477BB5E@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki [Adding Kamezawa into CC] On Sun 25-11-12 01:10:47, azurIt wrote: > >Could you take few snapshots over time? > > > Here it is, now from different server, snapshot was taken every second > for 10 minutes (hope it's enough): > www.watchdog.sk/lkml/memcg-bug-2.tar.gz Hmm, interesting: $ grep . */memory.failcnt | cut -d: -f2 | awk 'BEGIN{min=666666}{if (prev>0) {diff=$1-prev; if (diff>max) max=diff; if (diff] mem_cgroup_handle_oom+0x241/0x3b0 546 [] do_truncate+0x58/0xa0 533 [] 0xffffffffffffffff Tells us that the stacks are pretty much stable. $ grep do_truncate -r [0-9]* | cut -d/ -f2 | sort | uniq -c 546 24495 So 24495 is stuck in do_truncate [] do_truncate+0x58/0xa0 [] do_last+0x250/0xa30 [] path_openat+0xd7/0x440 [] do_filp_open+0x49/0xa0 [] do_sys_open+0x106/0x240 [] sys_open+0x20/0x30 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff I suspect it is waiting for i_mutex. Who is holding that lock? Other tasks are blocked on the mem_cgroup_handle_oom either coming from the page fault path so i_mutex can be exluded or vfs_write (24796) and that one is interesting: [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1146+0x5ab/0x5c0 [] mem_cgroup_cache_charge+0xbe/0xe0 [] add_to_page_cache_locked+0x4c/0x140 [] add_to_page_cache_lru+0x22/0x50 [] grab_cache_page_write_begin+0x8b/0xe0 [] ext3_write_begin+0x88/0x270 [] generic_file_buffered_write+0x116/0x290 [] __generic_file_aio_write+0x27c/0x480 [] generic_file_aio_write+0x76/0xf0 # takes &inode->i_mutex [] do_sync_write+0xea/0x130 [] vfs_write+0xf3/0x1f0 [] sys_write+0x51/0x90 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff This smells like a deadlock. But kind strange one. The rapidly increasing failcnt suggests that somebody still tries to allocate but who when all of them hung in the mem_cgroup_handle_oom. This can be explained though. Memcg OOM killer let's only one process (which is able to lock the hierarchy by mem_cgroup_oom_lock) call mem_cgroup_out_of_memory and kill a process, while others are waiting on the wait queue. Once the killer is done it calls memcg_wakeup_oom which wakes up other tasks waiting on the queue. Those retry the charge, in a hope there is some memory freed in the meantime which hasn't happened so they get into OOM again (and again and again). This all usually works out except in this particular case I would bet my hat that the OOM selected task is pid 24495 which is blocked on the mutex which is held by one of the oom killer task so it cannot finish - thus free a memory. It seems that the current Linus' tree is affected as well. I will have to think about a solution but it sounds really tricky. It is not just ext3 that is affected. I guess we need to tell mem_cgroup_cache_charge that it should never reach OOM from add_to_page_cache_locked. This sounds quite intrusive to me. On the other hand it is really weird that an excessive writer might trigger a memcg OOM killer. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_memory=2Dcgroup_bug?= Date: Sun, 25 Nov 2012 13:36:02 +0100 Message-ID: <20121125133602.CF488229@pobox.sk> References: <20121121200207.01068046@pobox.sk>, <20121122152441.GA9609@dhcp22.suse.cz>, <20121122190526.390C7A28@pobox.sk>, <20121122214249.GA20319@dhcp22.suse.cz>, <20121122233434.3D5E35E6@pobox.sk>, <20121123074023.GA24698@dhcp22.suse.cz>, <20121123102137.10D6D653@pobox.sk>, <20121123100438.GF24698@dhcp22.suse.cz>, <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20121125120524.GB10623@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= >So there is a lot of attempts to allocate which fail, every second! Yes, as i said, the cgroup was taking 100% of (allocated) CPU core(s). No= t sure if all processes were using CPU but _few_ of them (not only one) f= or sure. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_memory=2Dcgroup_bug?= Date: Sun, 25 Nov 2012 13:39:53 +0100 Message-ID: <20121125133953.AD1B2F0A@pobox.sk> References: <20121121200207.01068046@pobox.sk>, <20121122152441.GA9609@dhcp22.suse.cz>, <20121122190526.390C7A28@pobox.sk>, <20121122214249.GA20319@dhcp22.suse.cz>, <20121122233434.3D5E35E6@pobox.sk>, <20121123074023.GA24698@dhcp22.suse.cz>, <20121123102137.10D6D653@pobox.sk>, <20121123100438.GF24698@dhcp22.suse.cz>, <20121123155904.490039C5@pobox.sk> <20121125101707.GA10623@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20121125101707.GA10623@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= >Inlined at the end of the email. Please note I have compile tested >it. It might produce a lot of output. Thank you very much, i will install it ASAP (probably this night). >dmesg | grep "Out of memory" >doesn't tell anything, right? Only messages for other cgroups but not for the freezed one (before nor a= fter the freeze). azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: memory-cgroup bug Date: Sun, 25 Nov 2012 14:02:08 +0100 Message-ID: <20121125130208.GC10623@dhcp22.suse.cz> References: <20121122152441.GA9609@dhcp22.suse.cz> <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121123155904.490039C5@pobox.sk> <20121125101707.GA10623@dhcp22.suse.cz> <20121125133953.AD1B2F0A@pobox.sk> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=cSwHJGVQ1jmUTDwd9nWsAYWvIs220aGECBIV+I9+bOM=; b=WL9o986OUJSk/SdN7ro6zd+QEf8rQTqQBxpu118o0aTJiqPNED+6py9wncHgbypSgF EgjO1uqRe1VHc/1y5mclSmHGRSJibgIqwzJWqnVXZdqg44rD7y9k7PgNR6MTOtoH6wLD koG+E8T2EzKLdSqGbwDecLH3PJWyeT6mj7famL3j3f7zE/AxZtsTgddUCidBXebIhJFK kT4Fd30TbmcFJPaWwn+fOQdojI15HhnAmiH30uVkWNtE2qWBmDFDYCCRxdEdLYnJId4H sVCQ4u98ld7jyZP8XVALMahVu0vjbnNrqVLDitwbc4aWdmY4Ij9eyTLHTrlq4pAk9b92 2Nkw== Content-Disposition: inline In-Reply-To: <20121125133953.AD1B2F0A-Rm0zKEqwvD4@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist On Sun 25-11-12 13:39:53, azurIt wrote: > >Inlined at the end of the email. Please note I have compile tested > >it. It might produce a lot of output. > > > Thank you very much, i will install it ASAP (probably this night). Please don't. If my analysis is correct which I am almost 100% sure it is then it would cause excessive logging. I am sorry I cannot come up with something else in the mean time. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_memory=2Dcgroup_bug?= Date: Sun, 25 Nov 2012 14:27:09 +0100 Message-ID: <20121125142709.19F4E8C2@pobox.sk> References: <20121122152441.GA9609@dhcp22.suse.cz>, <20121122190526.390C7A28@pobox.sk>, <20121122214249.GA20319@dhcp22.suse.cz>, <20121122233434.3D5E35E6@pobox.sk>, <20121123074023.GA24698@dhcp22.suse.cz>, <20121123102137.10D6D653@pobox.sk>, <20121123100438.GF24698@dhcp22.suse.cz>, <20121123155904.490039C5@pobox.sk>, <20121125101707.GA10623@dhcp22.suse.cz>, <20121125133953.AD1B2F0A@pobox.sk> <20121125130208.GC10623@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Return-path: In-Reply-To: <20121125130208.GC10623-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, =?utf-8?q?cgroups_mailinglist?= >> Thank you very much, i will install it ASAP (probably this night). > >Please don't. If my analysis is correct which I am almost 100% sure it >is then it would cause excessive logging. I am sorry I cannot come up >with something else in the mean time. Ok then. I will, meanwhile, try to contact Andrea Righi (author of cgroup-task etc.) and ask him to send here his opinion about relation between freezes and his patches. Maybe it's some kind of a bug in memcg which don't appear in current vanilla code and is triggered by conditions created by, for example, cgroup-task. I noticed that there is always the exact number of freezed processes as the limit set for number of tasks by cgroup-task (i already tried to raise this limit AFTER the cgroup was freezed, didn't change anything). I'm sure it's not the problem with cgroup-task alone, it's 100% related also to memcg (but maybe there must be the combination of both of them). Thank you so far for your time! azur From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: memory-cgroup bug Date: Sun, 25 Nov 2012 14:44:40 +0100 Message-ID: <20121125134440.GD10623@dhcp22.suse.cz> References: <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121123155904.490039C5@pobox.sk> <20121125101707.GA10623@dhcp22.suse.cz> <20121125133953.AD1B2F0A@pobox.sk> <20121125130208.GC10623@dhcp22.suse.cz> <20121125142709.19F4E8C2@pobox.sk> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=mfRXSlbaRf5QSWtz2E6IfCZGhrpXqwtuzrJhon/y8X4=; b=suJpFAosvzWEDCJ7192V1YNXQzxkq96xN886ypAVQ5LeT2IyhevtNoRiJAh+nsc296 vGK6TTpe3gBQ59WLyu1GRDQPky4TSwf/rZU56++kGvJ55TN3/Z6InTrieqwcvSJBzvpd m7eBIIavg1vg4zgJ81Xz7ETvpA8jYd1FM4Cx4vbT1y6VNHk6W1JKkUFWEEHS4avvSUVQ EGJdRowdyVvstjjjtuYdPzGi+4wOhletgrBCtImuVUL6ixMItG/3aysv6HlY8rSc84En vj/pRy0C+R6frNa2nR+0OUO7QwR+4jUnILEtKJxiihRR2HKGe9GJ69sCUR5A7/5HG9NQ CCEQ== Content-Disposition: inline In-Reply-To: <20121125142709.19F4E8C2-Rm0zKEqwvD4@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist On Sun 25-11-12 14:27:09, azurIt wrote: > >> Thank you very much, i will install it ASAP (probably this night). > > > >Please don't. If my analysis is correct which I am almost 100% sure it > >is then it would cause excessive logging. I am sorry I cannot come up > >with something else in the mean time. > > > Ok then. I will, meanwhile, try to contact Andrea Righi (author of > cgroup-task etc.) and ask him to send here his opinion about relation > between freezes and his patches. As I described in other email. This seems to be a deadlock in memcg oom so I do not think that other patches influence this. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: memory-cgroup bug Date: Sun, 25 Nov 2012 14:55:42 +0100 Message-ID: <20121125135542.GE10623@dhcp22.suse.cz> References: <20121121200207.01068046@pobox.sk> <20121122152441.GA9609@dhcp22.suse.cz> <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=3fz3q1LodYzVee5BYASxUz57TxhDlkeofYwPhBu/xWk=; b=sg0wOvIKsBqDvpqI2JkEu3YkcG+1z0lliKYYpvhHvG4QpQOWjuBfYklB/xoz7O0lHl lO24wLf8S9qkparr2mczIhijbdGbTMB+JEonBOfONh8XxZk1Q9/nGHncAkr8y0PDxKDX dknjSWzu07DxrpOgwmzP11OjTsiqd/74BxLEYQnhsQjP4Jtr/skVzSJL+VsslMHSlCk4 B5zVUJWbHSovKaEP6h91KUxgscR3cbVF+GmybQw562nRZiVP/zKUFXQIsuA0Lyxab6qy IGde5fe8H12jGDkboOmkUGbdHTLiFJ9X8V92rCBB2LkOoOhhbOQ3CpxcUxTzMm5TcDVj tnlQ== Content-Disposition: inline In-Reply-To: <20121125120524.GB10623@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki On Sun 25-11-12 13:05:24, Michal Hocko wrote: > [Adding Kamezawa into CC] > > On Sun 25-11-12 01:10:47, azurIt wrote: > > >Could you take few snapshots over time? > > > > > > Here it is, now from different server, snapshot was taken every second > > for 10 minutes (hope it's enough): > > www.watchdog.sk/lkml/memcg-bug-2.tar.gz > > Hmm, interesting: > $ grep . */memory.failcnt | cut -d: -f2 | awk 'BEGIN{min=666666}{if (prev>0) {diff=$1-prev; if (diff>max) max=diff; if (diff min:16281 max:224048 avg:18818.943119 > > So there is a lot of attempts to allocate which fail, every second! > Will get to that later. > > The number of tasks in the group is stable (20): > $ for i in *; do ls -d1 $i/[0-9]* | wc -l; done | sort | uniq -c > 546 20 > > And no task has been killed or spawned: > $ for i in *; do ls -d1 $i/[0-9]* | cut -d/ -f2; done | sort | uniq > 24495 > 24762 > 24774 > 24796 > 24798 > 24805 > 24813 > 24827 > 24831 > 24841 > 24842 > 24863 > 24892 > 24924 > 24931 > 25130 > 25131 > 25192 > 25193 > 25243 > > $ for stack in [0-9]*/[0-9]* > do > head -n1 $stack/stack > done | sort | uniq -c > 9841 [] mem_cgroup_handle_oom+0x241/0x3b0 > 546 [] do_truncate+0x58/0xa0 > 533 [] 0xffffffffffffffff > > Tells us that the stacks are pretty much stable. > $ grep do_truncate -r [0-9]* | cut -d/ -f2 | sort | uniq -c > 546 24495 > > So 24495 is stuck in do_truncate > [] do_truncate+0x58/0xa0 > [] do_last+0x250/0xa30 > [] path_openat+0xd7/0x440 > [] do_filp_open+0x49/0xa0 > [] do_sys_open+0x106/0x240 > [] sys_open+0x20/0x30 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff > > I suspect it is waiting for i_mutex. Who is holding that lock? > Other tasks are blocked on the mem_cgroup_handle_oom either coming from > the page fault path so i_mutex can be exluded or vfs_write (24796) and > that one is interesting: > [] mem_cgroup_handle_oom+0x241/0x3b0 > [] T.1146+0x5ab/0x5c0 > [] mem_cgroup_cache_charge+0xbe/0xe0 > [] add_to_page_cache_locked+0x4c/0x140 > [] add_to_page_cache_lru+0x22/0x50 > [] grab_cache_page_write_begin+0x8b/0xe0 > [] ext3_write_begin+0x88/0x270 > [] generic_file_buffered_write+0x116/0x290 > [] __generic_file_aio_write+0x27c/0x480 > [] generic_file_aio_write+0x76/0xf0 # takes &inode->i_mutex > [] do_sync_write+0xea/0x130 > [] vfs_write+0xf3/0x1f0 > [] sys_write+0x51/0x90 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff > > This smells like a deadlock. But kind strange one. The rapidly > increasing failcnt suggests that somebody still tries to allocate but > who when all of them hung in the mem_cgroup_handle_oom. This can be > explained though. > Memcg OOM killer let's only one process (which is able to lock the > hierarchy by mem_cgroup_oom_lock) call mem_cgroup_out_of_memory and kill > a process, while others are waiting on the wait queue. Once the killer > is done it calls memcg_wakeup_oom which wakes up other tasks waiting on > the queue. Those retry the charge, in a hope there is some memory freed > in the meantime which hasn't happened so they get into OOM again (and > again and again). > This all usually works out except in this particular case I would bet > my hat that the OOM selected task is pid 24495 which is blocked on the > mutex which is held by one of the oom killer task so it cannot finish - > thus free a memory. > > It seems that the current Linus' tree is affected as well. > > I will have to think about a solution but it sounds really tricky. It is > not just ext3 that is affected. > > I guess we need to tell mem_cgroup_cache_charge that it should never > reach OOM from add_to_page_cache_locked. This sounds quite intrusive to > me. On the other hand it is really weird that an excessive writer might > trigger a memcg OOM killer. This is hackish but it should help you in this case. Kamezawa, what do you think about that? Should we generalize this and prepare something like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY automatically and use the function whenever we are in a locked context? To be honest I do not like this very much but nothing more sensible (without touching non-memcg paths) comes to my mind. --- diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..da50c83 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -448,7 +448,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(PageSwapBacked(page)); error = mem_cgroup_cache_charge(page, current->mm, - gfp_mask & GFP_RECLAIM_MASK); + (gfp_mask | __GFP_NORETRY) & GFP_RECLAIM_MASK); if (error) goto out; -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_memory=2Dcgroup_bug?= Date: Mon, 26 Nov 2012 01:38:55 +0100 Message-ID: <20121126013855.AF118F5E@pobox.sk> References: <20121121200207.01068046@pobox.sk>, <20121122152441.GA9609@dhcp22.suse.cz>, <20121122190526.390C7A28@pobox.sk>, <20121122214249.GA20319@dhcp22.suse.cz>, <20121122233434.3D5E35E6@pobox.sk>, <20121123074023.GA24698@dhcp22.suse.cz>, <20121123102137.10D6D653@pobox.sk>, <20121123100438.GF24698@dhcp22.suse.cz>, <20121125011047.7477BB5E@pobox.sk>, <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20121125135542.GE10623@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= >This is hackish but it should help you in this case. Kamezawa, what do >you think about that? Should we generalize this and prepare something >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY >automatically and use the function whenever we are in a locked context? >To be honest I do not like this very much but nothing more sensible >(without touching non-memcg paths) comes to my mind. I installed kernel with this patch, will report back if problem occurs ag= ain OR in few weeks if everything will be ok. Thank you! Btw, will this patch be backported to 3.2? azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: memory-cgroup bug Date: Mon, 26 Nov 2012 08:57:07 +0100 Message-ID: <20121126075656.GA17860@dhcp22.suse.cz> References: <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121126013855.AF118F5E-Rm0zKEqwvD4@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki On Mon 26-11-12 01:38:55, azurIt wrote: > >This is hackish but it should help you in this case. Kamezawa, what do > >you think about that? Should we generalize this and prepare something > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > >automatically and use the function whenever we are in a locked context? > >To be honest I do not like this very much but nothing more sensible > >(without touching non-memcg paths) comes to my mind. > > > I installed kernel with this patch, will report back if problem occurs > again OR in few weeks if everything will be ok. Thank you! Thanks! > Btw, will this patch be backported to 3.2? Once we agree on a proper solution it will be backported to the stable trees. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Date: Mon, 26 Nov 2012 14:18:37 +0100 Message-ID: <20121126131837.GC17860@dhcp22.suse.cz> References: <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121126013855.AF118F5E@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner [CCing also Johannes - the thread started here: https://lkml.org/lkml/2012/11/21/497] On Mon 26-11-12 01:38:55, azurIt wrote: > >This is hackish but it should help you in this case. Kamezawa, what do > >you think about that? Should we generalize this and prepare something > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > >automatically and use the function whenever we are in a locked context? > >To be honest I do not like this very much but nothing more sensible > >(without touching non-memcg paths) comes to my mind. > > > I installed kernel with this patch, will report back if problem occurs > again OR in few weeks if everything will be ok. Thank you! Now that I am looking at the patch closer it will not work because it depends on other patch which is not merged yet and even that one would help on its own because __GFP_NORETRY doesn't break the charge loop. Sorry I have missed that... The patch bellow should help though. (it is based on top of the current -mm tree but I will send a backport to 3.2 in the reply as well) --- >From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Mon, 26 Nov 2012 11:47:57 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [] do_truncate+0x58/0xa0 # takes i_mutex [] do_last+0x250/0xa30 [] path_openat+0xd7/0x440 [] do_filp_open+0x49/0xa0 [] do_sys_open+0x106/0x240 [] sys_open+0x20/0x30 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff Process B [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1146+0x5ab/0x5c0 [] mem_cgroup_cache_charge+0xbe/0xe0 [] add_to_page_cache_locked+0x4c/0x140 [] add_to_page_cache_lru+0x22/0x50 [] grab_cache_page_write_begin+0x8b/0xe0 [] ext3_write_begin+0x88/0x270 [] generic_file_buffered_write+0x116/0x290 [] __generic_file_aio_write+0x27c/0x480 [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [] do_sync_write+0xea/0x130 [] vfs_write+0xf3/0x1f0 [] sys_write+0x51/0x90 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which then tells mem_cgroup_charge_common that OOM is not allowed for the charge. No OOM from this path, except for fixing the bug, also make some sense as we really do not want to cause an OOM because of a page cache usage. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. __GFP_NORETRY is abused for this memcg specific flag because it has been used to prevent from OOM already (since not-merged-yet "memcg: reclaim when more than one page needed"). The only difference is that the flag doesn't prevent from reclaim anymore which kind of makes sense because the global memory allocator triggers reclaim as well. The retry without any reclaim on __GFP_NORETRY doesn't make much sense anyway because this is effectively a busy loop with allowed OOM in this path. Reported-by: azurIt Signed-off-by: Michal Hocko --- include/linux/gfp.h | 3 +++ include/linux/memcontrol.h | 12 ++++++++++++ mm/filemap.c | 8 +++++++- mm/memcontrol.c | 5 +---- 4 files changed, 23 insertions(+), 5 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 10e667f..aac9b21 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -152,6 +152,9 @@ struct vm_area_struct; /* 4GB DMA on some platforms */ #define GFP_DMA32 __GFP_DMA32 +/* memcg oom killer is not allowed */ +#define GFP_MEMCG_NO_OOM __GFP_NORETRY + /* Convert GFP flags to their corresponding migrate type */ static inline int allocflags_to_migratetype(gfp_t gfp_flags) { diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 095d2b4..1ad4bc6 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -65,6 +65,12 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM); +} + struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); @@ -215,6 +221,12 @@ static inline int mem_cgroup_cache_charge(struct page *page, return 0; } +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return 0; +} + static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp) { diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..ef14351 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, + /* + * Cannot trigger OOM even if gfp_mask would allow that normally + * because we might be called from a locked context and that + * could lead to deadlocks if the killed process is waiting for + * the same lock. + */ + error = mem_cgroup_cache_charge_no_oom(page, current->mm, gfp_mask & GFP_RECLAIM_MASK); if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 02ee2f7..b4754ba 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2430,9 +2430,6 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (!(gfp_mask & __GFP_WAIT)) return CHARGE_WOULDBLOCK; - if (gfp_mask & __GFP_NORETRY) - return CHARGE_NOMEM; - ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags); if (mem_cgroup_margin(mem_over_limit) >= nr_pages) return CHARGE_RETRY; @@ -3713,7 +3710,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, { struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; - bool oom = true; + bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); int ret; if (PageTransHuge(page)) { -- 1.7.10.4 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Mon, 26 Nov 2012 14:21:49 +0100 Message-ID: <20121126132149.GD17860@dhcp22.suse.cz> References: <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121126131837.GC17860-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Here we go with the patch for 3.2.34. Could you test with this one, please? --- >From 0d2d915c16f93918051b7ab8039d30b5a922049c Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Mon, 26 Nov 2012 11:47:57 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [] do_truncate+0x58/0xa0 # takes i_mutex [] do_last+0x250/0xa30 [] path_openat+0xd7/0x440 [] do_filp_open+0x49/0xa0 [] do_sys_open+0x106/0x240 [] sys_open+0x20/0x30 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff Process B [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1146+0x5ab/0x5c0 [] mem_cgroup_cache_charge+0xbe/0xe0 [] add_to_page_cache_locked+0x4c/0x140 [] add_to_page_cache_lru+0x22/0x50 [] grab_cache_page_write_begin+0x8b/0xe0 [] ext3_write_begin+0x88/0x270 [] generic_file_buffered_write+0x116/0x290 [] __generic_file_aio_write+0x27c/0x480 [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [] do_sync_write+0xea/0x130 [] vfs_write+0xf3/0x1f0 [] sys_write+0x51/0x90 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which then tells mem_cgroup_charge_common that OOM is not allowed for the charge. No OOM from this path, except for fixing the bug, also make some sense as we really do not want to cause an OOM because of a page cache usage. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. __GFP_NORETRY is abused for this memcg specific flag because no user accounted allocation use this flag except for THP which have memcg oom disabled already. Reported-by: azurIt Signed-off-by: Michal Hocko --- include/linux/gfp.h | 3 +++ include/linux/memcontrol.h | 13 +++++++++++++ mm/filemap.c | 8 +++++++- mm/memcontrol.c | 2 +- 4 files changed, 24 insertions(+), 2 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 3a76faf..806fb54 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -146,6 +146,9 @@ struct vm_area_struct; /* 4GB DMA on some platforms */ #define GFP_DMA32 __GFP_DMA32 +/* memcg oom killer is not allowed */ +#define GFP_MEMCG_NO_OOM __GFP_NORETRY + /* Convert GFP flags to their corresponding migrate type */ static inline int allocflags_to_migratetype(gfp_t gfp_flags) { diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 81572af..bf0e575 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -63,6 +63,13 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); + +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM); +} + extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru); extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru); extern void mem_cgroup_rotate_reclaimable_page(struct page *page); @@ -178,6 +185,12 @@ static inline int mem_cgroup_cache_charge(struct page *page, return 0; } +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return 0; +} + static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t gfp_mask, struct mem_cgroup **ptr) { diff --git a/mm/filemap.c b/mm/filemap.c index 556858c..ef182a9 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -449,7 +449,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, + /* + * Cannot trigger OOM even if gfp_mask would allow that normally + * because we might be called from a locked context and that + * could lead to deadlocks if the killed process is waiting for + * the same lock. + */ + error = mem_cgroup_cache_charge_no_oom(page, current->mm, gfp_mask & GFP_RECLAIM_MASK); if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c8425b1..1dbbe7f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2703,7 +2703,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; struct page_cgroup *pc; - bool oom = true; + bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); int ret; if (PageTransHuge(page)) { -- 1.7.10.4 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Date: Mon, 26 Nov 2012 12:46:22 -0500 Message-ID: <20121126174622.GE2799@cmpxchg.org> References: <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121126131837.GC17860@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki On Mon, Nov 26, 2012 at 02:18:37PM +0100, Michal Hocko wrote: > [CCing also Johannes - the thread started here: > https://lkml.org/lkml/2012/11/21/497] > > On Mon 26-11-12 01:38:55, azurIt wrote: > > >This is hackish but it should help you in this case. Kamezawa, what do > > >you think about that? Should we generalize this and prepare something > > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > > >automatically and use the function whenever we are in a locked context? > > >To be honest I do not like this very much but nothing more sensible > > >(without touching non-memcg paths) comes to my mind. > > > > > > I installed kernel with this patch, will report back if problem occurs > > again OR in few weeks if everything will be ok. Thank you! > > Now that I am looking at the patch closer it will not work because it > depends on other patch which is not merged yet and even that one would > help on its own because __GFP_NORETRY doesn't break the charge loop. > Sorry I have missed that... > > The patch bellow should help though. (it is based on top of the current > -mm tree but I will send a backport to 3.2 in the reply as well) > --- > >From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001 > From: Michal Hocko > Date: Mon, 26 Nov 2012 11:47:57 +0100 > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > memcg oom killer might deadlock if the process which falls down to > mem_cgroup_handle_oom holds a lock which prevents other task to > terminate because it is blocked on the very same lock. > This can happen when a write system call needs to allocate a page but > the allocation hits the memcg hard limit and there is nothing to reclaim > (e.g. there is no swap or swap limit is hit as well and all cache pages > have been reclaimed already) and the process selected by memcg OOM > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > Process A > [] do_truncate+0x58/0xa0 # takes i_mutex > [] do_last+0x250/0xa30 > [] path_openat+0xd7/0x440 > [] do_filp_open+0x49/0xa0 > [] do_sys_open+0x106/0x240 > [] sys_open+0x20/0x30 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff > > Process B > [] mem_cgroup_handle_oom+0x241/0x3b0 > [] T.1146+0x5ab/0x5c0 > [] mem_cgroup_cache_charge+0xbe/0xe0 > [] add_to_page_cache_locked+0x4c/0x140 > [] add_to_page_cache_lru+0x22/0x50 > [] grab_cache_page_write_begin+0x8b/0xe0 > [] ext3_write_begin+0x88/0x270 > [] generic_file_buffered_write+0x116/0x290 > [] __generic_file_aio_write+0x27c/0x480 > [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > [] do_sync_write+0xea/0x130 > [] vfs_write+0xf3/0x1f0 > [] sys_write+0x51/0x90 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff So process B manages to lock the hierarchy, calls mem_cgroup_out_of_memory() and retries the charge infinitely, waiting for task A to die. All while it holds the i_mutex, preventing task A from dying, right? I think global oom already handles this in a much better way: invoke the OOM killer, sleep for a second, then return to userspace to relinquish all kernel resources and locks. The only reason why we can't simply change from an endless retry loop is because we don't want to return VM_FAULT_OOM and invoke the global OOM killer. But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and just restart the pagefault. Return -ENOMEM to the buffered IO syscall respectively. This way, the memcg OOM killer is invoked as it should but nobody gets stuck anywhere livelocking with the exiting task. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Date: Mon, 26 Nov 2012 19:04:44 +0100 Message-ID: <20121126180444.GA12602@dhcp22.suse.cz> References: <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126174622.GE2799@cmpxchg.org> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=/hCb3H32V40/MNdI2Q+s3HIPTyXcS+8EzACL1dBKilo=; b=EqL+7vzNmU0KlGJL8FJNK1SW7yIfWmV5crmG2tyjkYHVGqrw0mPhpdUOD/oheZz9qg 7gdQYspkdHmChQ1VeYskkD0vtxjp0yOHN3eOLjHbmLUiWm05L8/HRDDVMx/eoiDpVL+5 v41pXE1zgFsZ7REzSAZs8jn5nrxlvwb1CLh9QlB5RlagZ8wzotPeUm86LwK5sXGDtW9z +5eU3Ib1+uv4d4MHpfh5IlAE+bwuKfP5FDAYcjbeWmUsrtAVtgzVhdpbx8n6PIGe9eht LHgqX9/CjEa1WOpS/2t+WtnuN/kLq5Z8vBYpn1xYaC4lFlRJtbGXFscYNHbyfWzjEnzm iIoA== Content-Disposition: inline In-Reply-To: <20121126174622.GE2799@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki On Mon 26-11-12 12:46:22, Johannes Weiner wrote: > On Mon, Nov 26, 2012 at 02:18:37PM +0100, Michal Hocko wrote: > > [CCing also Johannes - the thread started here: > > https://lkml.org/lkml/2012/11/21/497] > > > > On Mon 26-11-12 01:38:55, azurIt wrote: > > > >This is hackish but it should help you in this case. Kamezawa, what do > > > >you think about that? Should we generalize this and prepare something > > > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > > > >automatically and use the function whenever we are in a locked context? > > > >To be honest I do not like this very much but nothing more sensible > > > >(without touching non-memcg paths) comes to my mind. > > > > > > > > > I installed kernel with this patch, will report back if problem occurs > > > again OR in few weeks if everything will be ok. Thank you! > > > > Now that I am looking at the patch closer it will not work because it > > depends on other patch which is not merged yet and even that one would > > help on its own because __GFP_NORETRY doesn't break the charge loop. > > Sorry I have missed that... > > > > The patch bellow should help though. (it is based on top of the current > > -mm tree but I will send a backport to 3.2 in the reply as well) > > --- > > >From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001 > > From: Michal Hocko > > Date: Mon, 26 Nov 2012 11:47:57 +0100 > > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > > > memcg oom killer might deadlock if the process which falls down to > > mem_cgroup_handle_oom holds a lock which prevents other task to > > terminate because it is blocked on the very same lock. > > This can happen when a write system call needs to allocate a page but > > the allocation hits the memcg hard limit and there is nothing to reclaim > > (e.g. there is no swap or swap limit is hit as well and all cache pages > > have been reclaimed already) and the process selected by memcg OOM > > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > > > Process A > > [] do_truncate+0x58/0xa0 # takes i_mutex > > [] do_last+0x250/0xa30 > > [] path_openat+0xd7/0x440 > > [] do_filp_open+0x49/0xa0 > > [] do_sys_open+0x106/0x240 > > [] sys_open+0x20/0x30 > > [] system_call_fastpath+0x18/0x1d > > [] 0xffffffffffffffff > > > > Process B > > [] mem_cgroup_handle_oom+0x241/0x3b0 > > [] T.1146+0x5ab/0x5c0 > > [] mem_cgroup_cache_charge+0xbe/0xe0 > > [] add_to_page_cache_locked+0x4c/0x140 > > [] add_to_page_cache_lru+0x22/0x50 > > [] grab_cache_page_write_begin+0x8b/0xe0 > > [] ext3_write_begin+0x88/0x270 > > [] generic_file_buffered_write+0x116/0x290 > > [] __generic_file_aio_write+0x27c/0x480 > > [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > > [] do_sync_write+0xea/0x130 > > [] vfs_write+0xf3/0x1f0 > > [] sys_write+0x51/0x90 > > [] system_call_fastpath+0x18/0x1d > > [] 0xffffffffffffffff > > So process B manages to lock the hierarchy, calls > mem_cgroup_out_of_memory() and retries the charge infinitely, waiting > for task A to die. All while it holds the i_mutex, preventing task A > from dying, right? Right. > I think global oom already handles this in a much better way: invoke > the OOM killer, sleep for a second, then return to userspace to > relinquish all kernel resources and locks. The only reason why we > can't simply change from an endless retry loop is because we don't > want to return VM_FAULT_OOM and invoke the global OOM killer. Exactly. > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > respectively. This way, the memcg OOM killer is invoked as it should > but nobody gets stuck anywhere livelocking with the exiting task. Hmm, we would still have a problem with oom disabled (aka user space OOM killer), right? All processes but those in mem_cgroup_handle_oom are risky to be killed. Other POV might be, why we should trigger an OOM killer from those paths in the first place. Write or read (or even readahead) are all calls that should rather fail than cause an OOM killer in my opinion. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Date: Mon, 26 Nov 2012 13:24:21 -0500 Message-ID: <20121126182421.GB2301@cmpxchg.org> References: <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126174622.GE2799@cmpxchg.org> <20121126180444.GA12602@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121126180444.GA12602-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: azurIt , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote: > On Mon 26-11-12 12:46:22, Johannes Weiner wrote: > > On Mon, Nov 26, 2012 at 02:18:37PM +0100, Michal Hocko wrote: > > > [CCing also Johannes - the thread started here: > > > https://lkml.org/lkml/2012/11/21/497] > > > > > > On Mon 26-11-12 01:38:55, azurIt wrote: > > > > >This is hackish but it should help you in this case. Kamezawa, what do > > > > >you think about that? Should we generalize this and prepare something > > > > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > > > > >automatically and use the function whenever we are in a locked context? > > > > >To be honest I do not like this very much but nothing more sensible > > > > >(without touching non-memcg paths) comes to my mind. > > > > > > > > > > > > I installed kernel with this patch, will report back if problem occurs > > > > again OR in few weeks if everything will be ok. Thank you! > > > > > > Now that I am looking at the patch closer it will not work because it > > > depends on other patch which is not merged yet and even that one would > > > help on its own because __GFP_NORETRY doesn't break the charge loop. > > > Sorry I have missed that... > > > > > > The patch bellow should help though. (it is based on top of the current > > > -mm tree but I will send a backport to 3.2 in the reply as well) > > > --- > > > >From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001 > > > From: Michal Hocko > > > Date: Mon, 26 Nov 2012 11:47:57 +0100 > > > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > > > > > memcg oom killer might deadlock if the process which falls down to > > > mem_cgroup_handle_oom holds a lock which prevents other task to > > > terminate because it is blocked on the very same lock. > > > This can happen when a write system call needs to allocate a page but > > > the allocation hits the memcg hard limit and there is nothing to reclaim > > > (e.g. there is no swap or swap limit is hit as well and all cache pages > > > have been reclaimed already) and the process selected by memcg OOM > > > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > > > > > Process A > > > [] do_truncate+0x58/0xa0 # takes i_mutex > > > [] do_last+0x250/0xa30 > > > [] path_openat+0xd7/0x440 > > > [] do_filp_open+0x49/0xa0 > > > [] do_sys_open+0x106/0x240 > > > [] sys_open+0x20/0x30 > > > [] system_call_fastpath+0x18/0x1d > > > [] 0xffffffffffffffff > > > > > > Process B > > > [] mem_cgroup_handle_oom+0x241/0x3b0 > > > [] T.1146+0x5ab/0x5c0 > > > [] mem_cgroup_cache_charge+0xbe/0xe0 > > > [] add_to_page_cache_locked+0x4c/0x140 > > > [] add_to_page_cache_lru+0x22/0x50 > > > [] grab_cache_page_write_begin+0x8b/0xe0 > > > [] ext3_write_begin+0x88/0x270 > > > [] generic_file_buffered_write+0x116/0x290 > > > [] __generic_file_aio_write+0x27c/0x480 > > > [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > > > [] do_sync_write+0xea/0x130 > > > [] vfs_write+0xf3/0x1f0 > > > [] sys_write+0x51/0x90 > > > [] system_call_fastpath+0x18/0x1d > > > [] 0xffffffffffffffff > > > > So process B manages to lock the hierarchy, calls > > mem_cgroup_out_of_memory() and retries the charge infinitely, waiting > > for task A to die. All while it holds the i_mutex, preventing task A > > from dying, right? > > Right. > > > I think global oom already handles this in a much better way: invoke > > the OOM killer, sleep for a second, then return to userspace to > > relinquish all kernel resources and locks. The only reason why we > > can't simply change from an endless retry loop is because we don't > > want to return VM_FAULT_OOM and invoke the global OOM killer. > > Exactly. > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > > respectively. This way, the memcg OOM killer is invoked as it should > > but nobody gets stuck anywhere livelocking with the exiting task. > > Hmm, we would still have a problem with oom disabled (aka user space OOM > killer), right? All processes but those in mem_cgroup_handle_oom are > risky to be killed. Could we still let everybody get stuck in there when the OOM killer is disabled and let userspace take care of it? > Other POV might be, why we should trigger an OOM killer from those paths > in the first place. Write or read (or even readahead) are all calls that > should rather fail than cause an OOM killer in my opinion. Readahead is arguable, but we kill globally for read() and write() and I think we should do the same for memcg. The OOM killer is there to resolve a problem that comes from overcommitting the machine but the overuse does not have to be from the application that pushes the machine over the edge, that's why we don't just kill the allocating task but actually go look for the best candidate. If you have one memory hog that overuses the resources, attempted memory consumption in a different program should invoke the OOM killer. It does not matter if this is a page fault (would still happen with your patch) or a bufferd read/write (would no longer happen). From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Date: Mon, 26 Nov 2012 20:03:29 +0100 Message-ID: <20121126190329.GB12602@dhcp22.suse.cz> References: <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126174622.GE2799@cmpxchg.org> <20121126180444.GA12602@dhcp22.suse.cz> <20121126182421.GB2301@cmpxchg.org> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=/A6/rEMqtOOSoMbgZWoa0YltZclrJhJ5s4qndyqp9nA=; b=ENi6n81NrxaAkV5dJL48FHzAKd62QR3x2Laa/d/mkewwm/POHwW5bCzqswkxRfDjlD vaoqXYctSjCbNvgub/hYaYxtiYzwIyXMzuyvYK3ReS7Yg5vp+pnqLsQTmFpX6tXEbWQF +sgs9C5lO2xDC4g7hnxezQhnZ2wH6aUhoi0FqiR8OJA7I8vKD117VSeVvUBRiMl9fdUc ZH9DOUYMYDWvmmj+XIVXNn37gJGAs1BSIDF8mAyQeVpqhAI5cXS8CxBDP2gvTR2KTzyU 33C+4T/Q+Ji7Kj0LfuiDry7EUdQwfighNQfEYZKMHMExTzZZJB//3ZuSwEfRp7NXRjuR 0PeQ== Content-Disposition: inline In-Reply-To: <20121126182421.GB2301-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: azurIt , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki On Mon 26-11-12 13:24:21, Johannes Weiner wrote: > On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote: > > On Mon 26-11-12 12:46:22, Johannes Weiner wrote: [...] > > > I think global oom already handles this in a much better way: invoke > > > the OOM killer, sleep for a second, then return to userspace to > > > relinquish all kernel resources and locks. The only reason why we > > > can't simply change from an endless retry loop is because we don't > > > want to return VM_FAULT_OOM and invoke the global OOM killer. > > > > Exactly. > > > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > > > respectively. This way, the memcg OOM killer is invoked as it should > > > but nobody gets stuck anywhere livelocking with the exiting task. > > > > Hmm, we would still have a problem with oom disabled (aka user space OOM > > killer), right? All processes but those in mem_cgroup_handle_oom are > > risky to be killed. > > Could we still let everybody get stuck in there when the OOM killer is > disabled and let userspace take care of it? I am not sure what exactly you mean by "userspace take care of it" but if those processes are stuck and holding the lock then it is usually hard to find that out. Well if somebody is familiar with internal then it is doable but this makes the interface really unusable for regular usage. > > Other POV might be, why we should trigger an OOM killer from those paths > > in the first place. Write or read (or even readahead) are all calls that > > should rather fail than cause an OOM killer in my opinion. > > Readahead is arguable, but we kill globally for read() and write() and > I think we should do the same for memcg. Fair point but the global case is little bit easier than memcg in this case because nobody can hook on OOM killer and provide a userspace implementation for it which is one of the cooler feature of memcg... I am all open to any suggestions but we should somehow fix this (and backport it to stable trees as this is there for quite some time. The current report shows that the problem is not that hard to trigger). > The OOM killer is there to resolve a problem that comes from > overcommitting the machine but the overuse does not have to be from > the application that pushes the machine over the edge, that's why we > don't just kill the allocating task but actually go look for the best > candidate. If you have one memory hog that overuses the resources, > attempted memory consumption in a different program should invoke the > OOM killer. > It does not matter if this is a page fault (would still happen with > your patch) or a bufferd read/write (would no longer happen). true and it is sad that mmap then behaves slightly different than read/write which should I've mentioned in the changelog. As I said I am open to other suggestions. Thanks -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Date: Mon, 26 Nov 2012 14:29:41 -0500 Message-ID: <20121126192941.GC2301@cmpxchg.org> References: <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126174622.GE2799@cmpxchg.org> <20121126180444.GA12602@dhcp22.suse.cz> <20121126182421.GB2301@cmpxchg.org> <20121126190329.GB12602@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121126190329.GB12602-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: azurIt , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki On Mon, Nov 26, 2012 at 08:03:29PM +0100, Michal Hocko wrote: > On Mon 26-11-12 13:24:21, Johannes Weiner wrote: > > On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote: > > > On Mon 26-11-12 12:46:22, Johannes Weiner wrote: > [...] > > > > I think global oom already handles this in a much better way: invoke > > > > the OOM killer, sleep for a second, then return to userspace to > > > > relinquish all kernel resources and locks. The only reason why we > > > > can't simply change from an endless retry loop is because we don't > > > > want to return VM_FAULT_OOM and invoke the global OOM killer. > > > > > > Exactly. > > > > > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > > > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > > > > respectively. This way, the memcg OOM killer is invoked as it should > > > > but nobody gets stuck anywhere livelocking with the exiting task. > > > > > > Hmm, we would still have a problem with oom disabled (aka user space OOM > > > killer), right? All processes but those in mem_cgroup_handle_oom are > > > risky to be killed. > > > > Could we still let everybody get stuck in there when the OOM killer is > > disabled and let userspace take care of it? > > I am not sure what exactly you mean by "userspace take care of it" but > if those processes are stuck and holding the lock then it is usually > hard to find that out. Well if somebody is familiar with internal then > it is doable but this makes the interface really unusable for regular > usage. If oom_kill_disable is set, then all processes get stuck all the way down in the charge stack. Whatever resource they pin, you may deadlock on if you try to touch it while handling the problem from userspace. I don't see how this is a new problem...? Or do you mean something else? > > > Other POV might be, why we should trigger an OOM killer from those paths > > > in the first place. Write or read (or even readahead) are all calls that > > > should rather fail than cause an OOM killer in my opinion. > > > > Readahead is arguable, but we kill globally for read() and write() and > > I think we should do the same for memcg. > > Fair point but the global case is little bit easier than memcg in this > case because nobody can hook on OOM killer and provide a userspace > implementation for it which is one of the cooler feature of memcg... > I am all open to any suggestions but we should somehow fix this (and > backport it to stable trees as this is there for quite some time. The > current report shows that the problem is not that hard to trigger). As per above, the userspace OOM handling is risky as hell anyway. What happens when an anonymous fault waits in memcg userspace OOM while holding the mmap_sem, and a writer lines up behind it? Your userspace OOM handler had better not look at any of the /proc files of the stuck task that require the mmap_sem. At the same token, it probably shouldn't touch the same files a memcg task is stuck trying to read/write. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Date: Mon, 26 Nov 2012 21:08:48 +0100 Message-ID: <20121126200848.GC12602@dhcp22.suse.cz> References: <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126174622.GE2799@cmpxchg.org> <20121126180444.GA12602@dhcp22.suse.cz> <20121126182421.GB2301@cmpxchg.org> <20121126190329.GB12602@dhcp22.suse.cz> <20121126192941.GC2301@cmpxchg.org> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=SR/K7xFbd623Kr5XarbLBL4cdJZmG4PHcVZCqWVPhxA=; b=zVEthVSL13n+HVPKON+016d4lyoPGOdtpD9DnhpNOxFHTaQCfUlkjVyEL4sCmM6nq+ QGB7MEaJfnGeIOLP1d2VkWDIsZ92Pc7aDHDc/w3UnQq6tHFFETKGrkCL1aKXrCMoa0Gd 6+H8pavYdardy3eL9ZBkRDkYZT8+htJKZqwh7JZi8qnyVmikor+BmXccbgSJ46i3yKRa oq35BdrGrukVstpERQJBp5dmMc/hF+IlWE+VoVUjhzOCwqhrF48n3Z8DzkO8cnOYqzK7 YVL9jkiVFhpCLSayTZ1UFyPaCqvJpPD6wOXxgJoJw3O4ZxiUz7/wD53+0EdN7xLNTSHe Briw== Content-Disposition: inline In-Reply-To: <20121126192941.GC2301-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: azurIt , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki On Mon 26-11-12 14:29:41, Johannes Weiner wrote: > On Mon, Nov 26, 2012 at 08:03:29PM +0100, Michal Hocko wrote: > > On Mon 26-11-12 13:24:21, Johannes Weiner wrote: > > > On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote: > > > > On Mon 26-11-12 12:46:22, Johannes Weiner wrote: > > [...] > > > > > I think global oom already handles this in a much better way: invoke > > > > > the OOM killer, sleep for a second, then return to userspace to > > > > > relinquish all kernel resources and locks. The only reason why we > > > > > can't simply change from an endless retry loop is because we don't > > > > > want to return VM_FAULT_OOM and invoke the global OOM killer. > > > > > > > > Exactly. > > > > > > > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > > > > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > > > > > respectively. This way, the memcg OOM killer is invoked as it should > > > > > but nobody gets stuck anywhere livelocking with the exiting task. > > > > > > > > Hmm, we would still have a problem with oom disabled (aka user space OOM > > > > killer), right? All processes but those in mem_cgroup_handle_oom are > > > > risky to be killed. > > > > > > Could we still let everybody get stuck in there when the OOM killer is > > > disabled and let userspace take care of it? > > > > I am not sure what exactly you mean by "userspace take care of it" but > > if those processes are stuck and holding the lock then it is usually > > hard to find that out. Well if somebody is familiar with internal then > > it is doable but this makes the interface really unusable for regular > > usage. > > If oom_kill_disable is set, then all processes get stuck all the way > down in the charge stack. Whatever resource they pin, you may > deadlock on if you try to touch it while handling the problem from > userspace. OK, I guess I am getting what you are trying to say. So what you are suggesting is to just let mem_cgroup_out_of_memory send the signal and move on without retry (or with few charge retries without further OOM killing) and fail the charge with your new FAULT_OOM_HANDLED (resp. something like FAULT_RETRY) error code resp. ENOMEM depending on the caller. OOM disabled case would be "you are on your own" because this has been dangerous anyway. Correct? I do agree that the current endless retry loop is far from being ideal and can see some updates but I am quite nervous about any potential regressions in this area (e.g. too aggressive OOM etc...). I have to think about it some more. Anyway if you have some more specific ideas I would be happy to review patches. [...] Thanks -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Date: Mon, 26 Nov 2012 15:19:18 -0500 Message-ID: <20121126201918.GD2301@cmpxchg.org> References: <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126174622.GE2799@cmpxchg.org> <20121126180444.GA12602@dhcp22.suse.cz> <20121126182421.GB2301@cmpxchg.org> <20121126190329.GB12602@dhcp22.suse.cz> <20121126192941.GC2301@cmpxchg.org> <20121126200848.GC12602@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121126200848.GC12602@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki On Mon, Nov 26, 2012 at 09:08:48PM +0100, Michal Hocko wrote: > On Mon 26-11-12 14:29:41, Johannes Weiner wrote: > > On Mon, Nov 26, 2012 at 08:03:29PM +0100, Michal Hocko wrote: > > > On Mon 26-11-12 13:24:21, Johannes Weiner wrote: > > > > On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote: > > > > > On Mon 26-11-12 12:46:22, Johannes Weiner wrote: > > > [...] > > > > > > I think global oom already handles this in a much better way: invoke > > > > > > the OOM killer, sleep for a second, then return to userspace to > > > > > > relinquish all kernel resources and locks. The only reason why we > > > > > > can't simply change from an endless retry loop is because we don't > > > > > > want to return VM_FAULT_OOM and invoke the global OOM killer. > > > > > > > > > > Exactly. > > > > > > > > > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > > > > > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > > > > > > respectively. This way, the memcg OOM killer is invoked as it should > > > > > > but nobody gets stuck anywhere livelocking with the exiting task. > > > > > > > > > > Hmm, we would still have a problem with oom disabled (aka user space OOM > > > > > killer), right? All processes but those in mem_cgroup_handle_oom are > > > > > risky to be killed. > > > > > > > > Could we still let everybody get stuck in there when the OOM killer is > > > > disabled and let userspace take care of it? > > > > > > I am not sure what exactly you mean by "userspace take care of it" but > > > if those processes are stuck and holding the lock then it is usually > > > hard to find that out. Well if somebody is familiar with internal then > > > it is doable but this makes the interface really unusable for regular > > > usage. > > > > If oom_kill_disable is set, then all processes get stuck all the way > > down in the charge stack. Whatever resource they pin, you may > > deadlock on if you try to touch it while handling the problem from > > userspace. > > OK, I guess I am getting what you are trying to say. So what you are > suggesting is to just let mem_cgroup_out_of_memory send the signal and > move on without retry (or with few charge retries without further OOM > killing) and fail the charge with your new FAULT_OOM_HANDLED (resp. > something like FAULT_RETRY) error code resp. ENOMEM depending on the > caller. OOM disabled case would be "you are on your own" because this > has been dangerous anyway. Correct? Yes. > I do agree that the current endless retry loop is far from being ideal > and can see some updates but I am quite nervous about any potential > regressions in this area (e.g. too aggressive OOM etc...). I have to > think about it some more. Agreed on all points. Maybe we can keep a couple of the oom retry iterations or something like that, which is still much more than what global does and I don't think the global OOM killer is overly eager. Testing will show more. > Anyway if you have some more specific ideas I would be happy to review > patches. Okay, I just wanted to check back with you before going down this path. What are we going to do short term, though? Do you want to push the disable-oom-for-pagecache for now or should we put the VM_FAULT_OOM_HANDLED fix in the next version and do stable backports? This issue has been around for a while so frankly I don't think it's urgent enough to rush things. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_=2Dmm=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Mon, 26 Nov 2012 21:46:38 +0100 Message-ID: <20121126214638.64723F01@pobox.sk> References: <20121125120524.GB10623@dhcp22.suse.cz>, <20121125135542.GE10623@dhcp22.suse.cz>, <20121126013855.AF118F5E@pobox.sk>, <20121126131837.GC17860@dhcp22.suse.cz>, <20121126174622.GE2799@cmpxchg.org>, <20121126180444.GA12602@dhcp22.suse.cz>, <20121126182421.GB2301@cmpxchg.org>, <20121126190329.GB12602@dhcp22.suse.cz>, <20121126192941.GC2301@cmpxchg.org>, <20121126200848.GC12602@dhcp22.suse.cz> <20121126201918.GD2301@cmpxchg.org> Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Return-path: In-Reply-To: <20121126201918.GD2301-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Johannes_Weiner?= , =?utf-8?q?Michal_Hocko?= Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= >This issue has been around for a while so frankly I don't think it's >urgent enough to rush things. Well, it's quite urgent at least for us :( i wasn't reported this so far cos i wasn't sure it's a kernel thing. I will be really happy and thankfull if fix for this can go to 3.2 in some near future.. Thank you very much! azur From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Date: Mon, 26 Nov 2012 15:53:49 -0500 Message-ID: <20121126205349.GE2301@cmpxchg.org> References: <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126174622.GE2799@cmpxchg.org> <20121126180444.GA12602@dhcp22.suse.cz> <20121126182421.GB2301@cmpxchg.org> <20121126190329.GB12602@dhcp22.suse.cz> <20121126192941.GC2301@cmpxchg.org> <20121126200848.GC12602@dhcp22.suse.cz> <20121126201918.GD2301@cmpxchg.org> <20121126214638.64723F01@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121126214638.64723F01@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: Michal Hocko , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki On Mon, Nov 26, 2012 at 09:46:38PM +0100, azurIt wrote: > >This issue has been around for a while so frankly I don't think it's > >urgent enough to rush things. > > > Well, it's quite urgent at least for us :( i wasn't reported this so > far cos i wasn't sure it's a kernel thing. I will be really happy > and thankfull if fix for this can go to 3.2 in some near > future.. Thank you very much! I understand and of course it's important that we get it fixed as soon as possible. All I meant was that this problem has not exactly been introduced in 3.7 and the fix is non-trivial so we should not be rushing a change like this into 3.7 just days before its release. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Mon, 26 Nov 2012 22:28:26 +0100 Message-ID: <20121126222826.3843D563@pobox.sk> References: <20121122214249.GA20319@dhcp22.suse.cz>, <20121122233434.3D5E35E6@pobox.sk>, <20121123074023.GA24698@dhcp22.suse.cz>, <20121123102137.10D6D653@pobox.sk>, <20121123100438.GF24698@dhcp22.suse.cz>, <20121125011047.7477BB5E@pobox.sk>, <20121125120524.GB10623@dhcp22.suse.cz>, <20121125135542.GE10623@dhcp22.suse.cz>, <20121126013855.AF118F5E@pobox.sk>, <20121126131837.GC17860@dhcp22.suse.cz> <20121126132149.GD17860@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20121126132149.GD17860@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >Here we go with the patch for 3.2.34. Could you test with this one, >please? Michal, regarding to your conversation with Johannes Weiner, should i try= this patch or not? azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Date: Mon, 26 Nov 2012 23:06:40 +0100 Message-ID: <20121126220640.GE12602@dhcp22.suse.cz> References: <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126174622.GE2799@cmpxchg.org> <20121126180444.GA12602@dhcp22.suse.cz> <20121126182421.GB2301@cmpxchg.org> <20121126190329.GB12602@dhcp22.suse.cz> <20121126192941.GC2301@cmpxchg.org> <20121126200848.GC12602@dhcp22.suse.cz> <20121126201918.GD2301@cmpxchg.org> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=PKevxCl841nMi/pxgwaHCs0jGC13mxxxijhWgnVsgt8=; b=AerleamJ5sDhqvJkSPAlPPaAWqP6JQLVb6KMS5AJ/QlmugrBSiHGvpf/rSlP/ek2Yw 4kCxg+Zn464uEBIo17mWXaXbe0baVl/wSQAH+gVyue6BvC0AlNLm9348RkG+M5XPT+8b ZH8pzIM1OPbC6pCiAwN/cDbAEGNCwuH4JlnmzDpkumGbGsLJCZ8+nXoiURxtfT/pf5b6 RZdgY4vL9WIvdaNv50KXeO2i146PeGQE+Ec9RENsuMz/WJ0lG7hxFMGk2CqsxRjc/T+t npepi3UDQ4/vww9AG0dzoGGCQDKfg/TDuZZxVGWz/IA2ocouYA9vD32/qxzc7d8df7uA sMSw== Content-Disposition: inline In-Reply-To: <20121126201918.GD2301-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: azurIt , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki On Mon 26-11-12 15:19:18, Johannes Weiner wrote: > On Mon, Nov 26, 2012 at 09:08:48PM +0100, Michal Hocko wrote: [...] > > OK, I guess I am getting what you are trying to say. So what you are > > suggesting is to just let mem_cgroup_out_of_memory send the signal and > > move on without retry (or with few charge retries without further OOM > > killing) and fail the charge with your new FAULT_OOM_HANDLED (resp. > > something like FAULT_RETRY) error code resp. ENOMEM depending on the > > caller. OOM disabled case would be "you are on your own" because this > > has been dangerous anyway. Correct? > > Yes. > > > I do agree that the current endless retry loop is far from being ideal > > and can see some updates but I am quite nervous about any potential > > regressions in this area (e.g. too aggressive OOM etc...). I have to > > think about it some more. > > Agreed on all points. Maybe we can keep a couple of the oom retry > iterations or something like that, which is still much more than what > global does and I don't think the global OOM killer is overly eager. Yes we can offer less blood and more confort > > Testing will show more. > > > Anyway if you have some more specific ideas I would be happy to review > > patches. > > Okay, I just wanted to check back with you before going down this > path. What are we going to do short term, though? Do you want to > push the disable-oom-for-pagecache for now or should we put the > VM_FAULT_OOM_HANDLED fix in the next version and do stable backports? > > This issue has been around for a while so frankly I don't think it's > urgent enough to rush things. Yes, but now we have a real usecase where this hurts AFAIU. Unless we come up with a fix/reasonable workaround I would rather go with something simpler for starter and more sofisticated later. I have to double check other places where we do charging but the last time I've checked we don't hold page locks on already visible pages (we do precharge in __do_fault f.e.), mem_map for reading in the page fault path is also safe (with oom enabled) and I guess that tmpfs is ok as well. Then we have a page cache and that one should be covered by my patch. So we should be covered. But I like your idea long term. Thanks! -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Kamezawa Hiroyuki Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Date: Tue, 27 Nov 2012 09:05:30 +0900 Message-ID: <50B403CA.501@jp.fujitsu.com> References: <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20121126131837.GC17860-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii"; format="flowed" To: Michal Hocko Cc: azurIt , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , Johannes Weiner (2012/11/26 22:18), Michal Hocko wrote: > [CCing also Johannes - the thread started here: > https://lkml.org/lkml/2012/11/21/497] > > On Mon 26-11-12 01:38:55, azurIt wrote: >>> This is hackish but it should help you in this case. Kamezawa, what do >>> you think about that? Should we generalize this and prepare something >>> like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY >>> automatically and use the function whenever we are in a locked context? >>> To be honest I do not like this very much but nothing more sensible >>> (without touching non-memcg paths) comes to my mind. >> >> >> I installed kernel with this patch, will report back if problem occurs >> again OR in few weeks if everything will be ok. Thank you! > > Now that I am looking at the patch closer it will not work because it > depends on other patch which is not merged yet and even that one would > help on its own because __GFP_NORETRY doesn't break the charge loop. > Sorry I have missed that... > > The patch bellow should help though. (it is based on top of the current > -mm tree but I will send a backport to 3.2 in the reply as well) > --- > From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001 > From: Michal Hocko > Date: Mon, 26 Nov 2012 11:47:57 +0100 > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > memcg oom killer might deadlock if the process which falls down to > mem_cgroup_handle_oom holds a lock which prevents other task to > terminate because it is blocked on the very same lock. > This can happen when a write system call needs to allocate a page but > the allocation hits the memcg hard limit and there is nothing to reclaim > (e.g. there is no swap or swap limit is hit as well and all cache pages > have been reclaimed already) and the process selected by memcg OOM > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > Process A > [] do_truncate+0x58/0xa0 # takes i_mutex > [] do_last+0x250/0xa30 > [] path_openat+0xd7/0x440 > [] do_filp_open+0x49/0xa0 > [] do_sys_open+0x106/0x240 > [] sys_open+0x20/0x30 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff > > Process B > [] mem_cgroup_handle_oom+0x241/0x3b0 > [] T.1146+0x5ab/0x5c0 > [] mem_cgroup_cache_charge+0xbe/0xe0 > [] add_to_page_cache_locked+0x4c/0x140 > [] add_to_page_cache_lru+0x22/0x50 > [] grab_cache_page_write_begin+0x8b/0xe0 > [] ext3_write_begin+0x88/0x270 > [] generic_file_buffered_write+0x116/0x290 > [] __generic_file_aio_write+0x27c/0x480 > [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > [] do_sync_write+0xea/0x130 > [] vfs_write+0xf3/0x1f0 > [] sys_write+0x51/0x90 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff > > This is not a hard deadlock though because administrator can still > intervene and increase the limit on the group which helps the writer to > finish the allocation and release the lock. > > This patch heals the problem by forbidding OOM from page cache charges > (namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper > function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which > then tells mem_cgroup_charge_common that OOM is not allowed for the > charge. No OOM from this path, except for fixing the bug, also make some > sense as we really do not want to cause an OOM because of a page cache > usage. > As a possibly visible result add_to_page_cache_lru might fail more often > with ENOMEM but this is to be expected if the limit is set and it is > preferable than OOM killer IMO. > > __GFP_NORETRY is abused for this memcg specific flag because it has been > used to prevent from OOM already (since not-merged-yet "memcg: reclaim > when more than one page needed"). The only difference is that the flag > doesn't prevent from reclaim anymore which kind of makes sense because > the global memory allocator triggers reclaim as well. The retry without > any reclaim on __GFP_NORETRY doesn't make much sense anyway because this > is effectively a busy loop with allowed OOM in this path. > > Reported-by: azurIt > Signed-off-by: Michal Hocko As a short term fix, I think this patch will work enough and seems simple enough. Acked-by: KAMEZAWA Hiroyuki Reading discussion between you and Johannes, to release locks, I understand the memcg need to return "RETRY" for a long term fix. Thinking a little, it will be simple to return "RETRY" to all processes waited on oom kill queue of a memcg and it can be done by a small fixes to memory.c. Thank you. -Kame > --- > include/linux/gfp.h | 3 +++ > include/linux/memcontrol.h | 12 ++++++++++++ > mm/filemap.c | 8 +++++++- > mm/memcontrol.c | 5 +---- > 4 files changed, 23 insertions(+), 5 deletions(-) > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index 10e667f..aac9b21 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -152,6 +152,9 @@ struct vm_area_struct; > /* 4GB DMA on some platforms */ > #define GFP_DMA32 __GFP_DMA32 > > +/* memcg oom killer is not allowed */ > +#define GFP_MEMCG_NO_OOM __GFP_NORETRY > + > /* Convert GFP flags to their corresponding migrate type */ > static inline int allocflags_to_migratetype(gfp_t gfp_flags) > { > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 095d2b4..1ad4bc6 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -65,6 +65,12 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); > extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > gfp_t gfp_mask); > > +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, > + struct mm_struct *mm, gfp_t gfp_mask) > +{ > + return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM); > +} > + > struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); > struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); > > @@ -215,6 +221,12 @@ static inline int mem_cgroup_cache_charge(struct page *page, > return 0; > } > > +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, > + struct mm_struct *mm, gfp_t gfp_mask) > +{ > + return 0; > +} > + > static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp) > { > diff --git a/mm/filemap.c b/mm/filemap.c > index 83efee7..ef14351 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > VM_BUG_ON(!PageLocked(page)); > VM_BUG_ON(PageSwapBacked(page)); > > - error = mem_cgroup_cache_charge(page, current->mm, > + /* > + * Cannot trigger OOM even if gfp_mask would allow that normally > + * because we might be called from a locked context and that > + * could lead to deadlocks if the killed process is waiting for > + * the same lock. > + */ > + error = mem_cgroup_cache_charge_no_oom(page, current->mm, > gfp_mask & GFP_RECLAIM_MASK); > if (error) > goto out; > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 02ee2f7..b4754ba 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2430,9 +2430,6 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > if (!(gfp_mask & __GFP_WAIT)) > return CHARGE_WOULDBLOCK; > > - if (gfp_mask & __GFP_NORETRY) > - return CHARGE_NOMEM; > - > ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags); > if (mem_cgroup_margin(mem_over_limit) >= nr_pages) > return CHARGE_RETRY; > @@ -3713,7 +3710,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, > { > struct mem_cgroup *memcg = NULL; > unsigned int nr_pages = 1; > - bool oom = true; > + bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); > int ret; > > if (PageTransHuge(page)) { > From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Date: Tue, 27 Nov 2012 10:54:52 +0100 Message-ID: <20121127095452.GD20537@dhcp22.suse.cz> References: <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <50B403CA.501-+CUm20s59erQFUHtdCDX3A@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Kamezawa Hiroyuki Cc: azurIt , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , Johannes Weiner On Tue 27-11-12 09:05:30, KAMEZAWA Hiroyuki wrote: [...] > As a short term fix, I think this patch will work enough and seems simple enough. > Acked-by: KAMEZAWA Hiroyuki Thanks! If Johannes is also ok with this for now I will resubmit the patch to Andrew after I hear back from the reporter. > Reading discussion between you and Johannes, to release locks, I understand > the memcg need to return "RETRY" for a long term fix. Thinking a little, > it will be simple to return "RETRY" to all processes waited on oom kill queue > of a memcg and it can be done by a small fixes to memory.c. I wouldn't call it simple but it is doable. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Date: Tue, 27 Nov 2012 14:48:13 -0500 Message-ID: <20121127194813.GP24381@cmpxchg.org> References: <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <50B403CA.501-+CUm20s59erQFUHtdCDX3A@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Kamezawa Hiroyuki Cc: Michal Hocko , azurIt , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist On Tue, Nov 27, 2012 at 09:05:30AM +0900, Kamezawa Hiroyuki wrote: > (2012/11/26 22:18), Michal Hocko wrote: > >[CCing also Johannes - the thread started here: > >https://lkml.org/lkml/2012/11/21/497] > > > >On Mon 26-11-12 01:38:55, azurIt wrote: > >>>This is hackish but it should help you in this case. Kamezawa, what do > >>>you think about that? Should we generalize this and prepare something > >>>like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > >>>automatically and use the function whenever we are in a locked context? > >>>To be honest I do not like this very much but nothing more sensible > >>>(without touching non-memcg paths) comes to my mind. > >> > >> > >>I installed kernel with this patch, will report back if problem occurs > >>again OR in few weeks if everything will be ok. Thank you! > > > >Now that I am looking at the patch closer it will not work because it > >depends on other patch which is not merged yet and even that one would > >help on its own because __GFP_NORETRY doesn't break the charge loop. > >Sorry I have missed that... > > > >The patch bellow should help though. (it is based on top of the current > >-mm tree but I will send a backport to 3.2 in the reply as well) > >--- > > From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001 > >From: Michal Hocko > >Date: Mon, 26 Nov 2012 11:47:57 +0100 > >Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > > >memcg oom killer might deadlock if the process which falls down to > >mem_cgroup_handle_oom holds a lock which prevents other task to > >terminate because it is blocked on the very same lock. > >This can happen when a write system call needs to allocate a page but > >the allocation hits the memcg hard limit and there is nothing to reclaim > >(e.g. there is no swap or swap limit is hit as well and all cache pages > >have been reclaimed already) and the process selected by memcg OOM > >killer is blocked on i_mutex on the same inode (e.g. truncate it). > > > >Process A > >[] do_truncate+0x58/0xa0 # takes i_mutex > >[] do_last+0x250/0xa30 > >[] path_openat+0xd7/0x440 > >[] do_filp_open+0x49/0xa0 > >[] do_sys_open+0x106/0x240 > >[] sys_open+0x20/0x30 > >[] system_call_fastpath+0x18/0x1d > >[] 0xffffffffffffffff > > > >Process B > >[] mem_cgroup_handle_oom+0x241/0x3b0 > >[] T.1146+0x5ab/0x5c0 > >[] mem_cgroup_cache_charge+0xbe/0xe0 > >[] add_to_page_cache_locked+0x4c/0x140 > >[] add_to_page_cache_lru+0x22/0x50 > >[] grab_cache_page_write_begin+0x8b/0xe0 > >[] ext3_write_begin+0x88/0x270 > >[] generic_file_buffered_write+0x116/0x290 > >[] __generic_file_aio_write+0x27c/0x480 > >[] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > >[] do_sync_write+0xea/0x130 > >[] vfs_write+0xf3/0x1f0 > >[] sys_write+0x51/0x90 > >[] system_call_fastpath+0x18/0x1d > >[] 0xffffffffffffffff > > > >This is not a hard deadlock though because administrator can still > >intervene and increase the limit on the group which helps the writer to > >finish the allocation and release the lock. > > > >This patch heals the problem by forbidding OOM from page cache charges > >(namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper > >function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which > >then tells mem_cgroup_charge_common that OOM is not allowed for the > >charge. No OOM from this path, except for fixing the bug, also make some > >sense as we really do not want to cause an OOM because of a page cache > >usage. > >As a possibly visible result add_to_page_cache_lru might fail more often > >with ENOMEM but this is to be expected if the limit is set and it is > >preferable than OOM killer IMO. > > > >__GFP_NORETRY is abused for this memcg specific flag because it has been > >used to prevent from OOM already (since not-merged-yet "memcg: reclaim > >when more than one page needed"). The only difference is that the flag > >doesn't prevent from reclaim anymore which kind of makes sense because > >the global memory allocator triggers reclaim as well. The retry without > >any reclaim on __GFP_NORETRY doesn't make much sense anyway because this > >is effectively a busy loop with allowed OOM in this path. > > > >Reported-by: azurIt > >Signed-off-by: Michal Hocko > > As a short term fix, I think this patch will work enough and seems simple enough. > Acked-by: KAMEZAWA Hiroyuki Yes, let's do this for now. > >diff --git a/include/linux/gfp.h b/include/linux/gfp.h > >index 10e667f..aac9b21 100644 > >--- a/include/linux/gfp.h > >+++ b/include/linux/gfp.h > >@@ -152,6 +152,9 @@ struct vm_area_struct; > > /* 4GB DMA on some platforms */ > > #define GFP_DMA32 __GFP_DMA32 > > > >+/* memcg oom killer is not allowed */ > >+#define GFP_MEMCG_NO_OOM __GFP_NORETRY Could we leave this within memcg, please? An extra flag to mem_cgroup_cache_charge() or the like. GFP flags are about controlling the page allocator, this seems abusive. We have an oom flag down in try_charge, maybe just propagate this up the stack? > >diff --git a/mm/filemap.c b/mm/filemap.c > >index 83efee7..ef14351 100644 > >--- a/mm/filemap.c > >+++ b/mm/filemap.c > >@@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > > VM_BUG_ON(!PageLocked(page)); > > VM_BUG_ON(PageSwapBacked(page)); > > > >- error = mem_cgroup_cache_charge(page, current->mm, > >+ /* > >+ * Cannot trigger OOM even if gfp_mask would allow that normally > >+ * because we might be called from a locked context and that > >+ * could lead to deadlocks if the killed process is waiting for > >+ * the same lock. > >+ */ > >+ error = mem_cgroup_cache_charge_no_oom(page, current->mm, > > gfp_mask & GFP_RECLAIM_MASK); > > if (error) > > goto out; Shmem does not use this function but also charges under the i_mutex in the write path and fallocate at least. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked Date: Tue, 27 Nov 2012 21:54:36 +0100 Message-ID: <20121127205431.GA2433@dhcp22.suse.cz> References: <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> <20121127194813.GP24381@cmpxchg.org> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=ltxRjhfo0fZuyKsZOc00BEymNAe6W8KyPLkzPLaZXFo=; b=Mf3WNNIyWR8HXLjsc4bn+mwaLihxMEpm/ZnaCNxUkfnHUQY5FpUceo4gXwkQH9KpcU ZcAiw8zu8lo+QFUICxTlnInQbiIO0DLab7qUnLm/PHq1RtmjlwgKekCOCgs3TnpTEZT/ gujUjHyH6e0IQlGNkADDM+XL1q0vKdbKzaMbwOUVhHsl/Htv730VAEcja+P9BAOUOzks NgTpgDuwt8+jxmNqrG2xPCY5MuAai4GTWIuW/0uuS0EtFLwbZG/JGu3W4X/CHDlRMHKH 99ezT40YsXmSpP+RM7zvzvTrBhpFeDpZggSG2XA558jEgAshvRGXlOY7QpWaaFGlo3p0 VKcw== Content-Disposition: inline In-Reply-To: <20121127194813.GP24381-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner , KAMEZAWA Hiroyuki Cc: azurIt , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist On Tue 27-11-12 14:48:13, Johannes Weiner wrote: [...] > > >diff --git a/include/linux/gfp.h b/include/linux/gfp.h > > >index 10e667f..aac9b21 100644 > > >--- a/include/linux/gfp.h > > >+++ b/include/linux/gfp.h > > >@@ -152,6 +152,9 @@ struct vm_area_struct; > > > /* 4GB DMA on some platforms */ > > > #define GFP_DMA32 __GFP_DMA32 > > > > > >+/* memcg oom killer is not allowed */ > > >+#define GFP_MEMCG_NO_OOM __GFP_NORETRY > > Could we leave this within memcg, please? An extra flag to > mem_cgroup_cache_charge() or the like. GFP flags are about > controlling the page allocator, this seems abusive. We have an oom > flag down in try_charge, maybe just propagate this up the stack? OK, what about the patch bellow? I have dropped Kame's Acked-by because it has been reworked. The patch is the same in principle. > > >diff --git a/mm/filemap.c b/mm/filemap.c > > >index 83efee7..ef14351 100644 > > >--- a/mm/filemap.c > > >+++ b/mm/filemap.c > > >@@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > > > VM_BUG_ON(!PageLocked(page)); > > > VM_BUG_ON(PageSwapBacked(page)); > > > > > >- error = mem_cgroup_cache_charge(page, current->mm, > > >+ /* > > >+ * Cannot trigger OOM even if gfp_mask would allow that normally > > >+ * because we might be called from a locked context and that > > >+ * could lead to deadlocks if the killed process is waiting for > > >+ * the same lock. > > >+ */ > > >+ error = mem_cgroup_cache_charge_no_oom(page, current->mm, > > > gfp_mask & GFP_RECLAIM_MASK); > > > if (error) > > > goto out; > > Shmem does not use this function but also charges under the i_mutex in > the write path and fallocate at least. Right you are --- >From 60cc8a184490d277eb24fca551b114f1e2234ce0 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Mon, 26 Nov 2012 11:47:57 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [] do_truncate+0x58/0xa0 # takes i_mutex [] do_last+0x250/0xa30 [] path_openat+0xd7/0x440 [] do_filp_open+0x49/0xa0 [] do_sys_open+0x106/0x240 [] sys_open+0x20/0x30 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff Process B [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1146+0x5ab/0x5c0 [] mem_cgroup_cache_charge+0xbe/0xe0 [] add_to_page_cache_locked+0x4c/0x140 [] add_to_page_cache_lru+0x22/0x50 [] grab_cache_page_write_begin+0x8b/0xe0 [] ext3_write_begin+0x88/0x270 [] generic_file_buffered_write+0x116/0x290 [] __generic_file_aio_write+0x27c/0x480 [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [] do_sync_write+0xea/0x130 [] vfs_write+0xf3/0x1f0 [] sys_write+0x51/0x90 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom argument which is pushed down the call chain. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. Changes since v1 - do not abuse gfp_flags and rather use oom parameter directly as per Johannes - handle also shmem write fauls resp. fallocate properly as per Johannes Reported-by: azurIt Signed-off-by: Michal Hocko --- include/linux/memcontrol.h | 5 +++-- mm/filemap.c | 9 +++++++-- mm/memcontrol.c | 9 ++++----- mm/shmem.c | 14 +++++++++++--- 4 files changed, 25 insertions(+), 12 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 095d2b4..8f48d5e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -63,7 +63,7 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page, extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask); + gfp_t gfp_mask, bool oom); struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); @@ -210,7 +210,8 @@ static inline int mem_cgroup_newpage_charge(struct page *page, } static inline int mem_cgroup_cache_charge(struct page *page, - struct mm_struct *mm, gfp_t gfp_mask) + struct mm_struct *mm, gfp_t gfp_mask, + bool oom) { return 0; } diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..ef8fbd5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, - gfp_mask & GFP_RECLAIM_MASK); + /* + * Cannot trigger OOM even if gfp_mask would allow that normally + * because we might be called from a locked context and that + * could lead to deadlocks if the killed process is waiting for + * the same lock. + */ + error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false); if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 02ee2f7..26690d6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3709,11 +3709,10 @@ out: * < 0 if the cgroup is over its limit */ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask, enum charge_type ctype) + gfp_t gfp_mask, enum charge_type ctype, bool oom) { struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; - bool oom = true; int ret; if (PageTransHuge(page)) { @@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page, VM_BUG_ON(page->mapping && !PageAnon(page)); VM_BUG_ON(!mm); return mem_cgroup_charge_common(page, mm, gfp_mask, - MEM_CGROUP_CHARGE_TYPE_ANON); + MEM_CGROUP_CHARGE_TYPE_ANON, true); } /* @@ -3851,7 +3850,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page, } int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask) + gfp_t gfp_mask, bool oom) { struct mem_cgroup *memcg = NULL; enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; @@ -3863,7 +3862,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, return 0; if (!PageSwapCache(page)) - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); else { /* page is swapcache/shmem */ ret = __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, &memcg); diff --git a/mm/shmem.c b/mm/shmem.c index 55054a7..cef63b5 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page) * the shmem_swaplist_mutex which might hold up shmem_writepage(). * Charged back to the user (not to caller) when swap account is used. */ - error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL); + error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true); if (error) goto out; /* No radix_tree_preload: swap entry keeps a place for page in tree */ @@ -1152,8 +1152,16 @@ repeat: goto failed; } + /* + * Cannot trigger OOM even if gfp_mask would allow that + * normally because we might be called from a locked + * context (i_mutex held) if this is a write lock or + * fallocate and that could lead to deadlocks if the + * killed process is waiting for the same lock. + */ error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, + sgp < SGP_WRITE); if (!error) { error = shmem_add_to_page_cache(page, mapping, index, gfp, swp_to_radix_entry(swap)); @@ -1209,7 +1217,7 @@ repeat: SetPageSwapBacked(page); __set_page_locked(page); error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, true); if (error) goto decused; error = radix_tree_preload(gfp & GFP_RECLAIM_MASK); -- 1.7.10.4 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked Date: Tue, 27 Nov 2012 21:59:44 +0100 Message-ID: <20121127205944.GB2433@dhcp22.suse.cz> References: <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> <20121127194813.GP24381@cmpxchg.org> <20121127205431.GA2433@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=9R6XDazxUg7AcaKqGBRq/F6UnGCBywQQ+UAX6aWn+50=; b=llzAfgeoEeCiDmr88FnRYvwH9mURbGYNhpm7di1SRfHFyQimFfIf4UkocLMt2tnyWR +AdgOlp3K4pCWTOSX/2KRWfGQRenhlsbd4gx0w27XO4hb7USaiJdLGAQDeXdKp3onmFj DKrw2Lqpy3HEsSR8/8rV6Lb6egWctPwVpshXPpv9y6fEZClSFlIvJt5CtnMTlvOzkcjy 4YH6b1UI/QKRoI3PS1PEubflUqLi4yDH9oqgXlSP+EQu9vU5YrSP/SSsmLfUudR11eLR KRvvKsa2MSl0QgESiUlm9LZWBscQMYcWD1az2v3xXfWDWXXIdorRjQ/nSNZ4ZEy6w29M eomQ== Content-Disposition: inline In-Reply-To: <20121127205431.GA2433-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner , KAMEZAWA Hiroyuki Cc: azurIt , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist Sorry, forgot to about one shmem charge: --- >From 7ae29927d24471c1b1a6ceb021219c592c1ef518 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Tue, 27 Nov 2012 21:53:13 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [] do_truncate+0x58/0xa0 # takes i_mutex [] do_last+0x250/0xa30 [] path_openat+0xd7/0x440 [] do_filp_open+0x49/0xa0 [] do_sys_open+0x106/0x240 [] sys_open+0x20/0x30 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff Process B [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1146+0x5ab/0x5c0 [] mem_cgroup_cache_charge+0xbe/0xe0 [] add_to_page_cache_locked+0x4c/0x140 [] add_to_page_cache_lru+0x22/0x50 [] grab_cache_page_write_begin+0x8b/0xe0 [] ext3_write_begin+0x88/0x270 [] generic_file_buffered_write+0x116/0x290 [] __generic_file_aio_write+0x27c/0x480 [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [] do_sync_write+0xea/0x130 [] vfs_write+0xf3/0x1f0 [] sys_write+0x51/0x90 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom argument which is pushed down the call chain. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. Changes since v1 - do not abuse gfp_flags and rather use oom parameter directly as per Johannes - handle also shmem write fauls resp. fallocate properly as per Johannes Reported-by: azurIt Signed-off-by: Michal Hocko --- include/linux/memcontrol.h | 5 +++-- mm/filemap.c | 9 +++++++-- mm/memcontrol.c | 9 ++++----- mm/shmem.c | 15 ++++++++++++--- 4 files changed, 26 insertions(+), 12 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 095d2b4..8f48d5e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -63,7 +63,7 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page, extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask); + gfp_t gfp_mask, bool oom); struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); @@ -210,7 +210,8 @@ static inline int mem_cgroup_newpage_charge(struct page *page, } static inline int mem_cgroup_cache_charge(struct page *page, - struct mm_struct *mm, gfp_t gfp_mask) + struct mm_struct *mm, gfp_t gfp_mask, + bool oom) { return 0; } diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..ef8fbd5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, - gfp_mask & GFP_RECLAIM_MASK); + /* + * Cannot trigger OOM even if gfp_mask would allow that normally + * because we might be called from a locked context and that + * could lead to deadlocks if the killed process is waiting for + * the same lock. + */ + error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false); if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 02ee2f7..26690d6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3709,11 +3709,10 @@ out: * < 0 if the cgroup is over its limit */ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask, enum charge_type ctype) + gfp_t gfp_mask, enum charge_type ctype, bool oom) { struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; - bool oom = true; int ret; if (PageTransHuge(page)) { @@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page, VM_BUG_ON(page->mapping && !PageAnon(page)); VM_BUG_ON(!mm); return mem_cgroup_charge_common(page, mm, gfp_mask, - MEM_CGROUP_CHARGE_TYPE_ANON); + MEM_CGROUP_CHARGE_TYPE_ANON, true); } /* @@ -3851,7 +3850,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page, } int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask) + gfp_t gfp_mask, bool oom) { struct mem_cgroup *memcg = NULL; enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; @@ -3863,7 +3862,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, return 0; if (!PageSwapCache(page)) - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); else { /* page is swapcache/shmem */ ret = __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, &memcg); diff --git a/mm/shmem.c b/mm/shmem.c index 55054a7..ba59cfa 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page) * the shmem_swaplist_mutex which might hold up shmem_writepage(). * Charged back to the user (not to caller) when swap account is used. */ - error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL); + error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true); if (error) goto out; /* No radix_tree_preload: swap entry keeps a place for page in tree */ @@ -1152,8 +1152,16 @@ repeat: goto failed; } + /* + * Cannot trigger OOM even if gfp_mask would allow that + * normally because we might be called from a locked + * context (i_mutex held) if this is a write lock or + * fallocate and that could lead to deadlocks if the + * killed process is waiting for the same lock. + */ error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, + sgp < SGP_WRITE); if (!error) { error = shmem_add_to_page_cache(page, mapping, index, gfp, swp_to_radix_entry(swap)); @@ -1209,7 +1217,8 @@ repeat: SetPageSwapBacked(page); __set_page_locked(page); error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, + sgp < SGP_WRITE); if (error) goto decused; error = radix_tree_preload(gfp & GFP_RECLAIM_MASK); -- 1.7.10.4 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked Date: Wed, 28 Nov 2012 10:26:31 -0500 Message-ID: <20121128152631.GT24381@cmpxchg.org> References: <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> <20121127194813.GP24381@cmpxchg.org> <20121127205431.GA2433@dhcp22.suse.cz> <20121127205944.GB2433@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121127205944.GB2433-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: KAMEZAWA Hiroyuki , azurIt , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist On Tue, Nov 27, 2012 at 09:59:44PM +0100, Michal Hocko wrote: > @@ -3863,7 +3862,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > return 0; > > if (!PageSwapCache(page)) > - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); > + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); > else { /* page is swapcache/shmem */ > ret = __mem_cgroup_try_charge_swapin(mm, page, > gfp_mask, &memcg); I think you need to pass it down the swapcache path too, as that is what happens when the shmem page written to is in swap and has been read into swapcache by the time of charging. > @@ -1152,8 +1152,16 @@ repeat: > goto failed; > } > > + /* > + * Cannot trigger OOM even if gfp_mask would allow that > + * normally because we might be called from a locked > + * context (i_mutex held) if this is a write lock or > + * fallocate and that could lead to deadlocks if the > + * killed process is waiting for the same lock. > + */ Indentation broken? > error = mem_cgroup_cache_charge(page, current->mm, > - gfp & GFP_RECLAIM_MASK); > + gfp & GFP_RECLAIM_MASK, > + sgp < SGP_WRITE); The code tests for read-only paths a bunch of times using sgp != SGP_WRITE && sgp != SGP_FALLOC Would probably be more consistent and more robust to use this here as well? > @@ -1209,7 +1217,8 @@ repeat: > SetPageSwapBacked(page); > __set_page_locked(page); > error = mem_cgroup_cache_charge(page, current->mm, > - gfp & GFP_RECLAIM_MASK); > + gfp & GFP_RECLAIM_MASK, > + sgp < SGP_WRITE); Same. Otherwise, the patch looks good to me, thanks for persisting :) From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked Date: Wed, 28 Nov 2012 17:04:47 +0100 Message-ID: <20121128160447.GH12309@dhcp22.suse.cz> References: <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> <20121127194813.GP24381@cmpxchg.org> <20121127205431.GA2433@dhcp22.suse.cz> <20121127205944.GB2433@dhcp22.suse.cz> <20121128152631.GT24381@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121128152631.GT24381@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: KAMEZAWA Hiroyuki , azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist On Wed 28-11-12 10:26:31, Johannes Weiner wrote: > On Tue, Nov 27, 2012 at 09:59:44PM +0100, Michal Hocko wrote: > > @@ -3863,7 +3862,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > > return 0; > > > > if (!PageSwapCache(page)) > > - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); > > + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); > > else { /* page is swapcache/shmem */ > > ret = __mem_cgroup_try_charge_swapin(mm, page, > > gfp_mask, &memcg); > > I think you need to pass it down the swapcache path too, as that is > what happens when the shmem page written to is in swap and has been > read into swapcache by the time of charging. You are right, of course. I shouldn't send patches late in the evening after staring to a crashdump for a good part of the day. /me ashamed. > > @@ -1152,8 +1152,16 @@ repeat: > > goto failed; > > } > > > > + /* > > + * Cannot trigger OOM even if gfp_mask would allow that > > + * normally because we might be called from a locked > > + * context (i_mutex held) if this is a write lock or > > + * fallocate and that could lead to deadlocks if the > > + * killed process is waiting for the same lock. > > + */ > > Indentation broken? c&p > > error = mem_cgroup_cache_charge(page, current->mm, > > - gfp & GFP_RECLAIM_MASK); > > + gfp & GFP_RECLAIM_MASK, > > + sgp < SGP_WRITE); > > The code tests for read-only paths a bunch of times using > > sgp != SGP_WRITE && sgp != SGP_FALLOC > > Would probably be more consistent and more robust to use this here as > well? Yes my laziness. I was considering that but it was really long so I've chosen the simpler way. But you are right that consistency is probably better here > > @@ -1209,7 +1217,8 @@ repeat: > > SetPageSwapBacked(page); > > __set_page_locked(page); > > error = mem_cgroup_cache_charge(page, current->mm, > > - gfp & GFP_RECLAIM_MASK); > > + gfp & GFP_RECLAIM_MASK, > > + sgp < SGP_WRITE); > > Same. > > Otherwise, the patch looks good to me, thanks for persisting :) Thanks for the throughout review. Here we go with the fixed version. --- >From 5000bf32c9c02fcd31d18e615300d8e7e7ef94a5 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 28 Nov 2012 16:49:46 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [] do_truncate+0x58/0xa0 # takes i_mutex [] do_last+0x250/0xa30 [] path_openat+0xd7/0x440 [] do_filp_open+0x49/0xa0 [] do_sys_open+0x106/0x240 [] sys_open+0x20/0x30 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff Process B [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1146+0x5ab/0x5c0 [] mem_cgroup_cache_charge+0xbe/0xe0 [] add_to_page_cache_locked+0x4c/0x140 [] add_to_page_cache_lru+0x22/0x50 [] grab_cache_page_write_begin+0x8b/0xe0 [] ext3_write_begin+0x88/0x270 [] generic_file_buffered_write+0x116/0x290 [] __generic_file_aio_write+0x27c/0x480 [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [] do_sync_write+0xea/0x130 [] vfs_write+0xf3/0x1f0 [] sys_write+0x51/0x90 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom argument which is pushed down the call chain. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. Changes since v1 - do not abuse gfp_flags and rather use oom parameter directly as per Johannes - handle also shmem write fauls resp. fallocate properly as per Johannes Reported-by: azurIt Signed-off-by: Michal Hocko --- include/linux/memcontrol.h | 11 +++++++---- mm/filemap.c | 9 +++++++-- mm/memcontrol.c | 25 +++++++++++++------------ mm/memory.c | 2 +- mm/shmem.c | 17 ++++++++++++++--- mm/swapfile.c | 2 +- 6 files changed, 43 insertions(+), 23 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 095d2b4..5abe441 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); /* for swap handling */ extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm, - struct page *page, gfp_t mask, struct mem_cgroup **memcgp); + struct page *page, gfp_t mask, struct mem_cgroup **memcgp, + bool oom); extern void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *memcg); extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask); + gfp_t gfp_mask, bool oom); struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); @@ -210,13 +211,15 @@ static inline int mem_cgroup_newpage_charge(struct page *page, } static inline int mem_cgroup_cache_charge(struct page *page, - struct mm_struct *mm, gfp_t gfp_mask) + struct mm_struct *mm, gfp_t gfp_mask, + bool oom) { return 0; } static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, - struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp) + struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp, + bool oom) { return 0; } diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..ef8fbd5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, - gfp_mask & GFP_RECLAIM_MASK); + /* + * Cannot trigger OOM even if gfp_mask would allow that normally + * because we might be called from a locked context and that + * could lead to deadlocks if the killed process is waiting for + * the same lock. + */ + error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false); if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 02ee2f7..02a6d70 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3709,11 +3709,10 @@ out: * < 0 if the cgroup is over its limit */ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask, enum charge_type ctype) + gfp_t gfp_mask, enum charge_type ctype, bool oom) { struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; - bool oom = true; int ret; if (PageTransHuge(page)) { @@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page, VM_BUG_ON(page->mapping && !PageAnon(page)); VM_BUG_ON(!mm); return mem_cgroup_charge_common(page, mm, gfp_mask, - MEM_CGROUP_CHARGE_TYPE_ANON); + MEM_CGROUP_CHARGE_TYPE_ANON, true); } /* @@ -3754,7 +3753,8 @@ int mem_cgroup_newpage_charge(struct page *page, static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t mask, - struct mem_cgroup **memcgp) + struct mem_cgroup **memcgp, + bool oom) { struct mem_cgroup *memcg; struct page_cgroup *pc; @@ -3776,20 +3776,21 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, if (!memcg) goto charge_cur_mm; *memcgp = memcg; - ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true); + ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, oom); css_put(&memcg->css); if (ret == -EINTR) ret = 0; return ret; charge_cur_mm: - ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, true); + ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, oom); if (ret == -EINTR) ret = 0; return ret; } int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, - gfp_t gfp_mask, struct mem_cgroup **memcgp) + gfp_t gfp_mask, struct mem_cgroup **memcgp, + bool oom) { *memcgp = NULL; if (mem_cgroup_disabled()) @@ -3803,12 +3804,12 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, if (!PageSwapCache(page)) { int ret; - ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, memcgp, true); + ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, memcgp, oom); if (ret == -EINTR) ret = 0; return ret; } - return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp); + return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, oom); } void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg) @@ -3851,7 +3852,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page, } int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask) + gfp_t gfp_mask, bool oom) { struct mem_cgroup *memcg = NULL; enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; @@ -3863,10 +3864,10 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, return 0; if (!PageSwapCache(page)) - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); else { /* page is swapcache/shmem */ ret = __mem_cgroup_try_charge_swapin(mm, page, - gfp_mask, &memcg); + gfp_mask, &memcg, oom); if (!ret) __mem_cgroup_commit_charge_swapin(page, memcg, type); } diff --git a/mm/memory.c b/mm/memory.c index 6891d3b..afad903 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2991,7 +2991,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, } } - if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) { + if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr, true)) { ret = VM_FAULT_OOM; goto out_page; } diff --git a/mm/shmem.c b/mm/shmem.c index 55054a7..3b27db4 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page) * the shmem_swaplist_mutex which might hold up shmem_writepage(). * Charged back to the user (not to caller) when swap account is used. */ - error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL); + error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true); if (error) goto out; /* No radix_tree_preload: swap entry keeps a place for page in tree */ @@ -1152,8 +1152,17 @@ repeat: goto failed; } + /* + * Cannot trigger OOM even if gfp_mask would allow that + * normally because we might be called from a locked + * context (i_mutex held) if this is a write lock or + * fallocate and that could lead to deadlocks if the + * killed process is waiting for the same lock. + */ error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, + sgp != SGP_WRITE && + sgp != SGP_FALLOC); if (!error) { error = shmem_add_to_page_cache(page, mapping, index, gfp, swp_to_radix_entry(swap)); @@ -1209,7 +1218,9 @@ repeat: SetPageSwapBacked(page); __set_page_locked(page); error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, + sgp != SGP_WRITE && + sgp != SGP_FALLOC); if (error) goto decused; error = radix_tree_preload(gfp & GFP_RECLAIM_MASK); diff --git a/mm/swapfile.c b/mm/swapfile.c index 2f8e429..8ec511e 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -828,7 +828,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, int ret = 1; if (mem_cgroup_try_charge_swapin(vma->vm_mm, page, - GFP_KERNEL, &memcg)) { + GFP_KERNEL, &memcg, true)) { ret = -ENOMEM; goto out_nolock; } -- 1.7.10.4 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked Date: Wed, 28 Nov 2012 11:37:36 -0500 Message-ID: <20121128163736.GV24381@cmpxchg.org> References: <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> <20121127194813.GP24381@cmpxchg.org> <20121127205431.GA2433@dhcp22.suse.cz> <20121127205944.GB2433@dhcp22.suse.cz> <20121128152631.GT24381@cmpxchg.org> <20121128160447.GH12309@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121128160447.GH12309-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: KAMEZAWA Hiroyuki , azurIt , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote: > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 095d2b4..5abe441 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm, > gfp_t gfp_mask); > /* for swap handling */ > extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > - struct page *page, gfp_t mask, struct mem_cgroup **memcgp); > + struct page *page, gfp_t mask, struct mem_cgroup **memcgp, > + bool oom); Ok, now I feel almost bad for asking, but why the public interface, too? You only ever pass "true" in there and this is unlikely to change anytime soon, no? > @@ -3754,7 +3753,8 @@ int mem_cgroup_newpage_charge(struct page *page, > static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, > struct page *page, > gfp_t mask, > - struct mem_cgroup **memcgp) > + struct mem_cgroup **memcgp, > + bool oom) > { > struct mem_cgroup *memcg; > struct page_cgroup *pc; > @@ -3776,20 +3776,21 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, > if (!memcg) > goto charge_cur_mm; > *memcgp = memcg; > - ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true); > + ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, oom); > css_put(&memcg->css); > if (ret == -EINTR) > ret = 0; > return ret; > charge_cur_mm: > - ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, true); > + ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, oom); > if (ret == -EINTR) > ret = 0; > return ret; > } Only this one is needed... > @@ -3851,7 +3852,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page, > } > > int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > - gfp_t gfp_mask) > + gfp_t gfp_mask, bool oom) > { > struct mem_cgroup *memcg = NULL; > enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; > @@ -3863,10 +3864,10 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > return 0; > > if (!PageSwapCache(page)) > - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); > + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); > else { /* page is swapcache/shmem */ > ret = __mem_cgroup_try_charge_swapin(mm, page, > - gfp_mask, &memcg); > + gfp_mask, &memcg, oom); > if (!ret) > __mem_cgroup_commit_charge_swapin(page, memcg, type); > } ...for this site. > diff --git a/mm/memory.c b/mm/memory.c > index 6891d3b..afad903 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2991,7 +2991,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, > } > } > > - if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) { > + if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr, true)) { > ret = VM_FAULT_OOM; > goto out_page; > } Can not happen for shmem, the fault handler uses vma->vm_ops->fault. > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 2f8e429..8ec511e 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -828,7 +828,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, > int ret = 1; > > if (mem_cgroup_try_charge_swapin(vma->vm_mm, page, > - GFP_KERNEL, &memcg)) { > + GFP_KERNEL, &memcg, true)) { > ret = -ENOMEM; > goto out_nolock; > } Can not happen for shmem, uses shmem_unuse() instead. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked Date: Wed, 28 Nov 2012 17:46:40 +0100 Message-ID: <20121128164640.GB22201@dhcp22.suse.cz> References: <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> <20121127194813.GP24381@cmpxchg.org> <20121127205431.GA2433@dhcp22.suse.cz> <20121127205944.GB2433@dhcp22.suse.cz> <20121128152631.GT24381@cmpxchg.org> <20121128160447.GH12309@dhcp22.suse.cz> <20121128163736.GV24381@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121128163736.GV24381@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: KAMEZAWA Hiroyuki , azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist On Wed 28-11-12 11:37:36, Johannes Weiner wrote: > On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote: > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > index 095d2b4..5abe441 100644 > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm, > > gfp_t gfp_mask); > > /* for swap handling */ > > extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > > - struct page *page, gfp_t mask, struct mem_cgroup **memcgp); > > + struct page *page, gfp_t mask, struct mem_cgroup **memcgp, > > + bool oom); > > Ok, now I feel almost bad for asking, but why the public interface, > too? Would it work out if I tell it was to double check that your review quality is not decreased after that many revisions? :P Incremental update and the full patch in the reply --- diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 5abe441..8f48d5e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -57,8 +57,7 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); /* for swap handling */ extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm, - struct page *page, gfp_t mask, struct mem_cgroup **memcgp, - bool oom); + struct page *page, gfp_t mask, struct mem_cgroup **memcgp); extern void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *memcg); extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); @@ -218,8 +217,7 @@ static inline int mem_cgroup_cache_charge(struct page *page, } static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, - struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp, - bool oom) + struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp) { return 0; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 02a6d70..3c9b1c5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3789,8 +3789,7 @@ charge_cur_mm: } int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, - gfp_t gfp_mask, struct mem_cgroup **memcgp, - bool oom) + gfp_t gfp_mask, struct mem_cgroup **memcgp) { *memcgp = NULL; if (mem_cgroup_disabled()) @@ -3804,12 +3803,12 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, if (!PageSwapCache(page)) { int ret; - ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, memcgp, oom); + ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, memcgp, true); if (ret == -EINTR) ret = 0; return ret; } - return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, oom); + return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, true); } void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg) diff --git a/mm/memory.c b/mm/memory.c index afad903..6891d3b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2991,7 +2991,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, } } - if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr, true)) { + if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) { ret = VM_FAULT_OOM; goto out_page; } diff --git a/mm/swapfile.c b/mm/swapfile.c index 8ec511e..2f8e429 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -828,7 +828,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, int ret = 1; if (mem_cgroup_try_charge_swapin(vma->vm_mm, page, - GFP_KERNEL, &memcg, true)) { + GFP_KERNEL, &memcg)) { ret = -ENOMEM; goto out_nolock; } -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked Date: Wed, 28 Nov 2012 17:48:24 +0100 Message-ID: <20121128164824.GC22201@dhcp22.suse.cz> References: <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> <20121127194813.GP24381@cmpxchg.org> <20121127205431.GA2433@dhcp22.suse.cz> <20121127205944.GB2433@dhcp22.suse.cz> <20121128152631.GT24381@cmpxchg.org> <20121128160447.GH12309@dhcp22.suse.cz> <20121128163736.GV24381@cmpxchg.org> <20121128164640.GB22201@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121128164640.GB22201@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: KAMEZAWA Hiroyuki , azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist On Wed 28-11-12 17:46:40, Michal Hocko wrote: > On Wed 28-11-12 11:37:36, Johannes Weiner wrote: > > On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote: > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > > index 095d2b4..5abe441 100644 > > > --- a/include/linux/memcontrol.h > > > +++ b/include/linux/memcontrol.h > > > @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm, > > > gfp_t gfp_mask); > > > /* for swap handling */ > > > extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > > > - struct page *page, gfp_t mask, struct mem_cgroup **memcgp); > > > + struct page *page, gfp_t mask, struct mem_cgroup **memcgp, > > > + bool oom); > > > > Ok, now I feel almost bad for asking, but why the public interface, > > too? > > Would it work out if I tell it was to double check that your review > quality is not decreased after that many revisions? :P > > Incremental update and the full patch in the reply --- >From e21bb704947e9a477ec1df9121575c606dbfcb52 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 28 Nov 2012 17:46:32 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [] do_truncate+0x58/0xa0 # takes i_mutex [] do_last+0x250/0xa30 [] path_openat+0xd7/0x440 [] do_filp_open+0x49/0xa0 [] do_sys_open+0x106/0x240 [] sys_open+0x20/0x30 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff Process B [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1146+0x5ab/0x5c0 [] mem_cgroup_cache_charge+0xbe/0xe0 [] add_to_page_cache_locked+0x4c/0x140 [] add_to_page_cache_lru+0x22/0x50 [] grab_cache_page_write_begin+0x8b/0xe0 [] ext3_write_begin+0x88/0x270 [] generic_file_buffered_write+0x116/0x290 [] __generic_file_aio_write+0x27c/0x480 [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [] do_sync_write+0xea/0x130 [] vfs_write+0xf3/0x1f0 [] sys_write+0x51/0x90 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom argument which is pushed down the call chain. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. Changes since v1 - do not abuse gfp_flags and rather use oom parameter directly as per Johannes - handle also shmem write fauls resp. fallocate properly as per Johannes Reported-by: azurIt Signed-off-by: Michal Hocko --- include/linux/memcontrol.h | 5 +++-- mm/filemap.c | 9 +++++++-- mm/memcontrol.c | 20 ++++++++++---------- mm/shmem.c | 17 ++++++++++++++--- 4 files changed, 34 insertions(+), 17 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 095d2b4..8f48d5e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -63,7 +63,7 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page, extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask); + gfp_t gfp_mask, bool oom); struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); @@ -210,7 +210,8 @@ static inline int mem_cgroup_newpage_charge(struct page *page, } static inline int mem_cgroup_cache_charge(struct page *page, - struct mm_struct *mm, gfp_t gfp_mask) + struct mm_struct *mm, gfp_t gfp_mask, + bool oom) { return 0; } diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..ef8fbd5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, - gfp_mask & GFP_RECLAIM_MASK); + /* + * Cannot trigger OOM even if gfp_mask would allow that normally + * because we might be called from a locked context and that + * could lead to deadlocks if the killed process is waiting for + * the same lock. + */ + error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false); if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 02ee2f7..3c9b1c5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3709,11 +3709,10 @@ out: * < 0 if the cgroup is over its limit */ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask, enum charge_type ctype) + gfp_t gfp_mask, enum charge_type ctype, bool oom) { struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; - bool oom = true; int ret; if (PageTransHuge(page)) { @@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page, VM_BUG_ON(page->mapping && !PageAnon(page)); VM_BUG_ON(!mm); return mem_cgroup_charge_common(page, mm, gfp_mask, - MEM_CGROUP_CHARGE_TYPE_ANON); + MEM_CGROUP_CHARGE_TYPE_ANON, true); } /* @@ -3754,7 +3753,8 @@ int mem_cgroup_newpage_charge(struct page *page, static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t mask, - struct mem_cgroup **memcgp) + struct mem_cgroup **memcgp, + bool oom) { struct mem_cgroup *memcg; struct page_cgroup *pc; @@ -3776,13 +3776,13 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, if (!memcg) goto charge_cur_mm; *memcgp = memcg; - ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true); + ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, oom); css_put(&memcg->css); if (ret == -EINTR) ret = 0; return ret; charge_cur_mm: - ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, true); + ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, oom); if (ret == -EINTR) ret = 0; return ret; @@ -3808,7 +3808,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, ret = 0; return ret; } - return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp); + return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, true); } void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg) @@ -3851,7 +3851,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page, } int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask) + gfp_t gfp_mask, bool oom) { struct mem_cgroup *memcg = NULL; enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; @@ -3863,10 +3863,10 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, return 0; if (!PageSwapCache(page)) - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); else { /* page is swapcache/shmem */ ret = __mem_cgroup_try_charge_swapin(mm, page, - gfp_mask, &memcg); + gfp_mask, &memcg, oom); if (!ret) __mem_cgroup_commit_charge_swapin(page, memcg, type); } diff --git a/mm/shmem.c b/mm/shmem.c index 55054a7..3b27db4 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page) * the shmem_swaplist_mutex which might hold up shmem_writepage(). * Charged back to the user (not to caller) when swap account is used. */ - error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL); + error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true); if (error) goto out; /* No radix_tree_preload: swap entry keeps a place for page in tree */ @@ -1152,8 +1152,17 @@ repeat: goto failed; } + /* + * Cannot trigger OOM even if gfp_mask would allow that + * normally because we might be called from a locked + * context (i_mutex held) if this is a write lock or + * fallocate and that could lead to deadlocks if the + * killed process is waiting for the same lock. + */ error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, + sgp != SGP_WRITE && + sgp != SGP_FALLOC); if (!error) { error = shmem_add_to_page_cache(page, mapping, index, gfp, swp_to_radix_entry(swap)); @@ -1209,7 +1218,9 @@ repeat: SetPageSwapBacked(page); __set_page_locked(page); error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, + sgp != SGP_WRITE && + sgp != SGP_FALLOC); if (error) goto decused; error = radix_tree_preload(gfp & GFP_RECLAIM_MASK); -- 1.7.10.4 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked Date: Wed, 28 Nov 2012 13:44:33 -0500 Message-ID: <20121128184433.GH2301@cmpxchg.org> References: <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> <20121127194813.GP24381@cmpxchg.org> <20121127205431.GA2433@dhcp22.suse.cz> <20121127205944.GB2433@dhcp22.suse.cz> <20121128152631.GT24381@cmpxchg.org> <20121128160447.GH12309@dhcp22.suse.cz> <20121128163736.GV24381@cmpxchg.org> <20121128164640.GB22201@dhcp22.suse.cz> <20121128164824.GC22201@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121128164824.GC22201-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: KAMEZAWA Hiroyuki , azurIt , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist On Wed, Nov 28, 2012 at 05:48:24PM +0100, Michal Hocko wrote: > On Wed 28-11-12 17:46:40, Michal Hocko wrote: > > On Wed 28-11-12 11:37:36, Johannes Weiner wrote: > > > On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote: > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > > > index 095d2b4..5abe441 100644 > > > > --- a/include/linux/memcontrol.h > > > > +++ b/include/linux/memcontrol.h > > > > @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm, > > > > gfp_t gfp_mask); > > > > /* for swap handling */ > > > > extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > > > > - struct page *page, gfp_t mask, struct mem_cgroup **memcgp); > > > > + struct page *page, gfp_t mask, struct mem_cgroup **memcgp, > > > > + bool oom); > > > > > > Ok, now I feel almost bad for asking, but why the public interface, > > > too? > > > > Would it work out if I tell it was to double check that your review > > quality is not decreased after that many revisions? :P Deal. > >From e21bb704947e9a477ec1df9121575c606dbfcb52 Mon Sep 17 00:00:00 2001 > From: Michal Hocko > Date: Wed, 28 Nov 2012 17:46:32 +0100 > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > memcg oom killer might deadlock if the process which falls down to > mem_cgroup_handle_oom holds a lock which prevents other task to > terminate because it is blocked on the very same lock. > This can happen when a write system call needs to allocate a page but > the allocation hits the memcg hard limit and there is nothing to reclaim > (e.g. there is no swap or swap limit is hit as well and all cache pages > have been reclaimed already) and the process selected by memcg OOM > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > Process A > [] do_truncate+0x58/0xa0 # takes i_mutex > [] do_last+0x250/0xa30 > [] path_openat+0xd7/0x440 > [] do_filp_open+0x49/0xa0 > [] do_sys_open+0x106/0x240 > [] sys_open+0x20/0x30 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff > > Process B > [] mem_cgroup_handle_oom+0x241/0x3b0 > [] T.1146+0x5ab/0x5c0 > [] mem_cgroup_cache_charge+0xbe/0xe0 > [] add_to_page_cache_locked+0x4c/0x140 > [] add_to_page_cache_lru+0x22/0x50 > [] grab_cache_page_write_begin+0x8b/0xe0 > [] ext3_write_begin+0x88/0x270 > [] generic_file_buffered_write+0x116/0x290 > [] __generic_file_aio_write+0x27c/0x480 > [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > [] do_sync_write+0xea/0x130 > [] vfs_write+0xf3/0x1f0 > [] sys_write+0x51/0x90 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff > > This is not a hard deadlock though because administrator can still > intervene and increase the limit on the group which helps the writer to > finish the allocation and release the lock. > > This patch heals the problem by forbidding OOM from page cache charges > (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom > argument which is pushed down the call chain. > > As a possibly visible result add_to_page_cache_lru might fail more often > with ENOMEM but this is to be expected if the limit is set and it is > preferable than OOM killer IMO. > > Changes since v1 > - do not abuse gfp_flags and rather use oom parameter directly as per > Johannes > - handle also shmem write fauls resp. fallocate properly as per Johannes > > Reported-by: azurIt > Signed-off-by: Michal Hocko Acked-by: Johannes Weiner Thanks, Michal! From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hugh Dickins Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked Date: Wed, 28 Nov 2012 12:20:44 -0800 (PST) Message-ID: References: <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> <20121127194813.GP24381@cmpxchg.org> <20121127205431.GA2433@dhcp22.suse.cz> <20121127205944.GB2433@dhcp22.suse.cz> <20121128152631.GT24381@cmpxchg.org> <20121128160447.GH12309@dhcp22.suse.cz> <20121128163736.GV24381@cmpxchg.org> <20121128164640.GB22201@dhcp22.suse.cz> <20121128164824.GC22201@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=date:from:x-x-sender:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version:content-type; bh=IfDMQ0JzoQKnNC9NU9501BSTjXxFZCPwOUE2nsBVPXk=; b=SPcJK3zFLpSITVeOA7iGH4FlNJ1yJ0dOU4/ndQ0hyJHEKSOp6pj8PiGHEmMYddyV7+ 0+SPbX15Uo/ApShJJ+fYuYta+dy/lxNxVfjFlsD6cU7I0P2SkVMWENK/eN1l0wjgDwMx 06bjl2TfbdTXBwLzoiX3gWAk6xcBlGOHxkygtJmt25/MfimItH72cYEYd9ADYiFRv37N ArrwT7id3WRZJOD5FYd9yKs1HHuEEkTARcp5DGX/JXGZCbLAEK3cVrALixg6bXGx+bOJ SVlGvCGoA1wNRt3NHycK0gGH6pV2ibw7mbckDg4t8cw/7RirruCh0yJ5scSPCJaMprFP iqEA== In-Reply-To: <20121128164824.GC22201-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: TEXT/PLAIN; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: Johannes Weiner , KAMEZAWA Hiroyuki , azurIt , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist On Wed, 28 Nov 2012, Michal Hocko wrote: > From e21bb704947e9a477ec1df9121575c606dbfcb52 Mon Sep 17 00:00:00 2001 > From: Michal Hocko > Date: Wed, 28 Nov 2012 17:46:32 +0100 > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > memcg oom killer might deadlock if the process which falls down to > mem_cgroup_handle_oom holds a lock which prevents other task to > terminate because it is blocked on the very same lock. > This can happen when a write system call needs to allocate a page but > the allocation hits the memcg hard limit and there is nothing to reclaim > (e.g. there is no swap or swap limit is hit as well and all cache pages > have been reclaimed already) and the process selected by memcg OOM > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > Process A > [] do_truncate+0x58/0xa0 # takes i_mutex > [] do_last+0x250/0xa30 > [] path_openat+0xd7/0x440 > [] do_filp_open+0x49/0xa0 > [] do_sys_open+0x106/0x240 > [] sys_open+0x20/0x30 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff > > Process B > [] mem_cgroup_handle_oom+0x241/0x3b0 > [] T.1146+0x5ab/0x5c0 > [] mem_cgroup_cache_charge+0xbe/0xe0 > [] add_to_page_cache_locked+0x4c/0x140 > [] add_to_page_cache_lru+0x22/0x50 > [] grab_cache_page_write_begin+0x8b/0xe0 > [] ext3_write_begin+0x88/0x270 > [] generic_file_buffered_write+0x116/0x290 > [] __generic_file_aio_write+0x27c/0x480 > [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > [] do_sync_write+0xea/0x130 > [] vfs_write+0xf3/0x1f0 > [] sys_write+0x51/0x90 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff > > This is not a hard deadlock though because administrator can still > intervene and increase the limit on the group which helps the writer to > finish the allocation and release the lock. > > This patch heals the problem by forbidding OOM from page cache charges > (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom > argument which is pushed down the call chain. > > As a possibly visible result add_to_page_cache_lru might fail more often > with ENOMEM but this is to be expected if the limit is set and it is > preferable than OOM killer IMO. > > Changes since v1 > - do not abuse gfp_flags and rather use oom parameter directly as per > Johannes > - handle also shmem write fauls resp. fallocate properly as per Johannes > > Reported-by: azurIt > Signed-off-by: Michal Hocko Sorry, Michal, you've laboured hard on this: but I dislike it so much that I'm here overcoming my dread of entering an OOM-killer discussion, and the resultant deluge of unwelcome CCs for eternity afterwards. I had been relying on Johannes to repeat his "This issue has been around for a while so frankly I don't think it's urgent enough to rush things", but it looks like I have to be the one to repeat it. Your analysis of azurIt's traces may well be correct, and this patch may indeed ameliorate the situation, and it's fine as something for azurIt to try and report on and keep in his tree; but I hope that it does not go upstream and to stable. Why do I dislike it so much? I suppose because it's both too general and too limited at the same time. Too general in that it changes the behaviour on OOM for a large set of memcg charges, all those that go through add_to_page_cache_locked(), when only a subset of those have the i_mutex issue. If you're going to be that general, why not go further? Leave the mem_cgroup_cache_charge() interface as is, make it not-OOM internally, no need for SGP_WRITE,SGP_FALLOC distinctions in mm/shmem.c. No other filesystem gets the benefit of those distinctions: isn't it better to keep it simple? (And I can see a partial truncation case where shmem uses SGP_READ under i_mutex; and the change to shmem_unuse behaviour is a non-issue, since swapoff invites itself to be killed anyway.) Too limited in that i_mutex is just the held resource which azurIt's traces have led you to, but it's a general problem that the OOM-killed task might be waiting for a resource that the OOM-killing task holds. I suspect that if we try hard enough (I admit I have not), we can find an example of such a potential deadlock for almost every memcg charge site. mmap_sem? not as easy to invent a case with that as I thought, since it needs a down_write, and the typical page allocations happen with down_read, and I can't think of a process which does down_write on another's mm. But i_mutex is always good, once you remember the case of write to file from userspace page which got paged out, so the fault path has to read it back in, while i_mutex is still held at the outer level. An unusual case? Well, normally yes, but we're considering out-of-memory conditions, which may converge upon cases like this. Wouldn't it be nice if I could be constructive? But I'm sceptical even of Johannes's faith in what the global OOM killer would do: how does __alloc_pages_slowpath() get out of its "goto restart" loop, excepting the trivial case when the killer is the killed? I wonder why this issue has hit azurIt and no other reporter? No swap plays a part in it, but that's not so unusual. Yours glOOMily, Hugh > --- > include/linux/memcontrol.h | 5 +++-- > mm/filemap.c | 9 +++++++-- > mm/memcontrol.c | 20 ++++++++++---------- > mm/shmem.c | 17 ++++++++++++++--- > 4 files changed, 34 insertions(+), 17 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 095d2b4..8f48d5e 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -63,7 +63,7 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page, > extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); > > extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > - gfp_t gfp_mask); > + gfp_t gfp_mask, bool oom); > > struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); > struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); > @@ -210,7 +210,8 @@ static inline int mem_cgroup_newpage_charge(struct page *page, > } > > static inline int mem_cgroup_cache_charge(struct page *page, > - struct mm_struct *mm, gfp_t gfp_mask) > + struct mm_struct *mm, gfp_t gfp_mask, > + bool oom) > { > return 0; > } > diff --git a/mm/filemap.c b/mm/filemap.c > index 83efee7..ef8fbd5 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > VM_BUG_ON(!PageLocked(page)); > VM_BUG_ON(PageSwapBacked(page)); > > - error = mem_cgroup_cache_charge(page, current->mm, > - gfp_mask & GFP_RECLAIM_MASK); > + /* > + * Cannot trigger OOM even if gfp_mask would allow that normally > + * because we might be called from a locked context and that > + * could lead to deadlocks if the killed process is waiting for > + * the same lock. > + */ > + error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false); > if (error) > goto out; > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 02ee2f7..3c9b1c5 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -3709,11 +3709,10 @@ out: > * < 0 if the cgroup is over its limit > */ > static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, > - gfp_t gfp_mask, enum charge_type ctype) > + gfp_t gfp_mask, enum charge_type ctype, bool oom) > { > struct mem_cgroup *memcg = NULL; > unsigned int nr_pages = 1; > - bool oom = true; > int ret; > > if (PageTransHuge(page)) { > @@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page, > VM_BUG_ON(page->mapping && !PageAnon(page)); > VM_BUG_ON(!mm); > return mem_cgroup_charge_common(page, mm, gfp_mask, > - MEM_CGROUP_CHARGE_TYPE_ANON); > + MEM_CGROUP_CHARGE_TYPE_ANON, true); > } > > /* > @@ -3754,7 +3753,8 @@ int mem_cgroup_newpage_charge(struct page *page, > static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, > struct page *page, > gfp_t mask, > - struct mem_cgroup **memcgp) > + struct mem_cgroup **memcgp, > + bool oom) > { > struct mem_cgroup *memcg; > struct page_cgroup *pc; > @@ -3776,13 +3776,13 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, > if (!memcg) > goto charge_cur_mm; > *memcgp = memcg; > - ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true); > + ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, oom); > css_put(&memcg->css); > if (ret == -EINTR) > ret = 0; > return ret; > charge_cur_mm: > - ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, true); > + ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, oom); > if (ret == -EINTR) > ret = 0; > return ret; > @@ -3808,7 +3808,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, > ret = 0; > return ret; > } > - return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp); > + return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, true); > } > > void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg) > @@ -3851,7 +3851,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page, > } > > int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > - gfp_t gfp_mask) > + gfp_t gfp_mask, bool oom) > { > struct mem_cgroup *memcg = NULL; > enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; > @@ -3863,10 +3863,10 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > return 0; > > if (!PageSwapCache(page)) > - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); > + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); > else { /* page is swapcache/shmem */ > ret = __mem_cgroup_try_charge_swapin(mm, page, > - gfp_mask, &memcg); > + gfp_mask, &memcg, oom); > if (!ret) > __mem_cgroup_commit_charge_swapin(page, memcg, type); > } > diff --git a/mm/shmem.c b/mm/shmem.c > index 55054a7..3b27db4 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page) > * the shmem_swaplist_mutex which might hold up shmem_writepage(). > * Charged back to the user (not to caller) when swap account is used. > */ > - error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL); > + error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true); > if (error) > goto out; > /* No radix_tree_preload: swap entry keeps a place for page in tree */ > @@ -1152,8 +1152,17 @@ repeat: > goto failed; > } > > + /* > + * Cannot trigger OOM even if gfp_mask would allow that > + * normally because we might be called from a locked > + * context (i_mutex held) if this is a write lock or > + * fallocate and that could lead to deadlocks if the > + * killed process is waiting for the same lock. > + */ > error = mem_cgroup_cache_charge(page, current->mm, > - gfp & GFP_RECLAIM_MASK); > + gfp & GFP_RECLAIM_MASK, > + sgp != SGP_WRITE && > + sgp != SGP_FALLOC); > if (!error) { > error = shmem_add_to_page_cache(page, mapping, index, > gfp, swp_to_radix_entry(swap)); > @@ -1209,7 +1218,9 @@ repeat: > SetPageSwapBacked(page); > __set_page_locked(page); > error = mem_cgroup_cache_charge(page, current->mm, > - gfp & GFP_RECLAIM_MASK); > + gfp & GFP_RECLAIM_MASK, > + sgp != SGP_WRITE && > + sgp != SGP_FALLOC); > if (error) > goto decused; > error = radix_tree_preload(gfp & GFP_RECLAIM_MASK); > -- > 1.7.10.4 > > -- > Michal Hocko > SUSE Labs > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo-Bw31MaZKKs0EbZ0PF+XxCw@public.gmane.org For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org > From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked Date: Thu, 29 Nov 2012 15:05:49 +0100 Message-ID: <20121129140549.GC27887@dhcp22.suse.cz> References: <50B403CA.501@jp.fujitsu.com> <20121127194813.GP24381@cmpxchg.org> <20121127205431.GA2433@dhcp22.suse.cz> <20121127205944.GB2433@dhcp22.suse.cz> <20121128152631.GT24381@cmpxchg.org> <20121128160447.GH12309@dhcp22.suse.cz> <20121128163736.GV24381@cmpxchg.org> <20121128164640.GB22201@dhcp22.suse.cz> <20121128164824.GC22201@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Hugh Dickins Cc: Johannes Weiner , KAMEZAWA Hiroyuki , azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist On Wed 28-11-12 12:20:44, Hugh Dickins wrote: [...] > Sorry, Michal, you've laboured hard on this: but I dislike it so much > that I'm here overcoming my dread of entering an OOM-killer discussion, > and the resultant deluge of unwelcome CCs for eternity afterwards. > > I had been relying on Johannes to repeat his "This issue has been > around for a while so frankly I don't think it's urgent enough to > rush things", but it looks like I have to be the one to repeat it. Well, the idea was to use this only as a temporal fix and come up with a better solution without any hurry. > Your analysis of azurIt's traces may well be correct, and this patch > may indeed ameliorate the situation, and it's fine as something for > azurIt to try and report on and keep in his tree; but I hope that > it does not go upstream and to stable. > > Why do I dislike it so much? I suppose because it's both too general > and too limited at the same time. > > Too general in that it changes the behaviour on OOM for a large set > of memcg charges, all those that go through add_to_page_cache_locked(), > when only a subset of those have the i_mutex issue. This is a fair point but the real fix which we were discussing with Johannes would be even more risky for stable. > If you're going to be that general, why not go further? Leave the > mem_cgroup_cache_charge() interface as is, make it not-OOM internally, > no need for SGP_WRITE,SGP_FALLOC distinctions in mm/shmem.c. No other > filesystem gets the benefit of those distinctions: isn't it better to > keep it simple? (And I can see a partial truncation case where shmem > uses SGP_READ under i_mutex; and the change to shmem_unuse behaviour > is a non-issue, since swapoff invites itself to be killed anyway.) > > Too limited in that i_mutex is just the held resource which azurIt's > traces have led you to, but it's a general problem that the OOM-killed > task might be waiting for a resource that the OOM-killing task holds. > > I suspect that if we try hard enough (I admit I have not), we can find > an example of such a potential deadlock for almost every memcg charge > site. mmap_sem? not as easy to invent a case with that as I thought, > since it needs a down_write, and the typical page allocations happen > with down_read, and I can't think of a process which does down_write > on another's mm. > > But i_mutex is always good, once you remember the case of write to > file from userspace page which got paged out, so the fault path has > to read it back in, while i_mutex is still held at the outer level. > An unusual case? Well, normally yes, but we're considering > out-of-memory conditions, which may converge upon cases like this. > > Wouldn't it be nice if I could be constructive? But I'm sceptical > even of Johannes's faith in what the global OOM killer would do: > how does __alloc_pages_slowpath() get out of its "goto restart" > loop, excepting the trivial case when the killer is the killed? I am not sure I am following you here but the Johannes's idea was to break out of the charge after a signal has been sent and the charge still fails and either retry the fault or fail the allocation. I think this should work but I am afraid that this needs some tuning (number of retries f.e.) to prevent from too aggressive OOM or too many failurs. Do we have any other possibilities to solve this issue? Or do you think we should ignore the problem just because nobody complained for such a long time? Dunno, I think we should fix this with something less risky for now and come up with a real fix after it sees sufficient testing. > I wonder why this issue has hit azurIt and no other reporter? > No swap plays a part in it, but that's not so unusual. > > Yours glOOMily, > Hugh [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Fri, 30 Nov 2012 02:45:12 +0100 Message-ID: <20121130024512.EBFBD851@pobox.sk> References: <20121122214249.GA20319@dhcp22.suse.cz>, <20121122233434.3D5E35E6@pobox.sk>, <20121123074023.GA24698@dhcp22.suse.cz>, <20121123102137.10D6D653@pobox.sk>, <20121123100438.GF24698@dhcp22.suse.cz>, <20121125011047.7477BB5E@pobox.sk>, <20121125120524.GB10623@dhcp22.suse.cz>, <20121125135542.GE10623@dhcp22.suse.cz>, <20121126013855.AF118F5E@pobox.sk>, <20121126131837.GC17860@dhcp22.suse.cz> <20121126132149.GD17860@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Return-path: In-Reply-To: <20121126132149.GD17860-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >Here we go with the patch for 3.2.34. Could you test with this one, >please? I installed kernel with this patch, will report back if problem occurs again OR in few weeks if everything will be ok. Thank you! azurIt From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Fri, 30 Nov 2012 03:29:18 +0100 Message-ID: <20121130032918.59B3F780@pobox.sk> References: <20121122214249.GA20319@dhcp22.suse.cz>, <20121122233434.3D5E35E6@pobox.sk>, <20121123074023.GA24698@dhcp22.suse.cz>, <20121123102137.10D6D653@pobox.sk>, <20121123100438.GF24698@dhcp22.suse.cz>, <20121125011047.7477BB5E@pobox.sk>, <20121125120524.GB10623@dhcp22.suse.cz>, <20121125135542.GE10623@dhcp22.suse.cz>, <20121126013855.AF118F5E@pobox.sk>, <20121126131837.GC17860@dhcp22.suse.cz> <20121126132149.GD17860@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20121126132149.GD17860@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >Here we go with the patch for 3.2.34. Could you test with this one, >please? Michal, unfortunately i had to boot to another kernel because the one wit= h this patch keeps killing my MySQL server :( it was, probably, doing it = on OOM in any cgroup - looks like OOM was not choosing processes only fro= m cgroup which is out of memory. Here is the log from syslog: http://www.= watchdog.sk/lkml/oom_mysqld Maybe i should mention that MySQL server has it's own cgroup (called 'mys= ql') but with no limits to any resources. azurIt -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Fri, 30 Nov 2012 13:45:06 +0100 Message-ID: <20121130124506.GH29317@dhcp22.suse.cz> References: <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126132149.GD17860@dhcp22.suse.cz> <20121130032918.59B3F780@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121130032918.59B3F780-Rm0zKEqwvD4@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Fri 30-11-12 03:29:18, azurIt wrote: > >Here we go with the patch for 3.2.34. Could you test with this one, > >please? > > > Michal, unfortunately i had to boot to another kernel because the one > with this patch keeps killing my MySQL server :( it was, probably, > doing it on OOM in any cgroup - looks like OOM was not choosing > processes only from cgroup which is out of memory. Here is the log > from syslog: http://www.watchdog.sk/lkml/oom_mysqld You are seeing also global OOM: Nov 30 02:53:56 server01 kernel: [ 818.233159] Pid: 9247, comm: apache2 Not tainted 3.2.34-grsec #1 Nov 30 02:53:56 server01 kernel: [ 818.233289] Call Trace: Nov 30 02:53:56 server01 kernel: [ 818.233470] [] dump_header+0x7e/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.233600] [] ? find_lock_task_mm+0x2f/0x70 Nov 30 02:53:56 server01 kernel: [ 818.233721] [] oom_kill_process+0x85/0x2a0 Nov 30 02:53:56 server01 kernel: [ 818.233842] [] out_of_memory+0xe5/0x200 Nov 30 02:53:56 server01 kernel: [ 818.233963] [] ? pte_alloc_one+0x3f/0x50 Nov 30 02:53:56 server01 kernel: [ 818.234082] [] pagefault_out_of_memory+0xbd/0x110 Nov 30 02:53:56 server01 kernel: [ 818.234204] [] mm_fault_error+0xb6/0x1a0 Nov 30 02:53:56 server01 kernel: [ 818.235886] [] do_page_fault+0x3ee/0x460 Nov 30 02:53:56 server01 kernel: [ 818.236006] [] ? vma_merge+0x1f7/0x2c0 Nov 30 02:53:56 server01 kernel: [ 818.236124] [] ? do_brk+0x267/0x400 Nov 30 02:53:56 server01 kernel: [ 818.236244] [] ? gr_learn_resource+0x42/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.236367] [] page_fault+0x1f/0x30 [...] Nov 30 02:53:56 server01 kernel: [ 818.356297] Out of memory: Kill process 2188 (mysqld) score 60 or sacrifice child Nov 30 02:53:56 server01 kernel: [ 818.356493] Killed process 2188 (mysqld) total-vm:3330016kB, anon-rss:864176kB, file-rss:8072kB Then you also have memcg oom killer: Nov 30 02:53:56 server01 kernel: [ 818.375717] Task in /1037/uid killed as a result of limit of /1037 Nov 30 02:53:56 server01 kernel: [ 818.375886] memory: usage 102400kB, limit 102400kB, failcnt 736 Nov 30 02:53:56 server01 kernel: [ 818.376008] memory+swap: usage 102400kB, limit 102400kB, failcnt 0 The messages are intermixed and I guess rate limitting jumped in as well, because I cannot associate all the oom messages to a specific OOM event. Anyway your system is under both global and local memory pressure. You didn't see apache going down previously because it was probably the one which was stuck and could be killed. Anyway you need to setup your system more carefully. > Maybe i should mention that MySQL server has it's own cgroup (called > 'mysql') but with no limits to any resources. Where is that group in the hierarchy? > > azurIt > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Fri, 30 Nov 2012 13:53:30 +0100 Message-ID: <20121130135330.6D012B71@pobox.sk> References: <20121123074023.GA24698@dhcp22.suse.cz>, <20121123102137.10D6D653@pobox.sk>, <20121123100438.GF24698@dhcp22.suse.cz>, <20121125011047.7477BB5E@pobox.sk>, <20121125120524.GB10623@dhcp22.suse.cz>, <20121125135542.GE10623@dhcp22.suse.cz>, <20121126013855.AF118F5E@pobox.sk>, <20121126131837.GC17860@dhcp22.suse.cz>, <20121126132149.GD17860@dhcp22.suse.cz>, <20121130032918.59B3F780@pobox.sk> <20121130124506.GH29317@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Return-path: In-Reply-To: <20121130124506.GH29317-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >Anyway your system is under both global and local memory pressure. You >didn't see apache going down previously because it was probably the one >which was stuck and could be killed. >Anyway you need to setup your system more carefully. No, it wasn't, i'm 1000% sure (i was on SSH). Here is the memory usage graph from that system on that time: http://www.watchdog.sk/lkml/memory.png The blank part is rebooting into new kernel. MySQL server was killed several times, then i rebooted into previous kernel and problem was gone (not a single MySQL kill). You can see two MySQL kills there on 03:54 and 03:04:30. > >> Maybe i should mention that MySQL server has it's own cgroup (called >> 'mysql') but with no limits to any resources. > >Where is that group in the hierarchy? In root. From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Fri, 30 Nov 2012 14:44:27 +0100 Message-ID: <20121130144427.51A09169@pobox.sk> References: <20121123074023.GA24698@dhcp22.suse.cz>, <20121123102137.10D6D653@pobox.sk>, <20121123100438.GF24698@dhcp22.suse.cz>, <20121125011047.7477BB5E@pobox.sk>, <20121125120524.GB10623@dhcp22.suse.cz>, <20121125135542.GE10623@dhcp22.suse.cz>, <20121126013855.AF118F5E@pobox.sk>, <20121126131837.GC17860@dhcp22.suse.cz>, <20121126132149.GD17860@dhcp22.suse.cz>, <20121130032918.59B3F780@pobox.sk> <20121130124506.GH29317@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20121130124506.GH29317@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >Anyway your system is under both global and local memory pressure. You >didn't see apache going down previously because it was probably the one >which was stuck and could be killed. >Anyway you need to setup your system more carefully. There is, also, an evidence that system has enough of memory! :) Just tak= e column 'rss' from process list in OOM message and sum it - you will get= 2489911. It's probably in KB so it's about 2.4 GB. System has 14 GB of R= AM so this also match data on my graph - 2.4 is about 17% of 14. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Fri, 30 Nov 2012 15:44:31 +0100 Message-ID: <20121130144431.GI29317@dhcp22.suse.cz> References: <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126132149.GD17860@dhcp22.suse.cz> <20121130032918.59B3F780@pobox.sk> <20121130124506.GH29317@dhcp22.suse.cz> <20121130144427.51A09169@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121130144427.51A09169-Rm0zKEqwvD4@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Fri 30-11-12 14:44:27, azurIt wrote: > >Anyway your system is under both global and local memory pressure. You > >didn't see apache going down previously because it was probably the one > >which was stuck and could be killed. > >Anyway you need to setup your system more carefully. > > > There is, also, an evidence that system has enough of memory! :) Just > take column 'rss' from process list in OOM message and sum it - you > will get 2489911. It's probably in KB so it's about 2.4 GB. System has > 14 GB of RAM so this also match data on my graph - 2.4 is about 17% of > 14. Hmm, that corresponds to the ZONE_DMA32 size pretty nicely but that zone is hardly touched: Nov 30 02:53:56 server01 kernel: [ 818.241291] DMA32 free:2523636kB min:2672kB low:3340kB high:4008kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2542248kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no DMA32 zone is usually fills up first 4G unless your HW remaps the rest of the memory above 4G or you have a numa machine and the rest of the memory is at other node. Could you post your memory map printed during the boot? (e820: BIOS-provided physical RAM map: and following lines) There is also ZONE_NORMAL which is also not used much Nov 30 02:53:56 server01 kernel: [ 818.242163] Normal free:6924716kB min:12512kB low:15640kB high:18768kB active_anon:1463128kB inactive_anon:2072kB active_file:1803964kB inactive_file:1072628kB unevictable:3924kB isolated(anon):0kB isolated(file):0kB present:11893760kB mlocked:3924kB dirty:1000kB writeback:776kB mapped:35656kB shmem:3828kB slab_reclaimable:202560kB slab_unreclaimable:50696kB kernel_stack:2944kB pagetables:158616kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no You have mentioned that you are comounting with cpuset. If this happens to be a NUMA machine have you made the access to all nodes available? Also what does /proc/sys/vm/zone_reclaim_mode says? -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Fri, 30 Nov 2012 16:03:47 +0100 Message-ID: <20121130150347.GJ29317@dhcp22.suse.cz> References: <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126132149.GD17860@dhcp22.suse.cz> <20121130032918.59B3F780@pobox.sk> <20121130124506.GH29317@dhcp22.suse.cz> <20121130144427.51A09169@pobox.sk> <20121130144431.GI29317@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121130144431.GI29317-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Fri 30-11-12 15:44:31, Michal Hocko wrote: > On Fri 30-11-12 14:44:27, azurIt wrote: > > >Anyway your system is under both global and local memory pressure. You > > >didn't see apache going down previously because it was probably the one > > >which was stuck and could be killed. > > >Anyway you need to setup your system more carefully. > > > > > > There is, also, an evidence that system has enough of memory! :) Just > > take column 'rss' from process list in OOM message and sum it - you > > will get 2489911. It's probably in KB so it's about 2.4 GB. System has > > 14 GB of RAM so this also match data on my graph - 2.4 is about 17% of > > 14. > > Hmm, that corresponds to the ZONE_DMA32 size pretty nicely but that zone > is hardly touched: > Nov 30 02:53:56 server01 kernel: [ 818.241291] DMA32 free:2523636kB min:2672kB low:3340kB high:4008kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2542248kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no > > DMA32 zone is usually fills up first 4G unless your HW remaps the rest > of the memory above 4G or you have a numa machine and the rest of the > memory is at other node. Could you post your memory map printed during > the boot? (e820: BIOS-provided physical RAM map: and following lines) > > There is also ZONE_NORMAL which is also not used much > Nov 30 02:53:56 server01 kernel: [ 818.242163] Normal free:6924716kB min:12512kB low:15640kB high:18768kB active_anon:1463128kB inactive_anon:2072kB active_file:1803964kB inactive_file:1072628kB unevictable:3924kB isolated(anon):0kB isolated(file):0kB present:11893760kB mlocked:3924kB dirty:1000kB writeback:776kB mapped:35656kB shmem:3828kB slab_reclaimable:202560kB slab_unreclaimable:50696kB kernel_stack:2944kB pagetables:158616kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no > > You have mentioned that you are comounting with cpuset. If this happens > to be a NUMA machine have you made the access to all nodes available? And now that I am looking at the oom message more closely I can see Nov 30 02:53:56 server01 kernel: [ 818.232812] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 Nov 30 02:53:56 server01 kernel: [ 818.233029] apache2 cpuset=uid mems_allowed=0 Nov 30 02:53:56 server01 kernel: [ 818.233159] Pid: 9247, comm: apache2 Not tainted 3.2.34-grsec #1 Nov 30 02:53:56 server01 kernel: [ 818.233289] Call Trace: Nov 30 02:53:56 server01 kernel: [ 818.233470] [] dump_header+0x7e/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.233600] [] ? find_lock_task_mm+0x2f/0x70 Nov 30 02:53:56 server01 kernel: [ 818.233721] [] oom_kill_process+0x85/0x2a0 Nov 30 02:53:56 server01 kernel: [ 818.233842] [] out_of_memory+0xe5/0x200 Nov 30 02:53:56 server01 kernel: [ 818.233963] [] ? pte_alloc_one+0x3f/0x50 Nov 30 02:53:56 server01 kernel: [ 818.234082] [] pagefault_out_of_memory+0xbd/0x110 Nov 30 02:53:56 server01 kernel: [ 818.234204] [] mm_fault_error+0xb6/0x1a0 Nov 30 02:53:56 server01 kernel: [ 818.235886] [] do_page_fault+0x3ee/0x460 Nov 30 02:53:56 server01 kernel: [ 818.236006] [] ? vma_merge+0x1f7/0x2c0 Nov 30 02:53:56 server01 kernel: [ 818.236124] [] ? do_brk+0x267/0x400 Nov 30 02:53:56 server01 kernel: [ 818.236244] [] ? gr_learn_resource+0x42/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.236367] [] page_fault+0x1f/0x30 Which is interesting from 2 perspectives. Only the first node (Node-0) is allowed which would suggest that the cpuset controller is not configured to all nodes. It is still surprising Node 0 wouldn't have any memory (I would expect ZONE_DMA32 would be sitting there). Anyway, the more interesting thing is gfp_mask is GFP_NOWAIT allocation from the page fault? Huh this shouldn't happen - ever. > Also what does /proc/sys/vm/zone_reclaim_mode says? > -- > Michal Hocko > SUSE Labs > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Fri, 30 Nov 2012 16:08:11 +0100 Message-ID: <20121130160811.6BB25BDD@pobox.sk> References: <20121123100438.GF24698@dhcp22.suse.cz>, <20121125011047.7477BB5E@pobox.sk>, <20121125120524.GB10623@dhcp22.suse.cz>, <20121125135542.GE10623@dhcp22.suse.cz>, <20121126013855.AF118F5E@pobox.sk>, <20121126131837.GC17860@dhcp22.suse.cz>, <20121126132149.GD17860@dhcp22.suse.cz>, <20121130032918.59B3F780@pobox.sk>, <20121130124506.GH29317@dhcp22.suse.cz>, <20121130144427.51A09169@pobox.sk> <20121130144431.GI29317@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20121130144431.GI29317@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >DMA32 zone is usually fills up first 4G unless your HW remaps the rest >of the memory above 4G or you have a numa machine and the rest of the >memory is at other node. Could you post your memory map printed during >the boot? (e820: BIOS-provided physical RAM map: and following lines) Here is the full boot log: www.watchdog.sk/lkml/kern.log >You have mentioned that you are comounting with cpuset. If this happens >to be a NUMA machine have you made the access to all nodes available? >Also what does /proc/sys/vm/zone_reclaim_mode says? Don't really know what NUMA means and which nodes are you talking about, = sorry :( # cat /proc/sys/vm/zone_reclaim_mode cat: /proc/sys/vm/zone_reclaim_mode: No such file or directory azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Fri, 30 Nov 2012 16:37:15 +0100 Message-ID: <20121130153715.GK29317@dhcp22.suse.cz> References: <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126132149.GD17860@dhcp22.suse.cz> <20121130032918.59B3F780@pobox.sk> <20121130124506.GH29317@dhcp22.suse.cz> <20121130144427.51A09169@pobox.sk> <20121130144431.GI29317@dhcp22.suse.cz> <20121130150347.GJ29317@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121130150347.GJ29317@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Fri 30-11-12 16:03:47, Michal Hocko wrote: [...] > Anyway, the more interesting thing is gfp_mask is GFP_NOWAIT allocation > from the page fault? Huh this shouldn't happen - ever. OK, it starts making sense now. The message came from pagefault_out_of_memory which doesn't have gfp nor the required node information any longer. This suggests that VM_FAULT_OOM has been returned by the fault handler. So this hasn't been triggered by the page fault allocator. I am wondering whether this could be caused by the patch but the effect of that one should be limitted to the write (unlike the later version for -mm tree which hooks into the shmem as well). Will have to think about it some more. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Fri, 30 Nov 2012 16:39:42 +0100 Message-ID: <20121130153942.GL29317@dhcp22.suse.cz> References: <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126132149.GD17860@dhcp22.suse.cz> <20121130032918.59B3F780@pobox.sk> <20121130124506.GH29317@dhcp22.suse.cz> <20121130144427.51A09169@pobox.sk> <20121130144431.GI29317@dhcp22.suse.cz> <20121130160811.6BB25BDD@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121130160811.6BB25BDD@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Fri 30-11-12 16:08:11, azurIt wrote: > >DMA32 zone is usually fills up first 4G unless your HW remaps the rest > >of the memory above 4G or you have a numa machine and the rest of the > >memory is at other node. Could you post your memory map printed during > >the boot? (e820: BIOS-provided physical RAM map: and following lines) > > > Here is the full boot log: > www.watchdog.sk/lkml/kern.log The log is not complete. Could you paste the comple dmesg output? Or even better, do you have logs from the previous run? > >You have mentioned that you are comounting with cpuset. If this happens > >to be a NUMA machine have you made the access to all nodes available? > >Also what does /proc/sys/vm/zone_reclaim_mode says? > > > Don't really know what NUMA means and which nodes are you talking > about, sorry :( http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access > # cat /proc/sys/vm/zone_reclaim_mode > cat: /proc/sys/vm/zone_reclaim_mode: No such file or directory OK, so the NUMA is not enabled. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Fri, 30 Nov 2012 16:59:37 +0100 Message-ID: <20121130165937.F9564EBE@pobox.sk> References: <20121125120524.GB10623@dhcp22.suse.cz>, <20121125135542.GE10623@dhcp22.suse.cz>, <20121126013855.AF118F5E@pobox.sk>, <20121126131837.GC17860@dhcp22.suse.cz>, <20121126132149.GD17860@dhcp22.suse.cz>, <20121130032918.59B3F780@pobox.sk>, <20121130124506.GH29317@dhcp22.suse.cz>, <20121130144427.51A09169@pobox.sk>, <20121130144431.GI29317@dhcp22.suse.cz>, <20121130160811.6BB25BDD@pobox.sk> <20121130153942.GL29317@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Return-path: In-Reply-To: <20121130153942.GL29317-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >> Here is the full boot log: >> www.watchdog.sk/lkml/kern.log > >The log is not complete. Could you paste the comple dmesg output? Or >even better, do you have logs from the previous run? What is missing there? All kernel messages are logging into /var/log/kern.log (it's the same as dmesg), dmesg itself was already rewrited by other messages. I think it's all what that kernel printed. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Fri, 30 Nov 2012 17:19:23 +0100 Message-ID: <20121130161923.GN29317@dhcp22.suse.cz> References: <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126132149.GD17860@dhcp22.suse.cz> <20121130032918.59B3F780@pobox.sk> <20121130124506.GH29317@dhcp22.suse.cz> <20121130144427.51A09169@pobox.sk> <20121130144431.GI29317@dhcp22.suse.cz> <20121130160811.6BB25BDD@pobox.sk> <20121130153942.GL29317@dhcp22.suse.cz> <20121130165937.F9564EBE@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121130165937.F9564EBE@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Fri 30-11-12 16:59:37, azurIt wrote: > >> Here is the full boot log: > >> www.watchdog.sk/lkml/kern.log > > > >The log is not complete. Could you paste the comple dmesg output? Or > >even better, do you have logs from the previous run? > > > What is missing there? All kernel messages are logging into > /var/log/kern.log (it's the same as dmesg), dmesg itself was already > rewrited by other messages. I think it's all what that kernel printed. Early boot messages are missing - so exactly the BIOS memory map I was asking for. As the NUMA has been excluded it is probably not that relevant anymore. The important question is why you see VM_FAULT_OOM and whether memcg charging failure can trigger that. I don not see how this could happen right now because __GFP_NORETRY is not used for user pages (except for THP which disable memcg OOM already), file backed page faults (aka __do_fault) use mem_cgroup_newpage_charge which doesn't disable OOM. This is a real head scratcher. Could you also post your complete containers configuration, maybe there is something strange in there (basically grep . -r YOUR_CGROUP_MNT except for tasks files which are of no use right now). -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Fri, 30 Nov 2012 17:26:51 +0100 Message-ID: <20121130172651.B6917602@pobox.sk> References: <20121126013855.AF118F5E@pobox.sk>, <20121126131837.GC17860@dhcp22.suse.cz>, <20121126132149.GD17860@dhcp22.suse.cz>, <20121130032918.59B3F780@pobox.sk>, <20121130124506.GH29317@dhcp22.suse.cz>, <20121130144427.51A09169@pobox.sk>, <20121130144431.GI29317@dhcp22.suse.cz>, <20121130160811.6BB25BDD@pobox.sk>, <20121130153942.GL29317@dhcp22.suse.cz>, <20121130165937.F9564EBE@pobox.sk> <20121130161923.GN29317@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20121130161923.GN29317@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >Could you also post your complete containers configuration, maybe there >is something strange in there (basically grep . -r YOUR_CGROUP_MNT >except for tasks files which are of no use right now). Here it is: http://www.watchdog.sk/lkml/cgroups.gz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Fri, 30 Nov 2012 17:53:47 +0100 Message-ID: <20121130165347.GO29317@dhcp22.suse.cz> References: <20121126132149.GD17860@dhcp22.suse.cz> <20121130032918.59B3F780@pobox.sk> <20121130124506.GH29317@dhcp22.suse.cz> <20121130144427.51A09169@pobox.sk> <20121130144431.GI29317@dhcp22.suse.cz> <20121130160811.6BB25BDD@pobox.sk> <20121130153942.GL29317@dhcp22.suse.cz> <20121130165937.F9564EBE@pobox.sk> <20121130161923.GN29317@dhcp22.suse.cz> <20121130172651.B6917602@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121130172651.B6917602-Rm0zKEqwvD4@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Fri 30-11-12 17:26:51, azurIt wrote: > >Could you also post your complete containers configuration, maybe there > >is something strange in there (basically grep . -r YOUR_CGROUP_MNT > >except for tasks files which are of no use right now). > > > Here it is: > http://www.watchdog.sk/lkml/cgroups.gz The only strange thing I noticed is that some groups have 0 limit. Is this intentional? grep memory.limit_in_bytes cgroups | grep -v uid | sed 's@.*/@@' | sort | uniq -c 3 memory.limit_in_bytes:0 254 memory.limit_in_bytes:104857600 107 memory.limit_in_bytes:157286400 68 memory.limit_in_bytes:209715200 10 memory.limit_in_bytes:262144000 28 memory.limit_in_bytes:314572800 1 memory.limit_in_bytes:346030080 1 memory.limit_in_bytes:524288000 2 memory.limit_in_bytes:9223372036854775807 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Fri, 30 Nov 2012 21:43:05 +0100 Message-ID: <20121130214305.6741FF64@pobox.sk> References: <20121126132149.GD17860@dhcp22.suse.cz>, <20121130032918.59B3F780@pobox.sk>, <20121130124506.GH29317@dhcp22.suse.cz>, <20121130144427.51A09169@pobox.sk>, <20121130144431.GI29317@dhcp22.suse.cz>, <20121130160811.6BB25BDD@pobox.sk>, <20121130153942.GL29317@dhcp22.suse.cz>, <20121130165937.F9564EBE@pobox.sk>, <20121130161923.GN29317@dhcp22.suse.cz>, <20121130172651.B6917602@pobox.sk> <20121130165347.GO29317@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20121130165347.GO29317@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >The only strange thing I noticed is that some groups have 0 limit. Is >this intentional? >grep memory.limit_in_bytes cgroups | grep -v uid | sed 's@.*/@@' | sort = | uniq -c > 3 memory.limit_in_bytes:0 These are users who are not allowed to run anything. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Mon, 3 Dec 2012 16:16:01 +0100 Message-ID: <20121203151601.GA17093@dhcp22.suse.cz> References: <20121126131837.GC17860@dhcp22.suse.cz> <20121126132149.GD17860@dhcp22.suse.cz> <20121130032918.59B3F780@pobox.sk> <20121130124506.GH29317@dhcp22.suse.cz> <20121130144427.51A09169@pobox.sk> <20121130144431.GI29317@dhcp22.suse.cz> <20121130160811.6BB25BDD@pobox.sk> <20121130153942.GL29317@dhcp22.suse.cz> <20121130165937.F9564EBE@pobox.sk> <20121130161923.GN29317@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121130161923.GN29317@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Fri 30-11-12 17:19:23, Michal Hocko wrote: [...] > The important question is why you see VM_FAULT_OOM and whether memcg > charging failure can trigger that. I don not see how this could happen > right now because __GFP_NORETRY is not used for user pages (except for > THP which disable memcg OOM already), file backed page faults (aka > __do_fault) use mem_cgroup_newpage_charge which doesn't disable OOM. > This is a real head scratcher. The following should print the traces when we hand over ENOMEM to the caller. It should catch all charge paths (migration is not covered but that one is not important here). If we don't see any traces from here and there is still global OOM striking then there must be something else to trigger this. Could you test this with the patch which aims at fixing your deadlock, please? I realise that this is a production environment but I do not see anything relevant in the code. --- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c8425b1..9e5b56b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2397,6 +2397,7 @@ done: return 0; nomem: *ptr = NULL; + __WARN(); return -ENOMEM; bypass: *ptr = NULL; -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Wed, 05 Dec 2012 02:36:44 +0100 Message-ID: <20121205023644.18C3006B@pobox.sk> References: <20121126131837.GC17860@dhcp22.suse.cz>, <20121126132149.GD17860@dhcp22.suse.cz>, <20121130032918.59B3F780@pobox.sk>, <20121130124506.GH29317@dhcp22.suse.cz>, <20121130144427.51A09169@pobox.sk>, <20121130144431.GI29317@dhcp22.suse.cz>, <20121130160811.6BB25BDD@pobox.sk>, <20121130153942.GL29317@dhcp22.suse.cz>, <20121130165937.F9564EBE@pobox.sk>, <20121130161923.GN29317@dhcp22.suse.cz> <20121203151601.GA17093@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Return-path: In-Reply-To: <20121203151601.GA17093-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >The following should print the traces when we hand over ENOMEM to the >caller. It should catch all charge paths (migration is not covered but >that one is not important here). If we don't see any traces from here >and there is still global OOM striking then there must be something else >to trigger this. >Could you test this with the patch which aims at fixing your deadlock, >please? I realise that this is a production environment but I do not see >anything relevant in the code. Michal, i think/hope this is what you wanted: http://www.watchdog.sk/lkml/oom_mysqld2 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Wed, 5 Dec 2012 15:17:22 +0100 Message-ID: <20121205141722.GA9714@dhcp22.suse.cz> References: <20121130032918.59B3F780@pobox.sk> <20121130124506.GH29317@dhcp22.suse.cz> <20121130144427.51A09169@pobox.sk> <20121130144431.GI29317@dhcp22.suse.cz> <20121130160811.6BB25BDD@pobox.sk> <20121130153942.GL29317@dhcp22.suse.cz> <20121130165937.F9564EBE@pobox.sk> <20121130161923.GN29317@dhcp22.suse.cz> <20121203151601.GA17093@dhcp22.suse.cz> <20121205023644.18C3006B@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121205023644.18C3006B-Rm0zKEqwvD4@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Wed 05-12-12 02:36:44, azurIt wrote: > >The following should print the traces when we hand over ENOMEM to the > >caller. It should catch all charge paths (migration is not covered but > >that one is not important here). If we don't see any traces from here > >and there is still global OOM striking then there must be something else > >to trigger this. > >Could you test this with the patch which aims at fixing your deadlock, > >please? I realise that this is a production environment but I do not see > >anything relevant in the code. > > > Michal, > > i think/hope this is what you wanted: > http://www.watchdog.sk/lkml/oom_mysqld2 Dec 5 02:20:48 server01 kernel: [ 380.995947] WARNING: at mm/memcontrol.c:2400 T.1146+0x2c1/0x5d0() Dec 5 02:20:48 server01 kernel: [ 380.995950] Hardware name: S5000VSA Dec 5 02:20:48 server01 kernel: [ 380.995952] Pid: 5351, comm: apache2 Not tainted 3.2.34-grsec #1 Dec 5 02:20:48 server01 kernel: [ 380.995954] Call Trace: Dec 5 02:20:48 server01 kernel: [ 380.995960] [] warn_slowpath_common+0x7a/0xb0 Dec 5 02:20:48 server01 kernel: [ 380.995963] [] warn_slowpath_null+0x1a/0x20 Dec 5 02:20:48 server01 kernel: [ 380.995965] [] T.1146+0x2c1/0x5d0 Dec 5 02:20:48 server01 kernel: [ 380.995967] [] mem_cgroup_charge_common+0x53/0x90 Dec 5 02:20:48 server01 kernel: [ 380.995970] [] mem_cgroup_newpage_charge+0x45/0x50 Dec 5 02:20:48 server01 kernel: [ 380.995974] [] handle_pte_fault+0x609/0x940 Dec 5 02:20:48 server01 kernel: [ 380.995978] [] ? pte_alloc_one+0x3f/0x50 Dec 5 02:20:48 server01 kernel: [ 380.995981] [] handle_mm_fault+0x138/0x260 Dec 5 02:20:48 server01 kernel: [ 380.995983] [] do_page_fault+0x13d/0x460 Dec 5 02:20:48 server01 kernel: [ 380.995986] [] ? do_mmap_pgoff+0x3dc/0x430 Dec 5 02:20:48 server01 kernel: [ 380.995988] [] ? remove_vma+0x5d/0x80 Dec 5 02:20:48 server01 kernel: [ 380.995992] [] page_fault+0x1f/0x30 Dec 5 02:20:48 server01 kernel: [ 380.995994] ---[ end trace 25bbb3e634c25b7f ]--- Dec 5 02:20:48 server01 kernel: [ 380.996373] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 Dec 5 02:20:48 server01 kernel: [ 380.996377] apache2 cpuset=uid mems_allowed=0 Dec 5 02:20:48 server01 kernel: [ 380.996379] Pid: 5351, comm: apache2 Tainted: G W 3.2.34-grsec #1 Dec 5 02:20:48 server01 kernel: [ 380.996380] Call Trace: Dec 5 02:20:48 server01 kernel: [ 380.996384] [] dump_header+0x7e/0x1e0 Dec 5 02:20:48 server01 kernel: [ 380.996387] [] ? find_lock_task_mm+0x2f/0x70 Dec 5 02:20:48 server01 kernel: [ 380.996389] [] oom_kill_process+0x85/0x2a0 Dec 5 02:20:48 server01 kernel: [ 380.996392] [] out_of_memory+0xe5/0x200 Dec 5 02:20:48 server01 kernel: [ 380.996394] [] ? pte_alloc_one+0x3f/0x50 Dec 5 02:20:48 server01 kernel: [ 380.996397] [] pagefault_out_of_memory+0xbd/0x110 Dec 5 02:20:48 server01 kernel: [ 380.996399] [] mm_fault_error+0xb6/0x1a0 Dec 5 02:20:48 server01 kernel: [ 380.996401] [] do_page_fault+0x3ee/0x460 Dec 5 02:20:48 server01 kernel: [ 380.996403] [] ? do_mmap_pgoff+0x3dc/0x430 Dec 5 02:20:48 server01 kernel: [ 380.996405] [] ? remove_vma+0x5d/0x80 Dec 5 02:20:48 server01 kernel: [ 380.996408] [] page_fault+0x1f/0x30 OK, so the ENOMEM seems to be leaking from mem_cgroup_newpage_charge. This can only happen if this was an atomic allocation request (!__GFP_WAIT) or if oom is not allowed which is the case only for transparent huge page allocation. The first case can be excluded (in the clean 3.2 stable kernel) because all callers of mem_cgroup_newpage_charge use GFP_KERNEL. The later one should be OK because the page fault should fallback to a regular page if THP allocation/charge fails. [/me goes to double check] Hmm do_huge_pmd_wp_page seems to charge a huge page and fails with VM_FAULT_OOM without any fallback. We should do_huge_pmd_wp_page_fallback instead. This has been fixed in 3.5-rc1 by 1f1d06c3 (thp, memcg: split hugepage for memcg oom on cow) but it hasn't been backported to 3.2. The patch applies to 3.2 without any further modifications. I didn't have time to test it but if it helps you we should push this to the stable tree. --- >From 765f5e0121c4410faa19c088e9ada75976bde178 Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Tue, 29 May 2012 15:06:23 -0700 Subject: [PATCH] thp, memcg: split hugepage for memcg oom on cow On COW, a new hugepage is allocated and charged to the memcg. If the system is oom or the charge to the memcg fails, however, the fault handler will return VM_FAULT_OOM which results in an oom kill. Instead, it's possible to fallback to splitting the hugepage so that the COW results only in an order-0 page being allocated and charged to the memcg which has a higher liklihood to succeed. This is expensive because the hugepage must be split in the page fault handler, but it is much better than unnecessarily oom killing a process. Signed-off-by: David Rientjes Cc: Andrea Arcangeli Cc: Johannes Weiner Acked-by: KAMEZAWA Hiroyuki Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds (cherry picked from commit 1f1d06c34f7675026326cd9f39ff91e4555cf355) --- mm/huge_memory.c | 3 +++ mm/memory.c | 18 +++++++++++++++--- 2 files changed, 18 insertions(+), 3 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8f005e9..470cbb4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -921,6 +921,8 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, count_vm_event(THP_FAULT_FALLBACK); ret = do_huge_pmd_wp_page_fallback(mm, vma, address, pmd, orig_pmd, page, haddr); + if (ret & VM_FAULT_OOM) + split_huge_page(page); put_page(page); goto out; } @@ -928,6 +930,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) { put_page(new_page); + split_huge_page(page); put_page(page); ret |= VM_FAULT_OOM; goto out; diff --git a/mm/memory.c b/mm/memory.c index 70f5daf..15e686a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3469,6 +3469,7 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (unlikely(is_vm_hugetlb_page(vma))) return hugetlb_fault(mm, vma, address, flags); +retry: pgd = pgd_offset(mm, address); pud = pud_alloc(mm, pgd, address); if (!pud) @@ -3482,13 +3483,24 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, pmd, flags); } else { pmd_t orig_pmd = *pmd; + int ret; + barrier(); if (pmd_trans_huge(orig_pmd)) { if (flags & FAULT_FLAG_WRITE && !pmd_write(orig_pmd) && - !pmd_trans_splitting(orig_pmd)) - return do_huge_pmd_wp_page(mm, vma, address, - pmd, orig_pmd); + !pmd_trans_splitting(orig_pmd)) { + ret = do_huge_pmd_wp_page(mm, vma, address, pmd, + orig_pmd); + /* + * If COW results in an oom, the huge pmd will + * have been split, so retry the fault on the + * pte for a smaller charge. + */ + if (unlikely(ret & VM_FAULT_OOM)) + goto retry; + return ret; + } return 0; } } -- 1.7.10.4 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Thu, 06 Dec 2012 01:29:24 +0100 Message-ID: <20121206012924.FE077FD7@pobox.sk> References: <20121130032918.59B3F780@pobox.sk>, <20121130124506.GH29317@dhcp22.suse.cz>, <20121130144427.51A09169@pobox.sk>, <20121130144431.GI29317@dhcp22.suse.cz>, <20121130160811.6BB25BDD@pobox.sk>, <20121130153942.GL29317@dhcp22.suse.cz>, <20121130165937.F9564EBE@pobox.sk>, <20121130161923.GN29317@dhcp22.suse.cz>, <20121203151601.GA17093@dhcp22.suse.cz>, <20121205023644.18C3006B@pobox.sk> <20121205141722.GA9714@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Return-path: In-Reply-To: <20121205141722.GA9714-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >OK, so the ENOMEM seems to be leaking from mem_cgroup_newpage_charge. >This can only happen if this was an atomic allocation request >(!__GFP_WAIT) or if oom is not allowed which is the case only for >transparent huge page allocation. >The first case can be excluded (in the clean 3.2 stable kernel) because >all callers of mem_cgroup_newpage_charge use GFP_KERNEL. The later one >should be OK because the page fault should fallback to a regular page if >THP allocation/charge fails. >[/me goes to double check] >Hmm do_huge_pmd_wp_page seems to charge a huge page and fails with >VM_FAULT_OOM without any fallback. We should do_huge_pmd_wp_page_fallback >instead. This has been fixed in 3.5-rc1 by 1f1d06c3 (thp, memcg: split >hugepage for memcg oom on cow) but it hasn't been backported to 3.2. The >patch applies to 3.2 without any further modifications. I didn't have >time to test it but if it helps you we should push this to the stable >tree. This, unfortunately, didn't fix the problem :( http://www.watchdog.sk/lkml/oom_mysqld3 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Thu, 6 Dec 2012 10:54:23 +0100 Message-ID: <20121206095423.GB10931@dhcp22.suse.cz> References: <20121130144427.51A09169@pobox.sk> <20121130144431.GI29317@dhcp22.suse.cz> <20121130160811.6BB25BDD@pobox.sk> <20121130153942.GL29317@dhcp22.suse.cz> <20121130165937.F9564EBE@pobox.sk> <20121130161923.GN29317@dhcp22.suse.cz> <20121203151601.GA17093@dhcp22.suse.cz> <20121205023644.18C3006B@pobox.sk> <20121205141722.GA9714@dhcp22.suse.cz> <20121206012924.FE077FD7@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121206012924.FE077FD7-Rm0zKEqwvD4@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Thu 06-12-12 01:29:24, azurIt wrote: > >OK, so the ENOMEM seems to be leaking from mem_cgroup_newpage_charge. > >This can only happen if this was an atomic allocation request > >(!__GFP_WAIT) or if oom is not allowed which is the case only for > >transparent huge page allocation. > >The first case can be excluded (in the clean 3.2 stable kernel) because > >all callers of mem_cgroup_newpage_charge use GFP_KERNEL. The later one > >should be OK because the page fault should fallback to a regular page if > >THP allocation/charge fails. > >[/me goes to double check] > >Hmm do_huge_pmd_wp_page seems to charge a huge page and fails with > >VM_FAULT_OOM without any fallback. We should do_huge_pmd_wp_page_fallback > >instead. This has been fixed in 3.5-rc1 by 1f1d06c3 (thp, memcg: split > >hugepage for memcg oom on cow) but it hasn't been backported to 3.2. The > >patch applies to 3.2 without any further modifications. I didn't have > >time to test it but if it helps you we should push this to the stable > >tree. > > > This, unfortunately, didn't fix the problem :( > http://www.watchdog.sk/lkml/oom_mysqld3 Dohh. The very same stack mem_cgroup_newpage_charge called from the page fault. The heavy inlining is not particularly helping here... So there must be some other THP charge leaking out. [/me is diving into the code again] * do_huge_pmd_anonymous_page falls back to handle_pte_fault * do_huge_pmd_wp_page_fallback falls back to simple pages so it doesn't charge the huge page * do_huge_pmd_wp_page splits the huge page and retries with fallback to handle_pte_fault * collapse_huge_page is not called in the page fault path * do_wp_page, do_anonymous_page and __do_fault operate on a single page so the memcg charging cannot return ENOMEM There are no other callers AFAICS so I am getting clueless. Maybe more debugging will tell us something (the inlining has been reduced for thp paths which can reduce performance in thp page fault heavy workloads but this will give us better traces - I hope). Anyway do you see the same problem if transparent huge pages are disabled? echo never > /sys/kernel/mm/transparent_hugepage/enabled) --- >From 93a30140b50d8474a047b91c698f4880149635db Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Thu, 6 Dec 2012 10:40:17 +0100 Subject: [PATCH] more debugging --- mm/huge_memory.c | 6 +++--- mm/memcontrol.c | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 470cbb4..01a11f1 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -671,7 +671,7 @@ static inline struct page *alloc_hugepage(int defrag) } #endif -int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, +noinline int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, unsigned int flags) { @@ -790,7 +790,7 @@ pgtable_t get_pmd_huge_pte(struct mm_struct *mm) return pgtable; } -static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, +static noinline int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pmd_t orig_pmd, @@ -883,7 +883,7 @@ out_free_pages: goto out; } -int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, +noinline int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pmd_t orig_pmd) { int ret = 0; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9e5b56b..1986c65 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2397,7 +2397,7 @@ done: return 0; nomem: *ptr = NULL; - __WARN(); + __WARN_printf("gfp_mask:%u nr_pages:%u oom:%d ret:%d\n", gfp_mask, nr_pages, oom, ret); return -ENOMEM; bypass: *ptr = NULL; -- 1.7.10.4 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Thu, 06 Dec 2012 11:12:49 +0100 Message-ID: <20121206111249.58F013EA@pobox.sk> References: <20121130144427.51A09169@pobox.sk>, <20121130144431.GI29317@dhcp22.suse.cz>, <20121130160811.6BB25BDD@pobox.sk>, <20121130153942.GL29317@dhcp22.suse.cz>, <20121130165937.F9564EBE@pobox.sk>, <20121130161923.GN29317@dhcp22.suse.cz>, <20121203151601.GA17093@dhcp22.suse.cz>, <20121205023644.18C3006B@pobox.sk>, <20121205141722.GA9714@dhcp22.suse.cz>, <20121206012924.FE077FD7@pobox.sk> <20121206095423.GB10931@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Return-path: In-Reply-To: <20121206095423.GB10931@dhcp22.suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >Dohh. The very same stack mem_cgroup_newpage_charge called from the page >fault. The heavy inlining is not particularly helping here... So there >must be some other THP charge leaking out. >[/me is diving into the code again] > >* do_huge_pmd_anonymous_page falls back to handle_pte_fault >* do_huge_pmd_wp_page_fallback falls back to simple pages so it doesn't > charge the huge page >* do_huge_pmd_wp_page splits the huge page and retries with fallback to > handle_pte_fault >* collapse_huge_page is not called in the page fault path >* do_wp_page, do_anonymous_page and __do_fault operate on a single page > so the memcg charging cannot return ENOMEM > >There are no other callers AFAICS so I am getting clueless. Maybe more >debugging will tell us something (the inlining has been reduced for thp >paths which can reduce performance in thp page fault heavy workloads but >this will give us better traces - I hope). Should i apply all patches togather? (fix for this bug, more log messages, backported fix from 3.5 and this new one) >Anyway do you see the same problem if transparent huge pages are >disabled? >echo never > /sys/kernel/mm/transparent_hugepage/enabled) # cat /sys/kernel/mm/transparent_hugepage/enabled cat: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Thu, 6 Dec 2012 18:06:58 +0100 Message-ID: <20121206170658.GD10931@dhcp22.suse.cz> References: <20121130160811.6BB25BDD@pobox.sk> <20121130153942.GL29317@dhcp22.suse.cz> <20121130165937.F9564EBE@pobox.sk> <20121130161923.GN29317@dhcp22.suse.cz> <20121203151601.GA17093@dhcp22.suse.cz> <20121205023644.18C3006B@pobox.sk> <20121205141722.GA9714@dhcp22.suse.cz> <20121206012924.FE077FD7@pobox.sk> <20121206095423.GB10931@dhcp22.suse.cz> <20121206111249.58F013EA@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121206111249.58F013EA@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Thu 06-12-12 11:12:49, azurIt wrote: > >Dohh. The very same stack mem_cgroup_newpage_charge called from the page > >fault. The heavy inlining is not particularly helping here... So there > >must be some other THP charge leaking out. > >[/me is diving into the code again] > > > >* do_huge_pmd_anonymous_page falls back to handle_pte_fault > >* do_huge_pmd_wp_page_fallback falls back to simple pages so it doesn't > > charge the huge page > >* do_huge_pmd_wp_page splits the huge page and retries with fallback to > > handle_pte_fault > >* collapse_huge_page is not called in the page fault path > >* do_wp_page, do_anonymous_page and __do_fault operate on a single page > > so the memcg charging cannot return ENOMEM > > > >There are no other callers AFAICS so I am getting clueless. Maybe more > >debugging will tell us something (the inlining has been reduced for thp > >paths which can reduce performance in thp page fault heavy workloads but > >this will give us better traces - I hope). > > > Should i apply all patches togather? (fix for this bug, more log > messages, backported fix from 3.5 and this new one) Yes please -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Mon, 10 Dec 2012 02:20:38 +0100 Message-ID: <20121210022038.E6570D37@pobox.sk> References: <20121130144427.51A09169@pobox.sk>, <20121130144431.GI29317@dhcp22.suse.cz>, <20121130160811.6BB25BDD@pobox.sk>, <20121130153942.GL29317@dhcp22.suse.cz>, <20121130165937.F9564EBE@pobox.sk>, <20121130161923.GN29317@dhcp22.suse.cz>, <20121203151601.GA17093@dhcp22.suse.cz>, <20121205023644.18C3006B@pobox.sk>, <20121205141722.GA9714@dhcp22.suse.cz>, <20121206012924.FE077FD7@pobox.sk> <20121206095423.GB10931@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Return-path: In-Reply-To: <20121206095423.GB10931-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >There are no other callers AFAICS so I am getting clueless. Maybe more >debugging will tell us something (the inlining has been reduced for thp >paths which can reduce performance in thp page fault heavy workloads but >this will give us better traces - I hope). Michal, this was printing so many debug messages to console that the whole server hangs and i had to hard reset it after several minutes :( Sorry but i cannot test such a things in production. There's no problem with one soft reset which takes 4 minutes but this hard reset creates about 20 minutes outage (mainly cos of disk quotas checking). Last logged message: Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds. Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Mon, 10 Dec 2012 10:43:38 +0100 Message-ID: <20121210094318.GA6777@dhcp22.suse.cz> References: <20121130160811.6BB25BDD@pobox.sk> <20121130153942.GL29317@dhcp22.suse.cz> <20121130165937.F9564EBE@pobox.sk> <20121130161923.GN29317@dhcp22.suse.cz> <20121203151601.GA17093@dhcp22.suse.cz> <20121205023644.18C3006B@pobox.sk> <20121205141722.GA9714@dhcp22.suse.cz> <20121206012924.FE077FD7@pobox.sk> <20121206095423.GB10931@dhcp22.suse.cz> <20121210022038.E6570D37@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121210022038.E6570D37-Rm0zKEqwvD4@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Mon 10-12-12 02:20:38, azurIt wrote: [...] > Michal, Hi, > this was printing so many debug messages to console that the whole > server hangs Hmm, this is _really_ surprising. The latest patch didn't add any new logging actually. It just enahanced messages which were already printed out previously + changed few functions to be not inlined so they show up in the traces. So the only explanation is that the workload has changed or the patches got misapplied. > and i had to hard reset it after several minutes :( Sorry > but i cannot test such a things in production. There's no problem with > one soft reset which takes 4 minutes but this hard reset creates about > 20 minutes outage (mainly cos of disk quotas checking). Understood. > Last logged message: > > Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds. Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0 This explains why you have seen your machine hung. I am not familiar with grsec but stalling each fork 30s sounds really bad. Anyway this will not help me much. Do you happen to still have any of those logged traces from the last run? Apart from that. If my current understanding is correct then this is related to transparent huge pages (and leaking charge to the page fault handler). Do you see the same problem if you disable THP before you start your workload? (echo never > /sys/kernel/mm/transparent_hugepage/enabled) -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Mon, 10 Dec 2012 11:18:17 +0100 Message-ID: <20121210111817.F697F53E@pobox.sk> References: <20121130160811.6BB25BDD@pobox.sk>, <20121130153942.GL29317@dhcp22.suse.cz>, <20121130165937.F9564EBE@pobox.sk>, <20121130161923.GN29317@dhcp22.suse.cz>, <20121203151601.GA17093@dhcp22.suse.cz>, <20121205023644.18C3006B@pobox.sk>, <20121205141722.GA9714@dhcp22.suse.cz>, <20121206012924.FE077FD7@pobox.sk>, <20121206095423.GB10931@dhcp22.suse.cz>, <20121210022038.E6570D37@pobox.sk> <20121210094318.GA6777@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Return-path: In-Reply-To: <20121210094318.GA6777-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >Hmm, this is _really_ surprising. The latest patch didn't add any new >logging actually. It just enahanced messages which were already printed >out previously + changed few functions to be not inlined so they show up >in the traces. So the only explanation is that the workload has changed >or the patches got misapplied. This time i installed 3.2.35, maybe some changes between .34 and .35 did this? Should i try .34? >> Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds. Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0 > >This explains why you have seen your machine hung. I am not familiar >with grsec but stalling each fork 30s sounds really bad. Btw, i never ever saw such a message from grsecurity yet. Will write to grsec mailing list about explanation. >Anyway this will not help me much. Do you happen to still have any of >those logged traces from the last run? Unfortunately not, it didn't log anything and tons of messages were printed only to console (i was logged via IP-KVM). It looked that printing is infinite, i rebooted it after few minutes. >Apart from that. If my current understanding is correct then this is >related to transparent huge pages (and leaking charge to the page fault >handler). Do you see the same problem if you disable THP before you >start your workload? (echo never > /sys/kernel/mm/transparent_hugepage/enabled) # cat /sys/kernel/mm/transparent_hugepage/enabled cat: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory # ls -la /sys/kernel/mm total 0 drwx------ 3 root root 0 Dec 10 11:11 . drwx------ 5 root root 0 Dec 10 02:06 .. drwx------ 2 root root 0 Dec 10 11:11 cleancache From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Mon, 10 Dec 2012 16:52:05 +0100 Message-ID: <20121210155205.GB6777@dhcp22.suse.cz> References: <20121130165937.F9564EBE@pobox.sk> <20121130161923.GN29317@dhcp22.suse.cz> <20121203151601.GA17093@dhcp22.suse.cz> <20121205023644.18C3006B@pobox.sk> <20121205141722.GA9714@dhcp22.suse.cz> <20121206012924.FE077FD7@pobox.sk> <20121206095423.GB10931@dhcp22.suse.cz> <20121210022038.E6570D37@pobox.sk> <20121210094318.GA6777@dhcp22.suse.cz> <20121210111817.F697F53E@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121210111817.F697F53E-Rm0zKEqwvD4@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Mon 10-12-12 11:18:17, azurIt wrote: > >Hmm, this is _really_ surprising. The latest patch didn't add any new > >logging actually. It just enahanced messages which were already printed > >out previously + changed few functions to be not inlined so they show up > >in the traces. So the only explanation is that the workload has changed > >or the patches got misapplied. > > > This time i installed 3.2.35, maybe some changes between .34 and .35 > did this? Should i try .34? I would try to limit changes to minimum. So the original kernel you were using + the first patch to prevent OOM from the write path + 2 debugging patches. > >> Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds. Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0 > > > >This explains why you have seen your machine hung. I am not familiar > >with grsec but stalling each fork 30s sounds really bad. > > > Btw, i never ever saw such a message from grsecurity yet. Will write to grsec mailing list about explanation. > > > >Anyway this will not help me much. Do you happen to still have any of > >those logged traces from the last run? > > > Unfortunately not, it didn't log anything and tons of messages were > printed only to console (i was logged via IP-KVM). It looked that > printing is infinite, i rebooted it after few minutes. But was it at least related to the debugging from the patch or it was rather a totally unrelated thing? > >Apart from that. If my current understanding is correct then this is > >related to transparent huge pages (and leaking charge to the page fault > >handler). Do you see the same problem if you disable THP before you > >start your workload? (echo never > /sys/kernel/mm/transparent_hugepage/enabled) > > # cat /sys/kernel/mm/transparent_hugepage/enabled > cat: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory Weee. Then it cannot be related to THP at all. Which makes this even bigger mystery. We really need to find out who is leaking that charge. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Mon, 10 Dec 2012 18:18:54 +0100 Message-ID: <20121210181854.5BE82C77@pobox.sk> References: <20121130165937.F9564EBE@pobox.sk>, <20121130161923.GN29317@dhcp22.suse.cz>, <20121203151601.GA17093@dhcp22.suse.cz>, <20121205023644.18C3006B@pobox.sk>, <20121205141722.GA9714@dhcp22.suse.cz>, <20121206012924.FE077FD7@pobox.sk>, <20121206095423.GB10931@dhcp22.suse.cz>, <20121210022038.E6570D37@pobox.sk>, <20121210094318.GA6777@dhcp22.suse.cz>, <20121210111817.F697F53E@pobox.sk> <20121210155205.GB6777@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20121210155205.GB6777@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >I would try to limit changes to minimum. So the original kernel you were >using + the first patch to prevent OOM from the write path + 2 debugging >patches. ok. >But was it at least related to the debugging from the patch or it was >rather a totally unrelated thing? I wasn't reading it much but i think it looks like a traces i was sending= you before. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Mon, 17 Dec 2012 02:34:30 +0100 Message-ID: <20121217023430.5A390FD7@pobox.sk> References: <20121130165937.F9564EBE@pobox.sk>, <20121130161923.GN29317@dhcp22.suse.cz>, <20121203151601.GA17093@dhcp22.suse.cz>, <20121205023644.18C3006B@pobox.sk>, <20121205141722.GA9714@dhcp22.suse.cz>, <20121206012924.FE077FD7@pobox.sk>, <20121206095423.GB10931@dhcp22.suse.cz>, <20121210022038.E6570D37@pobox.sk>, <20121210094318.GA6777@dhcp22.suse.cz>, <20121210111817.F697F53E@pobox.sk> <20121210155205.GB6777@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20121210155205.GB6777@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >I would try to limit changes to minimum. So the original kernel you were >using + the first patch to prevent OOM from the write path + 2 debugging >patches. It didn't take off the whole system this time (but i was prepared to reco= rd a video of console ;) ), here it is: http://www.watchdog.sk/lkml/oom_mysqld4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Mon, 17 Dec 2012 17:32:03 +0100 Message-ID: <20121217163203.GD25432@dhcp22.suse.cz> References: <20121203151601.GA17093@dhcp22.suse.cz> <20121205023644.18C3006B@pobox.sk> <20121205141722.GA9714@dhcp22.suse.cz> <20121206012924.FE077FD7@pobox.sk> <20121206095423.GB10931@dhcp22.suse.cz> <20121210022038.E6570D37@pobox.sk> <20121210094318.GA6777@dhcp22.suse.cz> <20121210111817.F697F53E@pobox.sk> <20121210155205.GB6777@dhcp22.suse.cz> <20121217023430.5A390FD7@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121217023430.5A390FD7@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Mon 17-12-12 02:34:30, azurIt wrote: > >I would try to limit changes to minimum. So the original kernel you were > >using + the first patch to prevent OOM from the write path + 2 debugging > >patches. > > > It didn't take off the whole system this time (but i was > prepared to record a video of console ;) ), here it is: > http://www.watchdog.sk/lkml/oom_mysqld4 [...] [ 1248.059429] ------------[ cut here ]------------ [ 1248.059586] WARNING: at mm/memcontrol.c:2400 T.1146+0x2d9/0x610() [ 1248.059723] Hardware name: S5000VSA [ 1248.059855] gfp_mask:208 nr_pages:1 oom:0 ret:2 This is GFP_KERNEL allocation which is expected. It is also a simple page which is not that expected because we shouldn't return ENOMEM on those unless this was GFP_ATOMIC allocation (which it wasn't) or the caller told us to not trigger OOM which is the case only for THP pages (see mem_cgroup_charge_common). So the big question is how have we ended up with oom=false here... [Ohh, I am really an idiot. I screwed the first patch] - bool oom = true; + bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); Which obviously doesn't work. It should read !(gfp_mask &GFP_MEMCG_NO_OOM). No idea how I could have missed that. I am really sorry about that. --- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c04676d..1f35a74 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2704,7 +2704,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; struct page_cgroup *pc; - bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); + bool oom = !(gfp_mask & GFP_MEMCG_NO_OOM); int ret; if (PageTransHuge(page)) { -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Mon, 17 Dec 2012 19:23:01 +0100 Message-ID: <20121217192301.829A7020@pobox.sk> References: <20121203151601.GA17093@dhcp22.suse.cz>, <20121205023644.18C3006B@pobox.sk>, <20121205141722.GA9714@dhcp22.suse.cz>, <20121206012924.FE077FD7@pobox.sk>, <20121206095423.GB10931@dhcp22.suse.cz>, <20121210022038.E6570D37@pobox.sk>, <20121210094318.GA6777@dhcp22.suse.cz>, <20121210111817.F697F53E@pobox.sk>, <20121210155205.GB6777@dhcp22.suse.cz>, <20121217023430.5A390FD7@pobox.sk> <20121217163203.GD25432@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20121217163203.GD25432@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >[Ohh, I am really an idiot. I screwed the first patch] >- bool oom =3D true; >+ bool oom =3D !(gfp_mask | GFP_MEMCG_NO_OOM); > >Which obviously doesn't work. It should read !(gfp_mask &GFP_MEMCG_NO_OO= M). > No idea how I could have missed that. I am really sorry about that. :D no problem :) so, now it should really work as expected and completely= fix my original problem? is it safe to apply it on 3.2.35? Thank you ver= y much! azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Mon, 17 Dec 2012 20:55:10 +0100 Message-ID: <20121217195510.GA16375@dhcp22.suse.cz> References: <20121205141722.GA9714@dhcp22.suse.cz> <20121206012924.FE077FD7@pobox.sk> <20121206095423.GB10931@dhcp22.suse.cz> <20121210022038.E6570D37@pobox.sk> <20121210094318.GA6777@dhcp22.suse.cz> <20121210111817.F697F53E@pobox.sk> <20121210155205.GB6777@dhcp22.suse.cz> <20121217023430.5A390FD7@pobox.sk> <20121217163203.GD25432@dhcp22.suse.cz> <20121217192301.829A7020@pobox.sk> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=dM1v33DKUG/40/PGm1k9sSd6COs87ISoDF6r2JBNq3o=; b=dbe9V5owOzox+Rdd0fL4glrPB9i7NqIn1zJl+nxyt7yBXT/fFTJjrcZcJMbvRuLC5/ 8lAfmE24AnFqD0/Rnw7M0vZjgmwj70xsQr/Uw1/herLZ064nu+RzGyBLRXWdTceFbJBw w/h+NhnOpLND5eJgz0+CNJzbMEAufOyABn+cfOmBNGDmEs/NOaYK5K7dct763Q7uvg7l BEW4vqBpqTo/IrRIviSDEOrpCVuhFb0reO46iR1+YFwsHg8TBoAW5m3vrs4hzigiVBLc QnflWIMcVFwRWTlxK9w/QuAvYJInNUTLFyOkfzrFXXB1ryjHogT8X3iDmSFhNSqOwQw7 NvBw== Content-Disposition: inline In-Reply-To: <20121217192301.829A7020-Rm0zKEqwvD4@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Mon 17-12-12 19:23:01, azurIt wrote: > >[Ohh, I am really an idiot. I screwed the first patch] > >- bool oom = true; > >+ bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); > > > >Which obviously doesn't work. It should read !(gfp_mask &GFP_MEMCG_NO_OOM). > > No idea how I could have missed that. I am really sorry about that. > > > :D no problem :) so, now it should really work as expected and > completely fix my original problem? It should mitigate the problem. The real fix shouldn't be that specific (as per discussion in other thread). The chance this will get upstream is not big and that means that it will not get to the stable tree either. > is it safe to apply it on 3.2.35? I didn't check what are the differences but I do not think there is anything to conflict with it. > Thank you very much! HTH -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Tue, 18 Dec 2012 15:22:23 +0100 Message-ID: <20121218152223.6912832C@pobox.sk> References: <20121205141722.GA9714@dhcp22.suse.cz>, <20121206012924.FE077FD7@pobox.sk>, <20121206095423.GB10931@dhcp22.suse.cz>, <20121210022038.E6570D37@pobox.sk>, <20121210094318.GA6777@dhcp22.suse.cz>, <20121210111817.F697F53E@pobox.sk>, <20121210155205.GB6777@dhcp22.suse.cz>, <20121217023430.5A390FD7@pobox.sk>, <20121217163203.GD25432@dhcp22.suse.cz>, <20121217192301.829A7020@pobox.sk> <20121217195510.GA16375@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Return-path: In-Reply-To: <20121217195510.GA16375-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >It should mitigate the problem. The real fix shouldn't be that specific >(as per discussion in other thread). The chance this will get upstream >is not big and that means that it will not get to the stable tree >either. OOM is no longer killing processes outside target cgroups, so everything looks fine so far. Will report back when i will have more info. Thnks! azur From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Tue, 18 Dec 2012 16:20:04 +0100 Message-ID: <20121218152004.GA25208@dhcp22.suse.cz> References: <20121206095423.GB10931@dhcp22.suse.cz> <20121210022038.E6570D37@pobox.sk> <20121210094318.GA6777@dhcp22.suse.cz> <20121210111817.F697F53E@pobox.sk> <20121210155205.GB6777@dhcp22.suse.cz> <20121217023430.5A390FD7@pobox.sk> <20121217163203.GD25432@dhcp22.suse.cz> <20121217192301.829A7020@pobox.sk> <20121217195510.GA16375@dhcp22.suse.cz> <20121218152223.6912832C@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121218152223.6912832C@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Tue 18-12-12 15:22:23, azurIt wrote: > >It should mitigate the problem. The real fix shouldn't be that specific > >(as per discussion in other thread). The chance this will get upstream > >is not big and that means that it will not get to the stable tree > >either. > > > OOM is no longer killing processes outside target cgroups, so > everything looks fine so far. Will report back when i will have more > info. Thnks! OK, good to hear and fingers crossed. I will try to get back to the original problem and a better solution sometimes early next year when all the things settle a bit. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Mon, 24 Dec 2012 14:25:26 +0100 Message-ID: <20121224142526.020165D3@pobox.sk> References: <20121206095423.GB10931@dhcp22.suse.cz>, <20121210022038.E6570D37@pobox.sk>, <20121210094318.GA6777@dhcp22.suse.cz>, <20121210111817.F697F53E@pobox.sk>, <20121210155205.GB6777@dhcp22.suse.cz>, <20121217023430.5A390FD7@pobox.sk>, <20121217163203.GD25432@dhcp22.suse.cz>, <20121217192301.829A7020@pobox.sk>, <20121217195510.GA16375@dhcp22.suse.cz>, <20121218152223.6912832C@pobox.sk> <20121218152004.GA25208@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Return-path: In-Reply-To: <20121218152004.GA25208-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >OK, good to hear and fingers crossed. I will try to get back to the >original problem and a better solution sometimes early next year when >all the things settle a bit. Michal, problem, unfortunately, happened again :( twice. When it happened first time (two days ago) i don't want to believe it so i recompiled the kernel and boot it again to be sure i really used your patch. Today it happened again, here is report: http://watchdog.sk/lkml/memcg-bug-3.tar.gz Here is patch which i used (kernel 3.2.35, i didn't use any other from your patches): http://watchdog.sk/lkml/5-memcg-fix.patch azur From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Mon, 24 Dec 2012 14:38:50 +0100 Message-ID: <20121224143850.B611B3C3@pobox.sk> References: <20121206095423.GB10931@dhcp22.suse.cz>, <20121210022038.E6570D37@pobox.sk>, <20121210094318.GA6777@dhcp22.suse.cz>, <20121210111817.F697F53E@pobox.sk>, <20121210155205.GB6777@dhcp22.suse.cz>, <20121217023430.5A390FD7@pobox.sk>, <20121217163203.GD25432@dhcp22.suse.cz>, <20121217192301.829A7020@pobox.sk>, <20121217195510.GA16375@dhcp22.suse.cz>, <20121218152223.6912832C@pobox.sk> <20121218152004.GA25208@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Return-path: In-Reply-To: <20121218152004.GA25208-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >OK, good to hear and fingers crossed. I will try to get back to the >original problem and a better solution sometimes early next year when >all the things settle a bit. Btw, i noticed one more thing when problem is happening (=when any cgroup is stucked), i fogot to mention it before, sorry :( . It's related to HDDs, something is slowing them down in a strange way. All services are working normally and i really cannot notice any slowness, the only thing which i noticed is affeceted is our backup software ( www.Bacula.org ). When problem occurs at night, so it's happening when backup is running, backup is extremely slow and usually don't finish until i kill processes inside affected cgroup (=until i resolve the problem). Backup software is NOT doing big HDD bandwidth BUT it's doing quite huge number of disk operations (it needs to stat every file and directory). I believe that only speed of disk operations are affected and are very slow. Merry christmas! From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Fri, 28 Dec 2012 17:22:09 +0100 Message-ID: <20121228162209.GA1455@dhcp22.suse.cz> References: <20121210094318.GA6777@dhcp22.suse.cz> <20121210111817.F697F53E@pobox.sk> <20121210155205.GB6777@dhcp22.suse.cz> <20121217023430.5A390FD7@pobox.sk> <20121217163203.GD25432@dhcp22.suse.cz> <20121217192301.829A7020@pobox.sk> <20121217195510.GA16375@dhcp22.suse.cz> <20121218152223.6912832C@pobox.sk> <20121218152004.GA25208@dhcp22.suse.cz> <20121224142526.020165D3@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121224142526.020165D3-Rm0zKEqwvD4@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Mon 24-12-12 14:25:26, azurIt wrote: > >OK, good to hear and fingers crossed. I will try to get back to the > >original problem and a better solution sometimes early next year when > >all the things settle a bit. > > > Michal, problem, unfortunately, happened again :( twice. When it > happened first time (two days ago) i don't want to believe it so i > recompiled the kernel and boot it again to be sure i really used your > patch. Today it happened again, here is report: > http://watchdog.sk/lkml/memcg-bug-3.tar.gz Hmm, 1356352982/1507/stack says [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1147+0x5ab/0x5c0 [] mem_cgroup_cache_charge+0xbe/0xe0 [] add_to_page_cache_locked+0x4f/0x140 [] add_to_page_cache_lru+0x22/0x50 [] find_or_create_page+0x73/0xb0 [] __getblk+0xea/0x2c0 [] ext3_getblk+0xeb/0x240 [] ext3_bread+0x19/0x90 [] ext3_dx_find_entry+0x83/0x1e0 [] ext3_find_entry+0x2e4/0x480 [] ext3_lookup+0x4d/0x120 [] d_alloc_and_lookup+0x45/0x90 [] do_lookup+0x278/0x390 [] path_lookupat+0xae/0x7e0 [] do_path_lookup+0x35/0xe0 [] user_path_at_empty+0x59/0xb0 [] user_path_at+0x11/0x20 [] vfs_fstatat+0x47/0x80 [] vfs_lstat+0x1e/0x20 [] sys_newlstat+0x24/0x50 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff which suggests that the patch is incomplete and that I am blind :/ mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following follow-up patch on top of the one you already have (which should catch all the remaining cases). Sorry about that... --- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 89997ac..559a54d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2779,6 +2779,7 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg, int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask) { + bool oom = !(gfp_mask & GFP_MEMCG_NO_OOM); struct mem_cgroup *memcg = NULL; int ret; @@ -2791,7 +2792,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, mm = &init_mm; if (page_is_file_cache(page)) { - ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, true); + ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, oom); if (ret || !memcg) return ret; @@ -2827,6 +2828,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t mask, struct mem_cgroup **ptr) { + bool oom = !(mask & GFP_MEMCG_NO_OOM); struct mem_cgroup *memcg; int ret; @@ -2849,13 +2851,13 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, if (!memcg) goto charge_cur_mm; *ptr = memcg; - ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, true); + ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, oom); css_put(&memcg->css); return ret; charge_cur_mm: if (unlikely(!mm)) mm = &init_mm; - return __mem_cgroup_try_charge(mm, mask, 1, ptr, true); + return __mem_cgroup_try_charge(mm, mask, 1, ptr, oom); } static void -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Fri, 28 Dec 2012 17:35:21 +0100 Message-ID: <20121228163521.GB1455@dhcp22.suse.cz> References: <20121210094318.GA6777@dhcp22.suse.cz> <20121210111817.F697F53E@pobox.sk> <20121210155205.GB6777@dhcp22.suse.cz> <20121217023430.5A390FD7@pobox.sk> <20121217163203.GD25432@dhcp22.suse.cz> <20121217192301.829A7020@pobox.sk> <20121217195510.GA16375@dhcp22.suse.cz> <20121218152223.6912832C@pobox.sk> <20121218152004.GA25208@dhcp22.suse.cz> <20121224143850.B611B3C3@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121224143850.B611B3C3@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Mon 24-12-12 14:38:50, azurIt wrote: > >OK, good to hear and fingers crossed. I will try to get back to the > >original problem and a better solution sometimes early next year when > >all the things settle a bit. > > > Btw, i noticed one more thing when problem is happening (=when any > cgroup is stucked), i fogot to mention it before, sorry :( . It's > related to HDDs, something is slowing them down in a strange way. All > services are working normally and i really cannot notice any slowness, > the only thing which i noticed is affeceted is our backup software ( > www.Bacula.org ). When problem occurs at night, so it's happening when > backup is running, backup is extremely slow and usually don't finish > until i kill processes inside affected cgroup (=until i resolve the > problem). Backup software is NOT doing big HDD bandwidth BUT it's > doing quite huge number of disk operations (it needs to stat every > file and directory). I believe that only speed of disk operations are > affected and are very slow. I would bet that this is caused by the blocked proceses in memcg oom handler which hold i_mutex and the backup process wants to access the same inode with an operation which requires the lock. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Sun, 30 Dec 2012 02:09:47 +0100 Message-ID: <20121230020947.AA002F34@pobox.sk> References: <20121210094318.GA6777@dhcp22.suse.cz>, <20121210111817.F697F53E@pobox.sk>, <20121210155205.GB6777@dhcp22.suse.cz>, <20121217023430.5A390FD7@pobox.sk>, <20121217163203.GD25432@dhcp22.suse.cz>, <20121217192301.829A7020@pobox.sk>, <20121217195510.GA16375@dhcp22.suse.cz>, <20121218152223.6912832C@pobox.sk>, <20121218152004.GA25208@dhcp22.suse.cz>, <20121224142526.020165D3@pobox.sk> <20121228162209.GA1455@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Return-path: In-Reply-To: <20121228162209.GA1455-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >which suggests that the patch is incomplete and that I am blind :/ >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following >follow-up patch on top of the one you already have (which should catch >all the remaining cases). >Sorry about that... This was, again, killing my MySQL server (search for "(mysqld)"): http://www.watchdog.sk/lkml/oom_mysqld5 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Sun, 30 Dec 2012 12:08:15 +0100 Message-ID: <20121230110815.GA12940@dhcp22.suse.cz> References: <20121210155205.GB6777@dhcp22.suse.cz> <20121217023430.5A390FD7@pobox.sk> <20121217163203.GD25432@dhcp22.suse.cz> <20121217192301.829A7020@pobox.sk> <20121217195510.GA16375@dhcp22.suse.cz> <20121218152223.6912832C@pobox.sk> <20121218152004.GA25208@dhcp22.suse.cz> <20121224142526.020165D3@pobox.sk> <20121228162209.GA1455@dhcp22.suse.cz> <20121230020947.AA002F34@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121230020947.AA002F34@pobox.sk> Sender: linux-kernel-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Sun 30-12-12 02:09:47, azurIt wrote: > >which suggests that the patch is incomplete and that I am blind :/ > >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache > >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following > >follow-up patch on top of the one you already have (which should catch > >all the remaining cases). > >Sorry about that... > > > This was, again, killing my MySQL server (search for "(mysqld)"): > http://www.watchdog.sk/lkml/oom_mysqld5 grep "Kill process" oom_mysqld5 Dec 30 01:53:34 server01 kernel: [ 367.061801] Memory cgroup out of memory: Kill process 5512 (apache2) score 716 or sacrifice child Dec 30 01:53:35 server01 kernel: [ 367.338024] Memory cgroup out of memory: Kill process 5517 (apache2) score 718 or sacrifice child Dec 30 01:53:35 server01 kernel: [ 367.747888] Memory cgroup out of memory: Kill process 5513 (apache2) score 721 or sacrifice child Dec 30 01:53:36 server01 kernel: [ 368.159860] Memory cgroup out of memory: Kill process 5516 (apache2) score 726 or sacrifice child Dec 30 01:53:36 server01 kernel: [ 368.665606] Memory cgroup out of memory: Kill process 5520 (apache2) score 733 or sacrifice child Dec 30 01:53:36 server01 kernel: [ 368.765652] Out of memory: Kill process 1778 (mysqld) score 39 or sacrifice child Dec 30 01:53:36 server01 kernel: [ 369.101753] Memory cgroup out of memory: Kill process 5519 (apache2) score 754 or sacrifice child Dec 30 01:53:37 server01 kernel: [ 369.464262] Memory cgroup out of memory: Kill process 5583 (apache2) score 762 or sacrifice child Dec 30 01:53:37 server01 kernel: [ 369.465017] Out of memory: Kill process 5506 (apache2) score 18 or sacrifice child Dec 30 01:53:37 server01 kernel: [ 369.574932] Memory cgroup out of memory: Kill process 5523 (apache2) score 759 or sacrifice child So your mysqld has been killed by the global OOM not memcg. But why when you seem to be perfectly fine regarding memory? I guess the following backtrace is relevant: Dec 30 01:53:36 server01 kernel: [ 368.569720] DMA: 0*4kB 1*8kB 0*16kB 1*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15912kB Dec 30 01:53:36 server01 kernel: [ 368.570447] DMA32: 9*4kB 10*8kB 8*16kB 6*32kB 5*64kB 6*128kB 4*256kB 2*512kB 3*1024kB 3*2048kB 613*4096kB = 2523636kB Dec 30 01:53:36 server01 kernel: [ 368.571175] Normal: 5*4kB 2060*8kB 4122*16kB 2550*32kB 2667*64kB 722*128kB 197*256kB 68*512kB 15*1024kB 4*2048kB 1855*4096kB = 8134036kB Dec 30 01:53:36 server01 kernel: [ 368.571906] 308964 total pagecache pages Dec 30 01:53:36 server01 kernel: [ 368.572023] 0 pages in swap cache Dec 30 01:53:36 server01 kernel: [ 368.572140] Swap cache stats: add 0, delete 0, find 0/0 Dec 30 01:53:36 server01 kernel: [ 368.572260] Free swap = 0kB Dec 30 01:53:36 server01 kernel: [ 368.572375] Total swap = 0kB Dec 30 01:53:36 server01 kernel: [ 368.597836] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 Dec 30 01:53:36 server01 kernel: [ 368.598034] apache2 cpuset=uid mems_allowed=0 Dec 30 01:53:36 server01 kernel: [ 368.598152] Pid: 5385, comm: apache2 Not tainted 3.2.35-grsec #1 Dec 30 01:53:36 server01 kernel: [ 368.598273] Call Trace: Dec 30 01:53:36 server01 kernel: [ 368.598396] [] dump_header+0x7e/0x1e0 Dec 30 01:53:36 server01 kernel: [ 368.598516] [] ? find_lock_task_mm+0x2f/0x70 Dec 30 01:53:36 server01 kernel: [ 368.598638] [] oom_kill_process+0x85/0x2a0 Dec 30 01:53:36 server01 kernel: [ 368.598759] [] out_of_memory+0xe5/0x200 Dec 30 01:53:36 server01 kernel: [ 368.598880] [] pagefault_out_of_memory+0xbd/0x110 Dec 30 01:53:36 server01 kernel: [ 368.599006] [] mm_fault_error+0xb6/0x1a0 Dec 30 01:53:36 server01 kernel: [ 368.599127] [] do_page_fault+0x3ee/0x460 Dec 30 01:53:36 server01 kernel: [ 368.599250] [] ? mntput+0x1f/0x30 Dec 30 01:53:36 server01 kernel: [ 368.599371] [] ? fput+0x156/0x200 Dec 30 01:53:36 server01 kernel: [ 368.599496] [] page_fault+0x1f/0x30 This would suggest that an unexpected ENOMEM leaked during page fault path. I do not see which one could that be because you said THP (CONFIG_TRANSPARENT_HUGEPAGE) are disabled (and the other patch I have mentioned in the thread should fix that issue - btw. the patch is already scheduled for stable tree). __do_fault, do_anonymous_page and do_wp_page call mem_cgroup_newpage_charge with GFP_KERNEL which means that we do memcg OOM and never return ENOMEM. do_swap_page calls mem_cgroup_try_charge_swapin with GFP_KERNEL as well. I might have missed something but I will not get to look closer before 2nd January. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Fri, 25 Jan 2013 16:07:23 +0100 Message-ID: <20130125160723.FAE73567@pobox.sk> References: <20121210155205.GB6777@dhcp22.suse.cz>, <20121217023430.5A390FD7@pobox.sk>, <20121217163203.GD25432@dhcp22.suse.cz>, <20121217192301.829A7020@pobox.sk>, <20121217195510.GA16375@dhcp22.suse.cz>, <20121218152223.6912832C@pobox.sk>, <20121218152004.GA25208@dhcp22.suse.cz>, <20121224142526.020165D3@pobox.sk>, <20121228162209.GA1455@dhcp22.suse.cz>, <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: In-Reply-To: <20121230110815.GA12940-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="iso-8859-1" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= Any news? Thnx! azur ______________________________________________________________ > Od: "Michal Hocko" > Komu: azurIt > D=C3=A1tum: 30.12.2012 12:08 > Predmet: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to= _page_cache_locked > > CC: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailin= glist" , "KAMEZAWA Hiroyuki" , "Johannes Weiner" >On Sun 30-12-12 02:09:47, azurIt wrote: >> >which suggests that the patch is incomplete and that I am blind :/ >> >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page = cache >> >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the follow= ing >> >follow-up patch on top of the one you already have (which should ca= tch >> >all the remaining cases). >> >Sorry about that... >>=20 >>=20 >> This was, again, killing my MySQL server (search for "(mysqld)"): >> http://www.watchdog.sk/lkml/oom_mysqld5 > >grep "Kill process" oom_mysqld5=20 >Dec 30 01:53:34 server01 kernel: [ 367.061801] Memory cgroup out of m= emory: Kill process 5512 (apache2) score 716 or sacrifice child >Dec 30 01:53:35 server01 kernel: [ 367.338024] Memory cgroup out of m= emory: Kill process 5517 (apache2) score 718 or sacrifice child >Dec 30 01:53:35 server01 kernel: [ 367.747888] Memory cgroup out of m= emory: Kill process 5513 (apache2) score 721 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 368.159860] Memory cgroup out of m= emory: Kill process 5516 (apache2) score 726 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 368.665606] Memory cgroup out of m= emory: Kill process 5520 (apache2) score 733 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 368.765652] Out of memory: Kill pr= ocess 1778 (mysqld) score 39 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 369.101753] Memory cgroup out of m= emory: Kill process 5519 (apache2) score 754 or sacrifice child >Dec 30 01:53:37 server01 kernel: [ 369.464262] Memory cgroup out of m= emory: Kill process 5583 (apache2) score 762 or sacrifice child >Dec 30 01:53:37 server01 kernel: [ 369.465017] Out of memory: Kill pr= ocess 5506 (apache2) score 18 or sacrifice child >Dec 30 01:53:37 server01 kernel: [ 369.574932] Memory cgroup out of m= emory: Kill process 5523 (apache2) score 759 or sacrifice child > >So your mysqld has been killed by the global OOM not memcg. But why wh= en >you seem to be perfectly fine regarding memory? I guess the following >backtrace is relevant: >Dec 30 01:53:36 server01 kernel: [ 368.569720] DMA: 0*4kB 1*8kB 0*16k= B 1*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB =3D = 15912kB >Dec 30 01:53:36 server01 kernel: [ 368.570447] DMA32: 9*4kB 10*8kB 8*= 16kB 6*32kB 5*64kB 6*128kB 4*256kB 2*512kB 3*1024kB 3*2048kB 613*4096kB= =3D 2523636kB >Dec 30 01:53:36 server01 kernel: [ 368.571175] Normal: 5*4kB 2060*8kB= 4122*16kB 2550*32kB 2667*64kB 722*128kB 197*256kB 68*512kB 15*1024kB 4= *2048kB 1855*4096kB =3D 8134036kB >Dec 30 01:53:36 server01 kernel: [ 368.571906] 308964 total pagecache= pages >Dec 30 01:53:36 server01 kernel: [ 368.572023] 0 pages in swap cache >Dec 30 01:53:36 server01 kernel: [ 368.572140] Swap cache stats: add = 0, delete 0, find 0/0 >Dec 30 01:53:36 server01 kernel: [ 368.572260] Free swap =3D 0kB >Dec 30 01:53:36 server01 kernel: [ 368.572375] Total swap =3D 0kB >Dec 30 01:53:36 server01 kernel: [ 368.597836] apache2 invoked oom-ki= ller: gfp_mask=3D0x0, order=3D0, oom_adj=3D0, oom_score_adj=3D0 >Dec 30 01:53:36 server01 kernel: [ 368.598034] apache2 cpuset=3Duid m= ems_allowed=3D0 >Dec 30 01:53:36 server01 kernel: [ 368.598152] Pid: 5385, comm: apach= e2 Not tainted 3.2.35-grsec #1 >Dec 30 01:53:36 server01 kernel: [ 368.598273] Call Trace: >Dec 30 01:53:36 server01 kernel: [ 368.598396] [] = dump_header+0x7e/0x1e0 >Dec 30 01:53:36 server01 kernel: [ 368.598516] [] = ? find_lock_task_mm+0x2f/0x70 >Dec 30 01:53:36 server01 kernel: [ 368.598638] [] = oom_kill_process+0x85/0x2a0 >Dec 30 01:53:36 server01 kernel: [ 368.598759] [] = out_of_memory+0xe5/0x200 >Dec 30 01:53:36 server01 kernel: [ 368.598880] [] = pagefault_out_of_memory+0xbd/0x110 >Dec 30 01:53:36 server01 kernel: [ 368.599006] [] = mm_fault_error+0xb6/0x1a0 >Dec 30 01:53:36 server01 kernel: [ 368.599127] [] = do_page_fault+0x3ee/0x460 >Dec 30 01:53:36 server01 kernel: [ 368.599250] [] = ? mntput+0x1f/0x30 >Dec 30 01:53:36 server01 kernel: [ 368.599371] [] = ? fput+0x156/0x200 >Dec 30 01:53:36 server01 kernel: [ 368.599496] [] = page_fault+0x1f/0x30 > >This would suggest that an unexpected ENOMEM leaked during page fault >path. I do not see which one could that be because you said THP >(CONFIG_TRANSPARENT_HUGEPAGE) are disabled (and the other patch I have >mentioned in the thread should fix that issue - btw. the patch is >already scheduled for stable tree). > __do_fault, do_anonymous_page and do_wp_page call >mem_cgroup_newpage_charge with GFP_KERNEL which means that >we do memcg OOM and never return ENOMEM. do_swap_page calls >mem_cgroup_try_charge_swapin with GFP_KERNEL as well. > >I might have missed something but I will not get to look closer before >2nd January. >--=20 >Michal Hocko >SUSE Labs > From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Fri, 25 Jan 2013 17:31:30 +0100 Message-ID: <20130125163130.GF4721@dhcp22.suse.cz> References: <20121217163203.GD25432@dhcp22.suse.cz> <20121217192301.829A7020@pobox.sk> <20121217195510.GA16375@dhcp22.suse.cz> <20121218152223.6912832C@pobox.sk> <20121218152004.GA25208@dhcp22.suse.cz> <20121224142526.020165D3@pobox.sk> <20121228162209.GA1455@dhcp22.suse.cz> <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: Content-Disposition: inline In-Reply-To: <20130125160723.FAE73567@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="iso-8859-1" To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Fri 25-01-13 16:07:23, azurIt wrote: > Any news? Thnx! Sorry, but I didn't get to this one yet. >=20 > azur >=20 >=20 >=20 > ______________________________________________________________ > > Od: "Michal Hocko" > > Komu: azurIt > > D=E1tum: 30.12.2012 12:08 > > Predmet: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to= _page_cache_locked > > > > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailin= glist" , "KAMEZAWA Hiroyuki" , "Johannes Weiner" > >On Sun 30-12-12 02:09:47, azurIt wrote: > >> >which suggests that the patch is incomplete and that I am blind :/ > >> >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page = cache > >> >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the follow= ing > >> >follow-up patch on top of the one you already have (which should ca= tch > >> >all the remaining cases). > >> >Sorry about that... > >>=20 > >>=20 > >> This was, again, killing my MySQL server (search for "(mysqld)"): > >> http://www.watchdog.sk/lkml/oom_mysqld5 > > > >grep "Kill process" oom_mysqld5=20 > >Dec 30 01:53:34 server01 kernel: [ 367.061801] Memory cgroup out of m= emory: Kill process 5512 (apache2) score 716 or sacrifice child > >Dec 30 01:53:35 server01 kernel: [ 367.338024] Memory cgroup out of m= emory: Kill process 5517 (apache2) score 718 or sacrifice child > >Dec 30 01:53:35 server01 kernel: [ 367.747888] Memory cgroup out of m= emory: Kill process 5513 (apache2) score 721 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 368.159860] Memory cgroup out of m= emory: Kill process 5516 (apache2) score 726 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 368.665606] Memory cgroup out of m= emory: Kill process 5520 (apache2) score 733 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 368.765652] Out of memory: Kill pr= ocess 1778 (mysqld) score 39 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 369.101753] Memory cgroup out of m= emory: Kill process 5519 (apache2) score 754 or sacrifice child > >Dec 30 01:53:37 server01 kernel: [ 369.464262] Memory cgroup out of m= emory: Kill process 5583 (apache2) score 762 or sacrifice child > >Dec 30 01:53:37 server01 kernel: [ 369.465017] Out of memory: Kill pr= ocess 5506 (apache2) score 18 or sacrifice child > >Dec 30 01:53:37 server01 kernel: [ 369.574932] Memory cgroup out of m= emory: Kill process 5523 (apache2) score 759 or sacrifice child > > > >So your mysqld has been killed by the global OOM not memcg. But why wh= en > >you seem to be perfectly fine regarding memory? I guess the following > >backtrace is relevant: > >Dec 30 01:53:36 server01 kernel: [ 368.569720] DMA: 0*4kB 1*8kB 0*16k= B 1*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB =3D 15= 912kB > >Dec 30 01:53:36 server01 kernel: [ 368.570447] DMA32: 9*4kB 10*8kB 8*= 16kB 6*32kB 5*64kB 6*128kB 4*256kB 2*512kB 3*1024kB 3*2048kB 613*4096kB =3D= 2523636kB > >Dec 30 01:53:36 server01 kernel: [ 368.571175] Normal: 5*4kB 2060*8kB= 4122*16kB 2550*32kB 2667*64kB 722*128kB 197*256kB 68*512kB 15*1024kB 4*2= 048kB 1855*4096kB =3D 8134036kB > >Dec 30 01:53:36 server01 kernel: [ 368.571906] 308964 total pagecache= pages > >Dec 30 01:53:36 server01 kernel: [ 368.572023] 0 pages in swap cache > >Dec 30 01:53:36 server01 kernel: [ 368.572140] Swap cache stats: add = 0, delete 0, find 0/0 > >Dec 30 01:53:36 server01 kernel: [ 368.572260] Free swap =3D 0kB > >Dec 30 01:53:36 server01 kernel: [ 368.572375] Total swap =3D 0kB > >Dec 30 01:53:36 server01 kernel: [ 368.597836] apache2 invoked oom-ki= ller: gfp_mask=3D0x0, order=3D0, oom_adj=3D0, oom_score_adj=3D0 > >Dec 30 01:53:36 server01 kernel: [ 368.598034] apache2 cpuset=3Duid m= ems_allowed=3D0 > >Dec 30 01:53:36 server01 kernel: [ 368.598152] Pid: 5385, comm: apach= e2 Not tainted 3.2.35-grsec #1 > >Dec 30 01:53:36 server01 kernel: [ 368.598273] Call Trace: > >Dec 30 01:53:36 server01 kernel: [ 368.598396] [] = dump_header+0x7e/0x1e0 > >Dec 30 01:53:36 server01 kernel: [ 368.598516] [] = ? find_lock_task_mm+0x2f/0x70 > >Dec 30 01:53:36 server01 kernel: [ 368.598638] [] = oom_kill_process+0x85/0x2a0 > >Dec 30 01:53:36 server01 kernel: [ 368.598759] [] = out_of_memory+0xe5/0x200 > >Dec 30 01:53:36 server01 kernel: [ 368.598880] [] = pagefault_out_of_memory+0xbd/0x110 > >Dec 30 01:53:36 server01 kernel: [ 368.599006] [] = mm_fault_error+0xb6/0x1a0 > >Dec 30 01:53:36 server01 kernel: [ 368.599127] [] = do_page_fault+0x3ee/0x460 > >Dec 30 01:53:36 server01 kernel: [ 368.599250] [] = ? mntput+0x1f/0x30 > >Dec 30 01:53:36 server01 kernel: [ 368.599371] [] = ? fput+0x156/0x200 > >Dec 30 01:53:36 server01 kernel: [ 368.599496] [] = page_fault+0x1f/0x30 > > > >This would suggest that an unexpected ENOMEM leaked during page fault > >path. I do not see which one could that be because you said THP > >(CONFIG_TRANSPARENT_HUGEPAGE) are disabled (and the other patch I have > >mentioned in the thread should fix that issue - btw. the patch is > >already scheduled for stable tree). > > __do_fault, do_anonymous_page and do_wp_page call > >mem_cgroup_newpage_charge with GFP_KERNEL which means that > >we do memcg OOM and never return ENOMEM. do_swap_page calls > >mem_cgroup_try_charge_swapin with GFP_KERNEL as well. > > > >I might have missed something but I will not get to look closer before > >2nd January. > >--=20 > >Michal Hocko > >SUSE Labs > > > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html --=20 Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Tue, 5 Feb 2013 14:49:42 +0100 Message-ID: <20130205134937.GA22804@dhcp22.suse.cz> References: <20121217192301.829A7020@pobox.sk> <20121217195510.GA16375@dhcp22.suse.cz> <20121218152223.6912832C@pobox.sk> <20121218152004.GA25208@dhcp22.suse.cz> <20121224142526.020165D3@pobox.sk> <20121228162209.GA1455@dhcp22.suse.cz> <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130125163130.GF4721-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Fri 25-01-13 17:31:30, Michal Hocko wrote: > On Fri 25-01-13 16:07:23, azurIt wrote: > > Any news? Thnx! > > Sorry, but I didn't get to this one yet. Sorry, to get back to this that late but I was busy as hell since the beginning of the year. Has the issue repeated since then? You said you didn't apply other than the above mentioned patch. Could you apply also debugging part of the patches I have sent? In case you don't have it handy then it should be this one: --- >From 1623420d964e7e8bc88e2a6239563052df891bf7 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Mon, 3 Dec 2012 16:16:01 +0100 Subject: [PATCH] more debugging --- mm/huge_memory.c | 6 +++--- mm/memcontrol.c | 1 + 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 470cbb4..01a11f1 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -671,7 +671,7 @@ static inline struct page *alloc_hugepage(int defrag) } #endif -int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, +noinline int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, unsigned int flags) { @@ -790,7 +790,7 @@ pgtable_t get_pmd_huge_pte(struct mm_struct *mm) return pgtable; } -static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, +static noinline int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pmd_t orig_pmd, @@ -883,7 +883,7 @@ out_free_pages: goto out; } -int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, +noinline int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pmd_t orig_pmd) { int ret = 0; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c8425b1..1986c65 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2397,6 +2397,7 @@ done: return 0; nomem: *ptr = NULL; + __WARN_printf("gfp_mask:%u nr_pages:%u oom:%d ret:%d\n", gfp_mask, nr_pages, oom, ret); return -ENOMEM; bypass: *ptr = NULL; -- 1.7.10.4 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Tue, 05 Feb 2013 15:49:47 +0100 Message-ID: <20130205154947.CD6411E2@pobox.sk> References: <20121217192301.829A7020@pobox.sk>, <20121217195510.GA16375@dhcp22.suse.cz>, <20121218152223.6912832C@pobox.sk>, <20121218152004.GA25208@dhcp22.suse.cz>, <20121224142526.020165D3@pobox.sk>, <20121228162209.GA1455@dhcp22.suse.cz>, <20121230020947.AA002F34@pobox.sk>, <20121230110815.GA12940@dhcp22.suse.cz>, <20130125160723.FAE73567@pobox.sk>, <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20130205134937.GA22804@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >Sorry, to get back to this that late but I was busy as hell since the >beginning of the year. Thank you for your time! >Has the issue repeated since then? Yes, it's happening all the time but meanwhile i wrote a script which is = monitoring the problem and killing freezed processes when it occurs. But = i don't like it much, it's not a solution for me :( i also noticed, that = problem is always affecting the whole server but not so much as freezed c= group. Depends on number of freezed processes, sometimes it has almost no= imapct on the rest of the server, sometimes the whole server is lagging = much. I have another old problem which is maybe also related to this. I wasn't = connecting it with this before but now i'm not sure. Two of our servers, = which are affected by this cgroup problem, are also randomly freezing com= pletely (few times per month). These are the symptoms: - servers are answering to ping - it is possible to connect via SSH but connection is freezed after send= ing the password - it is possible to login via console but it is freezed after typeing th= e login These symptoms are very similar to HDD problems or HDD overload (but ther= e is no overload for sure). The only way to fix it is, probably, hard reb= ooting the server (didn't find any other way). What do you think? Can thi= s be related? Maybe HDDs are locked in the similar way the cgroups are - = we already found out that cgroup freezeing is related also to HDD activit= y. Maybe there is a little chance that the whole HDD subsystem ends in de= adlock? >You said you didn't apply other than the above mentioned patch. Could >you apply also debugging part of the patches I have sent? >In case you don't have it handy then it should be this one: Just to be sure - am i supposed to apply this two patches? http://watchdog.sk/lkml/patches/ azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Tue, 5 Feb 2013 17:09:34 +0100 Message-ID: <20130205160934.GB22804@dhcp22.suse.cz> References: <20121218152223.6912832C@pobox.sk> <20121218152004.GA25208@dhcp22.suse.cz> <20121224142526.020165D3@pobox.sk> <20121228162209.GA1455@dhcp22.suse.cz> <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130205154947.CD6411E2@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Tue 05-02-13 15:49:47, azurIt wrote: [...] > Just to be sure - am i supposed to apply this two patches? > http://watchdog.sk/lkml/patches/ 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I mentioned in a follow up email. Here is the full patch: --- >From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Mon, 26 Nov 2012 11:47:57 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [] do_truncate+0x58/0xa0 # takes i_mutex [] do_last+0x250/0xa30 [] path_openat+0xd7/0x440 [] do_filp_open+0x49/0xa0 [] do_sys_open+0x106/0x240 [] sys_open+0x20/0x30 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff Process B [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1146+0x5ab/0x5c0 [] mem_cgroup_cache_charge+0xbe/0xe0 [] add_to_page_cache_locked+0x4c/0x140 [] add_to_page_cache_lru+0x22/0x50 [] grab_cache_page_write_begin+0x8b/0xe0 [] ext3_write_begin+0x88/0x270 [] generic_file_buffered_write+0x116/0x290 [] __generic_file_aio_write+0x27c/0x480 [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [] do_sync_write+0xea/0x130 [] vfs_write+0xf3/0x1f0 [] sys_write+0x51/0x90 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which then tells mem_cgroup_charge_common that OOM is not allowed for the charge. No OOM from this path, except for fixing the bug, also make some sense as we really do not want to cause an OOM because of a page cache usage. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. __GFP_NORETRY is abused for this memcg specific flag because no user accounted allocation use this flag except for THP which have memcg oom disabled already. Reported-by: azurIt Signed-off-by: Michal Hocko --- include/linux/gfp.h | 3 +++ include/linux/memcontrol.h | 13 +++++++++++++ mm/filemap.c | 8 +++++++- mm/memcontrol.c | 10 ++++++---- 4 files changed, 29 insertions(+), 5 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 3a76faf..806fb54 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -146,6 +146,9 @@ struct vm_area_struct; /* 4GB DMA on some platforms */ #define GFP_DMA32 __GFP_DMA32 +/* memcg oom killer is not allowed */ +#define GFP_MEMCG_NO_OOM __GFP_NORETRY + /* Convert GFP flags to their corresponding migrate type */ static inline int allocflags_to_migratetype(gfp_t gfp_flags) { diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 81572af..bf0e575 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -63,6 +63,13 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); + +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM); +} + extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru); extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru); extern void mem_cgroup_rotate_reclaimable_page(struct page *page); @@ -178,6 +185,12 @@ static inline int mem_cgroup_cache_charge(struct page *page, return 0; } +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return 0; +} + static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t gfp_mask, struct mem_cgroup **ptr) { diff --git a/mm/filemap.c b/mm/filemap.c index 556858c..ef182a9 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -449,7 +449,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, + /* + * Cannot trigger OOM even if gfp_mask would allow that normally + * because we might be called from a locked context and that + * could lead to deadlocks if the killed process is waiting for + * the same lock. + */ + error = mem_cgroup_cache_charge_no_oom(page, current->mm, gfp_mask & GFP_RECLAIM_MASK); if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1986c65..a68aa08 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2704,7 +2704,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; struct page_cgroup *pc; - bool oom = true; + bool oom = !(gfp_mask & GFP_MEMCG_NO_OOM); int ret; if (PageTransHuge(page)) { @@ -2771,6 +2771,7 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg, int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask) { + bool oom = !(gfp_mask & GFP_MEMCG_NO_OOM); struct mem_cgroup *memcg = NULL; int ret; @@ -2783,7 +2784,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, mm = &init_mm; if (page_is_file_cache(page)) { - ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, true); + ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, oom); if (ret || !memcg) return ret; @@ -2819,6 +2820,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t mask, struct mem_cgroup **ptr) { + bool oom = !(mask & GFP_MEMCG_NO_OOM); struct mem_cgroup *memcg; int ret; @@ -2841,13 +2843,13 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, if (!memcg) goto charge_cur_mm; *ptr = memcg; - ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, true); + ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, oom); css_put(&memcg->css); return ret; charge_cur_mm: if (unlikely(!mm)) mm = &init_mm; - return __mem_cgroup_try_charge(mm, mask, 1, ptr, true); + return __mem_cgroup_try_charge(mm, mask, 1, ptr, oom); } static void -- 1.7.10.4 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Tue, 5 Feb 2013 17:31:06 +0100 Message-ID: <20130205163106.GC22804@dhcp22.suse.cz> References: <20121218152223.6912832C@pobox.sk> <20121218152004.GA25208@dhcp22.suse.cz> <20121224142526.020165D3@pobox.sk> <20121228162209.GA1455@dhcp22.suse.cz> <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130205154947.CD6411E2-Rm0zKEqwvD4@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Tue 05-02-13 15:49:47, azurIt wrote: [...] > I have another old problem which is maybe also related to this. I > wasn't connecting it with this before but now i'm not sure. Two of our > servers, which are affected by this cgroup problem, are also randomly > freezing completely (few times per month). These are the symptoms: > - servers are answering to ping > - it is possible to connect via SSH but connection is freezed after > sending the password > - it is possible to login via console but it is freezed after typeing > the login > These symptoms are very similar to HDD problems or HDD overload (but > there is no overload for sure). The only way to fix it is, probably, > hard rebooting the server (didn't find any other way). What do you > think? Can this be related? This is hard to tell without further information. > Maybe HDDs are locked in the similar way the cgroups are - we already > found out that cgroup freezeing is related also to HDD activity. Maybe > there is a little chance that the whole HDD subsystem ends in > deadlock? "HDD subsystem" whatever that means cannot be blocked by memcg being stuck. Certain access to soem files might be an issue because those could have locks held but I do not see other relations. I would start by checking the HW, trying to focus on reducing elements that could contribute - aka try to nail down to the minimum set which reproduces the issue. I cannot help you much with that I am afraid. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Tue, 05 Feb 2013 17:46:37 +0100 Message-ID: <20130205174637.C7A8CE45@pobox.sk> References: <20121218152223.6912832C@pobox.sk>, <20121218152004.GA25208@dhcp22.suse.cz>, <20121224142526.020165D3@pobox.sk>, <20121228162209.GA1455@dhcp22.suse.cz>, <20121230020947.AA002F34@pobox.sk>, <20121230110815.GA12940@dhcp22.suse.cz>, <20130125160723.FAE73567@pobox.sk>, <20130125163130.GF4721@dhcp22.suse.cz>, <20130205134937.GA22804@dhcp22.suse.cz>, <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20130205160934.GB22804@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >mentioned in a follow up email. ou, it wasn't complete? i used it in my last test.. sorry, i'm litte conf= used by all those patches. will try it this night and report back. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Greg Thelen Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Tue, 05 Feb 2013 08:48:23 -0800 Message-ID: References: <20121218152223.6912832C@pobox.sk> <20121218152004.GA25208@dhcp22.suse.cz> <20121224142526.020165D3@pobox.sk> <20121228162209.GA1455@dhcp22.suse.cz> <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:from:to:cc:subject:references:date:in-reply-to :message-id:user-agent:mime-version:content-type; bh=uf+DfhKV73kxJZwrUZecD9aGh4LYTSXob3oO5zajGk4=; b=BWYi+U9cFwUYPksfSojMLUEG5xpRmxv/TruKBUNPY9+HQWL8nvi5ywUYpXgjHNcAKL xJYxzwjP0mFODuS7oDcxt04MaNukc004M1aWLjLHbfgMJXmSxiAv/LDyMDHrWl2LpHwM fV6mtDxp3Gos37ny1J6bvpuosS6o4804vPPFxY+P77ppxLaufzph8q4Clm2shYqSfcKN aNyKGNrjV4RbxfhLtPR+TNpaK1YXOsdOR7tIr0Ck7tOKrWNcLEgr9rVQUiYjptqWAilb 2h37EL+MBwEdwopgPq8ySN8LzrqJ3OBSYc6NSUSgVr4SS1Fgmi/pBt0F7FW3gST6LXNy vx0Q== In-Reply-To: <20130205160934.GB22804@dhcp22.suse.cz> (Michal Hocko's message of "Tue, 5 Feb 2013 17:09:34 +0100") Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Tue, Feb 05 2013, Michal Hocko wrote: > On Tue 05-02-13 15:49:47, azurIt wrote: > [...] >> Just to be sure - am i supposed to apply this two patches? >> http://watchdog.sk/lkml/patches/ > > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > mentioned in a follow up email. Here is the full patch: > --- > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 > From: Michal Hocko > Date: Mon, 26 Nov 2012 11:47:57 +0100 > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > memcg oom killer might deadlock if the process which falls down to > mem_cgroup_handle_oom holds a lock which prevents other task to > terminate because it is blocked on the very same lock. > This can happen when a write system call needs to allocate a page but > the allocation hits the memcg hard limit and there is nothing to reclaim > (e.g. there is no swap or swap limit is hit as well and all cache pages > have been reclaimed already) and the process selected by memcg OOM > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > Process A > [] do_truncate+0x58/0xa0 # takes i_mutex > [] do_last+0x250/0xa30 > [] path_openat+0xd7/0x440 > [] do_filp_open+0x49/0xa0 > [] do_sys_open+0x106/0x240 > [] sys_open+0x20/0x30 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff > > Process B > [] mem_cgroup_handle_oom+0x241/0x3b0 > [] T.1146+0x5ab/0x5c0 > [] mem_cgroup_cache_charge+0xbe/0xe0 > [] add_to_page_cache_locked+0x4c/0x140 > [] add_to_page_cache_lru+0x22/0x50 > [] grab_cache_page_write_begin+0x8b/0xe0 > [] ext3_write_begin+0x88/0x270 > [] generic_file_buffered_write+0x116/0x290 > [] __generic_file_aio_write+0x27c/0x480 > [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > [] do_sync_write+0xea/0x130 > [] vfs_write+0xf3/0x1f0 > [] sys_write+0x51/0x90 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff It looks like grab_cache_page_write_begin() passes __GFP_FS into __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me think that this deadlock is also possible in the page allocator even before getting to add_to_page_cache_lru. no? Can callers holding fs resources (e.g. i_mutex) pass __GFP_FS into the page allocator? If __GFP_FS was avoided, then I think memcg user page charging would need a !__GFP_FS check to avoid invoking oom killer, but at least then we'd avoid both deadlocks and cover both page allocation and memcg page charging in similar fashion. Example from memcg_charge_kmem: may_oom = (gfp & __GFP_FS) && !(gfp & __GFP_NORETRY); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Tue, 5 Feb 2013 18:46:51 +0100 Message-ID: <20130205174651.GA3959@dhcp22.suse.cz> References: <20121224142526.020165D3@pobox.sk> <20121228162209.GA1455@dhcp22.suse.cz> <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:sender:date:from:to:cc:subject:message-id:references :mime-version:content-type:content-disposition:in-reply-to :user-agent; bh=sNz8eTPzOSmL14HKryJqfhf4UUdw/1oO2dC4xdbUw1w=; b=dy964Dm9y35tPI1URk/T+D06DtekqQsVjb/buA2tE2Nv5ASnLOX48QDIlMBQh+YFFr ZxfbkU/ogiXT8JQBrOsLFWfJjz08wopsM6+5QN4FU4Se9FatdK56TF4SuiibI81mcB0g KBQfRu9dknr0gLJvH/Ao+sUORNetpzEFvmbaNCqAE8ou+YrbKpCyqUatUx+ui8nHY4r1 WPf2txQjCWmCUgbgDz8zVZVq3LOgW0kKQxxjOyd5itIYXmrOGcgib6AjAe9cmRvJKooh aT+87g+uDFXk7oJYW7IWgRuIyJeTZPKZRqxbNFUP+K3ejaapwt0KUo8FAjH1Lyc8RpWP 05IQ== Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Greg Thelen Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Tue 05-02-13 08:48:23, Greg Thelen wrote: > On Tue, Feb 05 2013, Michal Hocko wrote: > > > On Tue 05-02-13 15:49:47, azurIt wrote: > > [...] > >> Just to be sure - am i supposed to apply this two patches? > >> http://watchdog.sk/lkml/patches/ > > > > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > > mentioned in a follow up email. Here is the full patch: > > --- > > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 > > From: Michal Hocko > > Date: Mon, 26 Nov 2012 11:47:57 +0100 > > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > > > memcg oom killer might deadlock if the process which falls down to > > mem_cgroup_handle_oom holds a lock which prevents other task to > > terminate because it is blocked on the very same lock. > > This can happen when a write system call needs to allocate a page but > > the allocation hits the memcg hard limit and there is nothing to reclaim > > (e.g. there is no swap or swap limit is hit as well and all cache pages > > have been reclaimed already) and the process selected by memcg OOM > > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > > > Process A > > [] do_truncate+0x58/0xa0 # takes i_mutex > > [] do_last+0x250/0xa30 > > [] path_openat+0xd7/0x440 > > [] do_filp_open+0x49/0xa0 > > [] do_sys_open+0x106/0x240 > > [] sys_open+0x20/0x30 > > [] system_call_fastpath+0x18/0x1d > > [] 0xffffffffffffffff > > > > Process B > > [] mem_cgroup_handle_oom+0x241/0x3b0 > > [] T.1146+0x5ab/0x5c0 > > [] mem_cgroup_cache_charge+0xbe/0xe0 > > [] add_to_page_cache_locked+0x4c/0x140 > > [] add_to_page_cache_lru+0x22/0x50 > > [] grab_cache_page_write_begin+0x8b/0xe0 > > [] ext3_write_begin+0x88/0x270 > > [] generic_file_buffered_write+0x116/0x290 > > [] __generic_file_aio_write+0x27c/0x480 > > [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > > [] do_sync_write+0xea/0x130 > > [] vfs_write+0xf3/0x1f0 > > [] sys_write+0x51/0x90 > > [] system_call_fastpath+0x18/0x1d > > [] 0xffffffffffffffff > > It looks like grab_cache_page_write_begin() passes __GFP_FS into > __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me > think that this deadlock is also possible in the page allocator even > before getting to add_to_page_cache_lru. no? I am not that familiar with VFS but i_mutex is a high level lock AFAIR and it shouldn't be called from the pageout path so __page_cache_alloc should be safe. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Greg Thelen Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Tue, 05 Feb 2013 10:09:57 -0800 Message-ID: References: <20121224142526.020165D3@pobox.sk> <20121228162209.GA1455@dhcp22.suse.cz> <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130205174651.GA3959@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:from:to:cc:subject:references:date:in-reply-to :message-id:user-agent:mime-version:content-type; bh=k5Qq/M1KyiqE8cOQ0b22rhAb1QWvph8NVYvTLTNFsKU=; b=Fs8MG/joKyRzVmoj9YZHqBKTfvrFRi6+qPf6XylTr7ioa8jnEKd3ZOQzeHMSMeF79b mJi+nkCtNCk61Uh5FKK50mwDMBwgCCNX0t2xDW1KMVSI6LnRUoDz7ff0wEiOTVtHUQHN Y5U0QjJAPzJY0zO4ywzDEJYzu/kla8Mz2yg0n/ikW7dEvzTwMUCeNboMKduxxHu+ya01 pU/2ToczxqFV4QrUnKlX9ieN5f6DFiJSkJQPvNLUsG1+mQFAbPyA4TwYCURRcImeiCew TJ8oGSSOASZPzOkRINEFKMSrkOq9yWbusUre6c85/yYpWa02W2iJRTJjEVlCKiOYN+5e UWBA== In-Reply-To: <20130205174651.GA3959@dhcp22.suse.cz> (Michal Hocko's message of "Tue, 5 Feb 2013 18:46:51 +0100") Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Tue, Feb 05 2013, Michal Hocko wrote: > On Tue 05-02-13 08:48:23, Greg Thelen wrote: >> On Tue, Feb 05 2013, Michal Hocko wrote: >> >> > On Tue 05-02-13 15:49:47, azurIt wrote: >> > [...] >> >> Just to be sure - am i supposed to apply this two patches? >> >> http://watchdog.sk/lkml/patches/ >> > >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >> > mentioned in a follow up email. Here is the full patch: >> > --- >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 >> > From: Michal Hocko >> > Date: Mon, 26 Nov 2012 11:47:57 +0100 >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked >> > >> > memcg oom killer might deadlock if the process which falls down to >> > mem_cgroup_handle_oom holds a lock which prevents other task to >> > terminate because it is blocked on the very same lock. >> > This can happen when a write system call needs to allocate a page but >> > the allocation hits the memcg hard limit and there is nothing to reclaim >> > (e.g. there is no swap or swap limit is hit as well and all cache pages >> > have been reclaimed already) and the process selected by memcg OOM >> > killer is blocked on i_mutex on the same inode (e.g. truncate it). >> > >> > Process A >> > [] do_truncate+0x58/0xa0 # takes i_mutex >> > [] do_last+0x250/0xa30 >> > [] path_openat+0xd7/0x440 >> > [] do_filp_open+0x49/0xa0 >> > [] do_sys_open+0x106/0x240 >> > [] sys_open+0x20/0x30 >> > [] system_call_fastpath+0x18/0x1d >> > [] 0xffffffffffffffff >> > >> > Process B >> > [] mem_cgroup_handle_oom+0x241/0x3b0 >> > [] T.1146+0x5ab/0x5c0 >> > [] mem_cgroup_cache_charge+0xbe/0xe0 >> > [] add_to_page_cache_locked+0x4c/0x140 >> > [] add_to_page_cache_lru+0x22/0x50 >> > [] grab_cache_page_write_begin+0x8b/0xe0 >> > [] ext3_write_begin+0x88/0x270 >> > [] generic_file_buffered_write+0x116/0x290 >> > [] __generic_file_aio_write+0x27c/0x480 >> > [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex >> > [] do_sync_write+0xea/0x130 >> > [] vfs_write+0xf3/0x1f0 >> > [] sys_write+0x51/0x90 >> > [] system_call_fastpath+0x18/0x1d >> > [] 0xffffffffffffffff >> >> It looks like grab_cache_page_write_begin() passes __GFP_FS into >> __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me >> think that this deadlock is also possible in the page allocator even >> before getting to add_to_page_cache_lru. no? > > I am not that familiar with VFS but i_mutex is a high level lock AFAIR > and it shouldn't be called from the pageout path so __page_cache_alloc > should be safe. I wasn't clear, sorry. My concern is not that pageout() grabs i_mutex. My concern is that __page_cache_alloc() will invoke the oom killer and select a victim which wants i_mutex. This victim will deadlock because the oom killer caller already holds i_mutex. The wild accusation I am making is that anyone who invokes the oom killer and waits on the victim to die is essentially grabbing all of the locks that any of the oom killer victims may grab (e.g. i_mutex). To avoid deadlock the oom killer can only be called is while holding no locks that the oom victim demands. I think some locks are grabbed in a way that allows the lock request to fail if the task has a fatal signal pending, so they are safe. But any locks acquisitions that cannot fail (e.g. mutex_lock) will deadlock with the oom killing process. So the oom killing process cannot hold any such locks which the victim will attempt to grab. Hopefully I'm missing something. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Tue, 5 Feb 2013 19:59:53 +0100 Message-ID: <20130205185953.GB3959@dhcp22.suse.cz> References: <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130205174651.GA3959@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:sender:date:from:to:cc:subject:message-id:references :mime-version:content-type:content-disposition:in-reply-to :user-agent; bh=Wmwa1grKyKjpKP00M7ntB+H3hJld7bwso7Hx2WG6vys=; b=kF8dFixqVamYK+AMCBIumf0xGlNZOSXcCUCHIb/2dFwHBUahNs3R29rleYo4voZKjd 1qkMEdV291EKfc7rtWnrlNwq2mJy8yN5ATPZOXv3c078rKum6WaxjqJTO3khDAEdey52 ondvtCF4uMMPQNhKyseJW+OsTXIpD/fsMC6TbNAfKRSJTODg1VKnJQuvNW1+oAIvbg// 84Lqy/bwYgPe6p9Nv0XhFcFis5DSsTwmG8S2PwaJYx5qx7ku1mwSqKv2BjAwdROAKJNy 1laSeh5JVP5rT3ryrI4+GDzSqcUJnyucbSZevg+XGhdlu5S7mjA5DB/PTPfgerCMUkYY HVmQ== Content-Disposition: inline In-Reply-To: Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Greg Thelen Cc: azurIt , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Tue 05-02-13 10:09:57, Greg Thelen wrote: > On Tue, Feb 05 2013, Michal Hocko wrote: > > > On Tue 05-02-13 08:48:23, Greg Thelen wrote: > >> On Tue, Feb 05 2013, Michal Hocko wrote: > >> > >> > On Tue 05-02-13 15:49:47, azurIt wrote: > >> > [...] > >> >> Just to be sure - am i supposed to apply this two patches? > >> >> http://watchdog.sk/lkml/patches/ > >> > > >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > >> > mentioned in a follow up email. Here is the full patch: > >> > --- > >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 > >> > From: Michal Hocko > >> > Date: Mon, 26 Nov 2012 11:47:57 +0100 > >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > >> > > >> > memcg oom killer might deadlock if the process which falls down to > >> > mem_cgroup_handle_oom holds a lock which prevents other task to > >> > terminate because it is blocked on the very same lock. > >> > This can happen when a write system call needs to allocate a page but > >> > the allocation hits the memcg hard limit and there is nothing to reclaim > >> > (e.g. there is no swap or swap limit is hit as well and all cache pages > >> > have been reclaimed already) and the process selected by memcg OOM > >> > killer is blocked on i_mutex on the same inode (e.g. truncate it). > >> > > >> > Process A > >> > [] do_truncate+0x58/0xa0 # takes i_mutex > >> > [] do_last+0x250/0xa30 > >> > [] path_openat+0xd7/0x440 > >> > [] do_filp_open+0x49/0xa0 > >> > [] do_sys_open+0x106/0x240 > >> > [] sys_open+0x20/0x30 > >> > [] system_call_fastpath+0x18/0x1d > >> > [] 0xffffffffffffffff > >> > > >> > Process B > >> > [] mem_cgroup_handle_oom+0x241/0x3b0 > >> > [] T.1146+0x5ab/0x5c0 > >> > [] mem_cgroup_cache_charge+0xbe/0xe0 > >> > [] add_to_page_cache_locked+0x4c/0x140 > >> > [] add_to_page_cache_lru+0x22/0x50 > >> > [] grab_cache_page_write_begin+0x8b/0xe0 > >> > [] ext3_write_begin+0x88/0x270 > >> > [] generic_file_buffered_write+0x116/0x290 > >> > [] __generic_file_aio_write+0x27c/0x480 > >> > [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > >> > [] do_sync_write+0xea/0x130 > >> > [] vfs_write+0xf3/0x1f0 > >> > [] sys_write+0x51/0x90 > >> > [] system_call_fastpath+0x18/0x1d > >> > [] 0xffffffffffffffff > >> > >> It looks like grab_cache_page_write_begin() passes __GFP_FS into > >> __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me > >> think that this deadlock is also possible in the page allocator even > >> before getting to add_to_page_cache_lru. no? > > > > I am not that familiar with VFS but i_mutex is a high level lock AFAIR > > and it shouldn't be called from the pageout path so __page_cache_alloc > > should be safe. > > I wasn't clear, sorry. My concern is not that pageout() grabs i_mutex. > My concern is that __page_cache_alloc() will invoke the oom killer and > select a victim which wants i_mutex. This victim will deadlock because > the oom killer caller already holds i_mutex. That would be true for the memcg oom because that one is blocking but the global oom just puts the allocator into sleep for a while and then the allocator should back off eventually (unless this is NOFAIL allocation). I would need to look closer whether this is really the case - I haven't seen that allocator code path for a while... > The wild accusation I am making is that anyone who invokes the oom > killer and waits on the victim to die is essentially grabbing all of > the locks that any of the oom killer victims may grab (e.g. i_mutex). True. > To avoid deadlock the oom killer can only be called is while holding > no locks that the oom victim demands. I think some locks are grabbed > in a way that allows the lock request to fail if the task has a fatal > signal pending, so they are safe. But any locks acquisitions that > cannot fail (e.g. mutex_lock) will deadlock with the oom killing > process. So the oom killing process cannot hold any such locks which > the victim will attempt to grab. Hopefully I'm missing something. Agreed. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Wed, 06 Feb 2013 02:17:21 +0100 Message-ID: <20130206021721.1AE9E3C7@pobox.sk> References: <20121218152223.6912832C@pobox.sk>, <20121218152004.GA25208@dhcp22.suse.cz>, <20121224142526.020165D3@pobox.sk>, <20121228162209.GA1455@dhcp22.suse.cz>, <20121230020947.AA002F34@pobox.sk>, <20121230110815.GA12940@dhcp22.suse.cz>, <20130125160723.FAE73567@pobox.sk>, <20130125163130.GF4721@dhcp22.suse.cz>, <20130205134937.GA22804@dhcp22.suse.cz>, <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20130205160934.GB22804@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >mentioned in a follow up email. Here is the full patch: Here is the log where OOM, again, killed MySQL server [search for "(mysql= d)"]: http://www.watchdog.sk/lkml/oom_mysqld6 azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Wed, 6 Feb 2013 15:01:19 +0100 Message-ID: <20130206140119.GD10254@dhcp22.suse.cz> References: <20121224142526.020165D3@pobox.sk> <20121228162209.GA1455@dhcp22.suse.cz> <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130206021721.1AE9E3C7-Rm0zKEqwvD4@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Wed 06-02-13 02:17:21, azurIt wrote: > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > >mentioned in a follow up email. Here is the full patch: > > > Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: > http://www.watchdog.sk/lkml/oom_mysqld6 [...] WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() Hardware name: S5000VSA gfp_mask:4304 nr_pages:1 oom:0 ret:2 Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 Call Trace: [] warn_slowpath_common+0x7a/0xb0 [] warn_slowpath_fmt+0x46/0x50 [] ? mem_cgroup_margin+0x73/0xa0 [] T.1149+0x2d9/0x610 [] ? blk_finish_plug+0x18/0x50 [] mem_cgroup_cache_charge+0xc4/0xf0 [] add_to_page_cache_locked+0x4f/0x140 [] add_to_page_cache_lru+0x22/0x50 [] filemap_fault+0x252/0x4f0 [] __do_fault+0x78/0x5a0 [] handle_pte_fault+0x84/0x940 [] ? vma_prio_tree_insert+0x30/0x50 [] ? vma_link+0x88/0xe0 [] handle_mm_fault+0x138/0x260 [] do_page_fault+0x13d/0x460 [] ? do_mmap_pgoff+0x3dc/0x430 [] page_fault+0x1f/0x30 ---[ end trace 8817670349022007 ]--- apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 apache2 cpuset=uid mems_allowed=0 Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 Call Trace: [] dump_header+0x7e/0x1e0 [] ? find_lock_task_mm+0x2f/0x70 [] oom_kill_process+0x85/0x2a0 [] out_of_memory+0xe5/0x200 [] pagefault_out_of_memory+0xbd/0x110 [] mm_fault_error+0xb6/0x1a0 [] do_page_fault+0x3ee/0x460 [] ? do_mmap_pgoff+0x3dc/0x430 [] page_fault+0x1f/0x30 The first trace comes from the debugging WARN and it clearly points to a file fault path. __do_fault pre-charges a page in case we need to do CoW (copy-on-write) for the returned page. This one falls back to memcg OOM and never returns ENOMEM as I have mentioned earlier. However, the fs fault handler (filemap_fault here) can fallback to page_cache_read if the readahead (do_sync_mmap_readahead) fails to get page to the page cache. And we can see this happening in the first trace. page_cache_read then calls add_to_page_cache_lru and eventually gets to add_to_page_cache_locked which calls mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should happen. This ENOMEM gets to the fault handler and kaboom. So the fix is really much more complex than I thought. Although add_to_page_cache_locked sounded like a good place it turned out to be not in fact. We need something more clever appaerently. One way would be not misusing __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32 bits for those flags in gfp_t so there should be some room there. Or we could do this per task flag, same we do for NO_IO in the current -mm tree. The later one seems easier wrt. gfp_mask passing horror - e.g. __generic_file_aio_write doesn't pass flags and it can be called from unlocked contexts as well. I have to think about it some more. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Wed, 6 Feb 2013 15:22:19 +0100 Message-ID: <20130206142219.GF10254@dhcp22.suse.cz> References: <20121228162209.GA1455@dhcp22.suse.cz> <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> <20130206140119.GD10254@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130206140119.GD10254-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Wed 06-02-13 15:01:19, Michal Hocko wrote: > On Wed 06-02-13 02:17:21, azurIt wrote: > > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > > >mentioned in a follow up email. Here is the full patch: > > > > > > Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: > > http://www.watchdog.sk/lkml/oom_mysqld6 > > [...] > WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() > Hardware name: S5000VSA > gfp_mask:4304 nr_pages:1 oom:0 ret:2 > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > Call Trace: > [] warn_slowpath_common+0x7a/0xb0 > [] warn_slowpath_fmt+0x46/0x50 > [] ? mem_cgroup_margin+0x73/0xa0 > [] T.1149+0x2d9/0x610 > [] ? blk_finish_plug+0x18/0x50 > [] mem_cgroup_cache_charge+0xc4/0xf0 > [] add_to_page_cache_locked+0x4f/0x140 > [] add_to_page_cache_lru+0x22/0x50 > [] filemap_fault+0x252/0x4f0 > [] __do_fault+0x78/0x5a0 > [] handle_pte_fault+0x84/0x940 > [] ? vma_prio_tree_insert+0x30/0x50 > [] ? vma_link+0x88/0xe0 > [] handle_mm_fault+0x138/0x260 > [] do_page_fault+0x13d/0x460 > [] ? do_mmap_pgoff+0x3dc/0x430 > [] page_fault+0x1f/0x30 > ---[ end trace 8817670349022007 ]--- > apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > apache2 cpuset=uid mems_allowed=0 > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > Call Trace: > [] dump_header+0x7e/0x1e0 > [] ? find_lock_task_mm+0x2f/0x70 > [] oom_kill_process+0x85/0x2a0 > [] out_of_memory+0xe5/0x200 > [] pagefault_out_of_memory+0xbd/0x110 > [] mm_fault_error+0xb6/0x1a0 > [] do_page_fault+0x3ee/0x460 > [] ? do_mmap_pgoff+0x3dc/0x430 > [] page_fault+0x1f/0x30 > > The first trace comes from the debugging WARN and it clearly points to > a file fault path. __do_fault pre-charges a page in case we need to > do CoW (copy-on-write) for the returned page. This one falls back to > memcg OOM and never returns ENOMEM as I have mentioned earlier. > However, the fs fault handler (filemap_fault here) can fallback to > page_cache_read if the readahead (do_sync_mmap_readahead) fails > to get page to the page cache. And we can see this happening in > the first trace. page_cache_read then calls add_to_page_cache_lru > and eventually gets to add_to_page_cache_locked which calls > mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should > happen. This ENOMEM gets to the fault handler and kaboom. > > So the fix is really much more complex than I thought. Although > add_to_page_cache_locked sounded like a good place it turned out to be > not in fact. > > We need something more clever appaerently. One way would be not misusing > __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32 > bits for those flags in gfp_t so there should be some room there. > Or we could do this per task flag, same we do for NO_IO in the current > -mm tree. > The later one seems easier wrt. gfp_mask passing horror - e.g. > __generic_file_aio_write doesn't pass flags and it can be called from > unlocked contexts as well. Ouch, PF_ flags space seem to be drained already because task_struct::flags is just unsigned int so there is just one bit left. I am not sure this is the best use for it. This will be a real pain! -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Date: Wed, 6 Feb 2013 17:00:51 +0100 Message-ID: <20130206160051.GG10254@dhcp22.suse.cz> References: <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> <20130206140119.GD10254@dhcp22.suse.cz> <20130206142219.GF10254@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130206142219.GF10254-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Wed 06-02-13 15:22:19, Michal Hocko wrote: > On Wed 06-02-13 15:01:19, Michal Hocko wrote: > > On Wed 06-02-13 02:17:21, azurIt wrote: > > > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > > > >mentioned in a follow up email. Here is the full patch: > > > > > > > > > Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: > > > http://www.watchdog.sk/lkml/oom_mysqld6 > > > > [...] > > WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() > > Hardware name: S5000VSA > > gfp_mask:4304 nr_pages:1 oom:0 ret:2 > > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > > Call Trace: > > [] warn_slowpath_common+0x7a/0xb0 > > [] warn_slowpath_fmt+0x46/0x50 > > [] ? mem_cgroup_margin+0x73/0xa0 > > [] T.1149+0x2d9/0x610 > > [] ? blk_finish_plug+0x18/0x50 > > [] mem_cgroup_cache_charge+0xc4/0xf0 > > [] add_to_page_cache_locked+0x4f/0x140 > > [] add_to_page_cache_lru+0x22/0x50 > > [] filemap_fault+0x252/0x4f0 > > [] __do_fault+0x78/0x5a0 > > [] handle_pte_fault+0x84/0x940 > > [] ? vma_prio_tree_insert+0x30/0x50 > > [] ? vma_link+0x88/0xe0 > > [] handle_mm_fault+0x138/0x260 > > [] do_page_fault+0x13d/0x460 > > [] ? do_mmap_pgoff+0x3dc/0x430 > > [] page_fault+0x1f/0x30 > > ---[ end trace 8817670349022007 ]--- > > apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > > apache2 cpuset=uid mems_allowed=0 > > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > > Call Trace: > > [] dump_header+0x7e/0x1e0 > > [] ? find_lock_task_mm+0x2f/0x70 > > [] oom_kill_process+0x85/0x2a0 > > [] out_of_memory+0xe5/0x200 > > [] pagefault_out_of_memory+0xbd/0x110 > > [] mm_fault_error+0xb6/0x1a0 > > [] do_page_fault+0x3ee/0x460 > > [] ? do_mmap_pgoff+0x3dc/0x430 > > [] page_fault+0x1f/0x30 > > > > The first trace comes from the debugging WARN and it clearly points to > > a file fault path. __do_fault pre-charges a page in case we need to > > do CoW (copy-on-write) for the returned page. This one falls back to > > memcg OOM and never returns ENOMEM as I have mentioned earlier. > > However, the fs fault handler (filemap_fault here) can fallback to > > page_cache_read if the readahead (do_sync_mmap_readahead) fails > > to get page to the page cache. And we can see this happening in > > the first trace. page_cache_read then calls add_to_page_cache_lru > > and eventually gets to add_to_page_cache_locked which calls > > mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should > > happen. This ENOMEM gets to the fault handler and kaboom. > > > > So the fix is really much more complex than I thought. Although > > add_to_page_cache_locked sounded like a good place it turned out to be > > not in fact. > > > > We need something more clever appaerently. One way would be not misusing > > __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32 > > bits for those flags in gfp_t so there should be some room there. > > Or we could do this per task flag, same we do for NO_IO in the current > > -mm tree. > > The later one seems easier wrt. gfp_mask passing horror - e.g. > > __generic_file_aio_write doesn't pass flags and it can be called from > > unlocked contexts as well. > > Ouch, PF_ flags space seem to be drained already because > task_struct::flags is just unsigned int so there is just one bit left. I > am not sure this is the best use for it. This will be a real pain! OK, so this something that should help you without any risk of false OOMs. I do not believe that something like that would be accepted upstream because it is really heavy. We will need to come up with something more clever for upstream. I have also added a warning which will trigger when the charge fails. If you see too many of those messages then there is something bad going on and the lack of OOM causes userspace to loop without getting any progress. So there you go - your personal patch ;) You can drop all other patches. Please note I have just compile tested it. But it should be pretty trivial to check it is correct --- >From 6f155187f77c971b45caf05dbc80ca9c20bc278c Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 6 Feb 2013 16:45:07 +0100 Subject: [PATCH 1/2] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [] do_truncate+0x58/0xa0 # takes i_mutex [] do_last+0x250/0xa30 [] path_openat+0xd7/0x440 [] do_filp_open+0x49/0xa0 [] do_sys_open+0x106/0x240 [] sys_open+0x20/0x30 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff Process B [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1146+0x5ab/0x5c0 [] mem_cgroup_cache_charge+0xbe/0xe0 [] add_to_page_cache_locked+0x4c/0x140 [] add_to_page_cache_lru+0x22/0x50 [] grab_cache_page_write_begin+0x8b/0xe0 [] ext3_write_begin+0x88/0x270 [] generic_file_buffered_write+0x116/0x290 [] __generic_file_aio_write+0x27c/0x480 [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [] do_sync_write+0xea/0x130 [] vfs_write+0xf3/0x1f0 [] sys_write+0x51/0x90 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from dangerous context. Memcg charging code has no way to find out whether it is called from a locked context we have to help it via process flags. PF_OOM_ORIGIN flag removed recently will be reused for PF_NO_MEMCG_OOM which signals that the memcg OOM killer could lead to a deadlock. Only locked callers of __generic_file_aio_write are currently marked. I am pretty sure there are more places (I didn't check shmem and hugetlb uses fancy instantion mutex during page fault and filesystems might use some locks during the write) but I've ignored those as this will probably be just a user specific patch without any way to get upstream in the current form. Reported-by: azurIt Signed-off-by: Michal Hocko --- drivers/staging/pohmelfs/inode.c | 2 ++ include/linux/sched.h | 1 + mm/filemap.c | 2 ++ mm/memcontrol.c | 18 ++++++++++++++---- 4 files changed, 19 insertions(+), 4 deletions(-) diff --git a/drivers/staging/pohmelfs/inode.c b/drivers/staging/pohmelfs/inode.c index 7a19555..523de82e 100644 --- a/drivers/staging/pohmelfs/inode.c +++ b/drivers/staging/pohmelfs/inode.c @@ -921,7 +921,9 @@ ssize_t pohmelfs_write(struct file *file, const char __user *buf, if (ret) goto err_out_unlock; + current->flags |= PF_NO_MEMCG_OOM; ret = __generic_file_aio_write(&kiocb, &iov, 1, &kiocb.ki_pos); + current->flags &= ~PF_NO_MEMCG_OOM; *ppos = kiocb.ki_pos; mutex_unlock(&inode->i_mutex); diff --git a/include/linux/sched.h b/include/linux/sched.h index 1e86bb4..f275c8f 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1781,6 +1781,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t * #define PF_FROZEN 0x00010000 /* frozen for system suspend */ #define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */ #define PF_KSWAPD 0x00040000 /* I am kswapd */ +#define PF_NO_MEMCG_OOM 0x00080000 /* Memcg OOM could lead to a deadlock */ #define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ #define PF_RANDOMIZE 0x00400000 /* randomize virtual address space */ diff --git a/mm/filemap.c b/mm/filemap.c index 556858c..58a316b 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2617,7 +2617,9 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov, mutex_lock(&inode->i_mutex); blk_start_plug(&plug); + current->flags |= PF_NO_MEMCG_OOM; ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos); + current->flags &= ~PF_NO_MEMCG_OOM; mutex_unlock(&inode->i_mutex); if (ret > 0 || ret == -EIOCBQUEUED) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c8425b1..128b615 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2397,6 +2397,14 @@ done: return 0; nomem: *ptr = NULL; + if (printk_ratelimit()) + printk(KERN_WARNING"%s: task:%s pid:%d got ENOMEM without OOM for memcg:%p." + " If this message shows up very often for the" + " same task then there is a risk that the" + " process is not able to make any progress" + " because of the current limit. Try to enlarge" + " the hard limit.\n", __FUNCTION__, + current->comm, current->pid, memcg); return -ENOMEM; bypass: *ptr = NULL; @@ -2703,7 +2711,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; struct page_cgroup *pc; - bool oom = true; + bool oom = !(current->flags & PF_NO_MEMCG_OOM); int ret; if (PageTransHuge(page)) { @@ -2770,6 +2778,7 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg, int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask) { + bool oom = !(current->flags & PF_NO_MEMCG_OOM); struct mem_cgroup *memcg = NULL; int ret; @@ -2782,7 +2791,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, mm = &init_mm; if (page_is_file_cache(page)) { - ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, true); + ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, oom); if (ret || !memcg) return ret; @@ -2818,6 +2827,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t mask, struct mem_cgroup **ptr) { + bool oom = !(current->flags & PF_NO_MEMCG_OOM); struct mem_cgroup *memcg; int ret; @@ -2840,13 +2850,13 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, if (!memcg) goto charge_cur_mm; *ptr = memcg; - ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, true); + ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, oom); css_put(&memcg->css); return ret; charge_cur_mm: if (unlikely(!mm)) mm = &init_mm; - return __mem_cgroup_try_charge(mm, mask, 1, ptr, true); + return __mem_cgroup_try_charge(mm, mask, 1, ptr, oom); } static void -- 1.7.10.4 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Kamezawa Hiroyuki Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Thu, 07 Feb 2013 20:01:45 +0900 Message-ID: <51138999.3090006@jp.fujitsu.com> References: <20121224142526.020165D3@pobox.sk> <20121228162209.GA1455@dhcp22.suse.cz> <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> <20130206140119.GD10254@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20130206140119.GD10254@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii"; format="flowed" To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , Johannes Weiner (2013/02/06 23:01), Michal Hocko wrote: > On Wed 06-02-13 02:17:21, azurIt wrote: >>> 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >>> mentioned in a follow up email. Here is the full patch: >> >> >> Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: >> http://www.watchdog.sk/lkml/oom_mysqld6 > > [...] > WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() > Hardware name: S5000VSA > gfp_mask:4304 nr_pages:1 oom:0 ret:2 > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > Call Trace: > [] warn_slowpath_common+0x7a/0xb0 > [] warn_slowpath_fmt+0x46/0x50 > [] ? mem_cgroup_margin+0x73/0xa0 > [] T.1149+0x2d9/0x610 > [] ? blk_finish_plug+0x18/0x50 > [] mem_cgroup_cache_charge+0xc4/0xf0 > [] add_to_page_cache_locked+0x4f/0x140 > [] add_to_page_cache_lru+0x22/0x50 > [] filemap_fault+0x252/0x4f0 > [] __do_fault+0x78/0x5a0 > [] handle_pte_fault+0x84/0x940 > [] ? vma_prio_tree_insert+0x30/0x50 > [] ? vma_link+0x88/0xe0 > [] handle_mm_fault+0x138/0x260 > [] do_page_fault+0x13d/0x460 > [] ? do_mmap_pgoff+0x3dc/0x430 > [] page_fault+0x1f/0x30 > ---[ end trace 8817670349022007 ]--- > apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > apache2 cpuset=uid mems_allowed=0 > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > Call Trace: > [] dump_header+0x7e/0x1e0 > [] ? find_lock_task_mm+0x2f/0x70 > [] oom_kill_process+0x85/0x2a0 > [] out_of_memory+0xe5/0x200 > [] pagefault_out_of_memory+0xbd/0x110 > [] mm_fault_error+0xb6/0x1a0 > [] do_page_fault+0x3ee/0x460 > [] ? do_mmap_pgoff+0x3dc/0x430 > [] page_fault+0x1f/0x30 > > The first trace comes from the debugging WARN and it clearly points to > a file fault path. __do_fault pre-charges a page in case we need to > do CoW (copy-on-write) for the returned page. This one falls back to > memcg OOM and never returns ENOMEM as I have mentioned earlier. > However, the fs fault handler (filemap_fault here) can fallback to > page_cache_read if the readahead (do_sync_mmap_readahead) fails > to get page to the page cache. And we can see this happening in > the first trace. page_cache_read then calls add_to_page_cache_lru > and eventually gets to add_to_page_cache_locked which calls > mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should > happen. This ENOMEM gets to the fault handler and kaboom. > Hmm. do we need to increase the "limit" virtually at memcg oom until the oom-killed process dies ? It may be doable by increasing stock->cache of each cpu....I think kernel can offer extra virtual charge up to oom-killed process's memory usage..... Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Thu, 7 Feb 2013 13:31:40 +0100 Message-ID: <20130207123140.GA15820@dhcp22.suse.cz> References: <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> <20130206140119.GD10254@dhcp22.suse.cz> <51138999.3090006@jp.fujitsu.com> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <51138999.3090006-+CUm20s59erQFUHtdCDX3A@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Kamezawa Hiroyuki Cc: azurIt , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , Johannes Weiner On Thu 07-02-13 20:01:45, KAMEZAWA Hiroyuki wrote: > (2013/02/06 23:01), Michal Hocko wrote: > >On Wed 06-02-13 02:17:21, azurIt wrote: > >>>5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > >>>mentioned in a follow up email. Here is the full patch: > >> > >> > >>Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: > >>http://www.watchdog.sk/lkml/oom_mysqld6 > > > >[...] > >WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() > >Hardware name: S5000VSA > >gfp_mask:4304 nr_pages:1 oom:0 ret:2 > >Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > >Call Trace: > > [] warn_slowpath_common+0x7a/0xb0 > > [] warn_slowpath_fmt+0x46/0x50 > > [] ? mem_cgroup_margin+0x73/0xa0 > > [] T.1149+0x2d9/0x610 > > [] ? blk_finish_plug+0x18/0x50 > > [] mem_cgroup_cache_charge+0xc4/0xf0 > > [] add_to_page_cache_locked+0x4f/0x140 > > [] add_to_page_cache_lru+0x22/0x50 > > [] filemap_fault+0x252/0x4f0 > > [] __do_fault+0x78/0x5a0 > > [] handle_pte_fault+0x84/0x940 > > [] ? vma_prio_tree_insert+0x30/0x50 > > [] ? vma_link+0x88/0xe0 > > [] handle_mm_fault+0x138/0x260 > > [] do_page_fault+0x13d/0x460 > > [] ? do_mmap_pgoff+0x3dc/0x430 > > [] page_fault+0x1f/0x30 > >---[ end trace 8817670349022007 ]--- > >apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > >apache2 cpuset=uid mems_allowed=0 > >Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > >Call Trace: > > [] dump_header+0x7e/0x1e0 > > [] ? find_lock_task_mm+0x2f/0x70 > > [] oom_kill_process+0x85/0x2a0 > > [] out_of_memory+0xe5/0x200 > > [] pagefault_out_of_memory+0xbd/0x110 > > [] mm_fault_error+0xb6/0x1a0 > > [] do_page_fault+0x3ee/0x460 > > [] ? do_mmap_pgoff+0x3dc/0x430 > > [] page_fault+0x1f/0x30 > > > >The first trace comes from the debugging WARN and it clearly points to > >a file fault path. __do_fault pre-charges a page in case we need to > >do CoW (copy-on-write) for the returned page. This one falls back to > >memcg OOM and never returns ENOMEM as I have mentioned earlier. > >However, the fs fault handler (filemap_fault here) can fallback to > >page_cache_read if the readahead (do_sync_mmap_readahead) fails > >to get page to the page cache. And we can see this happening in > >the first trace. page_cache_read then calls add_to_page_cache_lru > >and eventually gets to add_to_page_cache_locked which calls > >mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should > >happen. This ENOMEM gets to the fault handler and kaboom. > > > > Hmm. do we need to increase the "limit" virtually at memcg oom until > the oom-killed process dies ? It may be doable by increasing stock->cache > of each cpu....I think kernel can offer extra virtual charge up to > oom-killed process's memory usage..... If we can guarantee that the overflow charges do not exceed the memory usage of the killed process then this would work. The question is, how do we find out how much we can overflow. immigrate_on_move will play some role as well as the amount of the shared memory. I am afraid this would get too complex. Nevertheless the idea is nice. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Kamezawa Hiroyuki Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Fri, 08 Feb 2013 10:40:13 +0900 Message-ID: <5114577D.70608@jp.fujitsu.com> References: <20121224142526.020165D3@pobox.sk> <20121228162209.GA1455@dhcp22.suse.cz> <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> <20130206140119.GD10254@dhcp22.suse.cz> <51138999.3090006@jp.fujitsu.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <51138999.3090006-+CUm20s59erQFUHtdCDX3A@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii"; format="flowed" To: Michal Hocko Cc: azurIt , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , Johannes Weiner (2013/02/07 20:01), Kamezawa Hiroyuki wrote: > (2013/02/06 23:01), Michal Hocko wrote: >> On Wed 06-02-13 02:17:21, azurIt wrote: >>>> 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >>>> mentioned in a follow up email. Here is the full patch: >>> >>> >>> Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: >>> http://www.watchdog.sk/lkml/oom_mysqld6 >> >> [...] >> WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() >> Hardware name: S5000VSA >> gfp_mask:4304 nr_pages:1 oom:0 ret:2 >> Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >> Call Trace: >> [] warn_slowpath_common+0x7a/0xb0 >> [] warn_slowpath_fmt+0x46/0x50 >> [] ? mem_cgroup_margin+0x73/0xa0 >> [] T.1149+0x2d9/0x610 >> [] ? blk_finish_plug+0x18/0x50 >> [] mem_cgroup_cache_charge+0xc4/0xf0 >> [] add_to_page_cache_locked+0x4f/0x140 >> [] add_to_page_cache_lru+0x22/0x50 >> [] filemap_fault+0x252/0x4f0 >> [] __do_fault+0x78/0x5a0 >> [] handle_pte_fault+0x84/0x940 >> [] ? vma_prio_tree_insert+0x30/0x50 >> [] ? vma_link+0x88/0xe0 >> [] handle_mm_fault+0x138/0x260 >> [] do_page_fault+0x13d/0x460 >> [] ? do_mmap_pgoff+0x3dc/0x430 >> [] page_fault+0x1f/0x30 >> ---[ end trace 8817670349022007 ]--- >> apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 >> apache2 cpuset=uid mems_allowed=0 >> Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >> Call Trace: >> [] dump_header+0x7e/0x1e0 >> [] ? find_lock_task_mm+0x2f/0x70 >> [] oom_kill_process+0x85/0x2a0 >> [] out_of_memory+0xe5/0x200 >> [] pagefault_out_of_memory+0xbd/0x110 >> [] mm_fault_error+0xb6/0x1a0 >> [] do_page_fault+0x3ee/0x460 >> [] ? do_mmap_pgoff+0x3dc/0x430 >> [] page_fault+0x1f/0x30 >> >> The first trace comes from the debugging WARN and it clearly points to >> a file fault path. __do_fault pre-charges a page in case we need to >> do CoW (copy-on-write) for the returned page. This one falls back to >> memcg OOM and never returns ENOMEM as I have mentioned earlier. >> However, the fs fault handler (filemap_fault here) can fallback to >> page_cache_read if the readahead (do_sync_mmap_readahead) fails >> to get page to the page cache. And we can see this happening in >> the first trace. page_cache_read then calls add_to_page_cache_lru >> and eventually gets to add_to_page_cache_locked which calls >> mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should >> happen. This ENOMEM gets to the fault handler and kaboom. >> > > Hmm. do we need to increase the "limit" virtually at memcg oom until > the oom-killed process dies ? Here is my naive idea... == From 1a46318cf89e7df94bd4844f29105b61dacf335b Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Fri, 8 Feb 2013 10:43:52 +0900 Subject: [PATCH] [Don't Apply][PATCH] memcg relax resource at OOM situation. When an OOM happens, a task is killed and resources will be freed. A problem here is that a task, which is oom-killed, may wait for some other resource in which memory resource is required. Some thread waits for free memory may holds some mutex and oom-killed process wait for the mutex. To avoid this, relaxing charged memory by giving virtual resource can be a help. The system can get back it at uncharge(). This is a sample native implementation. Signed-off-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 73 insertions(+), 6 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 25ac5f4..4dea49a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -301,6 +301,9 @@ struct mem_cgroup { /* set when res.limit == memsw.limit */ bool memsw_is_minimum; + /* extra resource at emergency situation */ + unsigned long loan; + spinlock_t loan_lock; /* protect arrays of thresholds */ struct mutex thresholds_lock; @@ -2034,6 +2037,61 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, mem_cgroup_iter_break(root_memcg, victim); return total; } +/* + * When a memcg is in OOM situation, this lack of resource may cause deadlock + * because of complicated lock dependency(i_mutex...). To avoid that, we + * need extra resource or avoid charging. + * + * A memcg can request resource in an emergency state. We call it as loan. + * A memcg will return a loan when it does uncharge resource. We disallow + * double-loan and moving task to other groups until the loan is fully + * returned. + * + * Note: the problem here is that we cannot know what amount resouce should + * be necessary to exiting an emergency state..... + */ +#define LOAN_MAX (2 * 1024 * 1024) + +static void mem_cgroup_make_loan(struct mem_cgroup *memcg) +{ + u64 usage; + unsigned long amount; + + amount = LOAN_MAX; + + usage = res_counter_read_u64(&memcg->res, RES_USAGE); + if (amount > usage /2 ) + amount = usage / 2; + spin_lock(&memcg->loan_lock); + if (memcg->loan) { + spin_unlock(&memcg->loan_lock); + return; + } + memcg->loan = amount; + res_counter_uncharge(&memcg->res, amount); + if (do_swap_account) + res_counter_uncharge(&memcg->memsw, amount); + spin_unlock(&memcg->loan_lock); +} + +/* return amount of free resource which can be uncharged */ +static unsigned long +mem_cgroup_may_return_loan(struct mem_cgroup *memcg, unsigned long val) +{ + unsigned long tmp; + /* we don't care small race here */ + if (unlikely(!memcg->loan)) + return val; + spin_lock(&memcg->loan_lock); + if (memcg->loan) { + tmp = min(memcg->loan, val); + memcg->loan -= tmp; + val -= tmp; + } + spin_unlock(&memcg->loan_lock); + return val; +} + /* * Check OOM-Killer is already running under our hierarchy. @@ -2182,6 +2240,7 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, if (need_to_kill) { finish_wait(&memcg_oom_waitq, &owait.wait); mem_cgroup_out_of_memory(memcg, mask, order); + mem_cgroup_make_loan(memcg); } else { schedule(); finish_wait(&memcg_oom_waitq, &owait.wait); @@ -2748,6 +2807,8 @@ static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg, if (!mem_cgroup_is_root(memcg)) { unsigned long bytes = nr_pages * PAGE_SIZE; + bytes = mem_cgroup_may_return_loan(memcg, bytes); + res_counter_uncharge(&memcg->res, bytes); if (do_swap_account) res_counter_uncharge(&memcg->memsw, bytes); @@ -3989,6 +4050,7 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg, { struct memcg_batch_info *batch = NULL; bool uncharge_memsw = true; + unsigned long val; /* If swapout, usage of swap doesn't decrease */ if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) @@ -4029,9 +4091,11 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg, batch->memsw_nr_pages++; return; direct_uncharge: - res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE); + val = nr_pages * PAGE_SIZE; + val = mem_cgroup_may_return_loan(memcg, val); + res_counter_uncharge(&memcg->res, val); if (uncharge_memsw) - res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE); + res_counter_uncharge(&memcg->memsw, val); if (unlikely(batch->memcg != memcg)) memcg_oom_recover(memcg); } @@ -4182,6 +4246,7 @@ void mem_cgroup_uncharge_start(void) void mem_cgroup_uncharge_end(void) { struct memcg_batch_info *batch = ¤t->memcg_batch; + unsigned long val; if (!batch->do_batch) return; @@ -4192,16 +4257,16 @@ void mem_cgroup_uncharge_end(void) if (!batch->memcg) return; + val = batch->nr_pages * PAGE_SIZE; + val = mem_cgroup_may_return_loan(batch->memcg, val); /* * This "batch->memcg" is valid without any css_get/put etc... * bacause we hide charges behind us. */ if (batch->nr_pages) - res_counter_uncharge(&batch->memcg->res, - batch->nr_pages * PAGE_SIZE); + res_counter_uncharge(&batch->memcg->res, val); if (batch->memsw_nr_pages) - res_counter_uncharge(&batch->memcg->memsw, - batch->memsw_nr_pages * PAGE_SIZE); + res_counter_uncharge(&batch->memcg->memsw, val); memcg_oom_recover(batch->memcg); /* forget this pointer (for sanity check) */ batch->memcg = NULL; @@ -6291,6 +6356,8 @@ mem_cgroup_css_alloc(struct cgroup *cont) memcg->move_charge_at_immigrate = 0; mutex_init(&memcg->thresholds_lock); spin_lock_init(&memcg->move_lock); + memcg->loan = 0; + spin_lock_init(&memcg->loan_lock); return &memcg->css; -- 1.7.10.2 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Kamezawa Hiroyuki Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Fri, 08 Feb 2013 13:16:27 +0900 Message-ID: <51147C1B.1000402@jp.fujitsu.com> References: <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> <20130206140119.GD10254@dhcp22.suse.cz> <51138999.3090006@jp.fujitsu.com> <20130207123140.GA15820@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20130207123140.GA15820@dhcp22.suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii"; format="flowed" To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , Johannes Weiner (2013/02/07 21:31), Michal Hocko wrote: > On Thu 07-02-13 20:01:45, KAMEZAWA Hiroyuki wrote: >> (2013/02/06 23:01), Michal Hocko wrote: >>> On Wed 06-02-13 02:17:21, azurIt wrote: >>>>> 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >>>>> mentioned in a follow up email. Here is the full patch: >>>> >>>> >>>> Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: >>>> http://www.watchdog.sk/lkml/oom_mysqld6 >>> >>> [...] >>> WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() >>> Hardware name: S5000VSA >>> gfp_mask:4304 nr_pages:1 oom:0 ret:2 >>> Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >>> Call Trace: >>> [] warn_slowpath_common+0x7a/0xb0 >>> [] warn_slowpath_fmt+0x46/0x50 >>> [] ? mem_cgroup_margin+0x73/0xa0 >>> [] T.1149+0x2d9/0x610 >>> [] ? blk_finish_plug+0x18/0x50 >>> [] mem_cgroup_cache_charge+0xc4/0xf0 >>> [] add_to_page_cache_locked+0x4f/0x140 >>> [] add_to_page_cache_lru+0x22/0x50 >>> [] filemap_fault+0x252/0x4f0 >>> [] __do_fault+0x78/0x5a0 >>> [] handle_pte_fault+0x84/0x940 >>> [] ? vma_prio_tree_insert+0x30/0x50 >>> [] ? vma_link+0x88/0xe0 >>> [] handle_mm_fault+0x138/0x260 >>> [] do_page_fault+0x13d/0x460 >>> [] ? do_mmap_pgoff+0x3dc/0x430 >>> [] page_fault+0x1f/0x30 >>> ---[ end trace 8817670349022007 ]--- >>> apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 >>> apache2 cpuset=uid mems_allowed=0 >>> Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >>> Call Trace: >>> [] dump_header+0x7e/0x1e0 >>> [] ? find_lock_task_mm+0x2f/0x70 >>> [] oom_kill_process+0x85/0x2a0 >>> [] out_of_memory+0xe5/0x200 >>> [] pagefault_out_of_memory+0xbd/0x110 >>> [] mm_fault_error+0xb6/0x1a0 >>> [] do_page_fault+0x3ee/0x460 >>> [] ? do_mmap_pgoff+0x3dc/0x430 >>> [] page_fault+0x1f/0x30 >>> >>> The first trace comes from the debugging WARN and it clearly points to >>> a file fault path. __do_fault pre-charges a page in case we need to >>> do CoW (copy-on-write) for the returned page. This one falls back to >>> memcg OOM and never returns ENOMEM as I have mentioned earlier. >>> However, the fs fault handler (filemap_fault here) can fallback to >>> page_cache_read if the readahead (do_sync_mmap_readahead) fails >>> to get page to the page cache. And we can see this happening in >>> the first trace. page_cache_read then calls add_to_page_cache_lru >>> and eventually gets to add_to_page_cache_locked which calls >>> mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should >>> happen. This ENOMEM gets to the fault handler and kaboom. >>> >> >> Hmm. do we need to increase the "limit" virtually at memcg oom until >> the oom-killed process dies ? It may be doable by increasing stock->cache >> of each cpu....I think kernel can offer extra virtual charge up to >> oom-killed process's memory usage..... > > If we can guarantee that the overflow charges do not exceed the memory > usage of the killed process then this would work. The question is, how > do we find out how much we can overflow. immigrate_on_move will play > some role as well as the amount of the shared memory. I am afraid this > would get too complex. Nevertheless the idea is nice. > Yes, that's the problem. If we don't do in correct way, resouce usage undeflow can happen. I guess we can count it per task_struct at charging page-faulted anon pages. _Or_ in other consideration, for example, we do charge 1MB per thread regardless of its memory usage. And use it as a security at OOM-killing. Implemtation will be easy but explanation may be difficult.. Thanks, -Kame Thanks, -Kame From mboxrd@z Thu Jan 1 00:00:00 1970 From: Greg Thelen Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Thu, 07 Feb 2013 20:27:00 -0800 Message-ID: References: <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130205174651.GA3959@dhcp22.suse.cz> <20130205185953.GB3959@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:from:to:cc:subject:references:date:in-reply-to :message-id:user-agent:mime-version:content-type; bh=eJdj6HM2C2Jhcm5hTl+jIaADBHnIcTOL5GfTHPhB10k=; b=WeBX76ue/sWvnLD0DAGueZ4S6kL1GRL+nkBOu5Gh8Gy4IIN8VF7OZunqoO43JqP1aC jbgDnpo/rh5nOspwIQJ/qFvfd+QdyjGskvo/ZFOHZKn3YIUYiVehE753ZwCCaqvZ5dUn mG+exXLNhV+WVyZVDlhVu3n/6eqraAoMna6F9quXciRozdRUxYfpXV0cOYGwDHmeWqZO 6A/wXg/7sQNXm6RDBPgCK7Ffm+65qawGxuorB3W+YqEVnNNHjGV7lgIOSj0lJC8t76CH aG0UfxFJYb4Z3hIVhIOiQUYn/5NA4/Uriuf7xvZUX5PS77/RfXA4Xc+JttMdSm5RVlhh NNHQ== In-Reply-To: <20130205185953.GB3959@dhcp22.suse.cz> (Michal Hocko's message of "Tue, 5 Feb 2013 19:59:53 +0100") Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Tue, Feb 05 2013, Michal Hocko wrote: > On Tue 05-02-13 10:09:57, Greg Thelen wrote: >> On Tue, Feb 05 2013, Michal Hocko wrote: >> >> > On Tue 05-02-13 08:48:23, Greg Thelen wrote: >> >> On Tue, Feb 05 2013, Michal Hocko wrote: >> >> >> >> > On Tue 05-02-13 15:49:47, azurIt wrote: >> >> > [...] >> >> >> Just to be sure - am i supposed to apply this two patches? >> >> >> http://watchdog.sk/lkml/patches/ >> >> > >> >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >> >> > mentioned in a follow up email. Here is the full patch: >> >> > --- >> >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 >> >> > From: Michal Hocko >> >> > Date: Mon, 26 Nov 2012 11:47:57 +0100 >> >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked >> >> > >> >> > memcg oom killer might deadlock if the process which falls down to >> >> > mem_cgroup_handle_oom holds a lock which prevents other task to >> >> > terminate because it is blocked on the very same lock. >> >> > This can happen when a write system call needs to allocate a page but >> >> > the allocation hits the memcg hard limit and there is nothing to reclaim >> >> > (e.g. there is no swap or swap limit is hit as well and all cache pages >> >> > have been reclaimed already) and the process selected by memcg OOM >> >> > killer is blocked on i_mutex on the same inode (e.g. truncate it). >> >> > >> >> > Process A >> >> > [] do_truncate+0x58/0xa0 # takes i_mutex >> >> > [] do_last+0x250/0xa30 >> >> > [] path_openat+0xd7/0x440 >> >> > [] do_filp_open+0x49/0xa0 >> >> > [] do_sys_open+0x106/0x240 >> >> > [] sys_open+0x20/0x30 >> >> > [] system_call_fastpath+0x18/0x1d >> >> > [] 0xffffffffffffffff >> >> > >> >> > Process B >> >> > [] mem_cgroup_handle_oom+0x241/0x3b0 >> >> > [] T.1146+0x5ab/0x5c0 >> >> > [] mem_cgroup_cache_charge+0xbe/0xe0 >> >> > [] add_to_page_cache_locked+0x4c/0x140 >> >> > [] add_to_page_cache_lru+0x22/0x50 >> >> > [] grab_cache_page_write_begin+0x8b/0xe0 >> >> > [] ext3_write_begin+0x88/0x270 >> >> > [] generic_file_buffered_write+0x116/0x290 >> >> > [] __generic_file_aio_write+0x27c/0x480 >> >> > [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex >> >> > [] do_sync_write+0xea/0x130 >> >> > [] vfs_write+0xf3/0x1f0 >> >> > [] sys_write+0x51/0x90 >> >> > [] system_call_fastpath+0x18/0x1d >> >> > [] 0xffffffffffffffff >> >> >> >> It looks like grab_cache_page_write_begin() passes __GFP_FS into >> >> __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me >> >> think that this deadlock is also possible in the page allocator even >> >> before getting to add_to_page_cache_lru. no? >> > >> > I am not that familiar with VFS but i_mutex is a high level lock AFAIR >> > and it shouldn't be called from the pageout path so __page_cache_alloc >> > should be safe. >> >> I wasn't clear, sorry. My concern is not that pageout() grabs i_mutex. >> My concern is that __page_cache_alloc() will invoke the oom killer and >> select a victim which wants i_mutex. This victim will deadlock because >> the oom killer caller already holds i_mutex. > > That would be true for the memcg oom because that one is blocking but > the global oom just puts the allocator into sleep for a while and then > the allocator should back off eventually (unless this is NOFAIL > allocation). I would need to look closer whether this is really the case > - I haven't seen that allocator code path for a while... I think the page allocator can loop forever waiting for an oom victim to terminate even without NOFAIL. Especially if the oom victim wants a resource exclusively held by the allocating thread (e.g. i_mutex). It looks like the same deadlock you describe is also possible (though more rare) without memcg. If the looping thread is an eligible oom victim (i.e. not oom disabled, not an kernel thread, etc) then the page allocator can return NULL in so long as NOFAIL is not used. So any allocator which is able to call the oom killer and is not oom disabled (kernel thread, etc) is already exposed to the possibility of page allocator failure. So if the page allocator could detect the deadlock, then it could safely return NULL. Maybe after looping N times without forward progress the page allocator should consider failing unless NOFAIL is given. Switching back to the memcg oom situation, can we similarly return NULL if memcg oom kill has been tried a reasonable number of times. Simply failing the memcg charge with ENOMEM seems easier to support than exceeding limit (Kame's loan patch). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_if_PF=5FNO=5FMEMCG=5FOOM_is_set?= Date: Fri, 08 Feb 2013 06:03:04 +0100 Message-ID: <20130208060304.799F362F@pobox.sk> References: <20121230020947.AA002F34@pobox.sk>, <20121230110815.GA12940@dhcp22.suse.cz>, <20130125160723.FAE73567@pobox.sk>, <20130125163130.GF4721@dhcp22.suse.cz>, <20130205134937.GA22804@dhcp22.suse.cz>, <20130205154947.CD6411E2@pobox.sk>, <20130205160934.GB22804@dhcp22.suse.cz>, <20130206021721.1AE9E3C7@pobox.sk>, <20130206140119.GD10254@dhcp22.suse.cz>, <20130206142219.GF10254@dhcp22.suse.cz> <20130206160051.GG10254@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20130206160051.GG10254@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="iso-8859-1" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= Michal, thank you very much but it just didn't work and broke everything = :( This happened: Problem started to occur really often immediately after booting the new k= ernel, every few minutes for one of my users. But everything other seems = to work fine so i gave it a try for a day (which was a mistake). I grabbe= d some data for you and go to sleep: http://watchdog.sk/lkml/memcg-bug-4.tar.gz Few hours later i was woke up from my sweet sweet dreams by alerts smses = - Apache wasn't working and our system failed to restart it. When i obser= ved the situation, two apache processes (of that user as above) were stil= l running and it wasn't possible to kill them by any way. I grabbed some = data for you: http://watchdog.sk/lkml/memcg-bug-5.tar.gz Then I logged to the console and this was waiting for me: http://watchdog.sk/lkml/error.jpg Finally i rebooted into different kernel, wrote this e-mail and go to my = lovely bed ;) ______________________________________________________________ > Od: "Michal Hocko" > Komu: azurIt > D=C3=A1tum: 06.02.2013 17:00 > Predmet: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OO= M is set > > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailingl= ist" , "KAMEZAWA Hiroyuki" , "Johannes Weiner" >On Wed 06-02-13 15:22:19, Michal Hocko wrote: >> On Wed 06-02-13 15:01:19, Michal Hocko wrote: >> > On Wed 06-02-13 02:17:21, azurIt wrote: >> > > >5-memcg-fix-1.patch is not complete. It doesn't contain the follo= up I >> > > >mentioned in a follow up email. Here is the full patch: >> > >=20 >> > >=20 >> > > Here is the log where OOM, again, killed MySQL server [search for = "(mysqld)"]: >> > > http://www.watchdog.sk/lkml/oom_mysqld6 >> >=20 >> > [...] >> > WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() >> > Hardware name: S5000VSA >> > gfp_mask:4304 nr_pages:1 oom:0 ret:2 >> > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >> > Call Trace: >> > [] warn_slowpath_common+0x7a/0xb0 >> > [] warn_slowpath_fmt+0x46/0x50 >> > [] ? mem_cgroup_margin+0x73/0xa0 >> > [] T.1149+0x2d9/0x610 >> > [] ? blk_finish_plug+0x18/0x50 >> > [] mem_cgroup_cache_charge+0xc4/0xf0 >> > [] add_to_page_cache_locked+0x4f/0x140 >> > [] add_to_page_cache_lru+0x22/0x50 >> > [] filemap_fault+0x252/0x4f0 >> > [] __do_fault+0x78/0x5a0 >> > [] handle_pte_fault+0x84/0x940 >> > [] ? vma_prio_tree_insert+0x30/0x50 >> > [] ? vma_link+0x88/0xe0 >> > [] handle_mm_fault+0x138/0x260 >> > [] do_page_fault+0x13d/0x460 >> > [] ? do_mmap_pgoff+0x3dc/0x430 >> > [] page_fault+0x1f/0x30 >> > ---[ end trace 8817670349022007 ]--- >> > apache2 invoked oom-killer: gfp_mask=3D0x0, order=3D0, oom_adj=3D0, = oom_score_adj=3D0 >> > apache2 cpuset=3Duid mems_allowed=3D0 >> > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >> > Call Trace: >> > [] dump_header+0x7e/0x1e0 >> > [] ? find_lock_task_mm+0x2f/0x70 >> > [] oom_kill_process+0x85/0x2a0 >> > [] out_of_memory+0xe5/0x200 >> > [] pagefault_out_of_memory+0xbd/0x110 >> > [] mm_fault_error+0xb6/0x1a0 >> > [] do_page_fault+0x3ee/0x460 >> > [] ? do_mmap_pgoff+0x3dc/0x430 >> > [] page_fault+0x1f/0x30 >> >=20 >> > The first trace comes from the debugging WARN and it clearly points = to >> > a file fault path. __do_fault pre-charges a page in case we need to >> > do CoW (copy-on-write) for the returned page. This one falls back to >> > memcg OOM and never returns ENOMEM as I have mentioned earlier.=20 >> > However, the fs fault handler (filemap_fault here) can fallback to >> > page_cache_read if the readahead (do_sync_mmap_readahead) fails >> > to get page to the page cache. And we can see this happening in >> > the first trace. page_cache_read then calls add_to_page_cache_lru >> > and eventually gets to add_to_page_cache_locked which calls >> > mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should >> > happen. This ENOMEM gets to the fault handler and kaboom. >> >=20 >> > So the fix is really much more complex than I thought. Although >> > add_to_page_cache_locked sounded like a good place it turned out to = be >> > not in fact. >> >=20 >> > We need something more clever appaerently. One way would be not misu= sing >> > __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have = 32 >> > bits for those flags in gfp_t so there should be some room there.=20 >> > Or we could do this per task flag, same we do for NO_IO in the curre= nt >> > -mm tree. >> > The later one seems easier wrt. gfp_mask passing horror - e.g. >> > __generic_file_aio_write doesn't pass flags and it can be called fro= m >> > unlocked contexts as well. >>=20 >> Ouch, PF_ flags space seem to be drained already because >> task_struct::flags is just unsigned int so there is just one bit left.= I >> am not sure this is the best use for it. This will be a real pain! > >OK, so this something that should help you without any risk of false >OOMs. I do not believe that something like that would be accepted >upstream because it is really heavy. We will need to come up with >something more clever for upstream. >I have also added a warning which will trigger when the charge fails. If >you see too many of those messages then there is something bad going on >and the lack of OOM causes userspace to loop without getting any >progress. > >So there you go - your personal patch ;) You can drop all other patches. >Please note I have just compile tested it. But it should be pretty >trivial to check it is correct >--- >From 6f155187f77c971b45caf05dbc80ca9c20bc278c Mon Sep 17 00:00:00 2001 >From: Michal Hocko >Date: Wed, 6 Feb 2013 16:45:07 +0100 >Subject: [PATCH 1/2] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set > >memcg oom killer might deadlock if the process which falls down to >mem_cgroup_handle_oom holds a lock which prevents other task to >terminate because it is blocked on the very same lock. >This can happen when a write system call needs to allocate a page but >the allocation hits the memcg hard limit and there is nothing to reclaim >(e.g. there is no swap or swap limit is hit as well and all cache pages >have been reclaimed already) and the process selected by memcg OOM >killer is blocked on i_mutex on the same inode (e.g. truncate it). > >Process A >[] do_truncate+0x58/0xa0 # takes i_mutex >[] do_last+0x250/0xa30 >[] path_openat+0xd7/0x440 >[] do_filp_open+0x49/0xa0 >[] do_sys_open+0x106/0x240 >[] sys_open+0x20/0x30 >[] system_call_fastpath+0x18/0x1d >[] 0xffffffffffffffff > >Process B >[] mem_cgroup_handle_oom+0x241/0x3b0 >[] T.1146+0x5ab/0x5c0 >[] mem_cgroup_cache_charge+0xbe/0xe0 >[] add_to_page_cache_locked+0x4c/0x140 >[] add_to_page_cache_lru+0x22/0x50 >[] grab_cache_page_write_begin+0x8b/0xe0 >[] ext3_write_begin+0x88/0x270 >[] generic_file_buffered_write+0x116/0x290 >[] __generic_file_aio_write+0x27c/0x480 >[] generic_file_aio_write+0x76/0xf0 # takes = ->i_mutex >[] do_sync_write+0xea/0x130 >[] vfs_write+0xf3/0x1f0 >[] sys_write+0x51/0x90 >[] system_call_fastpath+0x18/0x1d >[] 0xffffffffffffffff > >This is not a hard deadlock though because administrator can still >intervene and increase the limit on the group which helps the writer to >finish the allocation and release the lock. > >This patch heals the problem by forbidding OOM from dangerous context. >Memcg charging code has no way to find out whether it is called from a >locked context we have to help it via process flags. PF_OOM_ORIGIN flag >removed recently will be reused for PF_NO_MEMCG_OOM which signals that >the memcg OOM killer could lead to a deadlock. >Only locked callers of __generic_file_aio_write are currently marked. I >am pretty sure there are more places (I didn't check shmem and hugetlb >uses fancy instantion mutex during page fault and filesystems might >use some locks during the write) but I've ignored those as this will >probably be just a user specific patch without any way to get upstream >in the current form. > >Reported-by: azurIt >Signed-off-by: Michal Hocko >--- > drivers/staging/pohmelfs/inode.c | 2 ++ > include/linux/sched.h | 1 + > mm/filemap.c | 2 ++ > mm/memcontrol.c | 18 ++++++++++++++---- > 4 files changed, 19 insertions(+), 4 deletions(-) > >diff --git a/drivers/staging/pohmelfs/inode.c b/drivers/staging/pohmelfs= /inode.c >index 7a19555..523de82e 100644 >--- a/drivers/staging/pohmelfs/inode.c >+++ b/drivers/staging/pohmelfs/inode.c >@@ -921,7 +921,9 @@ ssize_t pohmelfs_write(struct file *file, const char= __user *buf, > if (ret) > goto err_out_unlock; >=20 >+ current->flags |=3D PF_NO_MEMCG_OOM; > ret =3D __generic_file_aio_write(&kiocb, &iov, 1, &kiocb.ki_pos); >+ current->flags &=3D ~PF_NO_MEMCG_OOM; > *ppos =3D kiocb.ki_pos; >=20 > mutex_unlock(&inode->i_mutex); >diff --git a/include/linux/sched.h b/include/linux/sched.h >index 1e86bb4..f275c8f 100644 >--- a/include/linux/sched.h >+++ b/include/linux/sched.h >@@ -1781,6 +1781,7 @@ extern void thread_group_times(struct task_struct = *p, cputime_t *ut, cputime_t * > #define PF_FROZEN 0x00010000 /* frozen for system suspend */ > #define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */ > #define PF_KSWAPD 0x00040000 /* I am kswapd */ >+#define PF_NO_MEMCG_OOM 0x00080000 /* Memcg OOM could lead to a deadloc= k */ > #define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory= */ > #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ > #define PF_RANDOMIZE 0x00400000 /* randomize virtual address space */ >diff --git a/mm/filemap.c b/mm/filemap.c >index 556858c..58a316b 100644 >--- a/mm/filemap.c >+++ b/mm/filemap.c >@@ -2617,7 +2617,9 @@ ssize_t generic_file_aio_write(struct kiocb *iocb,= const struct iovec *iov, >=20 > mutex_lock(&inode->i_mutex); > blk_start_plug(&plug); >+ current->flags |=3D PF_NO_MEMCG_OOM; > ret =3D __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos); >+ current->flags &=3D ~PF_NO_MEMCG_OOM; > mutex_unlock(&inode->i_mutex); >=20 > if (ret > 0 || ret =3D=3D -EIOCBQUEUED) { >diff --git a/mm/memcontrol.c b/mm/memcontrol.c >index c8425b1..128b615 100644 >--- a/mm/memcontrol.c >+++ b/mm/memcontrol.c >@@ -2397,6 +2397,14 @@ done: > return 0; > nomem: > *ptr =3D NULL; >+ if (printk_ratelimit()) >+ printk(KERN_WARNING"%s: task:%s pid:%d got ENOMEM without OOM for mem= cg:%p." >+ " If this message shows up very often for the" >+ " same task then there is a risk that the" >+ " process is not able to make any progress" >+ " because of the current limit. Try to enlarge" >+ " the hard limit.\n", __FUNCTION__, >+ current->comm, current->pid, memcg); > return -ENOMEM; > bypass: > *ptr =3D NULL; >@@ -2703,7 +2711,7 @@ static int mem_cgroup_charge_common(struct page *p= age, struct mm_struct *mm, > struct mem_cgroup *memcg =3D NULL; > unsigned int nr_pages =3D 1; > struct page_cgroup *pc; >- bool oom =3D true; >+ bool oom =3D !(current->flags & PF_NO_MEMCG_OOM); > int ret; >=20 > if (PageTransHuge(page)) { >@@ -2770,6 +2778,7 @@ __mem_cgroup_commit_charge_lrucare(struct page *pa= ge, struct mem_cgroup *memcg, > int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > gfp_t gfp_mask) > { >+ bool oom =3D !(current->flags & PF_NO_MEMCG_OOM); > struct mem_cgroup *memcg =3D NULL; > int ret; >=20 >@@ -2782,7 +2791,7 @@ int mem_cgroup_cache_charge(struct page *page, str= uct mm_struct *mm, > mm =3D &init_mm; >=20 > if (page_is_file_cache(page)) { >- ret =3D __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, true); >+ ret =3D __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, oom); > if (ret || !memcg) > return ret; >=20 >@@ -2818,6 +2827,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct = *mm, > struct page *page, > gfp_t mask, struct mem_cgroup **ptr) > { >+ bool oom =3D !(current->flags & PF_NO_MEMCG_OOM); > struct mem_cgroup *memcg; > int ret; >=20 >@@ -2840,13 +2850,13 @@ int mem_cgroup_try_charge_swapin(struct mm_struc= t *mm, > if (!memcg) > goto charge_cur_mm; > *ptr =3D memcg; >- ret =3D __mem_cgroup_try_charge(NULL, mask, 1, ptr, true); >+ ret =3D __mem_cgroup_try_charge(NULL, mask, 1, ptr, oom); > css_put(&memcg->css); > return ret; > charge_cur_mm: > if (unlikely(!mm)) > mm =3D &init_mm; >- return __mem_cgroup_try_charge(mm, mask, 1, ptr, true); >+ return __mem_cgroup_try_charge(mm, mask, 1, ptr, oom); > } >=20 > static void >--=20 >1.7.10.4 > >--=20 >Michal Hocko >SUSE Labs > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Date: Fri, 8 Feb 2013 10:44:20 +0100 Message-ID: <20130208094420.GA7557@dhcp22.suse.cz> References: <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> <20130206140119.GD10254@dhcp22.suse.cz> <20130206142219.GF10254@dhcp22.suse.cz> <20130206160051.GG10254@dhcp22.suse.cz> <20130208060304.799F362F@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130208060304.799F362F@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Fri 08-02-13 06:03:04, azurIt wrote: > Michal, thank you very much but it just didn't work and broke > everything :( I am sorry to hear that. The patch should help to solve the deadlock you have seen earlier. It in no way can solve side effects of failing writes and it also cannot help much if the oom is permanent. > This happened: > Problem started to occur really often immediately after booting the > new kernel, every few minutes for one of my users. But everything > other seems to work fine so i gave it a try for a day (which was a > mistake). I grabbed some data for you and go to sleep: > http://watchdog.sk/lkml/memcg-bug-4.tar.gz Do you have logs from that time period? I have only glanced through the stacks and most of the threads are waiting in the mem_cgroup_handle_oom (mostly from the page fault path where we do not have other options than waiting) which suggests that your memory limit is seriously underestimated. If you look at the number of charging failures (memory.failcnt per-group file) then you will get 9332083 failures in _average_ per group. This is a lot! Not all those failures end with OOM, of course. But it clearly signals that the workload need much more memory than the limit allows. > Few hours later i was woke up from my sweet sweet dreams by alerts > smses - Apache wasn't working and our system failed to restart > it. When i observed the situation, two apache processes (of that user > as above) were still running and it wasn't possible to kill them by > any way. I grabbed some data for you: > http://watchdog.sk/lkml/memcg-bug-5.tar.gz There are only 5 groups in this one and all of them have no memory charged (so no OOM going on). All tasks are somewhere in the ptrace code. grep cache -r . ./1360297489/memory.stat:cache 0 ./1360297489/memory.stat:total_cache 65642496 ./1360297491/memory.stat:cache 0 ./1360297491/memory.stat:total_cache 65642496 ./1360297492/memory.stat:cache 0 ./1360297492/memory.stat:total_cache 65642496 ./1360297490/memory.stat:cache 0 ./1360297490/memory.stat:total_cache 65642496 ./1360297488/memory.stat:cache 0 ./1360297488/memory.stat:total_cache 65642496 which suggests that this is a parent group and the memory is charged in a child group. I guess that all those are under OOM as the number seems like they have limit at 62M. > Then I logged to the console and this was waiting for me: > http://watchdog.sk/lkml/error.jpg This is just a warning and it should be harmless. There is just one WARN in ptrace_check_attach: WARN_ON_ONCE(task_is_stopped(child)) This has been introduced by http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=321fb561 and the commit description claim this shouldn't happen. I am not familiar with this code but it sounds like a bug in the tracing code which is not related to the discussed issue. > Finally i rebooted into different kernel, wrote this e-mail and go to > my lovely bed ;) -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_if_PF=5FNO=5FMEMCG=5FOOM_is_set?= Date: Fri, 08 Feb 2013 12:02:49 +0100 Message-ID: <20130208120249.FD733220@pobox.sk> References: <20130125160723.FAE73567@pobox.sk>, <20130125163130.GF4721@dhcp22.suse.cz>, <20130205134937.GA22804@dhcp22.suse.cz>, <20130205154947.CD6411E2@pobox.sk>, <20130205160934.GB22804@dhcp22.suse.cz>, <20130206021721.1AE9E3C7@pobox.sk>, <20130206140119.GD10254@dhcp22.suse.cz>, <20130206142219.GF10254@dhcp22.suse.cz>, <20130206160051.GG10254@dhcp22.suse.cz>, <20130208060304.799F362F@pobox.sk> <20130208094420.GA7557@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20130208094420.GA7557@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= > >Do you have logs from that time period? > >I have only glanced through the stacks and most of the threads are >waiting in the mem_cgroup_handle_oom (mostly from the page fault path >where we do not have other options than waiting) which suggests that >your memory limit is seriously underestimated. If you look at the number >of charging failures (memory.failcnt per-group file) then you will get >9332083 failures in _average_ per group. This is a lot! >Not all those failures end with OOM, of course. But it clearly signals >that the workload need much more memory than the limit allows. What type of logs? I have all. Memory usage graph: http://www.watchdog.sk/lkml/memory2.png New kernel was booted about 1:15. Data in memcg-bug-4.tar.gz were taken a= bout 2:35 and data in memcg-bug-5.tar.gz about 5:25. There was always lot= s of free memory. Higher memory consumption between 3:39 and 5:33 was cau= sed by data backup and was completed few minutes before i restarted the s= erver (this was just a coincidence). >There are only 5 groups in this one and all of them have no memory >charged (so no OOM going on). All tasks are somewhere in the ptrace >code. It's all from the same cgroup but from different time. >grep cache -r . >./1360297489/memory.stat:cache 0 >./1360297489/memory.stat:total_cache 65642496 >./1360297491/memory.stat:cache 0 >./1360297491/memory.stat:total_cache 65642496 >./1360297492/memory.stat:cache 0 >./1360297492/memory.stat:total_cache 65642496 >./1360297490/memory.stat:cache 0 >./1360297490/memory.stat:total_cache 65642496 >./1360297488/memory.stat:cache 0 >./1360297488/memory.stat:total_cache 65642496 > >which suggests that this is a parent group and the memory is charged in >a child group. I guess that all those are under OOM as the number seems >like they have limit at 62M. The cgroup has limit 330M (346030080 bytes). As i said, these two process= es were stucked and was impossible to kill them. They were, maybe, the pr= ocesses which i was trying to 'strace' before - 'strace' was freezed as a= lways when the cgroup has this problem and i killed it (i was just trying= if it is the original cgroup problem). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Date: Fri, 8 Feb 2013 13:38:54 +0100 Message-ID: <20130208123854.GB7557@dhcp22.suse.cz> References: <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> <20130206140119.GD10254@dhcp22.suse.cz> <20130206142219.GF10254@dhcp22.suse.cz> <20130206160051.GG10254@dhcp22.suse.cz> <20130208060304.799F362F@pobox.sk> <20130208094420.GA7557@dhcp22.suse.cz> <20130208120249.FD733220@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130208120249.FD733220-Rm0zKEqwvD4@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Fri 08-02-13 12:02:49, azurIt wrote: > > > >Do you have logs from that time period? > > > >I have only glanced through the stacks and most of the threads are > >waiting in the mem_cgroup_handle_oom (mostly from the page fault path > >where we do not have other options than waiting) which suggests that > >your memory limit is seriously underestimated. If you look at the number > >of charging failures (memory.failcnt per-group file) then you will get > >9332083 failures in _average_ per group. This is a lot! > >Not all those failures end with OOM, of course. But it clearly signals > >that the workload need much more memory than the limit allows. > > > What type of logs? I have all. kernel log would be sufficient. > Memory usage graph: > http://www.watchdog.sk/lkml/memory2.png > > New kernel was booted about 1:15. Data in memcg-bug-4.tar.gz were taken about 2:35 and data in memcg-bug-5.tar.gz about 5:25. There was always lots of free memory. Higher memory consumption between 3:39 and 5:33 was caused by data backup and was completed few minutes before i restarted the server (this was just a coincidence). > > > > >There are only 5 groups in this one and all of them have no memory > >charged (so no OOM going on). All tasks are somewhere in the ptrace > >code. > > > It's all from the same cgroup but from different time. > > > > >grep cache -r . > >./1360297489/memory.stat:cache 0 > >./1360297489/memory.stat:total_cache 65642496 > >./1360297491/memory.stat:cache 0 > >./1360297491/memory.stat:total_cache 65642496 > >./1360297492/memory.stat:cache 0 > >./1360297492/memory.stat:total_cache 65642496 > >./1360297490/memory.stat:cache 0 > >./1360297490/memory.stat:total_cache 65642496 > >./1360297488/memory.stat:cache 0 > >./1360297488/memory.stat:total_cache 65642496 > > > >which suggests that this is a parent group and the memory is charged in > >a child group. I guess that all those are under OOM as the number seems > >like they have limit at 62M. > > > The cgroup has limit 330M (346030080 bytes). This limit is for top level groups, right? Those seem to children which have 62MB charged - is that a limit for those children? > As i said, these two processes Which are those two processes? > were stucked and was impossible to kill them. They were, > maybe, the processes which i was trying to 'strace' before - 'strace' > was freezed as always when the cgroup has this problem and i killed it > (i was just trying if it is the original cgroup problem). I have no idea what is the strace role here. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_if_PF=5FNO=5FMEMCG=5FOOM_is_set?= Date: Fri, 08 Feb 2013 14:56:16 +0100 Message-ID: <20130208145616.FB78CE24@pobox.sk> References: <20130205134937.GA22804@dhcp22.suse.cz>, <20130205154947.CD6411E2@pobox.sk>, <20130205160934.GB22804@dhcp22.suse.cz>, <20130206021721.1AE9E3C7@pobox.sk>, <20130206140119.GD10254@dhcp22.suse.cz>, <20130206142219.GF10254@dhcp22.suse.cz>, <20130206160051.GG10254@dhcp22.suse.cz>, <20130208060304.799F362F@pobox.sk>, <20130208094420.GA7557@dhcp22.suse.cz>, <20130208120249.FD733220@pobox.sk> <20130208123854.GB7557@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20130208123854.GB7557@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >kernel log would be sufficient. Full kernel log from kernel with you newest patch: http://watchdog.sk/lkml/kern2.log >This limit is for top level groups, right? Those seem to children which >have 62MB charged - is that a limit for those children? It was the limit for parent cgroup and processes were in one (the same) c= hild cgroup. Child cgroup has no memory limit set (so limit for parent wa= s also limit for child - 330 MB). >Which are those two processes? Data are inside memcg-bug-5.tar.gz in directories bug/// >I have no idea what is the strace role here. I was stracing exactly two processes from that cgroup and exactly two pro= cesses were stucked later and was immpossible to kill them. Both of them = were waiting on 'ptrace_stop'. Maybe it's completely unrelated, just gues= sing. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Date: Fri, 8 Feb 2013 15:47:20 +0100 Message-ID: <20130208144720.GC7557@dhcp22.suse.cz> References: <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> <20130206140119.GD10254@dhcp22.suse.cz> <20130206142219.GF10254@dhcp22.suse.cz> <20130206160051.GG10254@dhcp22.suse.cz> <20130208060304.799F362F@pobox.sk> <20130208094420.GA7557@dhcp22.suse.cz> <20130208120249.FD733220@pobox.sk> <20130208123854.GB7557@dhcp22.suse.cz> <20130208145616.FB78CE24@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130208145616.FB78CE24-Rm0zKEqwvD4@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Fri 08-02-13 14:56:16, azurIt wrote: > Data are inside memcg-bug-5.tar.gz in directories bug/// ohh, I didn't get those were timestamp directories. It makes more sense now. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Date: Fri, 8 Feb 2013 16:24:02 +0100 Message-ID: <20130208152402.GD7557@dhcp22.suse.cz> References: <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> <20130206140119.GD10254@dhcp22.suse.cz> <20130206142219.GF10254@dhcp22.suse.cz> <20130206160051.GG10254@dhcp22.suse.cz> <20130208060304.799F362F@pobox.sk> <20130208094420.GA7557@dhcp22.suse.cz> <20130208120249.FD733220@pobox.sk> <20130208123854.GB7557@dhcp22.suse.cz> <20130208145616.FB78CE24@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130208145616.FB78CE24-Rm0zKEqwvD4@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Fri 08-02-13 14:56:16, azurIt wrote: > >kernel log would be sufficient. > > > Full kernel log from kernel with you newest patch: > http://watchdog.sk/lkml/kern2.log OK, so the log says that there is a little slaughter on your yard: $ grep "Memory cgroup out of memory:" kern2.log | wc -l 220 $ grep "Memory cgroup out of memory:" kern2.log | sed 's@.*Kill process \([0-9]*\) .*@\1@' | sort -u | wc -l 220 Which means that the oom killer didn't try to kill any task more than once which is good because it tells us that the killed task manages to die before we trigger oom again. So this is definitely not a deadlock. You are just hitting OOM very often. $ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n 1 Task in /1091/uid killed as a result of limit of /1091 1 Task in /1223/uid killed as a result of limit of /1223 1 Task in /1229/uid killed as a result of limit of /1229 1 Task in /1255/uid killed as a result of limit of /1255 1 Task in /1424/uid killed as a result of limit of /1424 1 Task in /1470/uid killed as a result of limit of /1470 1 Task in /1567/uid killed as a result of limit of /1567 2 Task in /1080/uid killed as a result of limit of /1080 3 Task in /1381/uid killed as a result of limit of /1381 4 Task in /1185/uid killed as a result of limit of /1185 4 Task in /1289/uid killed as a result of limit of /1289 4 Task in /1709/uid killed as a result of limit of /1709 5 Task in /1279/uid killed as a result of limit of /1279 6 Task in /1020/uid killed as a result of limit of /1020 6 Task in /1527/uid killed as a result of limit of /1527 9 Task in /1388/uid killed as a result of limit of /1388 17 Task in /1281/uid killed as a result of limit of /1281 22 Task in /1599/uid killed as a result of limit of /1599 30 Task in /1155/uid killed as a result of limit of /1155 31 Task in /1258/uid killed as a result of limit of /1258 71 Task in /1293/uid killed as a result of limit of /1293 So the group 1293 suffers the most. I would check how much memory the worklod in the group really needs because this level of OOM cannot possible be healthy. The log also says that the deadlock prevention implemented by the patch triggered and some writes really failed due to potential OOM: $ grep "If this message shows up" kern2.log Feb 8 01:17:10 server01 kernel: [ 431.033593] __mem_cgroup_try_charge: task:apache2 pid:6733 got ENOMEM without OOM for memcg:ffff8803807d5600. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 01:22:52 server01 kernel: [ 773.556782] __mem_cgroup_try_charge: task:apache2 pid:12092 got ENOMEM without OOM for memcg:ffff8803807d5600. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 01:22:52 server01 kernel: [ 773.567916] __mem_cgroup_try_charge: task:apache2 pid:12093 got ENOMEM without OOM for memcg:ffff8803807d5600. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 01:29:00 server01 kernel: [ 1141.355693] __mem_cgroup_try_charge: task:apache2 pid:17734 got ENOMEM without OOM for memcg:ffff88036e956e00. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 03:30:39 server01 kernel: [ 8440.346811] __mem_cgroup_try_charge: task:apache2 pid:8687 got ENOMEM without OOM for memcg:ffff8803654d6e00. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. This doesn't look very unhealthy. I have expected that write would fail more often but it seems that the biggest memory pressure comes from mmaps and page faults which have no way other than OOM. So my suggestion would be to reconsider limits for groups to provide more realistical environment. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_if_PF=5FNO=5FMEMCG=5FOOM_is_set?= Date: Fri, 08 Feb 2013 16:58:05 +0100 Message-ID: <20130208165805.8908B143@pobox.sk> References: <20130205160934.GB22804@dhcp22.suse.cz>, <20130206021721.1AE9E3C7@pobox.sk>, <20130206140119.GD10254@dhcp22.suse.cz>, <20130206142219.GF10254@dhcp22.suse.cz>, <20130206160051.GG10254@dhcp22.suse.cz>, <20130208060304.799F362F@pobox.sk>, <20130208094420.GA7557@dhcp22.suse.cz>, <20130208120249.FD733220@pobox.sk>, <20130208123854.GB7557@dhcp22.suse.cz>, <20130208145616.FB78CE24@pobox.sk> <20130208152402.GD7557@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Return-path: In-Reply-To: <20130208152402.GD7557-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >Which means that the oom killer didn't try to kill any task more than >once which is good because it tells us that the killed task manages to >die before we trigger oom again. So this is definitely not a deadlock. >You are just hitting OOM very often. >$ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n > 1 Task in /1091/uid killed as a result of limit of /1091 > 1 Task in /1223/uid killed as a result of limit of /1223 > 1 Task in /1229/uid killed as a result of limit of /1229 > 1 Task in /1255/uid killed as a result of limit of /1255 > 1 Task in /1424/uid killed as a result of limit of /1424 > 1 Task in /1470/uid killed as a result of limit of /1470 > 1 Task in /1567/uid killed as a result of limit of /1567 > 2 Task in /1080/uid killed as a result of limit of /1080 > 3 Task in /1381/uid killed as a result of limit of /1381 > 4 Task in /1185/uid killed as a result of limit of /1185 > 4 Task in /1289/uid killed as a result of limit of /1289 > 4 Task in /1709/uid killed as a result of limit of /1709 > 5 Task in /1279/uid killed as a result of limit of /1279 > 6 Task in /1020/uid killed as a result of limit of /1020 > 6 Task in /1527/uid killed as a result of limit of /1527 > 9 Task in /1388/uid killed as a result of limit of /1388 > 17 Task in /1281/uid killed as a result of limit of /1281 > 22 Task in /1599/uid killed as a result of limit of /1599 > 30 Task in /1155/uid killed as a result of limit of /1155 > 31 Task in /1258/uid killed as a result of limit of /1258 > 71 Task in /1293/uid killed as a result of limit of /1293 > >So the group 1293 suffers the most. I would check how much memory the >worklod in the group really needs because this level of OOM cannot >possible be healthy. I took the kernel log from yesterday from the same time frame: $ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n 1 Task in /1252/uid killed as a result of limit of /1252 1 Task in /1709/uid killed as a result of limit of /1709 2 Task in /1185/uid killed as a result of limit of /1185 2 Task in /1388/uid killed as a result of limit of /1388 2 Task in /1567/uid killed as a result of limit of /1567 2 Task in /1650/uid killed as a result of limit of /1650 3 Task in /1527/uid killed as a result of limit of /1527 5 Task in /1552/uid killed as a result of limit of /1552 1634 Task in /1258/uid killed as a result of limit of /1258 As you can see, there were much more OOM in '1258' and no such problems like this night (well, there were never such problems before :) ). As i said, cgroup 1258 were freezing every few minutes with your latest patch so there must be something wrong (it usually freezes about once per day). And it was really freezed (i checked that), the sypthoms were: - cannot strace any of cgroup processes - no new processes were started, still the same processes were 'running' - kernel was unable to resolve this by it's own - all processes togather were taking 100% CPU - the whole memory limit was used (see memcg-bug-4.tar.gz for more info) Unfortunately i forget to check if killing only few of the processes will resolve it (i always killed them all yesterday night). Don't know if is was in deadlock or not but kernel was definitely unable to resolve the problem. And there is still a mystery of two freezed processes which cannot be killed. By the way, i KNOW that so much OOM is not healthy but the client simply don't want to buy more memory. He knows about the problem of unsufficient memory limit. Thank you. azur From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Fri, 8 Feb 2013 17:01:19 +0100 Message-ID: <20130208160119.GE7557@dhcp22.suse.cz> References: <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> <20130206140119.GD10254@dhcp22.suse.cz> <51138999.3090006@jp.fujitsu.com> <5114577D.70608@jp.fujitsu.com> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <5114577D.70608-+CUm20s59erQFUHtdCDX3A@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Kamezawa Hiroyuki Cc: azurIt , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , Johannes Weiner On Fri 08-02-13 10:40:13, KAMEZAWA Hiroyuki wrote: > (2013/02/07 20:01), Kamezawa Hiroyuki wrote: [...] > >Hmm. do we need to increase the "limit" virtually at memcg oom until > >the oom-killed process dies ? > > Here is my naive idea... and the next step would be http://en.wikipedia.org/wiki/Credit_default_swap :P But seriously now. The idea is not bad at all. This implementation would need some tweaks to work though (e.g. you would need to wake oom sleepers when you get a loan - because those are ones which can block the resource). We should also give the borrowed charges only to those who would oom to prevent from stealing. I think that it should be mem_cgroup_out_of_memory who establishes the loan and it can have a look at how much memory the killed task frees - e.g. some portion of get_mm_rss() or a more precise but much more expensive traversing via private vmas and check whether they charged memory from the target memcg hierarchy (this is a slow path anyway). But who knows maybe a fixed 2MB would work out as well. Thanks! > == > From 1a46318cf89e7df94bd4844f29105b61dacf335b Mon Sep 17 00:00:00 2001 > From: KAMEZAWA Hiroyuki > Date: Fri, 8 Feb 2013 10:43:52 +0900 > Subject: [PATCH] [Don't Apply][PATCH] memcg relax resource at OOM situation. > > When an OOM happens, a task is killed and resources will be freed. > > A problem here is that a task, which is oom-killed, may wait for > some other resource in which memory resource is required. Some thread > waits for free memory may holds some mutex and oom-killed process > wait for the mutex. > > To avoid this, relaxing charged memory by giving virtual resource > can be a help. The system can get back it at uncharge(). > This is a sample native implementation. > > Signed-off-by: KAMEZAWA Hiroyuki > --- > mm/memcontrol.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++----- > 1 file changed, 73 insertions(+), 6 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 25ac5f4..4dea49a 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -301,6 +301,9 @@ struct mem_cgroup { > /* set when res.limit == memsw.limit */ > bool memsw_is_minimum; > + /* extra resource at emergency situation */ > + unsigned long loan; > + spinlock_t loan_lock; > /* protect arrays of thresholds */ > struct mutex thresholds_lock; > @@ -2034,6 +2037,61 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, > mem_cgroup_iter_break(root_memcg, victim); > return total; > } > +/* > + * When a memcg is in OOM situation, this lack of resource may cause deadlock > + * because of complicated lock dependency(i_mutex...). To avoid that, we > + * need extra resource or avoid charging. > + * > + * A memcg can request resource in an emergency state. We call it as loan. > + * A memcg will return a loan when it does uncharge resource. We disallow > + * double-loan and moving task to other groups until the loan is fully > + * returned. > + * > + * Note: the problem here is that we cannot know what amount resouce should > + * be necessary to exiting an emergency state..... > + */ > +#define LOAN_MAX (2 * 1024 * 1024) > + > +static void mem_cgroup_make_loan(struct mem_cgroup *memcg) > +{ > + u64 usage; > + unsigned long amount; > + > + amount = LOAN_MAX; > + > + usage = res_counter_read_u64(&memcg->res, RES_USAGE); > + if (amount > usage /2 ) > + amount = usage / 2; > + spin_lock(&memcg->loan_lock); > + if (memcg->loan) { > + spin_unlock(&memcg->loan_lock); > + return; > + } > + memcg->loan = amount; > + res_counter_uncharge(&memcg->res, amount); > + if (do_swap_account) > + res_counter_uncharge(&memcg->memsw, amount); > + spin_unlock(&memcg->loan_lock); > +} > + > +/* return amount of free resource which can be uncharged */ > +static unsigned long > +mem_cgroup_may_return_loan(struct mem_cgroup *memcg, unsigned long val) > +{ > + unsigned long tmp; > + /* we don't care small race here */ > + if (unlikely(!memcg->loan)) > + return val; > + spin_lock(&memcg->loan_lock); > + if (memcg->loan) { > + tmp = min(memcg->loan, val); > + memcg->loan -= tmp; > + val -= tmp; > + } > + spin_unlock(&memcg->loan_lock); > + return val; > +} > + > /* > * Check OOM-Killer is already running under our hierarchy. > @@ -2182,6 +2240,7 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, > if (need_to_kill) { > finish_wait(&memcg_oom_waitq, &owait.wait); > mem_cgroup_out_of_memory(memcg, mask, order); > + mem_cgroup_make_loan(memcg); > } else { > schedule(); > finish_wait(&memcg_oom_waitq, &owait.wait); > @@ -2748,6 +2807,8 @@ static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg, > if (!mem_cgroup_is_root(memcg)) { > unsigned long bytes = nr_pages * PAGE_SIZE; > + bytes = mem_cgroup_may_return_loan(memcg, bytes); > + > res_counter_uncharge(&memcg->res, bytes); > if (do_swap_account) > res_counter_uncharge(&memcg->memsw, bytes); > @@ -3989,6 +4050,7 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg, > { > struct memcg_batch_info *batch = NULL; > bool uncharge_memsw = true; > + unsigned long val; > /* If swapout, usage of swap doesn't decrease */ > if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) > @@ -4029,9 +4091,11 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg, > batch->memsw_nr_pages++; > return; > direct_uncharge: > - res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE); > + val = nr_pages * PAGE_SIZE; > + val = mem_cgroup_may_return_loan(memcg, val); > + res_counter_uncharge(&memcg->res, val); > if (uncharge_memsw) > - res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE); > + res_counter_uncharge(&memcg->memsw, val); > if (unlikely(batch->memcg != memcg)) > memcg_oom_recover(memcg); > } > @@ -4182,6 +4246,7 @@ void mem_cgroup_uncharge_start(void) > void mem_cgroup_uncharge_end(void) > { > struct memcg_batch_info *batch = ¤t->memcg_batch; > + unsigned long val; > if (!batch->do_batch) > return; > @@ -4192,16 +4257,16 @@ void mem_cgroup_uncharge_end(void) > if (!batch->memcg) > return; > + val = batch->nr_pages * PAGE_SIZE; > + val = mem_cgroup_may_return_loan(batch->memcg, val); > /* > * This "batch->memcg" is valid without any css_get/put etc... > * bacause we hide charges behind us. > */ > if (batch->nr_pages) > - res_counter_uncharge(&batch->memcg->res, > - batch->nr_pages * PAGE_SIZE); > + res_counter_uncharge(&batch->memcg->res, val); > if (batch->memsw_nr_pages) > - res_counter_uncharge(&batch->memcg->memsw, > - batch->memsw_nr_pages * PAGE_SIZE); > + res_counter_uncharge(&batch->memcg->memsw, val); > memcg_oom_recover(batch->memcg); > /* forget this pointer (for sanity check) */ > batch->memcg = NULL; > @@ -6291,6 +6356,8 @@ mem_cgroup_css_alloc(struct cgroup *cont) > memcg->move_charge_at_immigrate = 0; > mutex_init(&memcg->thresholds_lock); > spin_lock_init(&memcg->move_lock); > + memcg->loan = 0; > + spin_lock_init(&memcg->loan_lock); > return &memcg->css; > -- > 1.7.10.2 > > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Fri, 8 Feb 2013 17:29:18 +0100 Message-ID: <20130208162918.GF7557@dhcp22.suse.cz> References: <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130205174651.GA3959@dhcp22.suse.cz> <20130205185953.GB3959@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Greg Thelen Cc: azurIt , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Thu 07-02-13 20:27:00, Greg Thelen wrote: > On Tue, Feb 05 2013, Michal Hocko wrote: > > > On Tue 05-02-13 10:09:57, Greg Thelen wrote: > >> On Tue, Feb 05 2013, Michal Hocko wrote: > >> > >> > On Tue 05-02-13 08:48:23, Greg Thelen wrote: > >> >> On Tue, Feb 05 2013, Michal Hocko wrote: > >> >> > >> >> > On Tue 05-02-13 15:49:47, azurIt wrote: > >> >> > [...] > >> >> >> Just to be sure - am i supposed to apply this two patches? > >> >> >> http://watchdog.sk/lkml/patches/ > >> >> > > >> >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > >> >> > mentioned in a follow up email. Here is the full patch: > >> >> > --- > >> >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 > >> >> > From: Michal Hocko > >> >> > Date: Mon, 26 Nov 2012 11:47:57 +0100 > >> >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > >> >> > > >> >> > memcg oom killer might deadlock if the process which falls down to > >> >> > mem_cgroup_handle_oom holds a lock which prevents other task to > >> >> > terminate because it is blocked on the very same lock. > >> >> > This can happen when a write system call needs to allocate a page but > >> >> > the allocation hits the memcg hard limit and there is nothing to reclaim > >> >> > (e.g. there is no swap or swap limit is hit as well and all cache pages > >> >> > have been reclaimed already) and the process selected by memcg OOM > >> >> > killer is blocked on i_mutex on the same inode (e.g. truncate it). > >> >> > > >> >> > Process A > >> >> > [] do_truncate+0x58/0xa0 # takes i_mutex > >> >> > [] do_last+0x250/0xa30 > >> >> > [] path_openat+0xd7/0x440 > >> >> > [] do_filp_open+0x49/0xa0 > >> >> > [] do_sys_open+0x106/0x240 > >> >> > [] sys_open+0x20/0x30 > >> >> > [] system_call_fastpath+0x18/0x1d > >> >> > [] 0xffffffffffffffff > >> >> > > >> >> > Process B > >> >> > [] mem_cgroup_handle_oom+0x241/0x3b0 > >> >> > [] T.1146+0x5ab/0x5c0 > >> >> > [] mem_cgroup_cache_charge+0xbe/0xe0 > >> >> > [] add_to_page_cache_locked+0x4c/0x140 > >> >> > [] add_to_page_cache_lru+0x22/0x50 > >> >> > [] grab_cache_page_write_begin+0x8b/0xe0 > >> >> > [] ext3_write_begin+0x88/0x270 > >> >> > [] generic_file_buffered_write+0x116/0x290 > >> >> > [] __generic_file_aio_write+0x27c/0x480 > >> >> > [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > >> >> > [] do_sync_write+0xea/0x130 > >> >> > [] vfs_write+0xf3/0x1f0 > >> >> > [] sys_write+0x51/0x90 > >> >> > [] system_call_fastpath+0x18/0x1d > >> >> > [] 0xffffffffffffffff > >> >> > >> >> It looks like grab_cache_page_write_begin() passes __GFP_FS into > >> >> __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me > >> >> think that this deadlock is also possible in the page allocator even > >> >> before getting to add_to_page_cache_lru. no? > >> > > >> > I am not that familiar with VFS but i_mutex is a high level lock AFAIR > >> > and it shouldn't be called from the pageout path so __page_cache_alloc > >> > should be safe. > >> > >> I wasn't clear, sorry. My concern is not that pageout() grabs i_mutex. > >> My concern is that __page_cache_alloc() will invoke the oom killer and > >> select a victim which wants i_mutex. This victim will deadlock because > >> the oom killer caller already holds i_mutex. > > > > That would be true for the memcg oom because that one is blocking but > > the global oom just puts the allocator into sleep for a while and then > > the allocator should back off eventually (unless this is NOFAIL > > allocation). I would need to look closer whether this is really the case > > - I haven't seen that allocator code path for a while... > > I think the page allocator can loop forever waiting for an oom victim to > terminate even without NOFAIL. Especially if the oom victim wants a > resource exclusively held by the allocating thread (e.g. i_mutex). It > looks like the same deadlock you describe is also possible (though more > rare) without memcg. OK, I have checked the allocator slow path and you are right even GFP_KERNEL will not fail. This can lead to similar deadlocks - e.g. OOM killed task blocked on down_write(mmap_sem) while the page fault handler holding mmap_sem for reading and allocating a new page without any progress. Luckily there are memory reserves where the allocator fall back eventually so the allocation should be able to get some memory and release the lock. There is still a theoretical chance this would block though. This sounds like a corner case though so I wouldn't care about it very much. > If the looping thread is an eligible oom victim (i.e. not oom disabled, > not an kernel thread, etc) then the page allocator can return NULL in so > long as NOFAIL is not used. So any allocator which is able to call the > oom killer and is not oom disabled (kernel thread, etc) is already > exposed to the possibility of page allocator failure. So if the page > allocator could detect the deadlock, then it could safely return NULL. > Maybe after looping N times without forward progress the page allocator > should consider failing unless NOFAIL is given. page allocator is quite tricky to touch and the chances of this deadlock are not that big. > if memcg oom kill has been tried a reasonable number of times. Simply > failing the memcg charge with ENOMEM seems easier to support than > exceeding limit (Kame's loan patch). We cannot do that in the page fault path because this would lead to a global oom killer. We would need to either retry the page fault or send KILL to the faulting process. But I do not like this much as this could lead to DoS attacks. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Date: Fri, 8 Feb 2013 17:40:56 +0100 Message-ID: <20130208164056.GG7557@dhcp22.suse.cz> References: <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130205174651.GA3959@dhcp22.suse.cz> <20130205185953.GB3959@dhcp22.suse.cz> <20130208162918.GF7557@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130208162918.GF7557@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Greg Thelen Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Fri 08-02-13 17:29:18, Michal Hocko wrote: [...] > OK, I have checked the allocator slow path and you are right even > GFP_KERNEL will not fail. This can lead to similar deadlocks - e.g. > OOM killed task blocked on down_write(mmap_sem) while the page fault > handler holding mmap_sem for reading and allocating a new page without > any progress. And now that I think about it some more it sounds like it shouldn't be possible because allocator would fail because it would see TIF_MEMDIE (OOM killer kills all threads that share the same mm). But maybe there are other locks that are dangerous, but I think that the risk is pretty low. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Date: Fri, 8 Feb 2013 18:10:12 +0100 Message-ID: <20130208171012.GH7557@dhcp22.suse.cz> References: <20130206140119.GD10254@dhcp22.suse.cz> <20130206142219.GF10254@dhcp22.suse.cz> <20130206160051.GG10254@dhcp22.suse.cz> <20130208060304.799F362F@pobox.sk> <20130208094420.GA7557@dhcp22.suse.cz> <20130208120249.FD733220@pobox.sk> <20130208123854.GB7557@dhcp22.suse.cz> <20130208145616.FB78CE24@pobox.sk> <20130208152402.GD7557@dhcp22.suse.cz> <20130208165805.8908B143@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130208165805.8908B143-Rm0zKEqwvD4@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Fri 08-02-13 16:58:05, azurIt wrote: [...] > I took the kernel log from yesterday from the same time frame: > > $ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n > 1 Task in /1252/uid killed as a result of limit of /1252 > 1 Task in /1709/uid killed as a result of limit of /1709 > 2 Task in /1185/uid killed as a result of limit of /1185 > 2 Task in /1388/uid killed as a result of limit of /1388 > 2 Task in /1567/uid killed as a result of limit of /1567 > 2 Task in /1650/uid killed as a result of limit of /1650 > 3 Task in /1527/uid killed as a result of limit of /1527 > 5 Task in /1552/uid killed as a result of limit of /1552 > 1634 Task in /1258/uid killed as a result of limit of /1258 > > As you can see, there were much more OOM in '1258' and no such > problems like this night (well, there were never such problems before > :) ). Well, all the patch does is that it prevents from the deadlock we have seen earlier. Previously the writer would block on the oom wait queue while it fails with ENOMEM now. Caller sees this as a short write which can be retried (it is a question whether userspace can cope with that properly). All other OOMs are preserved. I suspect that all the problems you are seeing now are just side effects of the OOM conditions. > As i said, cgroup 1258 were freezing every few minutes with your > latest patch so there must be something wrong (it usually freezes > about once per day). And it was really freezed (i checked that), the > sypthoms were: I assume you have checked that the killed processes eventually die, right? > - cannot strace any of cgroup processes > - no new processes were started, still the same processes were 'running' > - kernel was unable to resolve this by it's own > - all processes togather were taking 100% CPU > - the whole memory limit was used > (see memcg-bug-4.tar.gz for more info) Well, I do not see anything supsicious during that time period (timestamps translate between Fri Feb 8 02:34:05 and Fri Feb 8 02:36:48). The kernel log shows a lot of oom during that time. All killed processes die eventually. > Unfortunately i forget to check if killing only few of the processes > will resolve it (i always killed them all yesterday night). Don't > know if is was in deadlock or not but kernel was definitely unable > to resolve the problem. Nothing shows it would be a deadlock so far. It is well possible that the userspace went mad when seeing a lot of processes dying because it doesn't expect it. > And there is still a mystery of two freezed processes which cannot be > killed. > > By the way, i KNOW that so much OOM is not healthy but the client > simply don't want to buy more memory. He knows about the problem of > unsufficient memory limit. Well, then you would see a permanent flood of OOM killing, I am afraid. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_if_PF=5FNO=5FMEMCG=5FOOM_is_set?= Date: Fri, 08 Feb 2013 22:02:43 +0100 Message-ID: <20130208220243.EDEE0825@pobox.sk> References: <20130206140119.GD10254@dhcp22.suse.cz>, <20130206142219.GF10254@dhcp22.suse.cz>, <20130206160051.GG10254@dhcp22.suse.cz>, <20130208060304.799F362F@pobox.sk>, <20130208094420.GA7557@dhcp22.suse.cz>, <20130208120249.FD733220@pobox.sk>, <20130208123854.GB7557@dhcp22.suse.cz>, <20130208145616.FB78CE24@pobox.sk>, <20130208152402.GD7557@dhcp22.suse.cz>, <20130208165805.8908B143@pobox.sk> <20130208171012.GH7557@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20130208171012.GH7557@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= > >I assume you have checked that the killed processes eventually die, >right? > When i killed them by hand, yes, they dissappeard from process list (i sa= w it). I don't know if they really died when OOM killed them. >Well, I do not see anything supsicious during that time period >(timestamps translate between Fri Feb 8 02:34:05 and Fri Feb 8 >02:36:48). The kernel log shows a lot of oom during that time. All >killed processes die eventually. No, they didn't died by OOM when cgroup was freezed. Just check PIDs from= memcg-bug-4.tar.gz and try to find them in kernel log. Why are all PIDs = waiting on 'mem_cgroup_handle_oom' and there is no OOM message in the log= ? Data in memcg-bug-4.tar.gz are only for 2 minutes but i let it run for = about 15-20 minutes, no single process killed by OOM. I'm 100% sure that = OOM was not killing them (maybe it was trying to but it didn't happen). > >Nothing shows it would be a deadlock so far. It is well possible that >the userspace went mad when seeing a lot of processes dying because it >doesn't expect it. > Lots of processes are dying also now, without your latest patch, and no s= uch things are happening. I'm sure there is something more it this, maybe= it revealed another bug? azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Date: Sun, 10 Feb 2013 16:03:13 +0100 Message-ID: <20130210150310.GA9504@dhcp22.suse.cz> References: <20130206160051.GG10254@dhcp22.suse.cz> <20130208060304.799F362F@pobox.sk> <20130208094420.GA7557@dhcp22.suse.cz> <20130208120249.FD733220@pobox.sk> <20130208123854.GB7557@dhcp22.suse.cz> <20130208145616.FB78CE24@pobox.sk> <20130208152402.GD7557@dhcp22.suse.cz> <20130208165805.8908B143@pobox.sk> <20130208171012.GH7557@dhcp22.suse.cz> <20130208220243.EDEE0825@pobox.sk> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:sender:date:from:to:cc:subject:message-id:references :mime-version:content-type:content-disposition:in-reply-to :user-agent; bh=uyIv2AteJ0qg7b1ATaGIj867QxbkbIj4KKxUYM1gqbg=; b=KgsetLnlZGgHzPle3M+r3TJOaFrjdoIXD4XGjPb1FrH3h7+Qq+MGbEzLDhWY2sGdx5 07jitiJPHYpJHl3DFlcPh5MDI2/M0WrLczUQFLcQpeLUEI4Cahh/ncXMTsNIp8o3q8KT xR4yPmw3yaRYmvzznMPzBnssV4wr7roaCR7S8EFJeQ1enoQy6GTtyYoBGpv+ITZ+7of4 rvGTMm1+nEfcoqwwLQFoRsi7Aswt/2AayBr7nlPRIGVnAz6YbPM2SgUql4lbp37ac2ni fF8a9BiZsy27PudQuFtzVbE6Ib4TINYtOp6MWWw/dg5Cg4aEBxjbX4jsSV+kkhjJmAlw 2CFw== Content-Disposition: inline In-Reply-To: <20130208220243.EDEE0825-Rm0zKEqwvD4@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Fri 08-02-13 22:02:43, azurIt wrote: > > > >I assume you have checked that the killed processes eventually die, > >right? > > > When i killed them by hand, yes, they dissappeard from process list (i > saw it). I don't know if they really died when OOM killed them. > > > >Well, I do not see anything supsicious during that time period > >(timestamps translate between Fri Feb 8 02:34:05 and Fri Feb 8 > >02:36:48). The kernel log shows a lot of oom during that time. All > >killed processes die eventually. > > > No, they didn't died by OOM when cgroup was freezed. Just check PIDs > from memcg-bug-4.tar.gz and try to find them in kernel log. OK, you seem to be right. My initial examination showed that each cgroup under OOM was able to move forward - in other words it was able to send SIGKILL somebody and we didn't loop on a single task which cannot die for some reason. Now when looking closer it seem we really have 2 tasks which didn't die after being killed by OOM killer: $ for i in `grep "Memory cgroup out of memory:" kern2.log | sed 's@.*Kill process \([0-9]*\) .*@\1@'`; do find bug -name $i; done | sed 's@.*/@@' | sort | uniq -c 141 18211 141 8102 $ md5sum bug/*/18211/stack | cut -d" " -f1 | uniq -c 141 3b8ce17e82a065a24ee046112033e1e8 So all the stacks are same: [] ptrace_stop+0x114/0x290 [] ptrace_do_notify+0x88/0xa0 [] ptrace_notify+0x53/0x70 [] syscall_trace_enter+0xf8/0x1c0 [] tracesys+0x71/0xd7 [] 0xffffffffffffffff stuck in the ptrace code. The other task is more interesting: $ md5sum bug/*/8102/stack | cut -d" " -f1 | sort | uniq -c 135 042e893c0e6657ed321ea9045e528f3e 6 dc7e71ce73be2a5c73404b565926e709 All snapshots with 042e893c0e6657ed321ea9045e528f3e are in: [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1149+0x5f3/0x600 [] mem_cgroup_charge_common+0x6c/0xb0 [] mem_cgroup_newpage_charge+0x45/0x50 [] handle_pte_fault+0x609/0x940 [] handle_mm_fault+0x138/0x260 [] do_page_fault+0x13d/0x460 [] page_fault+0x1f/0x30 [] 0xffffffffffffffff While the others do not show any stack: cat 1360287257/8102/stack [] 0xffffffffffffffff Which is quite interesting because we are talking about snapshots starting at 1360287245 (which maps to 02:34:05) but the kern2.log tells us that this process has been killed much earlier at: Feb 8 01:18:30 server01 kernel: [ 511.139921] Task in /1293/uid killed as a result of limit of /1293 [...] Feb 8 01:18:30 server01 kernel: [ 511.229755] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name Feb 8 01:18:30 server01 kernel: [ 511.230146] [ 8102] 1293 8102 170258 65869 7 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230339] [ 8113] 1293 8113 163756 59442 5 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230528] [ 8116] 1293 8116 170094 65675 2 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230726] [ 8119] 1293 8119 170094 65675 6 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230924] [ 8123] 1293 8123 169070 64612 7 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.231132] [ 8124] 1293 8124 170094 65675 5 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.231321] [ 8125] 1293 8125 170094 65673 1 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.231516] Memory cgroup out of memory: Kill process 8102 (apache2) score 1000 or sacrifice child This would suggest that the task is hung and cannot be killed but if we have a look at the following OOM in the same group 1293 it was _not_ present in the process list for that group: Feb 8 01:18:33 server01 kernel: [ 514.789550] Task in /1293/uid killed as a result of limit of /1293 [...] Feb 8 01:18:33 server01 kernel: [ 514.893198] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name Feb 8 01:18:33 server01 kernel: [ 514.893594] [ 8113] 1293 8113 168212 64036 1 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.893786] [ 8116] 1293 8116 170258 65870 6 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.893976] [ 8119] 1293 8119 170258 65870 7 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894166] [ 8123] 1293 8123 170158 65824 6 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894356] [ 8124] 1293 8124 170258 65870 5 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894547] [ 8125] 1293 8125 170158 65824 1 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894749] [ 8149] 1293 8149 163989 59647 7 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894944] Memory cgroup out of memory: Kill process 8113 (apache2) score 1000 or sacrifice child This is all _before_ you started collecting stacks and it also says that 8102 is gone. This all suggests that a) stack unwinder which displays /proc//stack is somehow confused and it doesn't show the correct stack for this process and b) the two processes cannot terminate due to some issue related to ptrace (stracing) the dying process. The above oom list doesn't include any processes which already released the memory which would explain why you still can see it as a member of the group (when looking into cgroup/tasks file). My guess would be that there is a bug in ptrace which doesn't free a reference to the task so it cannot cannot go away although it has dropped all the resources already. > Why are all PIDs waiting on 'mem_cgroup_handle_oom' and there is no > OOM message in the log? I am not sure what you mean here but there are $ grep "Memory cgroup out of memory:" kern2.collected.log | wc -l 16 OOM killer events during the time you were gathering memcg-bug-4 data. > Data in memcg-bug-4.tar.gz are only for 2 > minutes but i let it run for about 15-20 minutes, no single process > killed by OOM. I can see $ grep "Memory cgroup out of memory:" kern2.after.log | wc -l 57 killed after 02:38:47 when you stopped gathering data for memcg-bug-4 > I'm 100% sure that OOM was not killing them (maybe it was trying to > but it didn't happen). OK, let's do a little exercise. The list of processes eligible for OOM are listed before any task is killed. So if we collect both pid lists and "Kill process" messages per pid then no entries in the pid list should be present after the specific pid is killed. $ mkdir out $ for i in `grep "Memory cgroup out of memory: Kill process" kern2.log | sed 's@.*Kill process \([0-9]*\) .*@\1@'` do grep -e "Memory cgroup out of memory: Kill process $i" \ -e "\[ *\<$i\]" kern2.log > out/$i done $ for i in out/* do tail -n1 $i | grep "Memory cgroup out of memory:" >/dev/null|| echo "$i has already killed tasks" done out/6698 has already killed tasks out/6703 has already killed tasks OK, so there are two pids which were listed after they have been killed. Let's have a look at them. $ cat out/6698 Feb 8 01:17:04 server01 kernel: [ 425.497924] [ 6698] 1293 6698 170258 65846 1 0 0 apache2 Feb 8 01:17:05 server01 kernel: [ 426.079010] [ 6698] 1293 6698 170258 65846 1 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.144460] [ 6698] 1293 6698 169358 65220 1 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.146058] Memory cgroup out of memory: Kill process 6698 (apache2) score 1000 or sacrifice child Feb 8 03:27:57 server01 kernel: [ 8278.439896] [ 6698] 1020 6698 168518 64219 0 0 0 apache2 Feb 8 03:27:57 server01 kernel: [ 8278.879439] [ 6698] 1020 6698 168518 64218 6 0 0 apache2 Feb 8 03:27:59 server01 kernel: [ 8280.023944] [ 6698] 1020 6698 168816 64540 7 0 0 apache2 Feb 8 03:28:02 server01 kernel: [ 8283.242282] [ 6698] 1020 6698 171953 67751 6 0 0 apache2 $ cat out/6703 Feb 8 01:17:04 server01 kernel: [ 425.498118] [ 6703] 1293 6703 170258 65844 6 0 0 apache2 Feb 8 01:17:05 server01 kernel: [ 426.079206] [ 6703] 1293 6703 170258 65844 6 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.144653] [ 6703] 1293 6703 169358 65219 2 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.258924] [ 6703] 1293 6703 169358 65219 5 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.260282] Memory cgroup out of memory: Kill process 6703 (apache2) score 1000 or sacrifice child Feb 8 03:27:57 server01 kernel: [ 8278.440043] [ 6703] 1020 6703 166286 61978 7 0 0 apache2 Feb 8 03:27:57 server01 kernel: [ 8278.879587] [ 6703] 1020 6703 166286 61977 7 0 0 apache2 Feb 8 03:27:59 server01 kernel: [ 8280.024091] [ 6703] 1020 6703 166484 62233 7 0 0 apache2 Feb 8 03:28:02 server01 kernel: [ 8283.242429] [ 6703] 1020 6703 167402 63118 0 0 0 apache2 Lists have the following columns: [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name As we can see the uid changed for both pids after it has been killed (from 1293 to 1020) which suggests that the pid has been reused later for a different user (which is a clear sign that those pids died) - thus different group in your setup. So those two died as well, apparently. > >Nothing shows it would be a deadlock so far. It is well possible that > >the userspace went mad when seeing a lot of processes dying because it > >doesn't expect it. > > Lots of processes are dying also now, without your latest patch, and > no such things are happening. I'm sure there is something more it > this, maybe it revealed another bug? So far nothing shows that there would be anything broken wrt. memcg OOM killer. The ptrace issue sounds strange, all right, but that is another story and worth a separate investigation. I would be interested whether you still see anything wrong going on without that in game. You can get pretty nice overview of what is going on wrt. OOM from the log. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_if_PF=5FNO=5FMEMCG=5FOOM_is_set?= Date: Sun, 10 Feb 2013 17:46:19 +0100 Message-ID: <20130210174619.24F20488@pobox.sk> References: <20130206160051.GG10254@dhcp22.suse.cz>, <20130208060304.799F362F@pobox.sk>, <20130208094420.GA7557@dhcp22.suse.cz>, <20130208120249.FD733220@pobox.sk>, <20130208123854.GB7557@dhcp22.suse.cz>, <20130208145616.FB78CE24@pobox.sk>, <20130208152402.GD7557@dhcp22.suse.cz>, <20130208165805.8908B143@pobox.sk>, <20130208171012.GH7557@dhcp22.suse.cz>, <20130208220243.EDEE0825@pobox.sk> <20130210150310.GA9504@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Return-path: In-Reply-To: <20130210150310.GA9504-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >stuck in the ptrace code. But this happens _after_ the cgroup was freezed and i tried to strace one of it's processes (to see what's happening): Feb 8 01:29:46 server01 kernel: [ 1187.540672] grsec: From 178.40.250.111: process /usr/lib/apache2/mpm-itk/apache2(apache2:18211) attached to via ptrace by /usr/bin/strace[strace:18258] uid/euid:0/0 gid/egid:0/0, parent /usr/bin/htop[htop:2901] uid/euid:0/0 gid/egid:0/0 >> Why are all PIDs waiting on 'mem_cgroup_handle_oom' and there is no >> OOM message in the log? > >I am not sure what you mean here but there are >$ grep "Memory cgroup out of memory:" kern2.collected.log | wc -l >16 > >OOM killer events during the time you were gathering memcg-bug-4 data. > >> Data in memcg-bug-4.tar.gz are only for 2 >> minutes but i let it run for about 15-20 minutes, no single process >> killed by OOM. > >I can see >$ grep "Memory cgroup out of memory:" kern2.after.log | wc -l >57 > >killed after 02:38:47 when you stopped gathering data for memcg-bug-4 I meant no single process was killed inside cgroup 1258 (data from this cgroup are in memcg-bug-4.tar.gz). Just get data from memcg-bug-4.tar.gz which were taken from cgroup 1258. Almost all processes are in 'mem_cgroup_handle_oom' so cgroup is under OOM. I assume that this is suppose to take only few seconds while kernel finds any process and kill it (and maybe do it again until enough of memory is freed). I was gathering the data for about 2 and a half minutes and NO SINGLE process was killed (just compate list of PIDs from the first and the last directory inside memcg-bug-4.tar.gz). Even more, no single process was killed in cgroup 1258 also after i stopped gathering the data. You can also take the list od PID from memcg-bug-4.tar.gz and you will find only 18211 and 8102 (which are the two stucked processes). So my question is: Why no process was killed inside cgroup 1258 while it was under OOM? It was under OOM for at least 2 and a half of minutes while i was gathering the data (then i let it run for additional, cca, 10 minutes and then killed processes by hand but i cannot proof this). Why kernel didn't kill any process for so long and ends the OOM? Btw, processes in cgroup 1258 (memcg-bug-4.tar.gz) are looping in this two tasks (i pasted only first line of stack): mem_cgroup_handle_oom+0x241/0x3b0 0xffffffffffffffff Some of them are in 'poll_schedule_timeout' and then they start to loop as above. Is this correct behavior? For example, do (first line of stack from process 7710 from all timestamps): for i in */7710/stack; do head -n1 $i; done From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Date: Mon, 11 Feb 2013 12:22:40 +0100 Message-ID: <20130211112240.GC19922@dhcp22.suse.cz> References: <20130208094420.GA7557@dhcp22.suse.cz> <20130208120249.FD733220@pobox.sk> <20130208123854.GB7557@dhcp22.suse.cz> <20130208145616.FB78CE24@pobox.sk> <20130208152402.GD7557@dhcp22.suse.cz> <20130208165805.8908B143@pobox.sk> <20130208171012.GH7557@dhcp22.suse.cz> <20130208220243.EDEE0825@pobox.sk> <20130210150310.GA9504@dhcp22.suse.cz> <20130210174619.24F20488@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130210174619.24F20488@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Sun 10-02-13 17:46:19, azurIt wrote: > >stuck in the ptrace code. > > > But this happens _after_ the cgroup was freezed and i tried to strace > one of it's processes (to see what's happening): > > Feb 8 01:29:46 server01 kernel: [ 1187.540672] grsec: From 178.40.250.111: process /usr/lib/apache2/mpm-itk/apache2(apache2:18211) attached to via ptrace by /usr/bin/strace[strace:18258] uid/euid:0/0 gid/egid:0/0, parent /usr/bin/htop[htop:2901] uid/euid:0/0 gid/egid:0/0 Hmmm, Feb 8 01:39:16 server01 kernel: [ 1757.266678] Memory cgroup out of memory: Kill process 18211 (apache2) score 725 or sacrifice child) So the process has been killed 10 minutes ago and this was really the last OOM event for group /1258: $ grep "Task in /1258/uid killed" kern2.log | tail -n2 Feb 8 01:39:16 server01 kernel: [ 1757.045021] Task in /1258/uid killed as a result of limit of /1258 Feb 8 01:39:16 server01 kernel: [ 1757.167984] Task in /1258/uid killed as a result of limit of /1258 But this was still before you started collecting data for memcg-bug-4 (2:34) so we do not know what was the previous stack unfortunatelly. > >> Why are all PIDs waiting on 'mem_cgroup_handle_oom' and there is no > >> OOM message in the log? > > > >I am not sure what you mean here but there are > >$ grep "Memory cgroup out of memory:" kern2.collected.log | wc -l > >16 > > > >OOM killer events during the time you were gathering memcg-bug-4 data. > > > >> Data in memcg-bug-4.tar.gz are only for 2 > >> minutes but i let it run for about 15-20 minutes, no single process > >> killed by OOM. > > > >I can see > >$ grep "Memory cgroup out of memory:" kern2.after.log | wc -l > >57 > > > >killed after 02:38:47 when you stopped gathering data for memcg-bug-4 > > > I meant no single process was killed inside cgroup 1258 (data from > this cgroup are in memcg-bug-4.tar.gz). > > Just get data from memcg-bug-4.tar.gz which were taken from cgroup > 1258. Are you sure about that? When I extracted all pids from timestamp directories and greped them in the log I got this: for i in `cat bug/pids` ; do grep "\[ *\<$i\>\]" kern2.log ; done Feb 8 01:31:02 server01 kernel: [ 1263.429212] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:31:15 server01 kernel: [ 1276.655241] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:29 server01 kernel: [ 1350.797835] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:42 server01 kernel: [ 1363.662242] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:46 server01 kernel: [ 1367.181798] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:46 server01 kernel: [ 1367.381627] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:46 server01 kernel: [ 1367.490896] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:33:02 server01 kernel: [ 1383.709652] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:36:26 server01 kernel: [ 1587.458967] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:36:26 server01 kernel: [ 1587.558419] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:36:26 server01 kernel: [ 1587.652474] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:02 server01 kernel: [ 1743.107086] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:16 server01 kernel: [ 1757.015359] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:16 server01 kernel: [ 1757.133998] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:16 server01 kernel: [ 1757.262992] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:18:12 server01 kernel: [ 493.156641] [ 7888] 1293 7888 169326 64876 3 0 0 apache2 Feb 8 01:18:12 server01 kernel: [ 493.269129] [ 7888] 1293 7888 169390 64876 4 0 0 apache2 Feb 8 01:18:21 server01 kernel: [ 502.384221] [ 8011] 1293 8011 170094 65675 5 0 0 apache2 Feb 8 01:18:24 server01 kernel: [ 505.052600] [ 8011] 1293 8011 170260 65854 2 0 0 apache2 Feb 8 01:18:24 server01 kernel: [ 505.200454] [ 8011] 1293 8011 170260 65854 2 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.538637] [ 8054] 1258 8054 164404 60618 1 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230146] [ 8102] 1293 8102 170258 65869 7 0 0 apache2 So at least 7888, 8011 and 8102 were from a different group (1293). Others were never listed in the eligible processes list which is a bit unexpected. It is also unfortunate because I cannot match them to their groups from the log. $ for i in `cat bug/pids` ; do grep "\[ *\<$i\>\]" kern2.log >/dev/null || echo "$i not listed" ; done 7265 not listed 7474 not listed 7710 not listed 7969 not listed 7988 not listed 7997 not listed 8000 not listed 8014 not listed 8016 not listed 8019 not listed 8057 not listed 8058 not listed 8059 not listed 8063 not listed 8064 not listed 8066 not listed 8067 not listed 8069 not listed 8070 not listed 8071 not listed 8072 not listed 8075 not listed 8091 not listed 8092 not listed 8094 not listed 8098 not listed 8099 not listed 8100 not listed Are you sure all of them belong to 1258 group? > Almost all processes are in 'mem_cgroup_handle_oom' so cgroup > is under OOM. You are right, almost all of them are waiting in mem_cgroup_handle_oom which suggest that they should be listed in a per group eligible tasks list. One way how this might happen is when a process which manages to get oom_lock has a fatal signal pending. Then we wouldn't get to oom_kill_process and no OOM messages would get printed. This is correct because such a task would terminate soon anyway and all the waiters would wake up eventually. If not enough memory would be freed another task would get the oom_lock and this one would trigger OOM (unless it has fatal signal pending as well). Another option would be that no task could be selected - e.g. because select_bad_process sees TIF_MEMDIE marked task - the one already killed by OOM killer but that wasn't able to terminate for some reason. 18211 could be such a task. But we do not know what was going on with it before strace attached to it. Finally it is possible that the OOM header (everything up to Kill process) was suppressed because of rate limiting. But $ grep -B1 "Kill process" kern2.log Feb 8 01:15:02 server01 kernel: [ 304.000402] [ 4969] 1258 4969 163761 59554 6 0 0 apache2 Feb 8 01:15:02 server01 kernel: [ 304.000649] Memory cgroup out of memory: Kill process 4816 (apache2) score 1000 or sacrifice child -- Feb 8 01:15:51 server01 kernel: [ 352.924573] [ 5847] 1709 5847 163433 58952 6 0 0 apache2 Feb 8 01:15:51 server01 kernel: [ 352.924761] Memory cgroup out of memory: Kill process 5212 (apache2) score 1000 or sacrifice child [...] says that the message was preceded by a process list so we can exclude rate limiting. > I assume that this is suppose to take only few seconds > while kernel finds any process and kill it (and maybe do it again > until enough of memory is freed). I was gathering the data for > about 2 and a half minutes and NO SINGLE process was killed (just > compate list of PIDs from the first and the last directory inside > memcg-bug-4.tar.gz). Even more, no single process was killed in cgroup > 1258 also after i stopped gathering the data. You can also take the > list od PID from memcg-bug-4.tar.gz and you will find only 18211 and > 8102 (which are the two stucked processes). > > So my question is: Why no process was killed inside cgroup 1258 > while it was under OOM? I would bet that there is something weird going on with pid:18211. But I do not have enough information to find out what and why. > It was under OOM for at least 2 and a half of minutes while i was > gathering the data (then i let it run for additional, cca, 10 minutes > and then killed processes by hand but i cannot proof this). Why kernel > didn't kill any process for so long and ends the OOM? As already mentioned above, select_bad_process doesn't select any task if there is one which is on the way out. Maybe this is what is going on here. > Btw, processes in cgroup 1258 (memcg-bug-4.tar.gz) are looping in this > two tasks (i pasted only first line of stack): > mem_cgroup_handle_oom+0x241/0x3b0 > 0xffffffffffffffff 0xffffffffffffffff is just a bogus entry. No idea why this happens. > Some of them are in 'poll_schedule_timeout' and then they start to > loop as above. Is this correct behavior? > For example, do (first line of stack from process 7710 from all > timestamps): for i in */7710/stack; do head -n1 $i; done Yes, this is perfectly ok, because that task starts with: $ cat bug/1360287245/7710/stack [] poll_schedule_timeout+0x49/0x70 [] do_sys_poll+0x54b/0x680 [] sys_poll+0x7c/0xf0 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff and then later on it gets into OOM because of a page fault: $ cat bug/1360287250/7710/stack [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1149+0x5f3/0x600 [] mem_cgroup_charge_common+0x6c/0xb0 [] mem_cgroup_newpage_charge+0x45/0x50 [] do_wp_page+0x14e/0x800 [] handle_pte_fault+0x264/0x940 [] handle_mm_fault+0x138/0x260 [] do_page_fault+0x13d/0x460 [] page_fault+0x1f/0x30 [] 0xffffffffffffffff And it loops in it until the end which is possible as well if the group is under permanent OOM condition and the task is not selected to be killed. Unfortunately I am not able to reproduce this behavior even if I try to hammer OOM like mad so I am afraid I cannot help you much without further debugging patches. I do realize that experimenting in your environment is a problem but I do not many options left. Please do not use strace and rather collect /proc/pid/stack instead. It would be also helpful to get group/tasks file to have a full list of tasks in the group --- >From 1139745d43cc8c56bc79c219291d1e5281799dd4 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Mon, 11 Feb 2013 12:18:36 +0100 Subject: [PATCH] oom: debug skipping killing --- mm/oom_kill.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 069b64e..3d759f0 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -329,6 +329,8 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, if (test_tsk_thread_flag(p, TIF_MEMDIE)) { if (unlikely(frozen(p))) thaw_process(p); + printk(KERN_WARNING"XXX: pid:%d (flags:%u) is TIF_MEMDIE. Waiting for it\n", + p->pid, p->flags); return ERR_PTR(-1UL); } if (!p->mm) @@ -353,8 +355,11 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, * then wait for it to finish before killing * some other task unnecessarily. */ - if (!(p->group_leader->ptrace & PT_TRACE_EXIT)) + if (!(p->group_leader->ptrace & PT_TRACE_EXIT)) { + printk(KERN_WARNING"XXX: pid:%d (flags:%u) is PF_EXITING. Waiting for it\n", + p->pid, p->flags); return ERR_PTR(-1UL); + } } } @@ -494,6 +499,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, * its children or threads, just set TIF_MEMDIE so it can die quickly */ if (p->flags & PF_EXITING) { + printk(KERN_WARNING"XXX: pid:%d (flags:%u). Not killing PF_EXITING\n", p->pid, p->flags); set_tsk_thread_flag(p, TIF_MEMDIE); return 0; } @@ -567,6 +573,8 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) * its memory. */ if (fatal_signal_pending(current)) { + printk(KERN_WARNING"XXX: pid:%d (flags:%u) has fatal_signal_pending. Waiting for it\n", + p->pid, p->flags); set_thread_flag(TIF_MEMDIE); return; } -- 1.7.10.4 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_if_PF=5FNO=5FMEMCG=5FOOM_is_set?= Date: Fri, 22 Feb 2013 09:23:32 +0100 Message-ID: <20130222092332.4001E4B6@pobox.sk> References: <20130208094420.GA7557@dhcp22.suse.cz>, <20130208120249.FD733220@pobox.sk>, <20130208123854.GB7557@dhcp22.suse.cz>, <20130208145616.FB78CE24@pobox.sk>, <20130208152402.GD7557@dhcp22.suse.cz>, <20130208165805.8908B143@pobox.sk>, <20130208171012.GH7557@dhcp22.suse.cz>, <20130208220243.EDEE0825@pobox.sk>, <20130210150310.GA9504@dhcp22.suse.cz>, <20130210174619.24F20488@pobox.sk> <20130211112240.GC19922@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Return-path: In-Reply-To: <20130211112240.GC19922-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >Unfortunately I am not able to reproduce this behavior even if I try >to hammer OOM like mad so I am afraid I cannot help you much without >further debugging patches. >I do realize that experimenting in your environment is a problem but I >do not many options left. Please do not use strace and rather collect >/proc/pid/stack instead. It would be also helpful to get group/tasks >file to have a full list of tasks in the group Hi Michal, sorry that i didn't response for a while. Today i installed kernel with your two patches and i'm running it now. I'm still having problems with OOM which is not able to handle low memory and is not killing processes. Here is some info: - data from cgroup 1258 while it was under OOM and no processes were killed (so OOM don't stop and cgroup was freezed) http://watchdog.sk/lkml/memcg-bug-6.tar.gz I noticed problem about on 8:39 and waited until 8:57 (nothing happend). Then i killed process 19864 which seems to help and other processes probably ends and cgroup started to work. But problem accoured again about 20 seconds later, so i killed all processes at 8:58. The problem is occuring all the time since then. All processes (in that cgroup) are always in state 'D' when it occurs. - kernel log from boot until now http://watchdog.sk/lkml/kern3.gz Btw, something probably happened also at about 3:09 but i wasn't able to gather any data because my 'load check script' killed all apache processes (load was more than 100). azur From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_if_PF=5FNO=5FMEMCG=5FOOM_is_set?= Date: Fri, 22 Feb 2013 13:00:55 +0100 Message-ID: <20130222130055.29151595@pobox.sk> References: <20130208094420.GA7557@dhcp22.suse.cz>, <20130208120249.FD733220@pobox.sk>, <20130208123854.GB7557@dhcp22.suse.cz>, <20130208145616.FB78CE24@pobox.sk>, <20130208152402.GD7557@dhcp22.suse.cz>, <20130208165805.8908B143@pobox.sk>, <20130208171012.GH7557@dhcp22.suse.cz>, <20130208220243.EDEE0825@pobox.sk>, <20130210150310.GA9504@dhcp22.suse.cz>, <20130210174619.24F20488@pobox.sk> <20130211112240.GC19922@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20130211112240.GC19922@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >Unfortunately I am not able to reproduce this behavior even if I try >to hammer OOM like mad so I am afraid I cannot help you much without >further debugging patches. >I do realize that experimenting in your environment is a problem but I >do not many options left. Please do not use strace and rather collect >/proc/pid/stack instead. It would be also helpful to get group/tasks >file to have a full list of tasks in the group Sending new info! I found out one interesting thing. When problem occurs (it probably happe= n when OOM is started in target cgroup but i'm not sure), the target cgro= up, somehow, becames broken. In other words, after the problem occurs onc= e in target cgroup, it is happening always in this cgroup. I made this te= st: 1.) I create cgroup A with limits (also with memory limit). 2.) Waited when OOM is started (can takes hours). Processes in target cgr= oup becames freezed so they must be killed. 3.) After this, processes are always freezing in cgroup A, it usually tak= es 20-30 seconds after killing previously freezed processes. 4.) I created cgroup B with the *same* limits as cgroup A and moved user = from A to B. Problem disappears. 5.) Go to (2) And second thing, i got've kernel oops, look at the end of: http://watchdog.sk/lkml/oops -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Date: Fri, 22 Feb 2013 13:52:17 +0100 Message-ID: <20130222125217.GA32285@dhcp22.suse.cz> References: <20130208123854.GB7557@dhcp22.suse.cz> <20130208145616.FB78CE24@pobox.sk> <20130208152402.GD7557@dhcp22.suse.cz> <20130208165805.8908B143@pobox.sk> <20130208171012.GH7557@dhcp22.suse.cz> <20130208220243.EDEE0825@pobox.sk> <20130210150310.GA9504@dhcp22.suse.cz> <20130210174619.24F20488@pobox.sk> <20130211112240.GC19922@dhcp22.suse.cz> <20130222092332.4001E4B6@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130222092332.4001E4B6-Rm0zKEqwvD4@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Hi, On Fri 22-02-13 09:23:32, azurIt wrote: [...] > sorry that i didn't response for a while. Today i installed kernel > with your two patches and i'm running it now. I am not sure how much time I'll have for this today but just to make sure we are on the same page, could you point me to the two patches you have applied in the mean time? -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_if_PF=5FNO=5FMEMCG=5FOOM_is_set?= Date: Fri, 22 Feb 2013 13:54:42 +0100 Message-ID: <20130222135442.ADFFF498@pobox.sk> References: <20130208123854.GB7557@dhcp22.suse.cz>, <20130208145616.FB78CE24@pobox.sk>, <20130208152402.GD7557@dhcp22.suse.cz>, <20130208165805.8908B143@pobox.sk>, <20130208171012.GH7557@dhcp22.suse.cz>, <20130208220243.EDEE0825@pobox.sk>, <20130210150310.GA9504@dhcp22.suse.cz>, <20130210174619.24F20488@pobox.sk>, <20130211112240.GC19922@dhcp22.suse.cz>, <20130222092332.4001E4B6@pobox.sk> <20130222125217.GA32285@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20130222125217.GA32285@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >I am not sure how much time I'll have for this today but just to make >sure we are on the same page, could you point me to the two patches you >have applied in the mean time? Here: http://watchdog.sk/lkml/patches2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Date: Fri, 22 Feb 2013 14:00:17 +0100 Message-ID: <20130222130017.GB32285@dhcp22.suse.cz> References: <20130222125217.GA32285@dhcp22.suse.cz> <20130222135442.ADFFF498@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130222135442.ADFFF498-Rm0zKEqwvD4@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Fri 22-02-13 13:54:42, azurIt wrote: > >I am not sure how much time I'll have for this today but just to make > >sure we are on the same page, could you point me to the two patches you > >have applied in the mean time? > > > Here: > http://watchdog.sk/lkml/patches2 OK, looks correct. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Date: Thu, 6 Jun 2013 18:04:46 +0200 Message-ID: <20130606160446.GE24115@dhcp22.suse.cz> References: <20130208123854.GB7557@dhcp22.suse.cz> <20130208145616.FB78CE24@pobox.sk> <20130208152402.GD7557@dhcp22.suse.cz> <20130208165805.8908B143@pobox.sk> <20130208171012.GH7557@dhcp22.suse.cz> <20130208220243.EDEE0825@pobox.sk> <20130210150310.GA9504@dhcp22.suse.cz> <20130210174619.24F20488@pobox.sk> <20130211112240.GC19922@dhcp22.suse.cz> <20130222092332.4001E4B6@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130222092332.4001E4B6-Rm0zKEqwvD4@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Hi, I am really sorry it took so long but I was constantly preempted by other stuff. I hope I have a good news for you, though. Johannes has found a nice way how to overcome deadlock issues from memcg OOM which might help you. Would you be willing to test with his patch (http://permalink.gmane.org/gmane.linux.kernel.mm/101437). Unlike my patch which handles just the i_mutex case his patch solved all possible locks. I can backport the patch for your kernel (are you still using 3.2 kernel or you have moved to a newer one?). On Fri 22-02-13 09:23:32, azurIt wrote: > >Unfortunately I am not able to reproduce this behavior even if I try > >to hammer OOM like mad so I am afraid I cannot help you much without > >further debugging patches. > >I do realize that experimenting in your environment is a problem but I > >do not many options left. Please do not use strace and rather collect > >/proc/pid/stack instead. It would be also helpful to get group/tasks > >file to have a full list of tasks in the group > > > > Hi Michal, > > > sorry that i didn't response for a while. Today i installed kernel with your two patches and i'm running it now. I'm still having problems with OOM which is not able to handle low memory and is not killing processes. Here is some info: > > - data from cgroup 1258 while it was under OOM and no processes were killed (so OOM don't stop and cgroup was freezed) > http://watchdog.sk/lkml/memcg-bug-6.tar.gz > > I noticed problem about on 8:39 and waited until 8:57 (nothing happend). Then i killed process 19864 which seems to help and other processes probably ends and cgroup started to work. But problem accoured again about 20 seconds later, so i killed all processes at 8:58. The problem is occuring all the time since then. All processes (in that cgroup) are always in state 'D' when it occurs. > > > - kernel log from boot until now > http://watchdog.sk/lkml/kern3.gz > > > Btw, something probably happened also at about 3:09 but i wasn't able to gather any data because my 'load check script' killed all apache processes (load was more than 100). > > > > azur > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_if_PF=5FNO=5FMEMCG=5FOOM_is_set?= Date: Thu, 06 Jun 2013 18:16:33 +0200 Message-ID: <20130606181633.BCC3E02E@pobox.sk> References: <20130208123854.GB7557@dhcp22.suse.cz>, <20130208145616.FB78CE24@pobox.sk>, <20130208152402.GD7557@dhcp22.suse.cz>, <20130208165805.8908B143@pobox.sk>, <20130208171012.GH7557@dhcp22.suse.cz>, <20130208220243.EDEE0825@pobox.sk>, <20130210150310.GA9504@dhcp22.suse.cz>, <20130210174619.24F20488@pobox.sk>, <20130211112240.GC19922@dhcp22.suse.cz>, <20130222092332.4001E4B6@pobox.sk> <20130606160446.GE24115@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20130606160446.GE24115@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="iso-8859-1" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= Hello Michal, nice to read you! :) Yes, i'm still on 3.2. Could you be so kind and try = to backport it? Thank you very much! azur ______________________________________________________________ > Od: "Michal Hocko" > Komu: azurIt > D=C3=A1tum: 06.06.2013 18:04 > Predmet: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMC= G_OOM is set > > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailingl= ist" , "KAMEZAWA Hiroyuki" , "Johannes Weiner" >Hi, > >I am really sorry it took so long but I was constantly preempted by >other stuff. I hope I have a good news for you, though. Johannes has >found a nice way how to overcome deadlock issues from memcg OOM which >might help you. Would you be willing to test with his patch >(http://permalink.gmane.org/gmane.linux.kernel.mm/101437). Unlike my >patch which handles just the i_mutex case his patch solved all possible >locks. > >I can backport the patch for your kernel (are you still using 3.2 kernel >or you have moved to a newer one?). > >On Fri 22-02-13 09:23:32, azurIt wrote: >> >Unfortunately I am not able to reproduce this behavior even if I try >> >to hammer OOM like mad so I am afraid I cannot help you much without >> >further debugging patches. >> >I do realize that experimenting in your environment is a problem but = I >> >do not many options left. Please do not use strace and rather collect >> >/proc/pid/stack instead. It would be also helpful to get group/tasks >> >file to have a full list of tasks in the group >>=20 >>=20 >>=20 >> Hi Michal, >>=20 >>=20 >> sorry that i didn't response for a while. Today i installed kernel wit= h your two patches and i'm running it now. I'm still having problems with= OOM which is not able to handle low memory and is not killing processes.= Here is some info: >>=20 >> - data from cgroup 1258 while it was under OOM and no processes were k= illed (so OOM don't stop and cgroup was freezed) >> http://watchdog.sk/lkml/memcg-bug-6.tar.gz >>=20 >> I noticed problem about on 8:39 and waited until 8:57 (nothing happend= ). Then i killed process 19864 which seems to help and other processes pr= obably ends and cgroup started to work. But problem accoured again about = 20 seconds later, so i killed all processes at 8:58. The problem is occur= ing all the time since then. All processes (in that cgroup) are always in= state 'D' when it occurs. >>=20 >>=20 >> - kernel log from boot until now >> http://watchdog.sk/lkml/kern3.gz >>=20 >>=20 >> Btw, something probably happened also at about 3:09 but i wasn't able = to gather any data because my 'load check script' killed all apache proce= sses (load was more than 100). >>=20 >>=20 >>=20 >> azur >> -- >> To unsubscribe from this list: send the line "unsubscribe cgroups" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > >--=20 >Michal Hocko >SUSE Labs > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Date: Fri, 7 Jun 2013 15:11:57 +0200 Message-ID: <20130607131157.GF8117@dhcp22.suse.cz> References: <20130208152402.GD7557@dhcp22.suse.cz> <20130208165805.8908B143@pobox.sk> <20130208171012.GH7557@dhcp22.suse.cz> <20130208220243.EDEE0825@pobox.sk> <20130210150310.GA9504@dhcp22.suse.cz> <20130210174619.24F20488@pobox.sk> <20130211112240.GC19922@dhcp22.suse.cz> <20130222092332.4001E4B6@pobox.sk> <20130606160446.GE24115@dhcp22.suse.cz> <20130606181633.BCC3E02E@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130606181633.BCC3E02E@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Thu 06-06-13 18:16:33, azurIt wrote: > Hello Michal, > > nice to read you! :) Yes, i'm still on 3.2. Could you be so kind and > try to backport it? Thank you very much! Here we go. I hope I didn't screw anything (Johannes might double check) because there were quite some changes in the area since 3.2. Nothing earth shattering though. Please note that I have only compile tested this. Also make sure you remove the previous patches you have from me. --- >From 9d2801c1f53147ca9134cc5f76ab28d505a37a54 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Fri, 7 Jun 2013 13:52:42 +0200 Subject: [PATCH] memcg: do not trap chargers with full callstack on OOM The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1146+0x5ab/0x5c0 [] mem_cgroup_cache_charge+0xbe/0xe0 [] add_to_page_cache_locked+0x4c/0x140 [] add_to_page_cache_lru+0x22/0x50 [] grab_cache_page_write_begin+0x8b/0xe0 [] ext3_write_begin+0x88/0x270 [] generic_file_buffered_write+0x116/0x290 [] __generic_file_aio_write+0x27c/0x480 [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [] do_sync_write+0xea/0x130 [] vfs_write+0xf3/0x1f0 [] sys_write+0x51/0x90 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff OOM kill victim: [] do_truncate+0x58/0xa0 # takes i_mutex [] do_last+0x250/0xa30 [] path_openat+0xd7/0x440 [] do_filp_open+0x49/0xa0 [] do_sys_open+0x106/0x240 [] sys_open+0x20/0x30 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting an OOM and makes sure nobody loops or sleeps on OOM with locks held: 1. When OOMing in a system call (buffered IO and friends), invoke the OOM killer but just return -ENOMEM, never sleep on a OOM waitqueue. Userspace should be able to handle this and it prevents anybody from looping or waiting with locks held. 2. When OOMing in a page fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 3. When detecting an OOM in a page fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. While reworking the OOM routine, also remove a needless OOM waitqueue wakeup when invoking the killer. Only uncharges and limit increases, things that actually change the memory situation, should do wakeups. Reported-by: Reported-by: azurIt Debugged-by: Michal Hocko Reported-by: David Rientjes Signed-off-by: Johannes Weiner Reviewed-by: Michal Hocko --- include/linux/memcontrol.h | 22 +++++++ include/linux/mm.h | 1 + include/linux/sched.h | 6 ++ mm/ksm.c | 2 +- mm/memcontrol.c | 149 ++++++++++++++++++++++++++++---------------- mm/memory.c | 40 ++++++++---- mm/oom_kill.c | 2 + 7 files changed, 156 insertions(+), 66 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index b87068a..56bfc39 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -120,6 +120,15 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page); extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); +static inline void mem_cgroup_set_userfault(struct task_struct *p) +{ + p->memcg_oom.in_userfault = 1; +} +static inline void mem_cgroup_clear_userfault(struct task_struct *p) +{ + p->memcg_oom.in_userfault = 0; +} +bool mem_cgroup_oom_synchronize(void); #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP extern int do_swap_account; #endif @@ -333,6 +342,19 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } +static inline void mem_cgroup_set_userfault(struct task_struct *p) +{ +} + +static inline void mem_cgroup_clear_userfault(struct task_struct *p) +{ +} + +static inline bool mem_cgroup_oom_synchronize(void) +{ + return false; +} + static inline void mem_cgroup_inc_page_stat(struct page *page, enum mem_cgroup_page_stat_item idx) { diff --git a/include/linux/mm.h b/include/linux/mm.h index 4baadd1..91380ef 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -156,6 +156,7 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_ALLOW_RETRY 0x08 /* Retry fault if blocking */ #define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */ #define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */ +#define FAULT_FLAG_KERNEL 0x80 /* kernel-triggered fault (get_user_pages etc.) */ /* * This interface is used by x86 PAT code to identify a pfn mapping that is diff --git a/include/linux/sched.h b/include/linux/sched.h index 1c4f3e9..d521a70 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1568,6 +1568,12 @@ struct task_struct { unsigned long nr_pages; /* uncharged usage */ unsigned long memsw_nr_pages; /* uncharged mem+swap usage */ } memcg_batch; + struct memcg_oom_info { + unsigned int in_userfault:1; + unsigned int in_memcg_oom:1; + int wakeups; + struct mem_cgroup *wait_on_memcg; + } memcg_oom; #endif #ifdef CONFIG_HAVE_HW_BREAKPOINT atomic_t ptrace_bp_refcnt; diff --git a/mm/ksm.c b/mm/ksm.c index 310544a..3295a3b 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -338,7 +338,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr) break; if (PageKsm(page)) ret = handle_mm_fault(vma->vm_mm, vma, addr, - FAULT_FLAG_WRITE); + FAULT_FLAG_KERNEL | FAULT_FLAG_WRITE); else ret = VM_FAULT_WRITE; put_page(page); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b63f5f7..67189b4 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -249,6 +249,7 @@ struct mem_cgroup { bool oom_lock; atomic_t under_oom; + atomic_t oom_wakeups; atomic_t refcnt; @@ -1846,6 +1847,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait, static void memcg_wakeup_oom(struct mem_cgroup *memcg) { + atomic_inc(&memcg->oom_wakeups); /* for filtering, pass "memcg" as argument. */ __wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg); } @@ -1857,55 +1859,109 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) } /* - * try to call OOM killer. returns false if we should exit memory-reclaim loop. + * try to call OOM killer */ -bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask) { - struct oom_wait_info owait; - bool locked, need_to_kill; - - owait.mem = memcg; - owait.wait.flags = 0; - owait.wait.func = memcg_oom_wake_function; - owait.wait.private = current; - INIT_LIST_HEAD(&owait.wait.task_list); - need_to_kill = true; - mem_cgroup_mark_under_oom(memcg); + bool locked, need_to_kill = true; /* At first, try to OOM lock hierarchy under memcg.*/ spin_lock(&memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); - /* - * Even if signal_pending(), we can't quit charge() loop without - * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL - * under OOM is always welcomed, use TASK_KILLABLE here. - */ - prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); if (!locked || memcg->oom_kill_disable) need_to_kill = false; if (locked) mem_cgroup_oom_notify(memcg); spin_unlock(&memcg_oom_lock); - if (need_to_kill) { - finish_wait(&memcg_oom_waitq, &owait.wait); - mem_cgroup_out_of_memory(memcg, mask); - } else { - schedule(); - finish_wait(&memcg_oom_waitq, &owait.wait); + /* + * A system call can just return -ENOMEM, but if this is a + * page fault and somebody else is handling the OOM already, + * we need to sleep on the OOM waitqueue for this memcg until + * the situation is resolved. Which can take some time + * because it might be handled by a userspace task. + * + * However, this is the charge context, which means that we + * may sit on a large call stack and hold various filesystem + * locks, the mmap_sem etc. and we don't want the OOM handler + * to deadlock on them while we sit here and wait. Store the + * current OOM context in the task_struct, then return + * -ENOMEM. At the end of the page fault handler, with the + * stack unwound, pagefault_out_of_memory() will check back + * with us by calling mem_cgroup_oom_synchronize(), possibly + * putting the task to sleep. + */ + if (current->memcg_oom.in_userfault) { + current->memcg_oom.in_memcg_oom = 1; + /* + * Somebody else is handling the situation. Make sure + * no wakeups are missed between now and going to + * sleep at the end of the page fault. + */ + if (!need_to_kill) { + mem_cgroup_mark_under_oom(memcg); + current->memcg_oom.wakeups = + atomic_read(&memcg->oom_wakeups); + css_get(&memcg->css); + current->memcg_oom.wait_on_memcg = memcg; + } } - spin_lock(&memcg_oom_lock); - if (locked) + + if (need_to_kill) + mem_cgroup_out_of_memory(memcg, mask); + + if (locked) { + spin_lock(&memcg_oom_lock); mem_cgroup_oom_unlock(memcg); - memcg_wakeup_oom(memcg); - spin_unlock(&memcg_oom_lock); + /* + * Sleeping tasks might have been killed, make sure + * they get scheduled so they can exit. + */ + if (need_to_kill) + memcg_oom_recover(memcg); + spin_unlock(&memcg_oom_lock); + } +} - mem_cgroup_unmark_under_oom(memcg); +bool mem_cgroup_oom_synchronize(void) +{ + struct oom_wait_info owait; + struct mem_cgroup *memcg; - if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + /* OOM is global, do not handle */ + if (!current->memcg_oom.in_memcg_oom) return false; - /* Give chance to dying process */ - schedule_timeout_uninterruptible(1); + + /* + * We invoked the OOM killer but there is a chance that a kill + * did not free up any charges. Everybody else might already + * be sleeping, so restart the fault and keep the rampage + * going until some charges are released. + */ + memcg = current->memcg_oom.wait_on_memcg; + if (!memcg) + goto out; + + if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + goto out_put; + + owait.mem = memcg; + owait.wait.flags = 0; + owait.wait.func = memcg_oom_wake_function; + owait.wait.private = current; + INIT_LIST_HEAD(&owait.wait.task_list); + + prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); + /* Only sleep if we didn't miss any wakeups since OOM */ + if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups) + schedule(); + finish_wait(&memcg_oom_waitq, &owait.wait); +out_put: + mem_cgroup_unmark_under_oom(memcg); + css_put(&memcg->css); + current->memcg_oom.wait_on_memcg = NULL; +out: + current->memcg_oom.in_memcg_oom = 0; return true; } @@ -2195,11 +2251,10 @@ enum { CHARGE_RETRY, /* need to retry but retry is not bad */ CHARGE_NOMEM, /* we can't do more. return -ENOMEM */ CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */ - CHARGE_OOM_DIE, /* the current is killed because of OOM */ }; static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, - unsigned int nr_pages, bool oom_check) + unsigned int nr_pages, bool invoke_oom) { unsigned long csize = nr_pages * PAGE_SIZE; struct mem_cgroup *mem_over_limit; @@ -2257,14 +2312,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (mem_cgroup_wait_acct_move(mem_over_limit)) return CHARGE_RETRY; - /* If we don't need to call oom-killer at el, return immediately */ - if (!oom_check) - return CHARGE_NOMEM; - /* check OOM */ - if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask)) - return CHARGE_OOM_DIE; + if (invoke_oom) + mem_cgroup_oom(mem_over_limit, gfp_mask); - return CHARGE_RETRY; + return CHARGE_NOMEM; } /* @@ -2349,7 +2400,7 @@ again: } do { - bool oom_check; + bool invoke_oom = oom && !nr_oom_retries; /* If killed, bypass charge */ if (fatal_signal_pending(current)) { @@ -2357,13 +2408,7 @@ again: goto bypass; } - oom_check = false; - if (oom && !nr_oom_retries) { - oom_check = true; - nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES; - } - - ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check); + ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, invoke_oom); switch (ret) { case CHARGE_OK: break; @@ -2376,16 +2421,12 @@ again: css_put(&memcg->css); goto nomem; case CHARGE_NOMEM: /* OOM routine works */ - if (!oom) { + if (!oom || invoke_oom) { css_put(&memcg->css); goto nomem; } - /* If oom, we never return -ENOMEM */ nr_oom_retries--; break; - case CHARGE_OOM_DIE: /* Killed by OOM Killer */ - css_put(&memcg->css); - goto bypass; } } while (ret != CHARGE_OK); diff --git a/mm/memory.c b/mm/memory.c index 829d437..bee177c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1720,7 +1720,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, cond_resched(); while (!(page = follow_page(vma, start, foll_flags))) { int ret; - unsigned int fault_flags = 0; + unsigned int fault_flags = FAULT_FLAG_KERNEL; /* For mlock, just skip the stack guard page. */ if (foll_flags & FOLL_MLOCK) { @@ -1842,6 +1842,7 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, if (!vma || address < vma->vm_start) return -EFAULT; + fault_flags |= FAULT_FLAG_KERNEL; ret = handle_mm_fault(mm, vma, address, fault_flags); if (ret & VM_FAULT_ERROR) { if (ret & VM_FAULT_OOM) @@ -3439,22 +3440,14 @@ unlock: /* * By the time we get here, we already hold the mm semaphore */ -int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, unsigned int flags) +static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) { pgd_t *pgd; pud_t *pud; pmd_t *pmd; pte_t *pte; - __set_current_state(TASK_RUNNING); - - count_vm_event(PGFAULT); - mem_cgroup_count_vm_event(mm, PGFAULT); - - /* do counter updates before entering really critical section. */ - check_sync_rss_stat(current); - if (unlikely(is_vm_hugetlb_page(vma))) return hugetlb_fault(mm, vma, address, flags); @@ -3503,6 +3496,31 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, return handle_pte_fault(mm, vma, address, pte, pmd, flags); } +int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) +{ + int in_userfault = !(flags & FAULT_FLAG_KERNEL); + int ret; + + __set_current_state(TASK_RUNNING); + + count_vm_event(PGFAULT); + mem_cgroup_count_vm_event(mm, PGFAULT); + + /* do counter updates before entering really critical section. */ + check_sync_rss_stat(current); + + if (in_userfault) + mem_cgroup_set_userfault(current); + + ret = __handle_mm_fault(mm, vma, address, flags); + + if (in_userfault) + mem_cgroup_clear_userfault(current); + + return ret; +} + #ifndef __PAGETABLE_PUD_FOLDED /* * Allocate page upper directory. diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 069b64e..aa60863 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -785,6 +785,8 @@ out: */ void pagefault_out_of_memory(void) { + if (mem_cgroup_oom_synchronize()) + return; if (try_set_system_oom()) { out_of_memory(NULL, 0, 0, NULL); clear_system_oom(); -- 1.7.10.4 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=5D_memcg=3A_do_not_trap_chargers_with_full_callstack_on_OOM?= Date: Mon, 17 Jun 2013 12:21:34 +0200 Message-ID: <20130617122134.2E072BA8@pobox.sk> References: <20130208152402.GD7557@dhcp22.suse.cz>, <20130208165805.8908B143@pobox.sk>, <20130208171012.GH7557@dhcp22.suse.cz>, <20130208220243.EDEE0825@pobox.sk>, <20130210150310.GA9504@dhcp22.suse.cz>, <20130210174619.24F20488@pobox.sk>, <20130211112240.GC19922@dhcp22.suse.cz>, <20130222092332.4001E4B6@pobox.sk>, <20130606160446.GE24115@dhcp22.suse.cz>, <20130606181633.BCC3E02E@pobox.sk> <20130607131157.GF8117@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20130607131157.GF8117@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >Here we go. I hope I didn't screw anything (Johannes might double check) >because there were quite some changes in the area since 3.2. Nothing >earth shattering though. Please note that I have only compile tested >this. Also make sure you remove the previous patches you have from me. Hi Michal, it, unfortunately, didn't work. Everything was working fine but original = problem is still occuring. I'm unable to send you stacks or more info bec= ause problem is taking down the whole server for some time now (don't kno= w what exactly caused it to start happening, maybe newer versions of 3.2.= x). But i'm sure of one thing - when problem occurs, nothing is able to a= ccess hard drives (every process which tries it is freezed until problem = is resolved or server is rebooted). Problem is fixed after killing proces= ses from cgroup which caused it and everything immediatelly starts to wor= k normally. I find this out by keeping terminal opened from another serve= r to one where my problem is occuring quite often and running several app= s there (htop, iotop, etc.). When problem occurs, all apps which wasn't w= orking with HDD was ok. The htop proved to be very usefull here because i= t's only reading proc filesystem and is also able to send KILL signals - = i was able to resolve the problem with it without rebooting the server. I created a special daemon (about month ago) which is able to detect and = fix the problem so i'm not having server outages now. The point was to NO= T access anything which is stored on HDDs, the daemon is only reading inf= o from cgroup filesystem and sending KILL signals to processes. Maybe i s= hould be able to also read stack files before killing, i will try it. Btw, which vanilla kernel includes this patch? Thank you and everyone involved very much for time and help. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Date: Wed, 19 Jun 2013 15:26:14 +0200 Message-ID: <20130619132614.GC16457@dhcp22.suse.cz> References: <20130208171012.GH7557@dhcp22.suse.cz> <20130208220243.EDEE0825@pobox.sk> <20130210150310.GA9504@dhcp22.suse.cz> <20130210174619.24F20488@pobox.sk> <20130211112240.GC19922@dhcp22.suse.cz> <20130222092332.4001E4B6@pobox.sk> <20130606160446.GE24115@dhcp22.suse.cz> <20130606181633.BCC3E02E@pobox.sk> <20130607131157.GF8117@dhcp22.suse.cz> <20130617122134.2E072BA8@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130617122134.2E072BA8@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Mon 17-06-13 12:21:34, azurIt wrote: > >Here we go. I hope I didn't screw anything (Johannes might double check) > >because there were quite some changes in the area since 3.2. Nothing > >earth shattering though. Please note that I have only compile tested > >this. Also make sure you remove the previous patches you have from me. > > > Hi Michal, > > it, unfortunately, didn't work. Everything was working fine but > original problem is still occuring. This would be more than surprising because tasks blocked at memcg OOM don't hold any locks anymore. Maybe I have messed something up during backport but I cannot spot anything. > I'm unable to send you stacks or more info because problem is taking > down the whole server for some time now (don't know what exactly > caused it to start happening, maybe newer versions of 3.2.x). So you are not testing with the same kernel with just the old patch replaced by the new one? > But i'm sure of one thing - when problem occurs, nothing is able to > access hard drives (every process which tries it is freezed until > problem is resolved or server is rebooted). I would be really interesting to see what those tasks are blocked on. > Problem is fixed after killing processes from cgroup which > caused it and everything immediatelly starts to work normally. I > find this out by keeping terminal opened from another server to one > where my problem is occuring quite often and running several apps > there (htop, iotop, etc.). When problem occurs, all apps which wasn't > working with HDD was ok. The htop proved to be very usefull here > because it's only reading proc filesystem and is also able to send > KILL signals - i was able to resolve the problem with it > without rebooting the server. sysrq+t will give you the list of all tasks and their traces. > I created a special daemon (about month ago) which is able to detect > and fix the problem so i'm not having server outages now. The point > was to NOT access anything which is stored on HDDs, the daemon is > only reading info from cgroup filesystem and sending KILL signals to > processes. Maybe i should be able to also read stack files before > killing, i will try it. > > Btw, which vanilla kernel includes this patch? None yet. But I hope it will be merged to 3.11 and backported to the stable trees. > Thank you and everyone involved very much for time and help. > > azur -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=5D_memcg=3A_do_not_trap_chargers_with_full_callstack_on_OOM?= Date: Sat, 22 Jun 2013 22:09:58 +0200 Message-ID: <20130622220958.D10567A4@pobox.sk> References: <20130208171012.GH7557@dhcp22.suse.cz>, <20130208220243.EDEE0825@pobox.sk>, <20130210150310.GA9504@dhcp22.suse.cz>, <20130210174619.24F20488@pobox.sk>, <20130211112240.GC19922@dhcp22.suse.cz>, <20130222092332.4001E4B6@pobox.sk>, <20130606160446.GE24115@dhcp22.suse.cz>, <20130606181633.BCC3E02E@pobox.sk>, <20130607131157.GF8117@dhcp22.suse.cz>, <20130617122134.2E072BA8@pobox.sk> <20130619132614.GC16457@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20130619132614.GC16457@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= Michal, >> I'm unable to send you stacks or more info because problem is taking >> down the whole server for some time now (don't know what exactly >> caused it to start happening, maybe newer versions of 3.2.x). > >So you are not testing with the same kernel with just the old patch >replaced by the new one? No, i'm not testing with the same kernel but all are 3.2.x. I even cannot= install older 3.2.x because grsecurity is always available for newest ke= rnel and there is no archive of older versions (at least i don't know abo= ut any). >> But i'm sure of one thing - when problem occurs, nothing is able to >> access hard drives (every process which tries it is freezed until >> problem is resolved or server is rebooted). > >I would be really interesting to see what those tasks are blocked on. I'm trying to get it, stay tuned :) Today i noticed one bug, not 100% sure it is related to 'your' patch but = i didn't seen this before. I noticed that i have lots of cgroups which ca= nnot be removed - if i do 'rmdir ', it just hangs and n= ever complete. Even more, it's not possible to access the whole cgroup fi= lesystem until i kill that rmdir (anything, which tries it, just hangs). = All unremoveable cgroups has this in 'memory.oom_control': oom_kill_disable 0 under_oom 1 And, yes, 'tasks' file is empty. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=5D_memcg=3A_do_not_trap_chargers_with_full_callstack_on_OOM?= Date: Mon, 24 Jun 2013 18:48:40 +0200 Message-ID: <20130624184840.781777E6@pobox.sk> References: <20130208171012.GH7557@dhcp22.suse.cz>, <20130208220243.EDEE0825@pobox.sk>, <20130210150310.GA9504@dhcp22.suse.cz>, <20130210174619.24F20488@pobox.sk>, <20130211112240.GC19922@dhcp22.suse.cz>, <20130222092332.4001E4B6@pobox.sk>, <20130606160446.GE24115@dhcp22.suse.cz>, <20130606181633.BCC3E02E@pobox.sk>, <20130607131157.GF8117@dhcp22.suse.cz>, <20130617122134.2E072BA8@pobox.sk> <20130619132614.GC16457@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20130619132614.GC16457@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >I would be really interesting to see what those tasks are blocked on. Ok, i got it! Problem occurs two times and it behaves differently each ti= me, I was running kernel with that latest patch. 1.) It doesn't have impact on the whole server, only on one cgroup. Here = are stacks: http://watchdog.sk/lkml/memcg-bug-7.tar.gz 2.) It almost takes down the server because of huge I/O on HDDs. Unfortun= ately, i had a bug in my script which was suppose to gather stacks (i was= n't able to do it by hand like in (1), server was almost unoperable). But= I was lucky and somehow killed processes from problematic cgroup (via ht= op) and server was ok again EXCEPT one important thing - processes from t= hat cgroup were still running in D state and i wasn't able to kill them f= or good. They were taking web server network ports so i had to reboot the= server :( BUT, before that, i gathered stacks: http://watchdog.sk/lkml/memcg-bug-8.tar.gz What do you think? azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Date: Mon, 24 Jun 2013 16:13:45 -0400 Message-ID: <20130624201345.GA21822@cmpxchg.org> References: <20130210150310.GA9504@dhcp22.suse.cz> <20130210174619.24F20488@pobox.sk> <20130211112240.GC19922@dhcp22.suse.cz> <20130222092332.4001E4B6@pobox.sk> <20130606160446.GE24115@dhcp22.suse.cz> <20130606181633.BCC3E02E@pobox.sk> <20130607131157.GF8117@dhcp22.suse.cz> <20130617122134.2E072BA8@pobox.sk> <20130619132614.GC16457@dhcp22.suse.cz> <20130622220958.D10567A4@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130622220958.D10567A4-Rm0zKEqwvD4@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: Michal Hocko , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki Hi guys, On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote: > >> But i'm sure of one thing - when problem occurs, nothing is able to > >> access hard drives (every process which tries it is freezed until > >> problem is resolved or server is rebooted). > > > >I would be really interesting to see what those tasks are blocked on. > > I'm trying to get it, stay tuned :) > > Today i noticed one bug, not 100% sure it is related to 'your' patch > but i didn't seen this before. I noticed that i have lots of cgroups > which cannot be removed - if i do 'rmdir ', it > just hangs and never complete. Even more, it's not possible to > access the whole cgroup filesystem until i kill that rmdir > (anything, which tries it, just hangs). All unremoveable cgroups has > this in 'memory.oom_control': oom_kill_disable 0 under_oom 1 Somebody acquires the OOM wait reference to the memcg and marks it under oom but then does not call into mem_cgroup_oom_synchronize() to clean up. That's why under_oom is set and the rmdir waits for outstanding references. > And, yes, 'tasks' file is empty. It's not a kernel thread that does it because all kernel-context handle_mm_fault() are annotated properly, which means the task must be userspace and, since tasks is empty, have exited before synchronizing. Can you try with the following patch on top? diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 5db0490..9a0b152 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -846,17 +846,6 @@ static noinline int mm_fault_error(struct pt_regs *regs, unsigned long error_code, unsigned long address, unsigned int fault) { - /* - * Pagefault was interrupted by SIGKILL. We have no reason to - * continue pagefault. - */ - if (fatal_signal_pending(current)) { - if (!(fault & VM_FAULT_RETRY)) - up_read(¤t->mm->mmap_sem); - if (!(error_code & PF_USER)) - no_context(regs, error_code, address); - return 1; - } if (!(fault & VM_FAULT_ERROR)) return 0; From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=5D_memcg=3A_do_not_trap_chargers_with_full_callstack_on_OOM?= Date: Fri, 28 Jun 2013 12:06:13 +0200 Message-ID: <20130628120613.6D6CAD21@pobox.sk> References: <20130210150310.GA9504@dhcp22.suse.cz>, <20130210174619.24F20488@pobox.sk>, <20130211112240.GC19922@dhcp22.suse.cz>, <20130222092332.4001E4B6@pobox.sk>, <20130606160446.GE24115@dhcp22.suse.cz>, <20130606181633.BCC3E02E@pobox.sk>, <20130607131157.GF8117@dhcp22.suse.cz>, <20130617122134.2E072BA8@pobox.sk>, <20130619132614.GC16457@dhcp22.suse.cz>, <20130622220958.D10567A4@pobox.sk> <20130624201345.GA21822@cmpxchg.org> Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Return-path: In-Reply-To: <20130624201345.GA21822-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Johannes_Weiner?= Cc: =?utf-8?q?Michal_Hocko?= , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= >It's not a kernel thread that does it because all kernel-context >handle_mm_fault() are annotated properly, which means the task must be >userspace and, since tasks is empty, have exited before synchronizing. > >Can you try with the following patch on top? Michal and Johannes, i have some observations which i made: Original patch from Johannes was really fixing something but definitely not everything and was introducing new problems. I'm running unpatched kernel from time i send my last message and problems with freezing cgroups are occuring very often (several times per day) - they were, on the other hand, quite rare with patch from Johannes. Johannes, i didn't try your last patch yet. I would like to wait until you or Michal look at my last message which contained detailed information about freezing of cgroups on kernel running your original patch (which was suppose to fix it for good). Even more, i would like to hear your opinion about that stucked processes which was holding web server port and which forced me to reboot production server at the middle of the day :( more information was in my last message. Thank you very much for your time. azur From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Date: Fri, 5 Jul 2013 14:17:28 -0400 Message-ID: <20130705181728.GQ17812@cmpxchg.org> References: <20130211112240.GC19922@dhcp22.suse.cz> <20130222092332.4001E4B6@pobox.sk> <20130606160446.GE24115@dhcp22.suse.cz> <20130606181633.BCC3E02E@pobox.sk> <20130607131157.GF8117@dhcp22.suse.cz> <20130617122134.2E072BA8@pobox.sk> <20130619132614.GC16457@dhcp22.suse.cz> <20130622220958.D10567A4@pobox.sk> <20130624201345.GA21822@cmpxchg.org> <20130628120613.6D6CAD21@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130628120613.6D6CAD21-Rm0zKEqwvD4@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: Michal Hocko , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki Hi azurIt, On Fri, Jun 28, 2013 at 12:06:13PM +0200, azurIt wrote: > >It's not a kernel thread that does it because all kernel-context > >handle_mm_fault() are annotated properly, which means the task must be > >userspace and, since tasks is empty, have exited before synchronizing. > > > >Can you try with the following patch on top? > > > Michal and Johannes, > > i have some observations which i made: Original patch from Johannes > was really fixing something but definitely not everything and was > introducing new problems. I'm running unpatched kernel from time i > send my last message and problems with freezing cgroups are occuring > very often (several times per day) - they were, on the other hand, > quite rare with patch from Johannes. That's good! > Johannes, i didn't try your last patch yet. I would like to wait > until you or Michal look at my last message which contained detailed > information about freezing of cgroups on kernel running your > original patch (which was suppose to fix it for good). Even more, i > would like to hear your opinion about that stucked processes which > was holding web server port and which forced me to reboot production > server at the middle of the day :( more information was in my last > message. Thank you very much for your time. I looked at your debug messages but could not find anything that would hint at a deadlock. All tasks are stuck in the refrigerator, so I assume you use the freezer cgroup and enabled it somehow? Sorry about your production server locking up, but from the stacks I don't see any connection to the OOM problems you were having... :/ From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=5D_memcg=3A_do_not_trap_chargers_with_full_callstack_on_OOM?= Date: Fri, 05 Jul 2013 21:02:46 +0200 Message-ID: <20130705210246.11D2135A@pobox.sk> References: <20130211112240.GC19922@dhcp22.suse.cz>, <20130222092332.4001E4B6@pobox.sk>, <20130606160446.GE24115@dhcp22.suse.cz>, <20130606181633.BCC3E02E@pobox.sk>, <20130607131157.GF8117@dhcp22.suse.cz>, <20130617122134.2E072BA8@pobox.sk>, <20130619132614.GC16457@dhcp22.suse.cz>, <20130622220958.D10567A4@pobox.sk>, <20130624201345.GA21822@cmpxchg.org>, <20130628120613.6D6CAD21@pobox.sk> <20130705181728.GQ17812@cmpxchg.org> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20130705181728.GQ17812@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Johannes_Weiner?= Cc: =?utf-8?q?Michal_Hocko?= , linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= >I looked at your debug messages but could not find anything that would >hint at a deadlock. All tasks are stuck in the refrigerator, so I >assume you use the freezer cgroup and enabled it somehow? Yes, i'm really using freezer cgroup BUT i was checking if it's not doing= problems - unfortunately, several days passed from that day and now i do= n't fully remember if i was checking it for both cases (unremoveabled cgr= oups and these freezed processes holding web server port). I'm 100% sure = i was checking it for unremoveable cgroups but not so sure for the other = problem (i had to act quickly in that case). Are you sure (from stacks) t= hat freezer cgroup was enabled there? Btw, what about that other stacks? I mean this file: http://watchdog.sk/lkml/memcg-bug-7.tar.gz It was taken while running the kernel with your patch and from cgroup whi= ch was under unresolveable OOM (just like my very original problem). Thank you! azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Date: Fri, 5 Jul 2013 15:18:54 -0400 Message-ID: <20130705191854.GR17812@cmpxchg.org> References: <20130606160446.GE24115@dhcp22.suse.cz> <20130606181633.BCC3E02E@pobox.sk> <20130607131157.GF8117@dhcp22.suse.cz> <20130617122134.2E072BA8@pobox.sk> <20130619132614.GC16457@dhcp22.suse.cz> <20130622220958.D10567A4@pobox.sk> <20130624201345.GA21822@cmpxchg.org> <20130628120613.6D6CAD21@pobox.sk> <20130705181728.GQ17812@cmpxchg.org> <20130705210246.11D2135A@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130705210246.11D2135A-Rm0zKEqwvD4@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: Michal Hocko , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote: > >I looked at your debug messages but could not find anything that would > >hint at a deadlock. All tasks are stuck in the refrigerator, so I > >assume you use the freezer cgroup and enabled it somehow? > > > Yes, i'm really using freezer cgroup BUT i was checking if it's not > doing problems - unfortunately, several days passed from that day > and now i don't fully remember if i was checking it for both cases > (unremoveabled cgroups and these freezed processes holding web > server port). I'm 100% sure i was checking it for unremoveable > cgroups but not so sure for the other problem (i had to act quickly > in that case). Are you sure (from stacks) that freezer cgroup was > enabled there? Yeah, all the traces without exception look like this: 1372089762/23433/stack:[] refrigerator+0x95/0x160 1372089762/23433/stack:[] get_signal_to_deliver+0x1cb/0x540 1372089762/23433/stack:[] do_signal+0x6b/0x750 1372089762/23433/stack:[] do_notify_resume+0x55/0x80 1372089762/23433/stack:[] int_signal+0x12/0x17 1372089762/23433/stack:[] 0xffffffffffffffff so the freezer was already enabled when you took the backtraces. > Btw, what about that other stacks? I mean this file: > http://watchdog.sk/lkml/memcg-bug-7.tar.gz > > It was taken while running the kernel with your patch and from > cgroup which was under unresolveable OOM (just like my very original > problem). I looked at these traces too, but none of the tasks are stuck in rmdir or the OOM path. Some /are/ in the page fault path, but they are happily doing reclaim and don't appear to be stuck. So I'm having a hard time matching this data to what you otherwise observed. However, based on what you reported the most likely explanation for the continued hangs is the unfinished OOM handling for which I sent the followup patch for arch/x86/mm/fault.c. From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=5D_memcg=3A_do_not_trap_chargers_with_full_callstack_on_OOM?= Date: Mon, 08 Jul 2013 01:42:24 +0200 Message-ID: <20130708014224.50F06960@pobox.sk> References: <20130606160446.GE24115@dhcp22.suse.cz>, <20130606181633.BCC3E02E@pobox.sk>, <20130607131157.GF8117@dhcp22.suse.cz>, <20130617122134.2E072BA8@pobox.sk>, <20130619132614.GC16457@dhcp22.suse.cz>, <20130622220958.D10567A4@pobox.sk>, <20130624201345.GA21822@cmpxchg.org>, <20130628120613.6D6CAD21@pobox.sk>, <20130705181728.GQ17812@cmpxchg.org>, <20130705210246.11D2135A@pobox.sk> <20130705191854.GR17812@cmpxchg.org> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20130705191854.GR17812@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Johannes_Weiner?= Cc: =?utf-8?q?Michal_Hocko?= , linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= > CC: "Michal Hocko" , linux-kernel@vger.kernel.org, linu= x-mm@kvack.org, "cgroups mailinglist" , "KAMEZAW= A Hiroyuki" >On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote: >> >I looked at your debug messages but could not find anything that woul= d >> >hint at a deadlock. All tasks are stuck in the refrigerator, so I >> >assume you use the freezer cgroup and enabled it somehow? >>=20 >>=20 >> Yes, i'm really using freezer cgroup BUT i was checking if it's not >> doing problems - unfortunately, several days passed from that day >> and now i don't fully remember if i was checking it for both cases >> (unremoveabled cgroups and these freezed processes holding web >> server port). I'm 100% sure i was checking it for unremoveable >> cgroups but not so sure for the other problem (i had to act quickly >> in that case). Are you sure (from stacks) that freezer cgroup was >> enabled there? > >Yeah, all the traces without exception look like this: > >1372089762/23433/stack:[] refrigerator+0x95/0x160 >1372089762/23433/stack:[] get_signal_to_deliver+0x1cb/= 0x540 >1372089762/23433/stack:[] do_signal+0x6b/0x750 >1372089762/23433/stack:[] do_notify_resume+0x55/0x80 >1372089762/23433/stack:[] int_signal+0x12/0x17 >1372089762/23433/stack:[] 0xffffffffffffffff > >so the freezer was already enabled when you took the backtraces. > >> Btw, what about that other stacks? I mean this file: >> http://watchdog.sk/lkml/memcg-bug-7.tar.gz >>=20 >> It was taken while running the kernel with your patch and from >> cgroup which was under unresolveable OOM (just like my very original >> problem). > >I looked at these traces too, but none of the tasks are stuck in rmdir >or the OOM path. Some /are/ in the page fault path, but they are >happily doing reclaim and don't appear to be stuck. So I'm having a >hard time matching this data to what you otherwise observed. > >However, based on what you reported the most likely explanation for >the continued hangs is the unfinished OOM handling for which I sent >the followup patch for arch/x86/mm/fault.c. > Johannes, today I tested both of your patches but problem with unremovable cgroups,= unfortunately, persists. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Date: Tue, 9 Jul 2013 15:00:17 +0200 Message-ID: <20130709130017.GE20281@dhcp22.suse.cz> References: <20130210174619.24F20488@pobox.sk> <20130211112240.GC19922@dhcp22.suse.cz> <20130222092332.4001E4B6@pobox.sk> <20130606160446.GE24115@dhcp22.suse.cz> <20130606181633.BCC3E02E@pobox.sk> <20130607131157.GF8117@dhcp22.suse.cz> <20130617122134.2E072BA8@pobox.sk> <20130619132614.GC16457@dhcp22.suse.cz> <20130622220958.D10567A4@pobox.sk> <20130624201345.GA21822@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130624201345.GA21822-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: azurIt , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki On Mon 24-06-13 16:13:45, Johannes Weiner wrote: > Hi guys, > > On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote: > > >> But i'm sure of one thing - when problem occurs, nothing is able to > > >> access hard drives (every process which tries it is freezed until > > >> problem is resolved or server is rebooted). > > > > > >I would be really interesting to see what those tasks are blocked on. > > > > I'm trying to get it, stay tuned :) > > > > Today i noticed one bug, not 100% sure it is related to 'your' patch > > but i didn't seen this before. I noticed that i have lots of cgroups > > which cannot be removed - if i do 'rmdir ', it > > just hangs and never complete. Even more, it's not possible to > > access the whole cgroup filesystem until i kill that rmdir > > (anything, which tries it, just hangs). All unremoveable cgroups has > > this in 'memory.oom_control': oom_kill_disable 0 under_oom 1 > > Somebody acquires the OOM wait reference to the memcg and marks it > under oom but then does not call into mem_cgroup_oom_synchronize() to > clean up. That's why under_oom is set and the rmdir waits for > outstanding references. > > > And, yes, 'tasks' file is empty. > > It's not a kernel thread that does it because all kernel-context > handle_mm_fault() are annotated properly, which means the task must be > userspace and, since tasks is empty, have exited before synchronizing. Yes, well spotted. I have missed that while reviewing your patch. The follow up fix looks correct. > Can you try with the following patch on top? > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index 5db0490..9a0b152 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -846,17 +846,6 @@ static noinline int > mm_fault_error(struct pt_regs *regs, unsigned long error_code, > unsigned long address, unsigned int fault) > { > - /* > - * Pagefault was interrupted by SIGKILL. We have no reason to > - * continue pagefault. > - */ > - if (fatal_signal_pending(current)) { > - if (!(fault & VM_FAULT_RETRY)) > - up_read(¤t->mm->mmap_sem); > - if (!(error_code & PF_USER)) > - no_context(regs, error_code, address); > - return 1; > - } > if (!(fault & VM_FAULT_ERROR)) > return 0; > -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Date: Tue, 9 Jul 2013 15:08:08 +0200 Message-ID: <20130709130808.GF20281@dhcp22.suse.cz> References: <20130211112240.GC19922@dhcp22.suse.cz> <20130222092332.4001E4B6@pobox.sk> <20130606160446.GE24115@dhcp22.suse.cz> <20130606181633.BCC3E02E@pobox.sk> <20130607131157.GF8117@dhcp22.suse.cz> <20130617122134.2E072BA8@pobox.sk> <20130619132614.GC16457@dhcp22.suse.cz> <20130622220958.D10567A4@pobox.sk> <20130624201345.GA21822@cmpxchg.org> <20130709130017.GE20281@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130709130017.GE20281-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: azurIt , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki On Tue 09-07-13 15:00:17, Michal Hocko wrote: > On Mon 24-06-13 16:13:45, Johannes Weiner wrote: > > Hi guys, > > > > On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote: > > > >> But i'm sure of one thing - when problem occurs, nothing is able to > > > >> access hard drives (every process which tries it is freezed until > > > >> problem is resolved or server is rebooted). > > > > > > > >I would be really interesting to see what those tasks are blocked on. > > > > > > I'm trying to get it, stay tuned :) > > > > > > Today i noticed one bug, not 100% sure it is related to 'your' patch > > > but i didn't seen this before. I noticed that i have lots of cgroups > > > which cannot be removed - if i do 'rmdir ', it > > > just hangs and never complete. Even more, it's not possible to > > > access the whole cgroup filesystem until i kill that rmdir > > > (anything, which tries it, just hangs). All unremoveable cgroups has > > > this in 'memory.oom_control': oom_kill_disable 0 under_oom 1 > > > > Somebody acquires the OOM wait reference to the memcg and marks it > > under oom but then does not call into mem_cgroup_oom_synchronize() to > > clean up. That's why under_oom is set and the rmdir waits for > > outstanding references. > > > > > And, yes, 'tasks' file is empty. > > > > It's not a kernel thread that does it because all kernel-context > > handle_mm_fault() are annotated properly, which means the task must be > > userspace and, since tasks is empty, have exited before synchronizing. > > Yes, well spotted. I have missed that while reviewing your patch. > The follow up fix looks correct. Hmm, I guess you wanted to remove !(fault & VM_FAULT_ERROR) test as well otherwise the else BUG() path would be unreachable and we wouldn't know that something fishy is going on. > > Can you try with the following patch on top? > > > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > > index 5db0490..9a0b152 100644 > > --- a/arch/x86/mm/fault.c > > +++ b/arch/x86/mm/fault.c > > @@ -846,17 +846,6 @@ static noinline int > > mm_fault_error(struct pt_regs *regs, unsigned long error_code, > > unsigned long address, unsigned int fault) > > { > > - /* > > - * Pagefault was interrupted by SIGKILL. We have no reason to > > - * continue pagefault. > > - */ > > - if (fatal_signal_pending(current)) { > > - if (!(fault & VM_FAULT_RETRY)) > > - up_read(¤t->mm->mmap_sem); > > - if (!(error_code & PF_USER)) > > - no_context(regs, error_code, address); > > - return 1; > > - } > > if (!(fault & VM_FAULT_ERROR)) > > return 0; > > > > -- > Michal Hocko > SUSE Labs > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Date: Tue, 9 Jul 2013 15:10:00 +0200 Message-ID: <20130709131000.GG20281@dhcp22.suse.cz> References: <20130222092332.4001E4B6@pobox.sk> <20130606160446.GE24115@dhcp22.suse.cz> <20130606181633.BCC3E02E@pobox.sk> <20130607131157.GF8117@dhcp22.suse.cz> <20130617122134.2E072BA8@pobox.sk> <20130619132614.GC16457@dhcp22.suse.cz> <20130622220958.D10567A4@pobox.sk> <20130624201345.GA21822@cmpxchg.org> <20130709130017.GE20281@dhcp22.suse.cz> <20130709130808.GF20281@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130709130808.GF20281-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: azurIt , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki On Tue 09-07-13 15:08:08, Michal Hocko wrote: > On Tue 09-07-13 15:00:17, Michal Hocko wrote: > > On Mon 24-06-13 16:13:45, Johannes Weiner wrote: > > > Hi guys, > > > > > > On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote: > > > > >> But i'm sure of one thing - when problem occurs, nothing is able to > > > > >> access hard drives (every process which tries it is freezed until > > > > >> problem is resolved or server is rebooted). > > > > > > > > > >I would be really interesting to see what those tasks are blocked on. > > > > > > > > I'm trying to get it, stay tuned :) > > > > > > > > Today i noticed one bug, not 100% sure it is related to 'your' patch > > > > but i didn't seen this before. I noticed that i have lots of cgroups > > > > which cannot be removed - if i do 'rmdir ', it > > > > just hangs and never complete. Even more, it's not possible to > > > > access the whole cgroup filesystem until i kill that rmdir > > > > (anything, which tries it, just hangs). All unremoveable cgroups has > > > > this in 'memory.oom_control': oom_kill_disable 0 under_oom 1 > > > > > > Somebody acquires the OOM wait reference to the memcg and marks it > > > under oom but then does not call into mem_cgroup_oom_synchronize() to > > > clean up. That's why under_oom is set and the rmdir waits for > > > outstanding references. > > > > > > > And, yes, 'tasks' file is empty. > > > > > > It's not a kernel thread that does it because all kernel-context > > > handle_mm_fault() are annotated properly, which means the task must be > > > userspace and, since tasks is empty, have exited before synchronizing. > > > > Yes, well spotted. I have missed that while reviewing your patch. > > The follow up fix looks correct. > > Hmm, I guess you wanted to remove !(fault & VM_FAULT_ERROR) test as well > otherwise the else BUG() path would be unreachable and we wouldn't know > that something fishy is going on. No, scratch it! We need it for VM_FAULT_RETRY. Sorry about the noise. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Date: Tue, 9 Jul 2013 15:10:29 +0200 Message-ID: <20130709131029.GH20281@dhcp22.suse.cz> References: <20130607131157.GF8117@dhcp22.suse.cz> <20130617122134.2E072BA8@pobox.sk> <20130619132614.GC16457@dhcp22.suse.cz> <20130622220958.D10567A4@pobox.sk> <20130624201345.GA21822@cmpxchg.org> <20130628120613.6D6CAD21@pobox.sk> <20130705181728.GQ17812@cmpxchg.org> <20130705210246.11D2135A@pobox.sk> <20130705191854.GR17812@cmpxchg.org> <20130708014224.50F06960@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130708014224.50F06960@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: Johannes Weiner , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki On Mon 08-07-13 01:42:24, azurIt wrote: > > CC: "Michal Hocko" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" > >On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote: > >> >I looked at your debug messages but could not find anything that would > >> >hint at a deadlock. All tasks are stuck in the refrigerator, so I > >> >assume you use the freezer cgroup and enabled it somehow? > >> > >> > >> Yes, i'm really using freezer cgroup BUT i was checking if it's not > >> doing problems - unfortunately, several days passed from that day > >> and now i don't fully remember if i was checking it for both cases > >> (unremoveabled cgroups and these freezed processes holding web > >> server port). I'm 100% sure i was checking it for unremoveable > >> cgroups but not so sure for the other problem (i had to act quickly > >> in that case). Are you sure (from stacks) that freezer cgroup was > >> enabled there? > > > >Yeah, all the traces without exception look like this: > > > >1372089762/23433/stack:[] refrigerator+0x95/0x160 > >1372089762/23433/stack:[] get_signal_to_deliver+0x1cb/0x540 > >1372089762/23433/stack:[] do_signal+0x6b/0x750 > >1372089762/23433/stack:[] do_notify_resume+0x55/0x80 > >1372089762/23433/stack:[] int_signal+0x12/0x17 > >1372089762/23433/stack:[] 0xffffffffffffffff > > > >so the freezer was already enabled when you took the backtraces. > > > >> Btw, what about that other stacks? I mean this file: > >> http://watchdog.sk/lkml/memcg-bug-7.tar.gz > >> > >> It was taken while running the kernel with your patch and from > >> cgroup which was under unresolveable OOM (just like my very original > >> problem). > > > >I looked at these traces too, but none of the tasks are stuck in rmdir > >or the OOM path. Some /are/ in the page fault path, but they are > >happily doing reclaim and don't appear to be stuck. So I'm having a > >hard time matching this data to what you otherwise observed. Agreed. > >However, based on what you reported the most likely explanation for > >the continued hangs is the unfinished OOM handling for which I sent > >the followup patch for arch/x86/mm/fault.c. > > Johannes, > > today I tested both of your patches but problem with unremovable > cgroups, unfortunately, persists. Is the group empty again with marked under_oom? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=5D_memcg=3A_do_not_trap_chargers_with_full_callstack_on_OOM?= Date: Tue, 09 Jul 2013 15:19:21 +0200 Message-ID: <20130709151921.5160C199@pobox.sk> References: <20130607131157.GF8117@dhcp22.suse.cz>, <20130617122134.2E072BA8@pobox.sk>, <20130619132614.GC16457@dhcp22.suse.cz>, <20130622220958.D10567A4@pobox.sk>, <20130624201345.GA21822@cmpxchg.org>, <20130628120613.6D6CAD21@pobox.sk>, <20130705181728.GQ17812@cmpxchg.org>, <20130705210246.11D2135A@pobox.sk>, <20130705191854.GR17812@cmpxchg.org>, <20130708014224.50F06960@pobox.sk> <20130709131029.GH20281@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20130709131029.GH20281@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: =?utf-8?q?Johannes_Weiner?= , linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= >On Mon 08-07-13 01:42:24, azurIt wrote: >> > CC: "Michal Hocko" , linux-kernel@vger.kernel.org, l= inux-mm@kvack.org, "cgroups mailinglist" , "KAME= ZAWA Hiroyuki" >> >On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote: >> >> >I looked at your debug messages but could not find anything that w= ould >> >> >hint at a deadlock. All tasks are stuck in the refrigerator, so I >> >> >assume you use the freezer cgroup and enabled it somehow? >> >>=20 >> >>=20 >> >> Yes, i'm really using freezer cgroup BUT i was checking if it's not >> >> doing problems - unfortunately, several days passed from that day >> >> and now i don't fully remember if i was checking it for both cases >> >> (unremoveabled cgroups and these freezed processes holding web >> >> server port). I'm 100% sure i was checking it for unremoveable >> >> cgroups but not so sure for the other problem (i had to act quickly >> >> in that case). Are you sure (from stacks) that freezer cgroup was >> >> enabled there? >> > >> >Yeah, all the traces without exception look like this: >> > >> >1372089762/23433/stack:[] refrigerator+0x95/0x160 >> >1372089762/23433/stack:[] get_signal_to_deliver+0x1= cb/0x540 >> >1372089762/23433/stack:[] do_signal+0x6b/0x750 >> >1372089762/23433/stack:[] do_notify_resume+0x55/0x8= 0 >> >1372089762/23433/stack:[] int_signal+0x12/0x17 >> >1372089762/23433/stack:[] 0xffffffffffffffff >> > >> >so the freezer was already enabled when you took the backtraces. >> > >> >> Btw, what about that other stacks? I mean this file: >> >> http://watchdog.sk/lkml/memcg-bug-7.tar.gz >> >>=20 >> >> It was taken while running the kernel with your patch and from >> >> cgroup which was under unresolveable OOM (just like my very origina= l >> >> problem). >> > >> >I looked at these traces too, but none of the tasks are stuck in rmdi= r >> >or the OOM path. Some /are/ in the page fault path, but they are >> >happily doing reclaim and don't appear to be stuck. So I'm having a >> >hard time matching this data to what you otherwise observed. > >Agreed. > >> >However, based on what you reported the most likely explanation for >> >the continued hangs is the unfinished OOM handling for which I sent >> >the followup patch for arch/x86/mm/fault.c. >>=20 >> Johannes, >>=20 >> today I tested both of your patches but problem with unremovable >> cgroups, unfortunately, persists. > >Is the group empty again with marked under_oom? Now i realized that i forgot to remove UID from that cgroup before trying= to remove it, so cgroup cannot be removed anyway (we are using third par= ty cgroup called cgroup-uid from Andrea Righi, which is able to associate= all user's processes with target cgroup). Look here for cgroup-uid patch= : https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.pa= tch ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was per= manently '1'. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Date: Tue, 9 Jul 2013 15:54:50 +0200 Message-ID: <20130709135450.GI20281@dhcp22.suse.cz> References: <20130619132614.GC16457@dhcp22.suse.cz> <20130622220958.D10567A4@pobox.sk> <20130624201345.GA21822@cmpxchg.org> <20130628120613.6D6CAD21@pobox.sk> <20130705181728.GQ17812@cmpxchg.org> <20130705210246.11D2135A@pobox.sk> <20130705191854.GR17812@cmpxchg.org> <20130708014224.50F06960@pobox.sk> <20130709131029.GH20281@dhcp22.suse.cz> <20130709151921.5160C199@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130709151921.5160C199-Rm0zKEqwvD4@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: Johannes Weiner , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki On Tue 09-07-13 15:19:21, azurIt wrote: [...] > Now i realized that i forgot to remove UID from that cgroup before > trying to remove it, so cgroup cannot be removed anyway (we are using > third party cgroup called cgroup-uid from Andrea Righi, which is able > to associate all user's processes with target cgroup). Look here for > cgroup-uid patch: > https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > permanently '1'. This is really strange. Could you post the whole diff against stable tree you are using (except for grsecurity stuff and the above cgroup-uid patch)? Btw. the bellow patch might help us to point to the exit path which leaves wait_on_memcg without mem_cgroup_oom_synchronize: --- diff --git a/kernel/exit.c b/kernel/exit.c index e6e01b9..ad472e0 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -895,6 +895,7 @@ NORET_TYPE void do_exit(long code) profile_task_exit(tsk); + WARN_ON(current->memcg_oom.wait_on_memcg); WARN_ON(blk_needs_flush_plug(tsk)); if (unlikely(in_interrupt())) -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=5D_memcg=3A_do_not_trap_chargers_with_full_callstack_on_OOM?= Date: Wed, 10 Jul 2013 18:25:06 +0200 Message-ID: <20130710182506.F25DF461@pobox.sk> References: <20130619132614.GC16457@dhcp22.suse.cz>, <20130622220958.D10567A4@pobox.sk>, <20130624201345.GA21822@cmpxchg.org>, <20130628120613.6D6CAD21@pobox.sk>, <20130705181728.GQ17812@cmpxchg.org>, <20130705210246.11D2135A@pobox.sk>, <20130705191854.GR17812@cmpxchg.org>, <20130708014224.50F06960@pobox.sk>, <20130709131029.GH20281@dhcp22.suse.cz>, <20130709151921.5160C199@pobox.sk> <20130709135450.GI20281@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20130709135450.GI20281@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: =?utf-8?q?Johannes_Weiner?= , linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , righi.andrea@gmail.com >> Now i realized that i forgot to remove UID from that cgroup before >> trying to remove it, so cgroup cannot be removed anyway (we are using >> third party cgroup called cgroup-uid from Andrea Righi, which is able >> to associate all user's processes with target cgroup). Look here for >> cgroup-uid patch: >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8= .patch >>=20 >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was >> permanently '1'. > >This is really strange. Could you post the whole diff against stable >tree you are using (except for grsecurity stuff and the above cgroup-uid >patch)? Here are all patches which i applied to kernel 3.2.48 in my last test: http://watchdog.sk/lkml/patches3/ Patches marked as 7-* are from Johannes. I'm appling them in order except= the grsecurity - it goes as first. azur >Btw. the bellow patch might help us to point to the exit path which >leaves wait_on_memcg without mem_cgroup_oom_synchronize: >--- >diff --git a/kernel/exit.c b/kernel/exit.c >index e6e01b9..ad472e0 100644 >--- a/kernel/exit.c >+++ b/kernel/exit.c >@@ -895,6 +895,7 @@ NORET_TYPE void do_exit(long code) >=20 > profile_task_exit(tsk); >=20 >+ WARN_ON(current->memcg_oom.wait_on_memcg); > WARN_ON(blk_needs_flush_plug(tsk)); >=20 > if (unlikely(in_interrupt())) >--=20 >Michal Hocko >SUSE Labs > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Date: Thu, 11 Jul 2013 09:25:07 +0200 Message-ID: <20130711072507.GA21667@dhcp22.suse.cz> References: <20130624201345.GA21822@cmpxchg.org> <20130628120613.6D6CAD21@pobox.sk> <20130705181728.GQ17812@cmpxchg.org> <20130705210246.11D2135A@pobox.sk> <20130705191854.GR17812@cmpxchg.org> <20130708014224.50F06960@pobox.sk> <20130709131029.GH20281@dhcp22.suse.cz> <20130709151921.5160C199@pobox.sk> <20130709135450.GI20281@dhcp22.suse.cz> <20130710182506.F25DF461@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130710182506.F25DF461@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: Johannes Weiner , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea@gmail.com On Wed 10-07-13 18:25:06, azurIt wrote: > >> Now i realized that i forgot to remove UID from that cgroup before > >> trying to remove it, so cgroup cannot be removed anyway (we are using > >> third party cgroup called cgroup-uid from Andrea Righi, which is able > >> to associate all user's processes with target cgroup). Look here for > >> cgroup-uid patch: > >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > >> > >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > >> permanently '1'. > > > >This is really strange. Could you post the whole diff against stable > >tree you are using (except for grsecurity stuff and the above cgroup-uid > >patch)? > > > Here are all patches which i applied to kernel 3.2.48 in my last test: > http://watchdog.sk/lkml/patches3/ The two patches from Johannes seem correct. >From a quick look even grsecurity patchset shouldn't interfere as it doesn't seem to put any code between handle_mm_fault and mm_fault_error and there also doesn't seem to be any new handle_mm_fault call sites. But I cannot tell there aren't other code paths which would lead to a memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=5D_memcg=3A_do_not_trap_chargers_with_full_callstack_on_OOM?= Date: Sun, 14 Jul 2013 01:26:41 +0200 Message-ID: <20130714012641.C2DA4E05@pobox.sk> References: <20130624201345.GA21822@cmpxchg.org>, <20130628120613.6D6CAD21@pobox.sk>, <20130705181728.GQ17812@cmpxchg.org>, <20130705210246.11D2135A@pobox.sk>, <20130705191854.GR17812@cmpxchg.org>, <20130708014224.50F06960@pobox.sk>, <20130709131029.GH20281@dhcp22.suse.cz>, <20130709151921.5160C199@pobox.sk>, <20130709135450.GI20281@dhcp22.suse.cz>, <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20130711072507.GA21667@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: =?utf-8?q?Johannes_Weiner?= , linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , righi.andrea@gmail.com > CC: "Johannes Weiner" , linux-kernel@vger.kernel.or= g, linux-mm@kvack.org, "cgroups mailinglist" , "= KAMEZAWA Hiroyuki" , righi.andrea@gmail.c= om >On Wed 10-07-13 18:25:06, azurIt wrote: >> >> Now i realized that i forgot to remove UID from that cgroup before >> >> trying to remove it, so cgroup cannot be removed anyway (we are usi= ng >> >> third party cgroup called cgroup-uid from Andrea Righi, which is ab= le >> >> to associate all user's processes with target cgroup). Look here fo= r >> >> cgroup-uid patch: >> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid= -v8.patch >> >>=20 >> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' w= as >> >> permanently '1'. >> > >> >This is really strange. Could you post the whole diff against stable >> >tree you are using (except for grsecurity stuff and the above cgroup-= uid >> >patch)? >>=20 >>=20 >> Here are all patches which i applied to kernel 3.2.48 in my last test: >> http://watchdog.sk/lkml/patches3/ > >The two patches from Johannes seem correct. > >From a quick look even grsecurity patchset shouldn't interfere as it >doesn't seem to put any code between handle_mm_fault and mm_fault_error >and there also doesn't seem to be any new handle_mm_fault call sites. > >But I cannot tell there aren't other code paths which would lead to a >memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. Michal, now i can definitely confirm that problem with unremovable cgroups persis= ts. What info do you need from me? I applied also your little 'WARN_ON' p= atch. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=5D_memcg=3A_do_not_trap_chargers_with_full_callstack_on_OOM?= Date: Sun, 14 Jul 2013 01:51:12 +0200 Message-ID: <20130714015112.FFCB7AF7@pobox.sk> References: <20130624201345.GA21822@cmpxchg.org>, <20130628120613.6D6CAD21@pobox.sk>, <20130705181728.GQ17812@cmpxchg.org>, <20130705210246.11D2135A@pobox.sk>, <20130705191854.GR17812@cmpxchg.org>, <20130708014224.50F06960@pobox.sk>, <20130709131029.GH20281@dhcp22.suse.cz>, <20130709151921.5160C199@pobox.sk>, <20130709135450.GI20281@dhcp22.suse.cz>, <20130710182506.F25DF461@pobox.sk>, <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20130714012641.C2DA4E05@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Michal_Hocko?= Cc: =?utf-8?q?Johannes_Weiner?= , linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , righi.andrea@gmail.com > CC: "Johannes Weiner" , linux-kernel@vger.kernel.or= g, linux-mm@kvack.org, "cgroups mailinglist" , "= KAMEZAWA Hiroyuki" , righi.andrea@gmail.c= om >> CC: "Johannes Weiner" , linux-kernel@vger.kernel.o= rg, linux-mm@kvack.org, "cgroups mailinglist" , = "KAMEZAWA Hiroyuki" , righi.andrea@gmail.= com >>On Wed 10-07-13 18:25:06, azurIt wrote: >>> >> Now i realized that i forgot to remove UID from that cgroup before >>> >> trying to remove it, so cgroup cannot be removed anyway (we are us= ing >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is a= ble >>> >> to associate all user's processes with target cgroup). Look here f= or >>> >> cgroup-uid patch: >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-ui= d-v8.patch >>> >>=20 >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' = was >>> >> permanently '1'. >>> > >>> >This is really strange. Could you post the whole diff against stable >>> >tree you are using (except for grsecurity stuff and the above cgroup= -uid >>> >patch)? >>>=20 >>>=20 >>> Here are all patches which i applied to kernel 3.2.48 in my last test= : >>> http://watchdog.sk/lkml/patches3/ >> >>The two patches from Johannes seem correct. >> >>>From a quick look even grsecurity patchset shouldn't interfere as it >>doesn't seem to put any code between handle_mm_fault and mm_fault_error >>and there also doesn't seem to be any new handle_mm_fault call sites. >> >>But I cannot tell there aren't other code paths which would lead to a >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > >Michal, > >now i can definitely confirm that problem with unremovable cgroups persi= sts. What info do you need from me? I applied also your little 'WARN_ON' = patch. > >azur Ok, i think you want this: http://watchdog.sk/lkml/kern4.log -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=5D_memcg=3A_do_not_trap_chargers_with_full_callstack_on_OOM?= Date: Sun, 14 Jul 2013 19:07:23 +0200 Message-ID: <20130714190723.BF406E48@pobox.sk> References: <20130606160446.GE24115@dhcp22.suse.cz>, <20130606181633.BCC3E02E@pobox.sk>, <20130607131157.GF8117@dhcp22.suse.cz>, <20130617122134.2E072BA8@pobox.sk>, <20130619132614.GC16457@dhcp22.suse.cz>, <20130622220958.D10567A4@pobox.sk>, <20130624201345.GA21822@cmpxchg.org>, <20130628120613.6D6CAD21@pobox.sk>, <20130705181728.GQ17812@cmpxchg.org>, <20130705210246.11D2135A@pobox.sk> <20130705191854.GR17812@cmpxchg.org> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20130705191854.GR17812@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Johannes_Weiner?= Cc: =?utf-8?q?Michal_Hocko?= , linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= > CC: "Michal Hocko" , linux-kernel@vger.kernel.org, linu= x-mm@kvack.org, "cgroups mailinglist" , "KAMEZAW= A Hiroyuki" >On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote: >> >I looked at your debug messages but could not find anything that woul= d >> >hint at a deadlock. All tasks are stuck in the refrigerator, so I >> >assume you use the freezer cgroup and enabled it somehow? >>=20 >>=20 >> Yes, i'm really using freezer cgroup BUT i was checking if it's not >> doing problems - unfortunately, several days passed from that day >> and now i don't fully remember if i was checking it for both cases >> (unremoveabled cgroups and these freezed processes holding web >> server port). I'm 100% sure i was checking it for unremoveable >> cgroups but not so sure for the other problem (i had to act quickly >> in that case). Are you sure (from stacks) that freezer cgroup was >> enabled there? > >Yeah, all the traces without exception look like this: > >1372089762/23433/stack:[] refrigerator+0x95/0x160 >1372089762/23433/stack:[] get_signal_to_deliver+0x1cb/= 0x540 >1372089762/23433/stack:[] do_signal+0x6b/0x750 >1372089762/23433/stack:[] do_notify_resume+0x55/0x80 >1372089762/23433/stack:[] int_signal+0x12/0x17 >1372089762/23433/stack:[] 0xffffffffffffffff > >so the freezer was already enabled when you took the backtraces. > >> Btw, what about that other stacks? I mean this file: >> http://watchdog.sk/lkml/memcg-bug-7.tar.gz >>=20 >> It was taken while running the kernel with your patch and from >> cgroup which was under unresolveable OOM (just like my very original >> problem). > >I looked at these traces too, but none of the tasks are stuck in rmdir >or the OOM path. Some /are/ in the page fault path, but they are >happily doing reclaim and don't appear to be stuck. So I'm having a >hard time matching this data to what you otherwise observed. > >However, based on what you reported the most likely explanation for >the continued hangs is the unfinished OOM handling for which I sent >the followup patch for arch/x86/mm/fault.c. Johannes, this problem happened again but was even worse, now i'm sure it wasn't my= fault. This time I even wasn't able to access /proc/ of hanged apac= he process (which was, again, helding web server port and forced me to re= boot the server). Everything which tried to access /proc/ just hange= d. Server even wasn't able to reboot correctly, it hanged and then done a= hard reboot after few minutes. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Date: Mon, 15 Jul 2013 17:41:19 +0200 Message-ID: <20130715154119.GA32435@dhcp22.suse.cz> References: <20130705210246.11D2135A@pobox.sk> <20130705191854.GR17812@cmpxchg.org> <20130708014224.50F06960@pobox.sk> <20130709131029.GH20281@dhcp22.suse.cz> <20130709151921.5160C199@pobox.sk> <20130709135450.GI20281@dhcp22.suse.cz> <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130714015112.FFCB7AF7-Rm0zKEqwvD4@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: Johannes Weiner , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org On Sun 14-07-13 01:51:12, azurIt wrote: > > CC: "Johannes Weiner" , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > >> CC: "Johannes Weiner" , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > >>On Wed 10-07-13 18:25:06, azurIt wrote: > >>> >> Now i realized that i forgot to remove UID from that cgroup before > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > >>> >> to associate all user's processes with target cgroup). Look here for > >>> >> cgroup-uid patch: > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > >>> >> > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > >>> >> permanently '1'. > >>> > > >>> >This is really strange. Could you post the whole diff against stable > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > >>> >patch)? > >>> > >>> > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > >>> http://watchdog.sk/lkml/patches3/ > >> > >>The two patches from Johannes seem correct. > >> > >>From a quick look even grsecurity patchset shouldn't interfere as it > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > >>and there also doesn't seem to be any new handle_mm_fault call sites. > >> > >>But I cannot tell there aren't other code paths which would lead to a > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > >Michal, > > > >now i can definitely confirm that problem with unremovable cgroups > >persists. What info do you need from me? I applied also your little > >'WARN_ON' patch. > > Ok, i think you want this: > http://watchdog.sk/lkml/kern4.log Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: Jul 14 01:11:41 server01 kernel: [ 595.393737] [] warn_slowpath_common+0x7a/0xb0 Jul 14 01:11:41 server01 kernel: [ 595.393903] [] warn_slowpath_null+0x1a/0x20 Jul 14 01:11:41 server01 kernel: [ 595.394068] [] do_exit+0x7d0/0x870 Jul 14 01:11:41 server01 kernel: [ 595.394231] [] ? thread_group_times+0x44/0xb0 Jul 14 01:11:41 server01 kernel: [ 595.394392] [] do_group_exit+0x51/0xc0 Jul 14 01:11:41 server01 kernel: [ 595.394551] [] sys_exit_group+0x17/0x20 Jul 14 01:11:41 server01 kernel: [ 595.394714] [] system_call_fastpath+0x18/0x1d Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- OK, so you had an OOM which has been handled by in-kernel oom handler (it killed 12021) and 12037 was in the same group. The warning tells us that it went through mem_cgroup_oom as well (otherwise it wouldn't have memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then it exited on the userspace request (by exit syscall). I do not see any way how, this could happen though. If mem_cgroup_oom is called then we always return CHARGE_NOMEM which turns into ENOMEM returned by __mem_cgroup_try_charge (invoke_oom must have been set to true). So if nobody screwed the return value on the way up to page fault handler then there is no way to escape. I will check the code. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Date: Mon, 15 Jul 2013 18:00:06 +0200 Message-ID: <20130715160006.GB32435@dhcp22.suse.cz> References: <20130705191854.GR17812@cmpxchg.org> <20130708014224.50F06960@pobox.sk> <20130709131029.GH20281@dhcp22.suse.cz> <20130709151921.5160C199@pobox.sk> <20130709135450.GI20281@dhcp22.suse.cz> <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130715154119.GA32435-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: Johannes Weiner , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org On Mon 15-07-13 17:41:19, Michal Hocko wrote: > On Sun 14-07-13 01:51:12, azurIt wrote: > > > CC: "Johannes Weiner" , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > > >> CC: "Johannes Weiner" , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > > >>On Wed 10-07-13 18:25:06, azurIt wrote: > > >>> >> Now i realized that i forgot to remove UID from that cgroup before > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > > >>> >> to associate all user's processes with target cgroup). Look here for > > >>> >> cgroup-uid patch: > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > >>> >> > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > > >>> >> permanently '1'. > > >>> > > > >>> >This is really strange. Could you post the whole diff against stable > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > > >>> >patch)? > > >>> > > >>> > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > > >>> http://watchdog.sk/lkml/patches3/ > > >> > > >>The two patches from Johannes seem correct. > > >> > > >>From a quick look even grsecurity patchset shouldn't interfere as it > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > > >>and there also doesn't seem to be any new handle_mm_fault call sites. > > >> > > >>But I cannot tell there aren't other code paths which would lead to a > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > > > > >Michal, > > > > > >now i can definitely confirm that problem with unremovable cgroups > > >persists. What info do you need from me? I applied also your little > > >'WARN_ON' patch. > > > > Ok, i think you want this: > > http://watchdog.sk/lkml/kern4.log > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: > Jul 14 01:11:41 server01 kernel: [ 595.393737] [] warn_slowpath_common+0x7a/0xb0 > Jul 14 01:11:41 server01 kernel: [ 595.393903] [] warn_slowpath_null+0x1a/0x20 > Jul 14 01:11:41 server01 kernel: [ 595.394068] [] do_exit+0x7d0/0x870 > Jul 14 01:11:41 server01 kernel: [ 595.394231] [] ? thread_group_times+0x44/0xb0 > Jul 14 01:11:41 server01 kernel: [ 595.394392] [] do_group_exit+0x51/0xc0 > Jul 14 01:11:41 server01 kernel: [ 595.394551] [] sys_exit_group+0x17/0x20 > Jul 14 01:11:41 server01 kernel: [ 595.394714] [] system_call_fastpath+0x18/0x1d > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- > > OK, so you had an OOM which has been handled by in-kernel oom handler > (it killed 12021) and 12037 was in the same group. The warning tells us > that it went through mem_cgroup_oom as well (otherwise it wouldn't have > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then > it exited on the userspace request (by exit syscall). > > I do not see any way how, this could happen though. If mem_cgroup_oom > is called then we always return CHARGE_NOMEM which turns into ENOMEM > returned by __mem_cgroup_try_charge (invoke_oom must have been set to > true). So if nobody screwed the return value on the way up to page > fault handler then there is no way to escape. > > I will check the code. OK, I guess I found it: __do_fault fault = filemap_fault do_async_mmap_readahead page_cache_async_readahead ondemand_readahead __do_page_cache_readahead read_pages readpages = ext3_readpages mpage_readpages # Doesn't propagate ENOMEM add_to_page_cache_lru add_to_page_cache add_to_page_cache_locked mem_cgroup_cache_charge So the read ahead most probably. Again! Duhhh. I will try to think about a fix for this. One obvious place is mpage_readpages but __do_page_cache_readahead ignores read_pages return value as well and page_cache_async_readahead, even worse, is just void and exported as such. So this smells like a hard to fix bugger. One possible, and really ugly way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault doesn't return VM_FAULT_ERROR, but that is a crude hack. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Date: Tue, 16 Jul 2013 11:35:44 -0400 Message-ID: <20130716153544.GX17812@cmpxchg.org> References: <20130708014224.50F06960@pobox.sk> <20130709131029.GH20281@dhcp22.suse.cz> <20130709151921.5160C199@pobox.sk> <20130709135450.GI20281@dhcp22.suse.cz> <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130715160006.GB32435-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: azurIt , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote: > On Mon 15-07-13 17:41:19, Michal Hocko wrote: > > On Sun 14-07-13 01:51:12, azurIt wrote: > > > > CC: "Johannes Weiner" , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > > > >> CC: "Johannes Weiner" , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > > > >>On Wed 10-07-13 18:25:06, azurIt wrote: > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > > > >>> >> to associate all user's processes with target cgroup). Look here for > > > >>> >> cgroup-uid patch: > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > > >>> >> > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > > > >>> >> permanently '1'. > > > >>> > > > > >>> >This is really strange. Could you post the whole diff against stable > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > > > >>> >patch)? > > > >>> > > > >>> > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > > > >>> http://watchdog.sk/lkml/patches3/ > > > >> > > > >>The two patches from Johannes seem correct. > > > >> > > > >>From a quick look even grsecurity patchset shouldn't interfere as it > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > > > >>and there also doesn't seem to be any new handle_mm_fault call sites. > > > >> > > > >>But I cannot tell there aren't other code paths which would lead to a > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > > > > > > > >Michal, > > > > > > > >now i can definitely confirm that problem with unremovable cgroups > > > >persists. What info do you need from me? I applied also your little > > > >'WARN_ON' patch. > > > > > > Ok, i think you want this: > > > http://watchdog.sk/lkml/kern4.log > > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [] warn_slowpath_common+0x7a/0xb0 > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [] warn_slowpath_null+0x1a/0x20 > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [] do_exit+0x7d0/0x870 > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [] ? thread_group_times+0x44/0xb0 > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [] do_group_exit+0x51/0xc0 > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [] sys_exit_group+0x17/0x20 > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [] system_call_fastpath+0x18/0x1d > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- > > > > OK, so you had an OOM which has been handled by in-kernel oom handler > > (it killed 12021) and 12037 was in the same group. The warning tells us > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then > > it exited on the userspace request (by exit syscall). > > > > I do not see any way how, this could happen though. If mem_cgroup_oom > > is called then we always return CHARGE_NOMEM which turns into ENOMEM > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to > > true). So if nobody screwed the return value on the way up to page > > fault handler then there is no way to escape. > > > > I will check the code. > > OK, I guess I found it: > __do_fault > fault = filemap_fault > do_async_mmap_readahead > page_cache_async_readahead > ondemand_readahead > __do_page_cache_readahead > read_pages > readpages = ext3_readpages > mpage_readpages # Doesn't propagate ENOMEM > add_to_page_cache_lru > add_to_page_cache > add_to_page_cache_locked > mem_cgroup_cache_charge > > So the read ahead most probably. Again! Duhhh. I will try to think > about a fix for this. One obvious place is mpage_readpages but > __do_page_cache_readahead ignores read_pages return value as well and > page_cache_async_readahead, even worse, is just void and exported as > such. > > So this smells like a hard to fix bugger. One possible, and really ugly > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault > doesn't return VM_FAULT_ERROR, but that is a crude hack. Ouch, good spot. I don't think we need to handle an OOM from the readahead code. If readahead does not produce the desired page, we retry synchroneously in page_cache_read() and handle the OOM properly. We should not signal an OOM for optional pages anyway. So either we pass a flag from the readahead code down to add_to_page_cache and mem_cgroup_cache_charge that tells the charge code to ignore OOM conditions and do not set up an OOM context. Or we DO call mem_cgroup_oom_synchronize() from the read_cache_pages, with an argument that makes it only clean up the context and not wait. It would not be completely outlandish to place it there, since it's right next to where an error from add_to_page_cache() is not further propagated back through the fault stack. I'm travelling right now, I'll send a patch when I get back (Thursday). Unless you beat me to it :) From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Date: Tue, 16 Jul 2013 18:09:05 +0200 Message-ID: <20130716160905.GA20018@dhcp22.suse.cz> References: <20130709131029.GH20281@dhcp22.suse.cz> <20130709151921.5160C199@pobox.sk> <20130709135450.GI20281@dhcp22.suse.cz> <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> <20130716153544.GX17812@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130716153544.GX17812-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: azurIt , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org On Tue 16-07-13 11:35:44, Johannes Weiner wrote: > On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote: > > On Mon 15-07-13 17:41:19, Michal Hocko wrote: > > > On Sun 14-07-13 01:51:12, azurIt wrote: > > > > > CC: "Johannes Weiner" , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > > > > >> CC: "Johannes Weiner" , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > > > > >>On Wed 10-07-13 18:25:06, azurIt wrote: > > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before > > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > > > > >>> >> to associate all user's processes with target cgroup). Look here for > > > > >>> >> cgroup-uid patch: > > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > > > >>> >> > > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > > > > >>> >> permanently '1'. > > > > >>> > > > > > >>> >This is really strange. Could you post the whole diff against stable > > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > > > > >>> >patch)? > > > > >>> > > > > >>> > > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > > > > >>> http://watchdog.sk/lkml/patches3/ > > > > >> > > > > >>The two patches from Johannes seem correct. > > > > >> > > > > >>From a quick look even grsecurity patchset shouldn't interfere as it > > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > > > > >>and there also doesn't seem to be any new handle_mm_fault call sites. > > > > >> > > > > >>But I cannot tell there aren't other code paths which would lead to a > > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > > > > > > > > > > >Michal, > > > > > > > > > >now i can definitely confirm that problem with unremovable cgroups > > > > >persists. What info do you need from me? I applied also your little > > > > >'WARN_ON' patch. > > > > > > > > Ok, i think you want this: > > > > http://watchdog.sk/lkml/kern4.log > > > > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name > > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child > > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB > > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ > > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() > > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA > > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 > > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: > > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [] warn_slowpath_common+0x7a/0xb0 > > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [] warn_slowpath_null+0x1a/0x20 > > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [] do_exit+0x7d0/0x870 > > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [] ? thread_group_times+0x44/0xb0 > > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [] do_group_exit+0x51/0xc0 > > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [] sys_exit_group+0x17/0x20 > > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [] system_call_fastpath+0x18/0x1d > > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- > > > > > > OK, so you had an OOM which has been handled by in-kernel oom handler > > > (it killed 12021) and 12037 was in the same group. The warning tells us > > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have > > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then > > > it exited on the userspace request (by exit syscall). > > > > > > I do not see any way how, this could happen though. If mem_cgroup_oom > > > is called then we always return CHARGE_NOMEM which turns into ENOMEM > > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to > > > true). So if nobody screwed the return value on the way up to page > > > fault handler then there is no way to escape. > > > > > > I will check the code. > > > > OK, I guess I found it: > > __do_fault > > fault = filemap_fault > > do_async_mmap_readahead > > page_cache_async_readahead > > ondemand_readahead > > __do_page_cache_readahead > > read_pages > > readpages = ext3_readpages > > mpage_readpages # Doesn't propagate ENOMEM > > add_to_page_cache_lru > > add_to_page_cache > > add_to_page_cache_locked > > mem_cgroup_cache_charge > > > > So the read ahead most probably. Again! Duhhh. I will try to think > > about a fix for this. One obvious place is mpage_readpages but > > __do_page_cache_readahead ignores read_pages return value as well and > > page_cache_async_readahead, even worse, is just void and exported as > > such. > > > > So this smells like a hard to fix bugger. One possible, and really ugly > > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault > > doesn't return VM_FAULT_ERROR, but that is a crude hack. > > Ouch, good spot. > > I don't think we need to handle an OOM from the readahead code. If > readahead does not produce the desired page, we retry synchroneously > in page_cache_read() and handle the OOM properly. We should not > signal an OOM for optional pages anyway. > > So either we pass a flag from the readahead code down to > add_to_page_cache and mem_cgroup_cache_charge that tells the charge > code to ignore OOM conditions and do not set up an OOM context. That was my previous attempt and it was sooo painful. > Or we DO call mem_cgroup_oom_synchronize() from the read_cache_pages, > with an argument that makes it only clean up the context and not wait. Yes, I was playing with this idea as well. I just do not like how fragile this is. We need some way to catch all possible places which might leak it. > It would not be completely outlandish to place it there, since it's > right next to where an error from add_to_page_cache() is not further > propagated back through the fault stack. > > I'm travelling right now, I'll send a patch when I get back > (Thursday). Unless you beat me to it :) I can cook something up but there is quite a big pile on my desk currently (as always :/). -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Date: Tue, 16 Jul 2013 12:48:30 -0400 Message-ID: <20130716164830.GZ17812@cmpxchg.org> References: <20130709151921.5160C199@pobox.sk> <20130709135450.GI20281@dhcp22.suse.cz> <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> <20130716153544.GX17812@cmpxchg.org> <20130716160905.GA20018@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130716160905.GA20018-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: azurIt , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org On Tue, Jul 16, 2013 at 06:09:05PM +0200, Michal Hocko wrote: > On Tue 16-07-13 11:35:44, Johannes Weiner wrote: > > On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote: > > > On Mon 15-07-13 17:41:19, Michal Hocko wrote: > > > > On Sun 14-07-13 01:51:12, azurIt wrote: > > > > > > CC: "Johannes Weiner" , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > > > > > >> CC: "Johannes Weiner" , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > > > > > >>On Wed 10-07-13 18:25:06, azurIt wrote: > > > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before > > > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > > > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > > > > > >>> >> to associate all user's processes with target cgroup). Look here for > > > > > >>> >> cgroup-uid patch: > > > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > > > > >>> >> > > > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > > > > > >>> >> permanently '1'. > > > > > >>> > > > > > > >>> >This is really strange. Could you post the whole diff against stable > > > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > > > > > >>> >patch)? > > > > > >>> > > > > > >>> > > > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > > > > > >>> http://watchdog.sk/lkml/patches3/ > > > > > >> > > > > > >>The two patches from Johannes seem correct. > > > > > >> > > > > > >>From a quick look even grsecurity patchset shouldn't interfere as it > > > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > > > > > >>and there also doesn't seem to be any new handle_mm_fault call sites. > > > > > >> > > > > > >>But I cannot tell there aren't other code paths which would lead to a > > > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > > > > > > > > > > > > > >Michal, > > > > > > > > > > > >now i can definitely confirm that problem with unremovable cgroups > > > > > >persists. What info do you need from me? I applied also your little > > > > > >'WARN_ON' patch. > > > > > > > > > > Ok, i think you want this: > > > > > http://watchdog.sk/lkml/kern4.log > > > > > > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name > > > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child > > > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB > > > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ > > > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() > > > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA > > > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 > > > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: > > > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [] warn_slowpath_common+0x7a/0xb0 > > > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [] warn_slowpath_null+0x1a/0x20 > > > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [] do_exit+0x7d0/0x870 > > > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [] ? thread_group_times+0x44/0xb0 > > > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [] do_group_exit+0x51/0xc0 > > > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [] sys_exit_group+0x17/0x20 > > > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [] system_call_fastpath+0x18/0x1d > > > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- > > > > > > > > OK, so you had an OOM which has been handled by in-kernel oom handler > > > > (it killed 12021) and 12037 was in the same group. The warning tells us > > > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have > > > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then > > > > it exited on the userspace request (by exit syscall). > > > > > > > > I do not see any way how, this could happen though. If mem_cgroup_oom > > > > is called then we always return CHARGE_NOMEM which turns into ENOMEM > > > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to > > > > true). So if nobody screwed the return value on the way up to page > > > > fault handler then there is no way to escape. > > > > > > > > I will check the code. > > > > > > OK, I guess I found it: > > > __do_fault > > > fault = filemap_fault > > > do_async_mmap_readahead > > > page_cache_async_readahead > > > ondemand_readahead > > > __do_page_cache_readahead > > > read_pages > > > readpages = ext3_readpages > > > mpage_readpages # Doesn't propagate ENOMEM > > > add_to_page_cache_lru > > > add_to_page_cache > > > add_to_page_cache_locked > > > mem_cgroup_cache_charge > > > > > > So the read ahead most probably. Again! Duhhh. I will try to think > > > about a fix for this. One obvious place is mpage_readpages but > > > __do_page_cache_readahead ignores read_pages return value as well and > > > page_cache_async_readahead, even worse, is just void and exported as > > > such. > > > > > > So this smells like a hard to fix bugger. One possible, and really ugly > > > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault > > > doesn't return VM_FAULT_ERROR, but that is a crude hack. > > > > Ouch, good spot. > > > > I don't think we need to handle an OOM from the readahead code. If > > readahead does not produce the desired page, we retry synchroneously > > in page_cache_read() and handle the OOM properly. We should not > > signal an OOM for optional pages anyway. > > > > So either we pass a flag from the readahead code down to > > add_to_page_cache and mem_cgroup_cache_charge that tells the charge > > code to ignore OOM conditions and do not set up an OOM context. > > That was my previous attempt and it was sooo painful. > > > Or we DO call mem_cgroup_oom_synchronize() from the read_cache_pages, > > with an argument that makes it only clean up the context and not wait. > > Yes, I was playing with this idea as well. I just do not like how > fragile this is. We need some way to catch all possible places which > might leak it. I don't think this is necessary, but we could add a sanity check in/near mem_cgroup_clear_userfault() that makes sure the OOM context is only set up when an error is returned. > > It would not be completely outlandish to place it there, since it's > > right next to where an error from add_to_page_cache() is not further > > propagated back through the fault stack. > > > > I'm travelling right now, I'll send a patch when I get back > > (Thursday). Unless you beat me to it :) > > I can cook something up but there is quite a big pile on my desk > currently (as always :/). No worries, I'll send an update. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Date: Fri, 19 Jul 2013 00:21:24 -0400 Message-ID: <20130719042124.GC17812@cmpxchg.org> References: <20130709135450.GI20281@dhcp22.suse.cz> <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> <20130716153544.GX17812@cmpxchg.org> <20130716160905.GA20018@dhcp22.suse.cz> <20130716164830.GZ17812@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130716164830.GZ17812@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea@gmail.com On Tue, Jul 16, 2013 at 12:48:30PM -0400, Johannes Weiner wrote: > On Tue, Jul 16, 2013 at 06:09:05PM +0200, Michal Hocko wrote: > > On Tue 16-07-13 11:35:44, Johannes Weiner wrote: > > > On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote: > > > > On Mon 15-07-13 17:41:19, Michal Hocko wrote: > > > > > On Sun 14-07-13 01:51:12, azurIt wrote: > > > > > > > CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com > > > > > > >> CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com > > > > > > >>On Wed 10-07-13 18:25:06, azurIt wrote: > > > > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before > > > > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > > > > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > > > > > > >>> >> to associate all user's processes with target cgroup). Look here for > > > > > > >>> >> cgroup-uid patch: > > > > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > > > > > >>> >> > > > > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > > > > > > >>> >> permanently '1'. > > > > > > >>> > > > > > > > >>> >This is really strange. Could you post the whole diff against stable > > > > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > > > > > > >>> >patch)? > > > > > > >>> > > > > > > >>> > > > > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > > > > > > >>> http://watchdog.sk/lkml/patches3/ > > > > > > >> > > > > > > >>The two patches from Johannes seem correct. > > > > > > >> > > > > > > >>From a quick look even grsecurity patchset shouldn't interfere as it > > > > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > > > > > > >>and there also doesn't seem to be any new handle_mm_fault call sites. > > > > > > >> > > > > > > >>But I cannot tell there aren't other code paths which would lead to a > > > > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > > > > > > > > > > > > > > > > >Michal, > > > > > > > > > > > > > >now i can definitely confirm that problem with unremovable cgroups > > > > > > >persists. What info do you need from me? I applied also your little > > > > > > >'WARN_ON' patch. > > > > > > > > > > > > Ok, i think you want this: > > > > > > http://watchdog.sk/lkml/kern4.log > > > > > > > > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name > > > > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child > > > > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB > > > > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ > > > > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() > > > > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA > > > > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 > > > > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: > > > > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [] warn_slowpath_common+0x7a/0xb0 > > > > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [] warn_slowpath_null+0x1a/0x20 > > > > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [] do_exit+0x7d0/0x870 > > > > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [] ? thread_group_times+0x44/0xb0 > > > > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [] do_group_exit+0x51/0xc0 > > > > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [] sys_exit_group+0x17/0x20 > > > > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [] system_call_fastpath+0x18/0x1d > > > > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- > > > > > > > > > > OK, so you had an OOM which has been handled by in-kernel oom handler > > > > > (it killed 12021) and 12037 was in the same group. The warning tells us > > > > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have > > > > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then > > > > > it exited on the userspace request (by exit syscall). > > > > > > > > > > I do not see any way how, this could happen though. If mem_cgroup_oom > > > > > is called then we always return CHARGE_NOMEM which turns into ENOMEM > > > > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to > > > > > true). So if nobody screwed the return value on the way up to page > > > > > fault handler then there is no way to escape. > > > > > > > > > > I will check the code. > > > > > > > > OK, I guess I found it: > > > > __do_fault > > > > fault = filemap_fault > > > > do_async_mmap_readahead > > > > page_cache_async_readahead > > > > ondemand_readahead > > > > __do_page_cache_readahead > > > > read_pages > > > > readpages = ext3_readpages > > > > mpage_readpages # Doesn't propagate ENOMEM > > > > add_to_page_cache_lru > > > > add_to_page_cache > > > > add_to_page_cache_locked > > > > mem_cgroup_cache_charge > > > > > > > > So the read ahead most probably. Again! Duhhh. I will try to think > > > > about a fix for this. One obvious place is mpage_readpages but > > > > __do_page_cache_readahead ignores read_pages return value as well and > > > > page_cache_async_readahead, even worse, is just void and exported as > > > > such. > > > > > > > > So this smells like a hard to fix bugger. One possible, and really ugly > > > > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault > > > > doesn't return VM_FAULT_ERROR, but that is a crude hack. I fixed it by disabling the OOM killer altogether for readahead code. We don't do it globally, we should not do it in the memcg, these are optional allocations/charges. I also disabled it for kernel faults triggered from within a syscall (copy_*user, get_user_pages), which should just return -ENOMEM as usual (unless it's nested inside a userspace fault). The only downside is that we can't get around annotating userspace faults anymore, so every architecture fault handler now passes FAULT_FLAG_USER to handle_mm_fault(). Makes the series a little less self-contained, but it's not unreasonable. It's easy to detect leaks now by checking if the memcg OOM context is setup and we are not returning VM_FAULT_OOM. Here is a combined diff based on 3.2. azurIt, any chance you could give this a shot? I tested it on my local machines, but you have a known reproducer of fairly unlikely scenarios... Thanks! Johannes diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c index fadd5f8..fa6b4e4 100644 --- a/arch/alpha/mm/fault.c +++ b/arch/alpha/mm/fault.c @@ -89,6 +89,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr, struct mm_struct *mm = current->mm; const struct exception_table_entry *fixup; int fault, si_code = SEGV_MAPERR; + unsigned long flags = 0; siginfo_t info; /* As of EV6, a load into $31/$f31 is a prefetch, and never faults @@ -142,10 +143,15 @@ do_page_fault(unsigned long address, unsigned long mmcsr, goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (cause > 0) + flags |= FAULT_FLAG_WRITE; + /* If for any reason at all we couldn't handle the fault, make sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(mm, vma, address, cause > 0 ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); up_read(&mm->mmap_sem); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c index aa33949..31b1e69 100644 --- a/arch/arm/mm/fault.c +++ b/arch/arm/mm/fault.c @@ -231,9 +231,10 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma) static int __kprobes __do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr, - struct task_struct *tsk) + struct task_struct *tsk, struct pt_regs *regs) { struct vm_area_struct *vma; + unsigned long flags = 0; int fault; vma = find_vma(mm, addr); @@ -253,11 +254,16 @@ good_area: goto out; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (fsr & FSR_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, (fsr & FSR_WRITE) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags); if (unlikely(fault & VM_FAULT_ERROR)) return fault; if (fault & VM_FAULT_MAJOR) @@ -320,7 +326,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) #endif } - fault = __do_page_fault(mm, addr, fsr, tsk); + fault = __do_page_fault(mm, addr, fsr, tsk, regs); up_read(&mm->mmap_sem); perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c index f7040a1..ada6237 100644 --- a/arch/avr32/mm/fault.c +++ b/arch/avr32/mm/fault.c @@ -59,6 +59,7 @@ asmlinkage void do_page_fault(unsigned long ecr, struct pt_regs *regs) struct mm_struct *mm; struct vm_area_struct *vma; const struct exception_table_entry *fixup; + unsigned long flags = 0; unsigned long address; unsigned long page; int writeaccess; @@ -127,12 +128,17 @@ good_area: panic("Unhandled case %lu in do_page_fault!", ecr); } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/cris/mm/fault.c b/arch/cris/mm/fault.c index 9dcac8e..35d096a 100644 --- a/arch/cris/mm/fault.c +++ b/arch/cris/mm/fault.c @@ -55,6 +55,7 @@ do_page_fault(unsigned long address, struct pt_regs *regs, struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct * vma; + unsigned long flags = 0; siginfo_t info; int fault; @@ -156,13 +157,18 @@ do_page_fault(unsigned long address, struct pt_regs *regs, goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess & 1) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, (writeaccess & 1) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/frv/mm/fault.c b/arch/frv/mm/fault.c index a325d57..2dbf219 100644 --- a/arch/frv/mm/fault.c +++ b/arch/frv/mm/fault.c @@ -35,6 +35,7 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear struct vm_area_struct *vma; struct mm_struct *mm; unsigned long _pme, lrai, lrad, fixup; + unsigned long flags = 0; siginfo_t info; pgd_t *pge; pud_t *pue; @@ -158,12 +159,17 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear break; } + if (user_mode(__frame)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, ear0, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, ear0, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c index c10b76f..e56baf3 100644 --- a/arch/hexagon/mm/vm_fault.c +++ b/arch/hexagon/mm/vm_fault.c @@ -52,6 +52,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs) siginfo_t info; int si_code = SEGV_MAPERR; int fault; + unsigned long flags = 0; const struct exception_table_entry *fixup; /* @@ -96,7 +97,12 @@ good_area: break; } - fault = handle_mm_fault(mm, vma, address, (cause > 0)); + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (cause > 0) + flags |= FAULT_FLAG_WRITE; + + fault = handle_mm_fault(mm, vma, address, flags); /* The most common case -- we are done. */ if (likely(!(fault & VM_FAULT_ERROR))) { diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c index 20b3593..ad9ef9d 100644 --- a/arch/ia64/mm/fault.c +++ b/arch/ia64/mm/fault.c @@ -79,6 +79,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re int signal = SIGSEGV, code = SEGV_MAPERR; struct vm_area_struct *vma, *prev_vma; struct mm_struct *mm = current->mm; + unsigned long flags = 0; struct siginfo si; unsigned long mask; int fault; @@ -149,12 +150,17 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re if ((vma->vm_flags & mask) != mask) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (mask & VM_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(mm, vma, address, (mask & VM_WRITE) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { /* * We ran out of memory, or some other thing happened diff --git a/arch/m32r/mm/fault.c b/arch/m32r/mm/fault.c index 2c9aeb4..e74f6fa 100644 --- a/arch/m32r/mm/fault.c +++ b/arch/m32r/mm/fault.c @@ -79,6 +79,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code, struct mm_struct *mm; struct vm_area_struct * vma; unsigned long page, addr; + unsigned long flags = 0; int write; int fault; siginfo_t info; @@ -188,6 +189,11 @@ good_area: if ((error_code & ACE_INSTRUCTION) && !(vma->vm_flags & VM_EXEC)) goto bad_area; + if (error_code & ACE_USERMODE) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo @@ -195,7 +201,7 @@ good_area: */ addr = (address & PAGE_MASK); set_thread_fault_code(error_code); - fault = handle_mm_fault(mm, vma, addr, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, addr, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c index 2db6099..ab88a91 100644 --- a/arch/m68k/mm/fault.c +++ b/arch/m68k/mm/fault.c @@ -73,6 +73,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address, { struct mm_struct *mm = current->mm; struct vm_area_struct * vma; + unsigned long flags = 0; int write, fault; #ifdef DEBUG @@ -134,13 +135,18 @@ good_area: goto acc_err; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); #ifdef DEBUG printk("handle_mm_fault returns %d\n",fault); #endif diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c index ae97d2c..b002612 100644 --- a/arch/microblaze/mm/fault.c +++ b/arch/microblaze/mm/fault.c @@ -89,6 +89,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address, { struct vm_area_struct *vma; struct mm_struct *mm = current->mm; + unsigned long flags = 0; siginfo_t info; int code = SEGV_MAPERR; int is_write = error_code & ESR_S; @@ -206,12 +207,17 @@ good_area: goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c index 937cf33..e5b9fed 100644 --- a/arch/mips/mm/fault.c +++ b/arch/mips/mm/fault.c @@ -40,6 +40,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, unsigned long writ struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; const int field = sizeof(unsigned long) * 2; + unsigned long flags = 0; siginfo_t info; int fault; @@ -139,12 +140,17 @@ good_area: } } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c index 0945409..031be56 100644 --- a/arch/mn10300/mm/fault.c +++ b/arch/mn10300/mm/fault.c @@ -121,6 +121,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long fault_code, { struct vm_area_struct *vma; struct task_struct *tsk; + unsigned long flags = 0; struct mm_struct *mm; unsigned long page; siginfo_t info; @@ -247,12 +248,17 @@ good_area: break; } + if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; @@ -329,9 +335,10 @@ no_context: */ out_of_memory: up_read(&mm->mmap_sem); - printk(KERN_ALERT "VM: killing process %s\n", tsk->comm); - if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) - do_exit(SIGKILL); + if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) { + pagefault_out_of_memory(); + return; + } goto no_context; do_sigbus: diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c index a5dce82..d586119 100644 --- a/arch/openrisc/mm/fault.c +++ b/arch/openrisc/mm/fault.c @@ -52,6 +52,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address, struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct *vma; + unsigned long flags = 0; siginfo_t info; int fault; @@ -153,13 +154,18 @@ good_area: if ((vector == 0x400) && !(vma->vm_page_prot.pgprot & _PAGE_EXEC)) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write_acc) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write_acc); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; @@ -246,10 +252,10 @@ out_of_memory: __asm__ __volatile__("l.nop 1"); up_read(&mm->mmap_sem); - printk("VM: killing process %s\n", tsk->comm); - if (user_mode(regs)) - do_exit(SIGKILL); - goto no_context; + if (!user_mode(regs)) + goto no_context; + pagefault_out_of_memory(); + return; do_sigbus: up_read(&mm->mmap_sem); diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c index 18162ce..a151e87 100644 --- a/arch/parisc/mm/fault.c +++ b/arch/parisc/mm/fault.c @@ -173,6 +173,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code, struct vm_area_struct *vma, *prev_vma; struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; + unsigned long flags = 0; unsigned long acc_type; int fault; @@ -195,13 +196,18 @@ good_area: if ((vma->vm_flags & acc_type) != acc_type) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (acc_type & VM_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(mm, vma, address, (acc_type & VM_WRITE) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { /* * We hit a shared mapping outside of the file, or some diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 5efe8c9..2bf339c 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -122,6 +122,7 @@ int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address, { struct vm_area_struct * vma; struct mm_struct *mm = current->mm; + unsigned long flags = 0; siginfo_t info; int code = SEGV_MAPERR; int is_write = 0, ret; @@ -305,12 +306,17 @@ good_area: goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - ret = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + ret = handle_mm_fault(mm, vma, address, flags); if (unlikely(ret & VM_FAULT_ERROR)) { if (ret & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c index a9a3018..fe6109c 100644 --- a/arch/s390/mm/fault.c +++ b/arch/s390/mm/fault.c @@ -301,6 +301,8 @@ static inline int do_exception(struct pt_regs *regs, int access, address = trans_exc_code & __FAIL_ADDR_MASK; perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); flags = FAULT_FLAG_ALLOW_RETRY; + if (regs->psw.mask & PSW_MASK_PSTATE) + flags |= FAULT_FLAG_USER; if (access == VM_WRITE || (trans_exc_code & store_indication) == 0x400) flags |= FAULT_FLAG_WRITE; down_read(&mm->mmap_sem); diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c index 47b600e..2ca5ae5 100644 --- a/arch/score/mm/fault.c +++ b/arch/score/mm/fault.c @@ -47,6 +47,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write, struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; const int field = sizeof(unsigned long) * 2; + unsigned long flags = 0; siginfo_t info; int fault; @@ -101,12 +102,16 @@ good_area: } survive: + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; @@ -172,10 +177,10 @@ out_of_memory: down_read(&mm->mmap_sem); goto survive; } - printk("VM: killing process %s\n", tsk->comm); - if (user_mode(regs)) - do_group_exit(SIGKILL); - goto no_context; + if (!user_mode(regs)) + goto no_context; + pagefault_out_of_memory(); + return; do_sigbus: up_read(&mm->mmap_sem); diff --git a/arch/sh/mm/fault_32.c b/arch/sh/mm/fault_32.c index 7bebd04..a61b803 100644 --- a/arch/sh/mm/fault_32.c +++ b/arch/sh/mm/fault_32.c @@ -126,6 +126,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct * vma; + unsigned long flags = 0; int si_code; int fault; siginfo_t info; @@ -195,12 +196,17 @@ good_area: goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sh/mm/tlbflush_64.c b/arch/sh/mm/tlbflush_64.c index e3430e0..0a9d645 100644 --- a/arch/sh/mm/tlbflush_64.c +++ b/arch/sh/mm/tlbflush_64.c @@ -96,6 +96,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long writeaccess, struct mm_struct *mm; struct vm_area_struct * vma; const struct exception_table_entry *fixup; + unsigned long flags = 0; pte_t *pte; int fault; @@ -184,12 +185,17 @@ good_area: } } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c index 8023fd7..efa3d48 100644 --- a/arch/sparc/mm/fault_32.c +++ b/arch/sparc/mm/fault_32.c @@ -222,6 +222,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write, struct vm_area_struct *vma; struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; + unsigned long flags = 0; unsigned int fixup; unsigned long g2; int from_user = !(regs->psr & PSR_PS); @@ -285,12 +286,17 @@ good_area: goto bad_area; } + if (from_user) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c index 504c062..bc536ea 100644 --- a/arch/sparc/mm/fault_64.c +++ b/arch/sparc/mm/fault_64.c @@ -276,6 +276,7 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; + unsigned long flags = 0; unsigned int insn = 0; int si_code, fault_code, fault; unsigned long address, mm_rss; @@ -423,7 +424,12 @@ good_area: goto bad_area; } - fault = handle_mm_fault(mm, vma, address, (fault_code & FAULT_CODE_WRITE) ? FAULT_FLAG_WRITE : 0); + if (!(regs->tstate & TSTATE_PRIV)) + flags |= FAULT_FLAG_USER; + if (fault_code & FAULT_CODE_WRITE) + flags |= FAULT_FLAG_WRITE; + + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c index 25b7b90..b2a7fd5 100644 --- a/arch/tile/mm/fault.c +++ b/arch/tile/mm/fault.c @@ -263,6 +263,7 @@ static int handle_page_fault(struct pt_regs *regs, struct mm_struct *mm; struct vm_area_struct *vma; unsigned long stack_offset; + unsigned long flags = 0; int fault; int si_code; int is_kernel_mode; @@ -415,12 +416,16 @@ good_area: } survive: + if (!is_kernel_mode) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; @@ -540,10 +545,10 @@ out_of_memory: down_read(&mm->mmap_sem); goto survive; } - pr_alert("VM: killing process %s\n", tsk->comm); - if (!is_kernel_mode) - do_group_exit(SIGKILL); - goto no_context; + if (is_kernel_mode) + goto no_context; + pagefault_out_of_memory(); + return 0; do_sigbus: up_read(&mm->mmap_sem); diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c index dafc947..626a85e 100644 --- a/arch/um/kernel/trap.c +++ b/arch/um/kernel/trap.c @@ -25,6 +25,7 @@ int handle_page_fault(unsigned long address, unsigned long ip, { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; + unsigned long flags = 0; pgd_t *pgd; pud_t *pud; pmd_t *pmd; @@ -62,10 +63,15 @@ good_area: if (!is_write && !(vma->vm_flags & (VM_READ | VM_EXEC))) goto out; + if (is_user) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + do { int fault; - fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) { goto out_of_memory; diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c index 283aa4b..3026943 100644 --- a/arch/unicore32/mm/fault.c +++ b/arch/unicore32/mm/fault.c @@ -169,9 +169,10 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma) } static int __do_pf(struct mm_struct *mm, unsigned long addr, unsigned int fsr, - struct task_struct *tsk) + struct task_struct *tsk, struct pt_regs *regs) { struct vm_area_struct *vma; + unsigned long flags = 0; int fault; vma = find_vma(mm, addr); @@ -191,12 +192,16 @@ good_area: goto out; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (!(fsr ^ 0x12)) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, - (!(fsr ^ 0x12)) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags); if (unlikely(fault & VM_FAULT_ERROR)) return fault; if (fault & VM_FAULT_MAJOR) @@ -252,7 +257,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs) #endif } - fault = __do_pf(mm, addr, fsr, tsk); + fault = __do_pf(mm, addr, fsr, tsk, regs); up_read(&mm->mmap_sem); /* diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 5db0490..90248c9 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -846,17 +846,6 @@ static noinline int mm_fault_error(struct pt_regs *regs, unsigned long error_code, unsigned long address, unsigned int fault) { - /* - * Pagefault was interrupted by SIGKILL. We have no reason to - * continue pagefault. - */ - if (fatal_signal_pending(current)) { - if (!(fault & VM_FAULT_RETRY)) - up_read(¤t->mm->mmap_sem); - if (!(error_code & PF_USER)) - no_context(regs, error_code, address); - return 1; - } if (!(fault & VM_FAULT_ERROR)) return 0; @@ -999,8 +988,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code) struct mm_struct *mm; int fault; int write = error_code & PF_WRITE; - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE | - (write ? FAULT_FLAG_WRITE : 0); + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; tsk = current; mm = tsk->mm; @@ -1160,6 +1148,11 @@ good_area: return; } + if (error_code & PF_USER) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c index e367e30..7db9fbe 100644 --- a/arch/xtensa/mm/fault.c +++ b/arch/xtensa/mm/fault.c @@ -41,6 +41,7 @@ void do_page_fault(struct pt_regs *regs) struct mm_struct *mm = current->mm; unsigned int exccause = regs->exccause; unsigned int address = regs->excvaddr; + unsigned long flags = 0; siginfo_t info; int is_write, is_exec; @@ -101,11 +102,16 @@ good_area: if (!(vma->vm_flags & (VM_READ | VM_WRITE))) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + /* If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index b87068a..b92e5e7 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -120,6 +120,17 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page); extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); +static inline unsigned int mem_cgroup_xchg_may_oom(struct task_struct *p, + unsigned int new) +{ + unsigned int old; + + old = p->memcg_oom.may_oom; + p->memcg_oom.may_oom = new; + + return old; +} +bool mem_cgroup_oom_synchronize(void); #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP extern int do_swap_account; #endif @@ -333,6 +344,17 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } +static inline unsigned int mem_cgroup_xchg_may_oom(struct task_struct *p, + unsigned int new) +{ + return 0; +} + +static inline bool mem_cgroup_oom_synchronize(void) +{ + return false; +} + static inline void mem_cgroup_inc_page_stat(struct page *page, enum mem_cgroup_page_stat_item idx) { diff --git a/include/linux/mm.h b/include/linux/mm.h index 4baadd1..846b82b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -156,6 +156,7 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_ALLOW_RETRY 0x08 /* Retry fault if blocking */ #define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */ #define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */ +#define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */ /* * This interface is used by x86 PAT code to identify a pfn mapping that is diff --git a/include/linux/sched.h b/include/linux/sched.h index 1c4f3e9..a77d198 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -91,6 +91,7 @@ struct sched_param { #include #include #include +#include #include @@ -1568,6 +1569,14 @@ struct task_struct { unsigned long nr_pages; /* uncharged usage */ unsigned long memsw_nr_pages; /* uncharged mem+swap usage */ } memcg_batch; + struct memcg_oom_info { + unsigned int may_oom:1; + unsigned int in_memcg_oom:1; + struct stack_trace trace; + unsigned long trace_entries[16]; + int wakeups; + struct mem_cgroup *wait_on_memcg; + } memcg_oom; #endif #ifdef CONFIG_HAVE_HW_BREAKPOINT atomic_t ptrace_bp_refcnt; diff --git a/mm/filemap.c b/mm/filemap.c index 5f0a3c9..d18bd47 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1660,6 +1660,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) struct file_ra_state *ra = &file->f_ra; struct inode *inode = mapping->host; pgoff_t offset = vmf->pgoff; + unsigned int may_oom; struct page *page; pgoff_t size; int ret = 0; @@ -1669,7 +1670,12 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) return VM_FAULT_SIGBUS; /* - * Do we have something in the page cache already? + * Do we have something in the page cache already? Either + * way, try readahead, but disable the memcg OOM killer for it + * as readahead is optional and no errors are propagated up + * the fault stack, which does not allow proper unwinding of a + * memcg OOM state. The OOM killer is enabled while trying to + * instantiate the faulting page individually below. */ page = find_get_page(mapping, offset); if (likely(page)) { @@ -1677,10 +1683,14 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) * We found the page, so try async readahead before * waiting for the lock. */ + may_oom = mem_cgroup_xchg_may_oom(current, 0); do_async_mmap_readahead(vma, ra, file, page, offset); + mem_cgroup_xchg_may_oom(current, may_oom); } else { - /* No page in the page cache at all */ + /* No page in the page cache at all. */ + may_oom = mem_cgroup_xchg_may_oom(current, 0); do_sync_mmap_readahead(vma, ra, file, offset); + mem_cgroup_xchg_may_oom(current, may_oom); count_vm_event(PGMAJFAULT); mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); ret = VM_FAULT_MAJOR; diff --git a/mm/ksm.c b/mm/ksm.c index 310544a..ae7e4ae 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -338,7 +338,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr) break; if (PageKsm(page)) ret = handle_mm_fault(vma->vm_mm, vma, addr, - FAULT_FLAG_WRITE); + FAULT_FLAG_WRITE); else ret = VM_FAULT_WRITE; put_page(page); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b63f5f7..c47c77e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -49,6 +49,7 @@ #include #include #include +#include #include "internal.h" #include @@ -249,6 +250,7 @@ struct mem_cgroup { bool oom_lock; atomic_t under_oom; + atomic_t oom_wakeups; atomic_t refcnt; @@ -1846,6 +1848,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait, static void memcg_wakeup_oom(struct mem_cgroup *memcg) { + atomic_inc(&memcg->oom_wakeups); /* for filtering, pass "memcg" as argument. */ __wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg); } @@ -1857,30 +1860,26 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) } /* - * try to call OOM killer. returns false if we should exit memory-reclaim loop. + * try to call OOM killer */ -bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask) { - struct oom_wait_info owait; - bool locked, need_to_kill; + bool locked, need_to_kill = true; - owait.mem = memcg; - owait.wait.flags = 0; - owait.wait.func = memcg_oom_wake_function; - owait.wait.private = current; - INIT_LIST_HEAD(&owait.wait.task_list); - need_to_kill = true; - mem_cgroup_mark_under_oom(memcg); + if (!current->memcg_oom.may_oom) + return; + + current->memcg_oom.in_memcg_oom = 1; + + current->memcg_oom.trace.nr_entries = 0; + current->memcg_oom.trace.max_entries = 16; + current->memcg_oom.trace.entries = current->memcg_oom.trace_entries; + current->memcg_oom.trace.skip = 1; + save_stack_trace(¤t->memcg_oom.trace); /* At first, try to OOM lock hierarchy under memcg.*/ spin_lock(&memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); - /* - * Even if signal_pending(), we can't quit charge() loop without - * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL - * under OOM is always welcomed, use TASK_KILLABLE here. - */ - prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); if (!locked || memcg->oom_kill_disable) need_to_kill = false; if (locked) @@ -1888,24 +1887,86 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) spin_unlock(&memcg_oom_lock); if (need_to_kill) { - finish_wait(&memcg_oom_waitq, &owait.wait); mem_cgroup_out_of_memory(memcg, mask); } else { - schedule(); - finish_wait(&memcg_oom_waitq, &owait.wait); + /* + * A system call can just return -ENOMEM, but if this + * is a page fault and somebody else is handling the + * OOM already, we need to sleep on the OOM waitqueue + * for this memcg until the situation is resolved. + * Which can take some time because it might be + * handled by a userspace task. + * + * However, this is the charge context, which means + * that we may sit on a large call stack and hold + * various filesystem locks, the mmap_sem etc. and we + * don't want the OOM handler to deadlock on them + * while we sit here and wait. Store the current OOM + * context in the task_struct, then return -ENOMEM. + * At the end of the page fault handler, with the + * stack unwound, pagefault_out_of_memory() will check + * back with us by calling + * mem_cgroup_oom_synchronize(), possibly putting the + * task to sleep. + */ + mem_cgroup_mark_under_oom(memcg); + current->memcg_oom.wakeups = atomic_read(&memcg->oom_wakeups); + css_get(&memcg->css); + current->memcg_oom.wait_on_memcg = memcg; } - spin_lock(&memcg_oom_lock); - if (locked) + + if (locked) { + spin_lock(&memcg_oom_lock); mem_cgroup_oom_unlock(memcg); - memcg_wakeup_oom(memcg); - spin_unlock(&memcg_oom_lock); + /* + * Sleeping tasks might have been killed, make sure + * they get scheduled so they can exit. + */ + if (need_to_kill) + memcg_oom_recover(memcg); + spin_unlock(&memcg_oom_lock); + } +} - mem_cgroup_unmark_under_oom(memcg); +bool mem_cgroup_oom_synchronize(void) +{ + struct oom_wait_info owait; + struct mem_cgroup *memcg; - if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + /* OOM is global, do not handle */ + if (!current->memcg_oom.in_memcg_oom) return false; - /* Give chance to dying process */ - schedule_timeout_uninterruptible(1); + + /* + * We invoked the OOM killer but there is a chance that a kill + * did not free up any charges. Everybody else might already + * be sleeping, so restart the fault and keep the rampage + * going until some charges are released. + */ + memcg = current->memcg_oom.wait_on_memcg; + if (!memcg) + goto out; + + if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + goto out_put; + + owait.mem = memcg; + owait.wait.flags = 0; + owait.wait.func = memcg_oom_wake_function; + owait.wait.private = current; + INIT_LIST_HEAD(&owait.wait.task_list); + + prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); + /* Only sleep if we didn't miss any wakeups since OOM */ + if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups) + schedule(); + finish_wait(&memcg_oom_waitq, &owait.wait); +out_put: + mem_cgroup_unmark_under_oom(memcg); + css_put(&memcg->css); + current->memcg_oom.wait_on_memcg = NULL; +out: + current->memcg_oom.in_memcg_oom = 0; return true; } @@ -2195,11 +2256,10 @@ enum { CHARGE_RETRY, /* need to retry but retry is not bad */ CHARGE_NOMEM, /* we can't do more. return -ENOMEM */ CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */ - CHARGE_OOM_DIE, /* the current is killed because of OOM */ }; static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, - unsigned int nr_pages, bool oom_check) + unsigned int nr_pages, bool invoke_oom) { unsigned long csize = nr_pages * PAGE_SIZE; struct mem_cgroup *mem_over_limit; @@ -2257,14 +2317,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (mem_cgroup_wait_acct_move(mem_over_limit)) return CHARGE_RETRY; - /* If we don't need to call oom-killer at el, return immediately */ - if (!oom_check) - return CHARGE_NOMEM; - /* check OOM */ - if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask)) - return CHARGE_OOM_DIE; + if (invoke_oom) + mem_cgroup_oom(mem_over_limit, gfp_mask); - return CHARGE_RETRY; + return CHARGE_NOMEM; } /* @@ -2349,7 +2405,7 @@ again: } do { - bool oom_check; + bool invoke_oom = oom && !nr_oom_retries; /* If killed, bypass charge */ if (fatal_signal_pending(current)) { @@ -2357,13 +2413,7 @@ again: goto bypass; } - oom_check = false; - if (oom && !nr_oom_retries) { - oom_check = true; - nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES; - } - - ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check); + ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, invoke_oom); switch (ret) { case CHARGE_OK: break; @@ -2376,16 +2426,12 @@ again: css_put(&memcg->css); goto nomem; case CHARGE_NOMEM: /* OOM routine works */ - if (!oom) { + if (!oom || invoke_oom) { css_put(&memcg->css); goto nomem; } - /* If oom, we never return -ENOMEM */ nr_oom_retries--; break; - case CHARGE_OOM_DIE: /* Killed by OOM Killer */ - css_put(&memcg->css); - goto bypass; } } while (ret != CHARGE_OK); diff --git a/mm/memory.c b/mm/memory.c index 829d437..fc6d741 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -57,6 +57,7 @@ #include #include #include +#include #include #include @@ -3439,22 +3440,14 @@ unlock: /* * By the time we get here, we already hold the mm semaphore */ -int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, unsigned int flags) +static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) { pgd_t *pgd; pud_t *pud; pmd_t *pmd; pte_t *pte; - __set_current_state(TASK_RUNNING); - - count_vm_event(PGFAULT); - mem_cgroup_count_vm_event(mm, PGFAULT); - - /* do counter updates before entering really critical section. */ - check_sync_rss_stat(current); - if (unlikely(is_vm_hugetlb_page(vma))) return hugetlb_fault(mm, vma, address, flags); @@ -3503,6 +3496,39 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, return handle_pte_fault(mm, vma, address, pte, pmd, flags); } +int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) +{ + int userfault = flags & FAULT_FLAG_USER; + int ret; + + __set_current_state(TASK_RUNNING); + + count_vm_event(PGFAULT); + mem_cgroup_count_vm_event(mm, PGFAULT); + + /* do counter updates before entering really critical section. */ + check_sync_rss_stat(current); + + if (userfault) + WARN_ON(mem_cgroup_xchg_may_oom(current, 1) == 1); + + ret = __handle_mm_fault(mm, vma, address, flags); + + if (userfault) + WARN_ON(mem_cgroup_xchg_may_oom(current, 0) == 0); + +#ifdef CONFIG_CGROUP_MEM_RES_CTLR + if (WARN(!(ret & VM_FAULT_OOM) && current->memcg_oom.in_memcg_oom, + "Fixing unhandled memcg OOM context, set up from:\n")) { + print_stack_trace(¤t->memcg_oom.trace, 0); + mem_cgroup_oom_synchronize(); + } +#endif + + return ret; +} + #ifndef __PAGETABLE_PUD_FOLDED /* * Allocate page upper directory. diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 069b64e..aa60863 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -785,6 +785,8 @@ out: */ void pagefault_out_of_memory(void) { + if (mem_cgroup_oom_synchronize()) + return; if (try_set_system_oom()) { out_of_memory(NULL, 0, 0, NULL); clear_system_oom(); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: [patch 1/5] mm: invoke oom-killer from remaining unconverted page fault handlers Date: Fri, 19 Jul 2013 00:22:38 -0400 Message-ID: <20130719042238.GD17812@cmpxchg.org> References: <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> <20130716153544.GX17812@cmpxchg.org> <20130716160905.GA20018@dhcp22.suse.cz> <20130716164830.GZ17812@cmpxchg.org> <20130719042124.GC17812@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130719042124.GC17812-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: azurIt , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org [already upstream, included for 3.2 reference] A few remaining architectures directly kill the page faulting task in an out of memory situation. This is usually not a good idea since that task might not even use a significant amount of memory and so may not be the optimal victim to resolve the situation. Since 2.6.29's 1c0fe6e ("mm: invoke oom-killer from page fault") there is a hook that architecture page fault handlers are supposed to call to invoke the OOM killer and let it pick the right task to kill. Convert the remaining architectures over to this hook. To have the previous behavior of simply taking out the faulting task the vm.oom_kill_allocating_task sysctl can be set to 1. Signed-off-by: Johannes Weiner Reviewed-by: Michal Hocko Cc: KAMEZAWA Hiroyuki Acked-by: David Rientjes Acked-by: Vineet Gupta [arch/arc bits] Cc: James Hogan Cc: David Howells Cc: Jonas Bonn Cc: Chen Liqin Cc: Lennox Wu Cc: Chris Metcalf Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/mn10300/mm/fault.c | 7 ++++--- arch/openrisc/mm/fault.c | 8 ++++---- arch/score/mm/fault.c | 8 ++++---- arch/tile/mm/fault.c | 8 ++++---- 4 files changed, 16 insertions(+), 15 deletions(-) diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c index 0945409..5ac4df5 100644 --- a/arch/mn10300/mm/fault.c +++ b/arch/mn10300/mm/fault.c @@ -329,9 +329,10 @@ no_context: */ out_of_memory: up_read(&mm->mmap_sem); - printk(KERN_ALERT "VM: killing process %s\n", tsk->comm); - if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) - do_exit(SIGKILL); + if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) { + pagefault_out_of_memory(); + return; + } goto no_context; do_sigbus: diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c index a5dce82..d78881c 100644 --- a/arch/openrisc/mm/fault.c +++ b/arch/openrisc/mm/fault.c @@ -246,10 +246,10 @@ out_of_memory: __asm__ __volatile__("l.nop 1"); up_read(&mm->mmap_sem); - printk("VM: killing process %s\n", tsk->comm); - if (user_mode(regs)) - do_exit(SIGKILL); - goto no_context; + if (!user_mode(regs)) + goto no_context; + pagefault_out_of_memory(); + return; do_sigbus: up_read(&mm->mmap_sem); diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c index 47b600e..6b18fb0 100644 --- a/arch/score/mm/fault.c +++ b/arch/score/mm/fault.c @@ -172,10 +172,10 @@ out_of_memory: down_read(&mm->mmap_sem); goto survive; } - printk("VM: killing process %s\n", tsk->comm); - if (user_mode(regs)) - do_group_exit(SIGKILL); - goto no_context; + if (!user_mode(regs)) + goto no_context; + pagefault_out_of_memory(); + return; do_sigbus: up_read(&mm->mmap_sem); diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c index 25b7b90..3312531 100644 --- a/arch/tile/mm/fault.c +++ b/arch/tile/mm/fault.c @@ -540,10 +540,10 @@ out_of_memory: down_read(&mm->mmap_sem); goto survive; } - pr_alert("VM: killing process %s\n", tsk->comm); - if (!is_kernel_mode) - do_group_exit(SIGKILL); - goto no_context; + if (is_kernel_mode) + goto no_context; + pagefault_out_of_memory(); + return 0; do_sigbus: up_read(&mm->mmap_sem); -- 1.8.3.2 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: [patch 2/5] mm: pass userspace fault flag to generic fault handler Date: Fri, 19 Jul 2013 00:24:24 -0400 Message-ID: <20130719042424.GE17812@cmpxchg.org> References: <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> <20130716153544.GX17812@cmpxchg.org> <20130716160905.GA20018@dhcp22.suse.cz> <20130716164830.GZ17812@cmpxchg.org> <20130719042124.GC17812@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130719042124.GC17812-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: azurIt , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org The global OOM killer is (XXX: for most architectures) only invoked for userspace faults, not for faults from kernelspace (uaccess, gup). Memcg OOM handling is currently invoked for all faults. Allow it to behave like the global case by having the architectures pass a flag to the generic fault handler code that identifies userspace faults. Signed-off-by: Johannes Weiner --- arch/alpha/mm/fault.c | 8 +++++++- arch/arm/mm/fault.c | 12 +++++++++--- arch/avr32/mm/fault.c | 8 +++++++- arch/cris/mm/fault.c | 8 +++++++- arch/frv/mm/fault.c | 8 +++++++- arch/hexagon/mm/vm_fault.c | 8 +++++++- arch/ia64/mm/fault.c | 8 +++++++- arch/m32r/mm/fault.c | 8 +++++++- arch/m68k/mm/fault.c | 8 +++++++- arch/microblaze/mm/fault.c | 8 +++++++- arch/mips/mm/fault.c | 8 +++++++- arch/mn10300/mm/fault.c | 8 +++++++- arch/openrisc/mm/fault.c | 8 +++++++- arch/parisc/mm/fault.c | 8 +++++++- arch/powerpc/mm/fault.c | 8 +++++++- arch/s390/mm/fault.c | 2 ++ arch/score/mm/fault.c | 7 ++++++- arch/sh/mm/fault_32.c | 8 +++++++- arch/sh/mm/tlbflush_64.c | 8 +++++++- arch/sparc/mm/fault_32.c | 8 +++++++- arch/sparc/mm/fault_64.c | 8 +++++++- arch/tile/mm/fault.c | 7 ++++++- arch/um/kernel/trap.c | 8 +++++++- arch/unicore32/mm/fault.c | 13 +++++++++---- arch/x86/mm/fault.c | 8 ++++++-- arch/xtensa/mm/fault.c | 8 +++++++- include/linux/mm.h | 1 + 27 files changed, 179 insertions(+), 31 deletions(-) diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c index fadd5f8..fa6b4e4 100644 --- a/arch/alpha/mm/fault.c +++ b/arch/alpha/mm/fault.c @@ -89,6 +89,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr, struct mm_struct *mm = current->mm; const struct exception_table_entry *fixup; int fault, si_code = SEGV_MAPERR; + unsigned long flags = 0; siginfo_t info; /* As of EV6, a load into $31/$f31 is a prefetch, and never faults @@ -142,10 +143,15 @@ do_page_fault(unsigned long address, unsigned long mmcsr, goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (cause > 0) + flags |= FAULT_FLAG_WRITE; + /* If for any reason at all we couldn't handle the fault, make sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(mm, vma, address, cause > 0 ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); up_read(&mm->mmap_sem); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c index aa33949..31b1e69 100644 --- a/arch/arm/mm/fault.c +++ b/arch/arm/mm/fault.c @@ -231,9 +231,10 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma) static int __kprobes __do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr, - struct task_struct *tsk) + struct task_struct *tsk, struct pt_regs *regs) { struct vm_area_struct *vma; + unsigned long flags = 0; int fault; vma = find_vma(mm, addr); @@ -253,11 +254,16 @@ good_area: goto out; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (fsr & FSR_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, (fsr & FSR_WRITE) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags); if (unlikely(fault & VM_FAULT_ERROR)) return fault; if (fault & VM_FAULT_MAJOR) @@ -320,7 +326,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) #endif } - fault = __do_page_fault(mm, addr, fsr, tsk); + fault = __do_page_fault(mm, addr, fsr, tsk, regs); up_read(&mm->mmap_sem); perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c index f7040a1..ada6237 100644 --- a/arch/avr32/mm/fault.c +++ b/arch/avr32/mm/fault.c @@ -59,6 +59,7 @@ asmlinkage void do_page_fault(unsigned long ecr, struct pt_regs *regs) struct mm_struct *mm; struct vm_area_struct *vma; const struct exception_table_entry *fixup; + unsigned long flags = 0; unsigned long address; unsigned long page; int writeaccess; @@ -127,12 +128,17 @@ good_area: panic("Unhandled case %lu in do_page_fault!", ecr); } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/cris/mm/fault.c b/arch/cris/mm/fault.c index 9dcac8e..35d096a 100644 --- a/arch/cris/mm/fault.c +++ b/arch/cris/mm/fault.c @@ -55,6 +55,7 @@ do_page_fault(unsigned long address, struct pt_regs *regs, struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct * vma; + unsigned long flags = 0; siginfo_t info; int fault; @@ -156,13 +157,18 @@ do_page_fault(unsigned long address, struct pt_regs *regs, goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess & 1) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, (writeaccess & 1) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/frv/mm/fault.c b/arch/frv/mm/fault.c index a325d57..2dbf219 100644 --- a/arch/frv/mm/fault.c +++ b/arch/frv/mm/fault.c @@ -35,6 +35,7 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear struct vm_area_struct *vma; struct mm_struct *mm; unsigned long _pme, lrai, lrad, fixup; + unsigned long flags = 0; siginfo_t info; pgd_t *pge; pud_t *pue; @@ -158,12 +159,17 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear break; } + if (user_mode(__frame)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, ear0, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, ear0, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c index c10b76f..e56baf3 100644 --- a/arch/hexagon/mm/vm_fault.c +++ b/arch/hexagon/mm/vm_fault.c @@ -52,6 +52,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs) siginfo_t info; int si_code = SEGV_MAPERR; int fault; + unsigned long flags = 0; const struct exception_table_entry *fixup; /* @@ -96,7 +97,12 @@ good_area: break; } - fault = handle_mm_fault(mm, vma, address, (cause > 0)); + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (cause > 0) + flags |= FAULT_FLAG_WRITE; + + fault = handle_mm_fault(mm, vma, address, flags); /* The most common case -- we are done. */ if (likely(!(fault & VM_FAULT_ERROR))) { diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c index 20b3593..ad9ef9d 100644 --- a/arch/ia64/mm/fault.c +++ b/arch/ia64/mm/fault.c @@ -79,6 +79,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re int signal = SIGSEGV, code = SEGV_MAPERR; struct vm_area_struct *vma, *prev_vma; struct mm_struct *mm = current->mm; + unsigned long flags = 0; struct siginfo si; unsigned long mask; int fault; @@ -149,12 +150,17 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re if ((vma->vm_flags & mask) != mask) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (mask & VM_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(mm, vma, address, (mask & VM_WRITE) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { /* * We ran out of memory, or some other thing happened diff --git a/arch/m32r/mm/fault.c b/arch/m32r/mm/fault.c index 2c9aeb4..e74f6fa 100644 --- a/arch/m32r/mm/fault.c +++ b/arch/m32r/mm/fault.c @@ -79,6 +79,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code, struct mm_struct *mm; struct vm_area_struct * vma; unsigned long page, addr; + unsigned long flags = 0; int write; int fault; siginfo_t info; @@ -188,6 +189,11 @@ good_area: if ((error_code & ACE_INSTRUCTION) && !(vma->vm_flags & VM_EXEC)) goto bad_area; + if (error_code & ACE_USERMODE) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo @@ -195,7 +201,7 @@ good_area: */ addr = (address & PAGE_MASK); set_thread_fault_code(error_code); - fault = handle_mm_fault(mm, vma, addr, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, addr, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c index 2db6099..ab88a91 100644 --- a/arch/m68k/mm/fault.c +++ b/arch/m68k/mm/fault.c @@ -73,6 +73,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address, { struct mm_struct *mm = current->mm; struct vm_area_struct * vma; + unsigned long flags = 0; int write, fault; #ifdef DEBUG @@ -134,13 +135,18 @@ good_area: goto acc_err; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); #ifdef DEBUG printk("handle_mm_fault returns %d\n",fault); #endif diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c index ae97d2c..b002612 100644 --- a/arch/microblaze/mm/fault.c +++ b/arch/microblaze/mm/fault.c @@ -89,6 +89,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address, { struct vm_area_struct *vma; struct mm_struct *mm = current->mm; + unsigned long flags = 0; siginfo_t info; int code = SEGV_MAPERR; int is_write = error_code & ESR_S; @@ -206,12 +207,17 @@ good_area: goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c index 937cf33..e5b9fed 100644 --- a/arch/mips/mm/fault.c +++ b/arch/mips/mm/fault.c @@ -40,6 +40,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, unsigned long writ struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; const int field = sizeof(unsigned long) * 2; + unsigned long flags = 0; siginfo_t info; int fault; @@ -139,12 +140,17 @@ good_area: } } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c index 5ac4df5..031be56 100644 --- a/arch/mn10300/mm/fault.c +++ b/arch/mn10300/mm/fault.c @@ -121,6 +121,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long fault_code, { struct vm_area_struct *vma; struct task_struct *tsk; + unsigned long flags = 0; struct mm_struct *mm; unsigned long page; siginfo_t info; @@ -247,12 +248,17 @@ good_area: break; } + if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c index d78881c..d586119 100644 --- a/arch/openrisc/mm/fault.c +++ b/arch/openrisc/mm/fault.c @@ -52,6 +52,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address, struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct *vma; + unsigned long flags = 0; siginfo_t info; int fault; @@ -153,13 +154,18 @@ good_area: if ((vector == 0x400) && !(vma->vm_page_prot.pgprot & _PAGE_EXEC)) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write_acc) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write_acc); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c index 18162ce..a151e87 100644 --- a/arch/parisc/mm/fault.c +++ b/arch/parisc/mm/fault.c @@ -173,6 +173,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code, struct vm_area_struct *vma, *prev_vma; struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; + unsigned long flags = 0; unsigned long acc_type; int fault; @@ -195,13 +196,18 @@ good_area: if ((vma->vm_flags & acc_type) != acc_type) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (acc_type & VM_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(mm, vma, address, (acc_type & VM_WRITE) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { /* * We hit a shared mapping outside of the file, or some diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 5efe8c9..2bf339c 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -122,6 +122,7 @@ int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address, { struct vm_area_struct * vma; struct mm_struct *mm = current->mm; + unsigned long flags = 0; siginfo_t info; int code = SEGV_MAPERR; int is_write = 0, ret; @@ -305,12 +306,17 @@ good_area: goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - ret = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + ret = handle_mm_fault(mm, vma, address, flags); if (unlikely(ret & VM_FAULT_ERROR)) { if (ret & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c index a9a3018..fe6109c 100644 --- a/arch/s390/mm/fault.c +++ b/arch/s390/mm/fault.c @@ -301,6 +301,8 @@ static inline int do_exception(struct pt_regs *regs, int access, address = trans_exc_code & __FAIL_ADDR_MASK; perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); flags = FAULT_FLAG_ALLOW_RETRY; + if (regs->psw.mask & PSW_MASK_PSTATE) + flags |= FAULT_FLAG_USER; if (access == VM_WRITE || (trans_exc_code & store_indication) == 0x400) flags |= FAULT_FLAG_WRITE; down_read(&mm->mmap_sem); diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c index 6b18fb0..2ca5ae5 100644 --- a/arch/score/mm/fault.c +++ b/arch/score/mm/fault.c @@ -47,6 +47,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write, struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; const int field = sizeof(unsigned long) * 2; + unsigned long flags = 0; siginfo_t info; int fault; @@ -101,12 +102,16 @@ good_area: } survive: + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sh/mm/fault_32.c b/arch/sh/mm/fault_32.c index 7bebd04..a61b803 100644 --- a/arch/sh/mm/fault_32.c +++ b/arch/sh/mm/fault_32.c @@ -126,6 +126,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct * vma; + unsigned long flags = 0; int si_code; int fault; siginfo_t info; @@ -195,12 +196,17 @@ good_area: goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sh/mm/tlbflush_64.c b/arch/sh/mm/tlbflush_64.c index e3430e0..0a9d645 100644 --- a/arch/sh/mm/tlbflush_64.c +++ b/arch/sh/mm/tlbflush_64.c @@ -96,6 +96,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long writeaccess, struct mm_struct *mm; struct vm_area_struct * vma; const struct exception_table_entry *fixup; + unsigned long flags = 0; pte_t *pte; int fault; @@ -184,12 +185,17 @@ good_area: } } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c index 8023fd7..efa3d48 100644 --- a/arch/sparc/mm/fault_32.c +++ b/arch/sparc/mm/fault_32.c @@ -222,6 +222,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write, struct vm_area_struct *vma; struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; + unsigned long flags = 0; unsigned int fixup; unsigned long g2; int from_user = !(regs->psr & PSR_PS); @@ -285,12 +286,17 @@ good_area: goto bad_area; } + if (from_user) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c index 504c062..bc536ea 100644 --- a/arch/sparc/mm/fault_64.c +++ b/arch/sparc/mm/fault_64.c @@ -276,6 +276,7 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; + unsigned long flags = 0; unsigned int insn = 0; int si_code, fault_code, fault; unsigned long address, mm_rss; @@ -423,7 +424,12 @@ good_area: goto bad_area; } - fault = handle_mm_fault(mm, vma, address, (fault_code & FAULT_CODE_WRITE) ? FAULT_FLAG_WRITE : 0); + if (!(regs->tstate & TSTATE_PRIV)) + flags |= FAULT_FLAG_USER; + if (fault_code & FAULT_CODE_WRITE) + flags |= FAULT_FLAG_WRITE; + + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c index 3312531..b2a7fd5 100644 --- a/arch/tile/mm/fault.c +++ b/arch/tile/mm/fault.c @@ -263,6 +263,7 @@ static int handle_page_fault(struct pt_regs *regs, struct mm_struct *mm; struct vm_area_struct *vma; unsigned long stack_offset; + unsigned long flags = 0; int fault; int si_code; int is_kernel_mode; @@ -415,12 +416,16 @@ good_area: } survive: + if (!is_kernel_mode) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c index dafc947..626a85e 100644 --- a/arch/um/kernel/trap.c +++ b/arch/um/kernel/trap.c @@ -25,6 +25,7 @@ int handle_page_fault(unsigned long address, unsigned long ip, { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; + unsigned long flags = 0; pgd_t *pgd; pud_t *pud; pmd_t *pmd; @@ -62,10 +63,15 @@ good_area: if (!is_write && !(vma->vm_flags & (VM_READ | VM_EXEC))) goto out; + if (is_user) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + do { int fault; - fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) { goto out_of_memory; diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c index 283aa4b..3026943 100644 --- a/arch/unicore32/mm/fault.c +++ b/arch/unicore32/mm/fault.c @@ -169,9 +169,10 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma) } static int __do_pf(struct mm_struct *mm, unsigned long addr, unsigned int fsr, - struct task_struct *tsk) + struct task_struct *tsk, struct pt_regs *regs) { struct vm_area_struct *vma; + unsigned long flags = 0; int fault; vma = find_vma(mm, addr); @@ -191,12 +192,16 @@ good_area: goto out; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (!(fsr ^ 0x12)) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, - (!(fsr ^ 0x12)) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags); if (unlikely(fault & VM_FAULT_ERROR)) return fault; if (fault & VM_FAULT_MAJOR) @@ -252,7 +257,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs) #endif } - fault = __do_pf(mm, addr, fsr, tsk); + fault = __do_pf(mm, addr, fsr, tsk, regs); up_read(&mm->mmap_sem); /* diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 5db0490..1cebabe 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -999,8 +999,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code) struct mm_struct *mm; int fault; int write = error_code & PF_WRITE; - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE | - (write ? FAULT_FLAG_WRITE : 0); + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; tsk = current; mm = tsk->mm; @@ -1160,6 +1159,11 @@ good_area: return; } + if (error_code & PF_USER) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c index e367e30..7db9fbe 100644 --- a/arch/xtensa/mm/fault.c +++ b/arch/xtensa/mm/fault.c @@ -41,6 +41,7 @@ void do_page_fault(struct pt_regs *regs) struct mm_struct *mm = current->mm; unsigned int exccause = regs->exccause; unsigned int address = regs->excvaddr; + unsigned long flags = 0; siginfo_t info; int is_write, is_exec; @@ -101,11 +102,16 @@ good_area: if (!(vma->vm_flags & (VM_READ | VM_WRITE))) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + /* If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/include/linux/mm.h b/include/linux/mm.h index 4baadd1..846b82b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -156,6 +156,7 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_ALLOW_RETRY 0x08 /* Retry fault if blocking */ #define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */ #define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */ +#define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */ /* * This interface is used by x86 PAT code to identify a pfn mapping that is -- 1.8.3.2 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: [patch 3/5] x86: finish fault error path with fatal signal Date: Fri, 19 Jul 2013 00:25:02 -0400 Message-ID: <20130719042502.GF17812@cmpxchg.org> References: <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> <20130716153544.GX17812@cmpxchg.org> <20130716160905.GA20018@dhcp22.suse.cz> <20130716164830.GZ17812@cmpxchg.org> <20130719042124.GC17812@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130719042124.GC17812@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea@gmail.com The x86 fault handler bails in the middle of error handling when the task has been killed. For the next patch this is a problem, because it relies on pagefault_out_of_memory() being called even when the task has been killed, to perform proper OOM state unwinding. This is a rather minor optimization, just remove it. Signed-off-by: Johannes Weiner --- arch/x86/mm/fault.c | 11 ----------- 1 file changed, 11 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 1cebabe..90248c9 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -846,17 +846,6 @@ static noinline int mm_fault_error(struct pt_regs *regs, unsigned long error_code, unsigned long address, unsigned int fault) { - /* - * Pagefault was interrupted by SIGKILL. We have no reason to - * continue pagefault. - */ - if (fatal_signal_pending(current)) { - if (!(fault & VM_FAULT_RETRY)) - up_read(¤t->mm->mmap_sem); - if (!(error_code & PF_USER)) - no_context(regs, error_code, address); - return 1; - } if (!(fault & VM_FAULT_ERROR)) return 0; -- 1.8.3.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: [patch 4/5] memcg: do not trap chargers with full callstack on OOM Date: Fri, 19 Jul 2013 00:25:47 -0400 Message-ID: <20130719042547.GG17812@cmpxchg.org> References: <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> <20130716153544.GX17812@cmpxchg.org> <20130716160905.GA20018@dhcp22.suse.cz> <20130716164830.GZ17812@cmpxchg.org> <20130719042124.GC17812@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130719042124.GC17812-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: azurIt , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1146+0x5ab/0x5c0 [] mem_cgroup_cache_charge+0xbe/0xe0 [] add_to_page_cache_locked+0x4c/0x140 [] add_to_page_cache_lru+0x22/0x50 [] grab_cache_page_write_begin+0x8b/0xe0 [] ext3_write_begin+0x88/0x270 [] generic_file_buffered_write+0x116/0x290 [] __generic_file_aio_write+0x27c/0x480 [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [] do_sync_write+0xea/0x130 [] vfs_write+0xf3/0x1f0 [] sys_write+0x51/0x90 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff OOM kill victim: [] do_truncate+0x58/0xa0 # takes i_mutex [] do_last+0x250/0xa30 [] path_openat+0xd7/0x440 [] do_filp_open+0x49/0xa0 [] do_sys_open+0x106/0x240 [] sys_open+0x20/0x30 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 0. When OOMing in a system call (buffered IO and friends), do not invoke the OOM killer, do not sleep on a OOM waitqueue, just return -ENOMEM. Userspace should be able to handle this and it prevents anybody from looping or waiting with locks held. 1. When OOMing in a kernel fault, do not invoke the OOM killer, do not sleep on the OOM waitqueue, just return -ENOMEM. The kernel fault stack knows how to handle this. If a kernel fault is nested inside a user fault, however, user fault handling applies: 2. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 3. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. While reworking the OOM routine, also remove a needless OOM waitqueue wakeup when invoking the killer. In addition to the wakeup implied in the kill signal delivery, only uncharges and limit increases, things that actually change the memory situation, should poke the waitqueue. Reported-by: Reported-by: azurIt Debugged-by: Michal Hocko Signed-off-by: Johannes Weiner --- include/linux/memcontrol.h | 22 +++++++ include/linux/sched.h | 6 ++ mm/filemap.c | 14 ++++- mm/ksm.c | 2 +- mm/memcontrol.c | 139 +++++++++++++++++++++++++++++---------------- mm/memory.c | 37 ++++++++---- mm/oom_kill.c | 2 + 7 files changed, 159 insertions(+), 63 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index b87068a..b92e5e7 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -120,6 +120,17 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page); extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); +static inline unsigned int mem_cgroup_xchg_may_oom(struct task_struct *p, + unsigned int new) +{ + unsigned int old; + + old = p->memcg_oom.may_oom; + p->memcg_oom.may_oom = new; + + return old; +} +bool mem_cgroup_oom_synchronize(void); #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP extern int do_swap_account; #endif @@ -333,6 +344,17 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } +static inline unsigned int mem_cgroup_xchg_may_oom(struct task_struct *p, + unsigned int new) +{ + return 0; +} + +static inline bool mem_cgroup_oom_synchronize(void) +{ + return false; +} + static inline void mem_cgroup_inc_page_stat(struct page *page, enum mem_cgroup_page_stat_item idx) { diff --git a/include/linux/sched.h b/include/linux/sched.h index 1c4f3e9..7e6c9e9 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1568,6 +1568,12 @@ struct task_struct { unsigned long nr_pages; /* uncharged usage */ unsigned long memsw_nr_pages; /* uncharged mem+swap usage */ } memcg_batch; + struct memcg_oom_info { + unsigned int may_oom:1; + unsigned int in_memcg_oom:1; + int wakeups; + struct mem_cgroup *wait_on_memcg; + } memcg_oom; #endif #ifdef CONFIG_HAVE_HW_BREAKPOINT atomic_t ptrace_bp_refcnt; diff --git a/mm/filemap.c b/mm/filemap.c index 5f0a3c9..d18bd47 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1660,6 +1660,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) struct file_ra_state *ra = &file->f_ra; struct inode *inode = mapping->host; pgoff_t offset = vmf->pgoff; + unsigned int may_oom; struct page *page; pgoff_t size; int ret = 0; @@ -1669,7 +1670,12 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) return VM_FAULT_SIGBUS; /* - * Do we have something in the page cache already? + * Do we have something in the page cache already? Either + * way, try readahead, but disable the memcg OOM killer for it + * as readahead is optional and no errors are propagated up + * the fault stack, which does not allow proper unwinding of a + * memcg OOM state. The OOM killer is enabled while trying to + * instantiate the faulting page individually below. */ page = find_get_page(mapping, offset); if (likely(page)) { @@ -1677,10 +1683,14 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) * We found the page, so try async readahead before * waiting for the lock. */ + may_oom = mem_cgroup_xchg_may_oom(current, 0); do_async_mmap_readahead(vma, ra, file, page, offset); + mem_cgroup_xchg_may_oom(current, may_oom); } else { - /* No page in the page cache at all */ + /* No page in the page cache at all. */ + may_oom = mem_cgroup_xchg_may_oom(current, 0); do_sync_mmap_readahead(vma, ra, file, offset); + mem_cgroup_xchg_may_oom(current, may_oom); count_vm_event(PGMAJFAULT); mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); ret = VM_FAULT_MAJOR; diff --git a/mm/ksm.c b/mm/ksm.c index 310544a..ae7e4ae 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -338,7 +338,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr) break; if (PageKsm(page)) ret = handle_mm_fault(vma->vm_mm, vma, addr, - FAULT_FLAG_WRITE); + FAULT_FLAG_WRITE); else ret = VM_FAULT_WRITE; put_page(page); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b63f5f7..99b0101 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -249,6 +249,7 @@ struct mem_cgroup { bool oom_lock; atomic_t under_oom; + atomic_t oom_wakeups; atomic_t refcnt; @@ -1846,6 +1847,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait, static void memcg_wakeup_oom(struct mem_cgroup *memcg) { + atomic_inc(&memcg->oom_wakeups); /* for filtering, pass "memcg" as argument. */ __wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg); } @@ -1857,30 +1859,20 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) } /* - * try to call OOM killer. returns false if we should exit memory-reclaim loop. + * try to call OOM killer */ -bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask) { - struct oom_wait_info owait; - bool locked, need_to_kill; + bool locked, need_to_kill = true; - owait.mem = memcg; - owait.wait.flags = 0; - owait.wait.func = memcg_oom_wake_function; - owait.wait.private = current; - INIT_LIST_HEAD(&owait.wait.task_list); - need_to_kill = true; - mem_cgroup_mark_under_oom(memcg); + if (!current->memcg_oom.may_oom) + return; + + current->memcg_oom.in_memcg_oom = 1; /* At first, try to OOM lock hierarchy under memcg.*/ spin_lock(&memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); - /* - * Even if signal_pending(), we can't quit charge() loop without - * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL - * under OOM is always welcomed, use TASK_KILLABLE here. - */ - prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); if (!locked || memcg->oom_kill_disable) need_to_kill = false; if (locked) @@ -1888,24 +1880,86 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) spin_unlock(&memcg_oom_lock); if (need_to_kill) { - finish_wait(&memcg_oom_waitq, &owait.wait); mem_cgroup_out_of_memory(memcg, mask); } else { - schedule(); - finish_wait(&memcg_oom_waitq, &owait.wait); + /* + * A system call can just return -ENOMEM, but if this + * is a page fault and somebody else is handling the + * OOM already, we need to sleep on the OOM waitqueue + * for this memcg until the situation is resolved. + * Which can take some time because it might be + * handled by a userspace task. + * + * However, this is the charge context, which means + * that we may sit on a large call stack and hold + * various filesystem locks, the mmap_sem etc. and we + * don't want the OOM handler to deadlock on them + * while we sit here and wait. Store the current OOM + * context in the task_struct, then return -ENOMEM. + * At the end of the page fault handler, with the + * stack unwound, pagefault_out_of_memory() will check + * back with us by calling + * mem_cgroup_oom_synchronize(), possibly putting the + * task to sleep. + */ + mem_cgroup_mark_under_oom(memcg); + current->memcg_oom.wakeups = atomic_read(&memcg->oom_wakeups); + css_get(&memcg->css); + current->memcg_oom.wait_on_memcg = memcg; } - spin_lock(&memcg_oom_lock); - if (locked) + + if (locked) { + spin_lock(&memcg_oom_lock); mem_cgroup_oom_unlock(memcg); - memcg_wakeup_oom(memcg); - spin_unlock(&memcg_oom_lock); + /* + * Sleeping tasks might have been killed, make sure + * they get scheduled so they can exit. + */ + if (need_to_kill) + memcg_oom_recover(memcg); + spin_unlock(&memcg_oom_lock); + } +} - mem_cgroup_unmark_under_oom(memcg); +bool mem_cgroup_oom_synchronize(void) +{ + struct oom_wait_info owait; + struct mem_cgroup *memcg; - if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + /* OOM is global, do not handle */ + if (!current->memcg_oom.in_memcg_oom) return false; - /* Give chance to dying process */ - schedule_timeout_uninterruptible(1); + + /* + * We invoked the OOM killer but there is a chance that a kill + * did not free up any charges. Everybody else might already + * be sleeping, so restart the fault and keep the rampage + * going until some charges are released. + */ + memcg = current->memcg_oom.wait_on_memcg; + if (!memcg) + goto out; + + if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + goto out_put; + + owait.mem = memcg; + owait.wait.flags = 0; + owait.wait.func = memcg_oom_wake_function; + owait.wait.private = current; + INIT_LIST_HEAD(&owait.wait.task_list); + + prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); + /* Only sleep if we didn't miss any wakeups since OOM */ + if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups) + schedule(); + finish_wait(&memcg_oom_waitq, &owait.wait); +out_put: + mem_cgroup_unmark_under_oom(memcg); + css_put(&memcg->css); + current->memcg_oom.wait_on_memcg = NULL; +out: + current->memcg_oom.in_memcg_oom = 0; return true; } @@ -2195,11 +2249,10 @@ enum { CHARGE_RETRY, /* need to retry but retry is not bad */ CHARGE_NOMEM, /* we can't do more. return -ENOMEM */ CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */ - CHARGE_OOM_DIE, /* the current is killed because of OOM */ }; static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, - unsigned int nr_pages, bool oom_check) + unsigned int nr_pages, bool invoke_oom) { unsigned long csize = nr_pages * PAGE_SIZE; struct mem_cgroup *mem_over_limit; @@ -2257,14 +2310,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (mem_cgroup_wait_acct_move(mem_over_limit)) return CHARGE_RETRY; - /* If we don't need to call oom-killer at el, return immediately */ - if (!oom_check) - return CHARGE_NOMEM; - /* check OOM */ - if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask)) - return CHARGE_OOM_DIE; + if (invoke_oom) + mem_cgroup_oom(mem_over_limit, gfp_mask); - return CHARGE_RETRY; + return CHARGE_NOMEM; } /* @@ -2349,7 +2398,7 @@ again: } do { - bool oom_check; + bool invoke_oom = oom && !nr_oom_retries; /* If killed, bypass charge */ if (fatal_signal_pending(current)) { @@ -2357,13 +2406,7 @@ again: goto bypass; } - oom_check = false; - if (oom && !nr_oom_retries) { - oom_check = true; - nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES; - } - - ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check); + ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, invoke_oom); switch (ret) { case CHARGE_OK: break; @@ -2376,16 +2419,12 @@ again: css_put(&memcg->css); goto nomem; case CHARGE_NOMEM: /* OOM routine works */ - if (!oom) { + if (!oom || invoke_oom) { css_put(&memcg->css); goto nomem; } - /* If oom, we never return -ENOMEM */ nr_oom_retries--; break; - case CHARGE_OOM_DIE: /* Killed by OOM Killer */ - css_put(&memcg->css); - goto bypass; } } while (ret != CHARGE_OK); diff --git a/mm/memory.c b/mm/memory.c index 829d437..2be02b7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3439,22 +3439,14 @@ unlock: /* * By the time we get here, we already hold the mm semaphore */ -int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, unsigned int flags) +static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) { pgd_t *pgd; pud_t *pud; pmd_t *pmd; pte_t *pte; - __set_current_state(TASK_RUNNING); - - count_vm_event(PGFAULT); - mem_cgroup_count_vm_event(mm, PGFAULT); - - /* do counter updates before entering really critical section. */ - check_sync_rss_stat(current); - if (unlikely(is_vm_hugetlb_page(vma))) return hugetlb_fault(mm, vma, address, flags); @@ -3503,6 +3495,31 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, return handle_pte_fault(mm, vma, address, pte, pmd, flags); } +int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) +{ + int userfault = flags & FAULT_FLAG_USER; + int ret; + + __set_current_state(TASK_RUNNING); + + count_vm_event(PGFAULT); + mem_cgroup_count_vm_event(mm, PGFAULT); + + /* do counter updates before entering really critical section. */ + check_sync_rss_stat(current); + + if (userfault) + WARN_ON(mem_cgroup_xchg_may_oom(current, 1) == 1); + + ret = __handle_mm_fault(mm, vma, address, flags); + + if (userfault) + WARN_ON(mem_cgroup_xchg_may_oom(current, 0) == 0); + + return ret; +} + #ifndef __PAGETABLE_PUD_FOLDED /* * Allocate page upper directory. diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 069b64e..aa60863 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -785,6 +785,8 @@ out: */ void pagefault_out_of_memory(void) { + if (mem_cgroup_oom_synchronize()) + return; if (try_set_system_oom()) { out_of_memory(NULL, 0, 0, NULL); clear_system_oom(); -- 1.8.3.2 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: [patch 5/5] mm: memcontrol: sanity check memcg OOM context unwind Date: Fri, 19 Jul 2013 00:26:23 -0400 Message-ID: <20130719042623.GH17812@cmpxchg.org> References: <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> <20130716153544.GX17812@cmpxchg.org> <20130716160905.GA20018@dhcp22.suse.cz> <20130716164830.GZ17812@cmpxchg.org> <20130719042124.GC17812@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130719042124.GC17812-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: azurIt , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org Catch the cases where a memcg OOM context is set up in the failed charge path but the fault handler is not actually returning VM_FAULT_ERROR, which would be required to properly finalize the OOM. Example output: the first trace shows the stack at the end of handle_mm_fault() where an unexpected memcg OOM context is detected. The subsequent trace is of whoever set up that OOM context. In this case it was the charging of readahead pages in a file fault, which does not propagate VM_FAULT_OOM on failure and should disable OOM: [ 27.805359] WARNING: at /home/hannes/src/linux/linux/mm/memory.c:3523 handle_mm_fault+0x1fb/0x3f0() [ 27.805360] Hardware name: PowerEdge 1950 [ 27.805361] Fixing unhandled memcg OOM context, set up from: [ 27.805362] Pid: 1599, comm: file Tainted: G W 3.2.0-00005-g6d10010 #97 [ 27.805363] Call Trace: [ 27.805365] [] warn_slowpath_common+0x6a/0xa0 [ 27.805367] [] warn_slowpath_fmt+0x41/0x50 [ 27.805369] [] handle_mm_fault+0x1fb/0x3f0 [ 27.805371] [] do_page_fault+0x140/0x4a0 [ 27.805373] [] ? do_mmap_pgoff+0x34b/0x360 [ 27.805376] [] page_fault+0x1f/0x30 [ 27.805377] ---[ end trace 305ec584fba81649 ]--- [ 27.805378] [] __mem_cgroup_try_charge+0x5c8/0x7e0 [ 27.805380] [] mem_cgroup_cache_charge+0xac/0x110 [ 27.805381] [] add_to_page_cache_locked+0x3e/0x120 [ 27.805383] [] add_to_page_cache_lru+0x15/0x40 [ 27.805385] [] mpage_readpages+0xc3/0x150 [ 27.805387] [] ext4_readpages+0x18/0x20 [ 27.805388] [] __do_page_cache_readahead+0x1c1/0x270 [ 27.805390] [] ra_submit+0x1c/0x20 [ 27.805392] [] filemap_fault+0x3f4/0x450 [ 27.805394] [] __do_fault+0x6d/0x510 [ 27.805395] [] handle_pte_fault+0x8a/0x920 [ 27.805397] [] handle_mm_fault+0x19c/0x3f0 [ 27.805398] [] do_page_fault+0x140/0x4a0 [ 27.805400] [] page_fault+0x1f/0x30 [ 27.805401] [] 0xffffffffffffffff Debug patch only. Not-signed-off-by: Johannes Weiner --- include/linux/sched.h | 3 +++ mm/memcontrol.c | 7 +++++++ mm/memory.c | 9 +++++++++ 3 files changed, 19 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 7e6c9e9..a77d198 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -91,6 +91,7 @@ struct sched_param { #include #include #include +#include #include @@ -1571,6 +1572,8 @@ struct task_struct { struct memcg_oom_info { unsigned int may_oom:1; unsigned int in_memcg_oom:1; + struct stack_trace trace; + unsigned long trace_entries[16]; int wakeups; struct mem_cgroup *wait_on_memcg; } memcg_oom; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 99b0101..c47c77e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -49,6 +49,7 @@ #include #include #include +#include #include "internal.h" #include @@ -1870,6 +1871,12 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask) current->memcg_oom.in_memcg_oom = 1; + current->memcg_oom.trace.nr_entries = 0; + current->memcg_oom.trace.max_entries = 16; + current->memcg_oom.trace.entries = current->memcg_oom.trace_entries; + current->memcg_oom.trace.skip = 1; + save_stack_trace(¤t->memcg_oom.trace); + /* At first, try to OOM lock hierarchy under memcg.*/ spin_lock(&memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); diff --git a/mm/memory.c b/mm/memory.c index 2be02b7..fc6d741 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -57,6 +57,7 @@ #include #include #include +#include #include #include @@ -3517,6 +3518,14 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (userfault) WARN_ON(mem_cgroup_xchg_may_oom(current, 0) == 0); +#ifdef CONFIG_CGROUP_MEM_RES_CTLR + if (WARN(!(ret & VM_FAULT_OOM) && current->memcg_oom.in_memcg_oom, + "Fixing unhandled memcg OOM context, set up from:\n")) { + print_stack_trace(¤t->memcg_oom.trace, 0); + mem_cgroup_oom_synchronize(); + } +#endif + return ret; } -- 1.8.3.2 From mboxrd@z Thu Jan 1 00:00:00 1970 From: "azurIt" Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=5D_memcg=3A_do_not_trap_chargers_with_full_callstack_on_OOM?= Date: Fri, 19 Jul 2013 10:23:39 +0200 Message-ID: <20130719102339.34DF73E5@pobox.sk> References: <20130709135450.GI20281@dhcp22.suse.cz>, <20130710182506.F25DF461@pobox.sk>, <20130711072507.GA21667@dhcp22.suse.cz>, <20130714012641.C2DA4E05@pobox.sk>, <20130714015112.FFCB7AF7@pobox.sk>, <20130715154119.GA32435@dhcp22.suse.cz>, <20130715160006.GB32435@dhcp22.suse.cz>, <20130716153544.GX17812@cmpxchg.org>, <20130716160905.GA20018@dhcp22.suse.cz>, <20130716164830.GZ17812@cmpxchg.org> <20130719042124.GC17812@cmpxchg.org> Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Return-path: In-Reply-To: <20130719042124.GC17812-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" To: =?utf-8?q?Johannes_Weiner?= , =?utf-8?q?Michal_Hocko?= Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > CC: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org >On Tue, Jul 16, 2013 at 12:48:30PM -0400, Johannes Weiner wrote: >> On Tue, Jul 16, 2013 at 06:09:05PM +0200, Michal Hocko wrote: >> > On Tue 16-07-13 11:35:44, Johannes Weiner wrote: >> > > On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote: >> > > > On Mon 15-07-13 17:41:19, Michal Hocko wrote: >> > > > > On Sun 14-07-13 01:51:12, azurIt wrote: >> > > > > > > CC: "Johannes Weiner" , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org >> > > > > > >> CC: "Johannes Weiner" , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org >> > > > > > >>On Wed 10-07-13 18:25:06, azurIt wrote: >> > > > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before >> > > > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using >> > > > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able >> > > > > > >>> >> to associate all user's processes with target cgroup). Look here for >> > > > > > >>> >> cgroup-uid patch: >> > > > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch >> > > > > > >>> >> >> > > > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was >> > > > > > >>> >> permanently '1'. >> > > > > > >>> > >> > > > > > >>> >This is really strange. Could you post the whole diff against stable >> > > > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid >> > > > > > >>> >patch)? >> > > > > > >>> >> > > > > > >>> >> > > > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: >> > > > > > >>> http://watchdog.sk/lkml/patches3/ >> > > > > > >> >> > > > > > >>The two patches from Johannes seem correct. >> > > > > > >> >> > > > > > >>From a quick look even grsecurity patchset shouldn't interfere as it >> > > > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error >> > > > > > >>and there also doesn't seem to be any new handle_mm_fault call sites. >> > > > > > >> >> > > > > > >>But I cannot tell there aren't other code paths which would lead to a >> > > > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. >> > > > > > > >> > > > > > > >> > > > > > >Michal, >> > > > > > > >> > > > > > >now i can definitely confirm that problem with unremovable cgroups >> > > > > > >persists. What info do you need from me? I applied also your little >> > > > > > >'WARN_ON' patch. >> > > > > > >> > > > > > Ok, i think you want this: >> > > > > > http://watchdog.sk/lkml/kern4.log >> > > > > >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [] warn_slowpath_common+0x7a/0xb0 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [] warn_slowpath_null+0x1a/0x20 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [] do_exit+0x7d0/0x870 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [] ? thread_group_times+0x44/0xb0 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [] do_group_exit+0x51/0xc0 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [] sys_exit_group+0x17/0x20 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [] system_call_fastpath+0x18/0x1d >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- >> > > > > >> > > > > OK, so you had an OOM which has been handled by in-kernel oom handler >> > > > > (it killed 12021) and 12037 was in the same group. The warning tells us >> > > > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have >> > > > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then >> > > > > it exited on the userspace request (by exit syscall). >> > > > > >> > > > > I do not see any way how, this could happen though. If mem_cgroup_oom >> > > > > is called then we always return CHARGE_NOMEM which turns into ENOMEM >> > > > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to >> > > > > true). So if nobody screwed the return value on the way up to page >> > > > > fault handler then there is no way to escape. >> > > > > >> > > > > I will check the code. >> > > > >> > > > OK, I guess I found it: >> > > > __do_fault >> > > > fault = filemap_fault >> > > > do_async_mmap_readahead >> > > > page_cache_async_readahead >> > > > ondemand_readahead >> > > > __do_page_cache_readahead >> > > > read_pages >> > > > readpages = ext3_readpages >> > > > mpage_readpages # Doesn't propagate ENOMEM >> > > > add_to_page_cache_lru >> > > > add_to_page_cache >> > > > add_to_page_cache_locked >> > > > mem_cgroup_cache_charge >> > > > >> > > > So the read ahead most probably. Again! Duhhh. I will try to think >> > > > about a fix for this. One obvious place is mpage_readpages but >> > > > __do_page_cache_readahead ignores read_pages return value as well and >> > > > page_cache_async_readahead, even worse, is just void and exported as >> > > > such. >> > > > >> > > > So this smells like a hard to fix bugger. One possible, and really ugly >> > > > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault >> > > > doesn't return VM_FAULT_ERROR, but that is a crude hack. > >I fixed it by disabling the OOM killer altogether for readahead code. >We don't do it globally, we should not do it in the memcg, these are >optional allocations/charges. > >I also disabled it for kernel faults triggered from within a syscall >(copy_*user, get_user_pages), which should just return -ENOMEM as >usual (unless it's nested inside a userspace fault). The only >downside is that we can't get around annotating userspace faults >anymore, so every architecture fault handler now passes >FAULT_FLAG_USER to handle_mm_fault(). Makes the series a little less >self-contained, but it's not unreasonable. > >It's easy to detect leaks now by checking if the memcg OOM context is >setup and we are not returning VM_FAULT_OOM. > >Here is a combined diff based on 3.2. azurIt, any chance you could >give this a shot? I tested it on my local machines, but you have a >known reproducer of fairly unlikely scenarios... I will be out of office between 25.7. and 1.8. and I don't want to run anything which can potentially do an outage of our services. I will test this patch after 2.8. Should I use also previous patches of this one is enough? Thank you very much Johannes. azur From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [patch 3/5] x86: finish fault error path with fatal signal Date: Wed, 24 Jul 2013 16:32:05 -0400 Message-ID: <20130724203205.GL715@cmpxchg.org> References: <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> <20130716153544.GX17812@cmpxchg.org> <20130716160905.GA20018@dhcp22.suse.cz> <20130716164830.GZ17812@cmpxchg.org> <20130719042124.GC17812@cmpxchg.org> <20130719042502.GF17812@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20130719042502.GF17812-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: azurIt , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org On Fri, Jul 19, 2013 at 12:25:02AM -0400, Johannes Weiner wrote: > The x86 fault handler bails in the middle of error handling when the > task has been killed. For the next patch this is a problem, because > it relies on pagefault_out_of_memory() being called even when the task > has been killed, to perform proper OOM state unwinding. > > This is a rather minor optimization, just remove it. > > Signed-off-by: Johannes Weiner > --- > arch/x86/mm/fault.c | 11 ----------- > 1 file changed, 11 deletions(-) > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index 1cebabe..90248c9 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -846,17 +846,6 @@ static noinline int > mm_fault_error(struct pt_regs *regs, unsigned long error_code, > unsigned long address, unsigned int fault) > { > - /* > - * Pagefault was interrupted by SIGKILL. We have no reason to > - * continue pagefault. > - */ > - if (fatal_signal_pending(current)) { > - if (!(fault & VM_FAULT_RETRY)) > - up_read(¤t->mm->mmap_sem); > - if (!(error_code & PF_USER)) > - no_context(regs, error_code, address); > - return 1; This is broken but I only hit it now after testing for a while. The patch has the right idea: in case of an OOM kill, we should continue the fault and not abort. What I missed is that in case of a kill during lock_page, i.e. VM_FAULT_RETRY && fatal_signal, we have to exit the fault and not do up_read() etc. This introduced a locking imbalance that would get everybody hung on mmap_sem. I moved the retry handling outside of mm_fault_error() (come on...) and stole some documentation from arm. It's now a little bit more explicit and comparable to other architectures. I'll send an updated series, patch for reference: --- From: Johannes Weiner Subject: [patch] x86: finish fault error path with fatal signal The x86 fault handler bails in the middle of error handling when the task has been killed. For the next patch this is a problem, because it relies on pagefault_out_of_memory() being called even when the task has been killed, to perform proper OOM state unwinding. This is a rather minor optimization that cuts short the fault handling by a few instructions in rare cases. Just remove it. Signed-off-by: Johannes Weiner --- arch/x86/mm/fault.c | 33 +++++++++++++-------------------- 1 file changed, 13 insertions(+), 20 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 6d77c38..0c18beb 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -842,31 +842,17 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address, force_sig_info_fault(SIGBUS, code, address, tsk, fault); } -static noinline int +static noinline void mm_fault_error(struct pt_regs *regs, unsigned long error_code, unsigned long address, unsigned int fault) { - /* - * Pagefault was interrupted by SIGKILL. We have no reason to - * continue pagefault. - */ - if (fatal_signal_pending(current)) { - if (!(fault & VM_FAULT_RETRY)) - up_read(¤t->mm->mmap_sem); - if (!(error_code & PF_USER)) - no_context(regs, error_code, address, 0, 0); - return 1; - } - if (!(fault & VM_FAULT_ERROR)) - return 0; - if (fault & VM_FAULT_OOM) { /* Kernel mode? Handle exceptions or die: */ if (!(error_code & PF_USER)) { up_read(¤t->mm->mmap_sem); no_context(regs, error_code, address, SIGSEGV, SEGV_MAPERR); - return 1; + return; } up_read(¤t->mm->mmap_sem); @@ -884,7 +870,6 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code, else BUG(); } - return 1; } static int spurious_fault_check(unsigned long error_code, pte_t *pte) @@ -1189,9 +1174,17 @@ good_area: */ fault = handle_mm_fault(mm, vma, address, flags); - if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) { - if (mm_fault_error(regs, error_code, address, fault)) - return; + /* + * If we need to retry but a fatal signal is pending, handle the + * signal first. We do not need to release the mmap_sem because it + * would already be released in __lock_page_or_retry in mm/filemap.c. + */ + if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) + return; + + if (unlikely(fault & VM_FAULT_ERROR)) { + mm_fault_error(regs, error_code, address, fault); + return; } /* -- 1.8.3.2 From mboxrd@z Thu Jan 1 00:00:00 1970 From: KOSAKI Motohiro Subject: Re: [patch 3/5] x86: finish fault error path with fatal signal Date: Thu, 25 Jul 2013 16:29:13 -0400 Message-ID: <51F18A99.7000306@gmail.com> References: <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> <20130716153544.GX17812@cmpxchg.org> <20130716160905.GA20018@dhcp22.suse.cz> <20130716164830.GZ17812@cmpxchg.org> <20130719042124.GC17812@cmpxchg.org> <20130719042502.GF17812@cmpxchg.org> <20130724203205.GL715@cmpxchg.org> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=iR1ZD/w4CqYES5L61dJ7B/1wEdqySM2mB/fpPjRZ6pY=; b=D/WL9iEROciNS6YIW+CH7zqlPsmnTsvAVPWcKKGaZE/xMsjqDKg4r3O7dMObgUKW0r S/guuP3jyjfquwPpNUQ7hK8cQrsTFmSO2lgxowS59WdKN8JqsdUP/ZVD+ccYcu6qg8Qu No7BRH6zU1UAhM56xHq7HhhQ98kLA0Gm1d0blgsfFy+XNvrzX1+sByt6h/5Yp/YshuFi YiU0v9+Ui7838w0ODDY5Z4ZphRMd5FTNI3t+pX1c/mU/9qAc5+fhFsfoNPPTsQn4m7EQ lXNCtHtX3zBtW6vaGJP5kWziUeO2WHo1iQzDzNRmq2jFiAJ9rVST3FrNZqw3yoWJRQau AObQ== In-Reply-To: <20130724203205.GL715@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii"; format="flowed" To: Johannes Weiner Cc: Michal Hocko , azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea@gmail.com, kosaki.motohiro@gmail.com (7/24/13 4:32 PM), Johannes Weiner wrote: > On Fri, Jul 19, 2013 at 12:25:02AM -0400, Johannes Weiner wrote: >> The x86 fault handler bails in the middle of error handling when the >> task has been killed. For the next patch this is a problem, because >> it relies on pagefault_out_of_memory() being called even when the task >> has been killed, to perform proper OOM state unwinding. >> >> This is a rather minor optimization, just remove it. >> >> Signed-off-by: Johannes Weiner >> --- >> arch/x86/mm/fault.c | 11 ----------- >> 1 file changed, 11 deletions(-) >> >> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c >> index 1cebabe..90248c9 100644 >> --- a/arch/x86/mm/fault.c >> +++ b/arch/x86/mm/fault.c >> @@ -846,17 +846,6 @@ static noinline int >> mm_fault_error(struct pt_regs *regs, unsigned long error_code, >> unsigned long address, unsigned int fault) >> { >> - /* >> - * Pagefault was interrupted by SIGKILL. We have no reason to >> - * continue pagefault. >> - */ >> - if (fatal_signal_pending(current)) { >> - if (!(fault & VM_FAULT_RETRY)) >> - up_read(¤t->mm->mmap_sem); >> - if (!(error_code & PF_USER)) >> - no_context(regs, error_code, address); >> - return 1; > > This is broken but I only hit it now after testing for a while. > > The patch has the right idea: in case of an OOM kill, we should > continue the fault and not abort. What I missed is that in case of a > kill during lock_page, i.e. VM_FAULT_RETRY && fatal_signal, we have to > exit the fault and not do up_read() etc. This introduced a locking > imbalance that would get everybody hung on mmap_sem. > > I moved the retry handling outside of mm_fault_error() (come on...) > and stole some documentation from arm. It's now a little bit more > explicit and comparable to other architectures. > > I'll send an updated series, patch for reference: > > --- > From: Johannes Weiner > Subject: [patch] x86: finish fault error path with fatal signal > > The x86 fault handler bails in the middle of error handling when the > task has been killed. For the next patch this is a problem, because > it relies on pagefault_out_of_memory() being called even when the task > has been killed, to perform proper OOM state unwinding. > > This is a rather minor optimization that cuts short the fault handling > by a few instructions in rare cases. Just remove it. > > Signed-off-by: Johannes Weiner > --- > arch/x86/mm/fault.c | 33 +++++++++++++-------------------- > 1 file changed, 13 insertions(+), 20 deletions(-) > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index 6d77c38..0c18beb 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -842,31 +842,17 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address, > force_sig_info_fault(SIGBUS, code, address, tsk, fault); > } > > -static noinline int > +static noinline void > mm_fault_error(struct pt_regs *regs, unsigned long error_code, > unsigned long address, unsigned int fault) > { > - /* > - * Pagefault was interrupted by SIGKILL. We have no reason to > - * continue pagefault. > - */ > - if (fatal_signal_pending(current)) { > - if (!(fault & VM_FAULT_RETRY)) > - up_read(¤t->mm->mmap_sem); > - if (!(error_code & PF_USER)) > - no_context(regs, error_code, address, 0, 0); > - return 1; > - } > - if (!(fault & VM_FAULT_ERROR)) > - return 0; > - > if (fault & VM_FAULT_OOM) { > /* Kernel mode? Handle exceptions or die: */ > if (!(error_code & PF_USER)) { > up_read(¤t->mm->mmap_sem); > no_context(regs, error_code, address, > SIGSEGV, SEGV_MAPERR); > - return 1; > + return; > } > > up_read(¤t->mm->mmap_sem); > @@ -884,7 +870,6 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code, > else > BUG(); > } > - return 1; > } > > static int spurious_fault_check(unsigned long error_code, pte_t *pte) > @@ -1189,9 +1174,17 @@ good_area: > */ > fault = handle_mm_fault(mm, vma, address, flags); > > - if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) { > - if (mm_fault_error(regs, error_code, address, fault)) > - return; > + /* > + * If we need to retry but a fatal signal is pending, handle the > + * signal first. We do not need to release the mmap_sem because it > + * would already be released in __lock_page_or_retry in mm/filemap.c. > + */ > + if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) > + return; > + > + if (unlikely(fault & VM_FAULT_ERROR)) { > + mm_fault_error(regs, error_code, address, fault); > + return; > } When I made the patch you removed code, Ingo suggested we need put all rare case code into if(unlikely()) block. Yes, this is purely micro optimization. But it is not costly to maintain. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [patch 3/5] x86: finish fault error path with fatal signal Date: Thu, 25 Jul 2013 17:50:33 -0400 Message-ID: <20130725215033.GP715@cmpxchg.org> References: <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> <20130716153544.GX17812@cmpxchg.org> <20130716160905.GA20018@dhcp22.suse.cz> <20130716164830.GZ17812@cmpxchg.org> <20130719042124.GC17812@cmpxchg.org> <20130719042502.GF17812@cmpxchg.org> <20130724203205.GL715@cmpxchg.org> <51F18A99.7000306@gmail.com> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <51F18A99.7000306-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: KOSAKI Motohiro Cc: Michal Hocko , azurIt , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org On Thu, Jul 25, 2013 at 04:29:13PM -0400, KOSAKI Motohiro wrote: > (7/24/13 4:32 PM), Johannes Weiner wrote: > >@@ -1189,9 +1174,17 @@ good_area: > > */ > > fault = handle_mm_fault(mm, vma, address, flags); > > > >- if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) { > >- if (mm_fault_error(regs, error_code, address, fault)) > >- return; > >+ /* > >+ * If we need to retry but a fatal signal is pending, handle the > >+ * signal first. We do not need to release the mmap_sem because it > >+ * would already be released in __lock_page_or_retry in mm/filemap.c. > >+ */ > >+ if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) > >+ return; > >+ > >+ if (unlikely(fault & VM_FAULT_ERROR)) { > >+ mm_fault_error(regs, error_code, address, fault); > >+ return; > > } > > When I made the patch you removed code, Ingo suggested we need put all rare case code > into if(unlikely()) block. Yes, this is purely micro optimization. But it is not costly > to maintain. Fair enough, thanks for the heads up! From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx103.postini.com [74.125.245.103]) by kanga.kvack.org (Postfix) with SMTP id 4669D6B0072 for ; Wed, 21 Nov 2012 19:27:15 -0500 (EST) Received: from m2.gw.fujitsu.co.jp (unknown [10.0.50.72]) by fgwmail5.fujitsu.co.jp (Postfix) with ESMTP id 5F0B23EE0C5 for ; Thu, 22 Nov 2012 09:27:13 +0900 (JST) Received: from smail (m2 [127.0.0.1]) by outgoing.m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 42C3A45DE4D for ; Thu, 22 Nov 2012 09:27:13 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (s2.gw.fujitsu.co.jp [10.0.50.92]) by m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 2CDBF45DD78 for ; Thu, 22 Nov 2012 09:27:13 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id 1E9D01DB802C for ; Thu, 22 Nov 2012 09:27:13 +0900 (JST) Received: from m1001.s.css.fujitsu.com (m1001.s.css.fujitsu.com [10.240.81.139]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id CD5F11DB8038 for ; Thu, 22 Nov 2012 09:27:12 +0900 (JST) Message-ID: <50AD713F.9030909@jp.fujitsu.com> Date: Thu, 22 Nov 2012 09:26:39 +0900 From: Kamezawa Hiroyuki MIME-Version: 1.0 Subject: Re: memory-cgroup bug References: <20121121200207.01068046@pobox.sk> In-Reply-To: <20121121200207.01068046@pobox.sk> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm (2012/11/22 4:02), azurIt wrote: > Hi, > > i'm using memory cgroup for limiting our users and having a really strange problem when a cgroup gets out of its memory limit. It's very strange because it happens only sometimes (about once per week on random user), out of memory is usually handled ok. This happens when problem occures: > - no new processes can be started for this cgroup > - current processes are freezed and taking 100% of CPU > - when i try to 'strace' any of current processes, the whole strace freezes until process is killed (strace cannot be terminated by CTRL-c) > - problem can be resolved by raising memory limit for cgroup or killing of few processes inside cgroup so some memory is freed > > I also garbbed the content of /proc//stack of freezed process: > [] mem_cgroup_handle_oom+0x241/0x3b0 > [] T.1146+0x5ab/0x5c0 > [] mem_cgroup_charge_common+0x56/0xa0 > [] mem_cgroup_newpage_charge+0x45/0x50 > [] do_wp_page+0x14e/0x800 > [] handle_pte_fault+0x264/0x940 > [] handle_mm_fault+0x138/0x260 > [] do_page_fault+0x13d/0x460 > [] page_fault+0x1f/0x30 > [] 0xffffffffffffffff > > I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32. > > Any ideas? Thnx. > Under OOM in memcg, only one process is allowed to work. Because processes tends to use up CPU at memory shortage. other processes are freezed. Then, the problem here is the one process which uses CPU. IIUC, 'freezed' threads are in sleep and never use CPU. It's expected oom-killer or memory-reclaim can solve the probelm. What is your memcg's memory.oom_control value ? and process's oom_adj values ? (/proc//oom_adj, /proc//oom_score_adj) Thanks, -Kame > azurIt > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx107.postini.com [74.125.245.107]) by kanga.kvack.org (Postfix) with SMTP id 81D9C6B005A for ; Thu, 22 Nov 2012 04:36:20 -0500 (EST) Subject: =?utf-8?q?Re=3A_memory=2Dcgroup_bug?= Date: Thu, 22 Nov 2012 10:36:18 +0100 From: "azurIt" References: <20121121200207.01068046@pobox.sk> <50AD713F.9030909@jp.fujitsu.com> In-Reply-To: <50AD713F.9030909@jp.fujitsu.com> MIME-Version: 1.0 Message-Id: <20121122103618.79F03818@pobox.sk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: =?utf-8?q?Kamezawa_Hiroyuki?= Cc: linux-kernel@vger.kernel.org, linux-mm ______________________________________________________________ > Od: "Kamezawa Hiroyuki" > Komu: azurIt > DA!tum: 22.11.2012 01:27 > Predmet: Re: memory-cgroup bug > > CC: linux-kernel@vger.kernel.org, "linux-mm" >(2012/11/22 4:02), azurIt wrote: >> Hi, >> >> i'm using memory cgroup for limiting our users and having a really strange problem when a cgroup gets out of its memory limit. It's very strange because it happens only sometimes (about once per week on random user), out of memory is usually handled ok. This happens when problem occures: >> - no new processes can be started for this cgroup >> - current processes are freezed and taking 100% of CPU >> - when i try to 'strace' any of current processes, the whole strace freezes until process is killed (strace cannot be terminated by CTRL-c) >> - problem can be resolved by raising memory limit for cgroup or killing of few processes inside cgroup so some memory is freed >> >> I also garbbed the content of /proc//stack of freezed process: >> [] mem_cgroup_handle_oom+0x241/0x3b0 >> [] T.1146+0x5ab/0x5c0 >> [] mem_cgroup_charge_common+0x56/0xa0 >> [] mem_cgroup_newpage_charge+0x45/0x50 >> [] do_wp_page+0x14e/0x800 >> [] handle_pte_fault+0x264/0x940 >> [] handle_mm_fault+0x138/0x260 >> [] do_page_fault+0x13d/0x460 >> [] page_fault+0x1f/0x30 >> [] 0xffffffffffffffff >> >> I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32. >> >> Any ideas? Thnx. >> > >Under OOM in memcg, only one process is allowed to work. Because processes tends to use up >CPU at memory shortage. other processes are freezed. > > >Then, the problem here is the one process which uses CPU. IIUC, 'freezed' threads are >in sleep and never use CPU. It's expected oom-killer or memory-reclaim can solve the probelm. > >What is your memcg's memory.oom_control value ? oom_kill_disable 0 >and process's oom_adj values ? (/proc//oom_adj, /proc//oom_score_adj) when i look to a random user PID (Apache web server): oom_adj = 0 oom_score_adj = 0 I can look also to the data of 'freezed' proces if you need it but i will have to wait until problem occurs again. The main problem is that when this problem happens, it's NOT resolved automatically by kernel/OOM and user of cgroup, where it happend, has non-working services until i kill his processes by hand. I'm sure that all 'freezed' processes are taking very much CPU because also server load goes really high - next time i will make a screenshot of htop. I really wonder why OOM is __sometimes__ not resolving this (it's usually is, only sometimes not). Thank you! azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx170.postini.com [74.125.245.170]) by kanga.kvack.org (Postfix) with SMTP id B492D8D0003 for ; Thu, 22 Nov 2012 16:45:30 -0500 (EST) Received: by mail-ea0-f169.google.com with SMTP id a12so3200583eaa.14 for ; Thu, 22 Nov 2012 13:45:29 -0800 (PST) Date: Thu, 22 Nov 2012 22:45:27 +0100 From: Michal Hocko Subject: Re: memory-cgroup bug Message-ID: <20121122214527.GB20319@dhcp22.suse.cz> References: <20121121200207.01068046@pobox.sk> <50AD713F.9030909@jp.fujitsu.com> <20121122103618.79F03818@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121122103618.79F03818@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: Kamezawa Hiroyuki , linux-kernel@vger.kernel.org, linux-mm On Thu 22-11-12 10:36:18, azurIt wrote: [...] > I can look also to the data of 'freezed' proces if you need it but i > will have to wait until problem occurs again. > > The main problem is that when this problem happens, it's NOT resolved > automatically by kernel/OOM and user of cgroup, where it happend, has > non-working services until i kill his processes by hand. I'm sure > that all 'freezed' processes are taking very much CPU because also > server load goes really high - next time i will make a screenshot of > htop. I really wonder why OOM is __sometimes__ not resolving this > (it's usually is, only sometimes not). What does your kernel log says while this is happening. Are there any memcg OOM messages showing up? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx148.postini.com [74.125.245.148]) by kanga.kvack.org (Postfix) with SMTP id 22D136B005A for ; Fri, 23 Nov 2012 02:40:29 -0500 (EST) Received: by mail-vb0-f41.google.com with SMTP id v13so11275602vbk.14 for ; Thu, 22 Nov 2012 23:40:28 -0800 (PST) Date: Fri, 23 Nov 2012 08:40:23 +0100 From: Michal Hocko Subject: Re: memory-cgroup bug Message-ID: <20121123074023.GA24698@dhcp22.suse.cz> References: <20121121200207.01068046@pobox.sk> <20121122152441.GA9609@dhcp22.suse.cz> <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121122233434.3D5E35E6@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist On Thu 22-11-12 23:34:34, azurIt wrote: [...] > >And finally could you post the disassembly of your version of > >mem_cgroup_handle_oom, please? > > How can i do this? Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom function. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx179.postini.com [74.125.245.179]) by kanga.kvack.org (Postfix) with SMTP id 199B86B004D for ; Fri, 23 Nov 2012 04:21:40 -0500 (EST) Subject: =?utf-8?q?Re=3A_memory=2Dcgroup_bug?= Date: Fri, 23 Nov 2012 10:21:37 +0100 From: "azurIt" References: <20121121200207.01068046@pobox.sk>, <20121122152441.GA9609@dhcp22.suse.cz>, <20121122190526.390C7A28@pobox.sk>, <20121122214249.GA20319@dhcp22.suse.cz>, <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> In-Reply-To: <20121123074023.GA24698@dhcp22.suse.cz> MIME-Version: 1.0 Message-Id: <20121123102137.10D6D653@pobox.sk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= >Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or >use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom >function. If 'YOUR_VMLINUX' is supposed to be my kernel image: # gdb vmlinuz-3.2.34-grsec-1 GNU gdb (GDB) 7.0.1-debian Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: ... "/root/bug/vmlinuz-3.2.34-grsec-1": not in executable format: File format not recognized # objdump -d vmlinuz-3.2.34-grsec-1 objdump: vmlinuz-3.2.34-grsec-1: File format not recognized # file vmlinuz-3.2.34-grsec-1 vmlinuz-3.2.34-grsec-1: Linux kernel x86 boot executable bzImage, version 3.2.34-grsec (root@server01) #1, RO-rootFS, swap_dev 0x3, Normal VGA I'm probably doing something wrong :) It, luckily, happend again so i have more info. - there wasn't any logs in kernel from OOM for that cgroup - there were 16 processes in cgroup - processes in cgroup were taking togather 100% of CPU (it was allowed to use only one core, so 100% of that core) - memory.failcnt was groving fast - oom_control: oom_kill_disable 0 under_oom 0 (this was looping from 0 to 1) - limit_in_bytes was set to 157286400 - content of stat (as you can see, the whole memory limit was used): cache 0 rss 0 mapped_file 0 pgpgin 0 pgpgout 0 swap 0 pgfault 0 pgmajfault 0 inactive_anon 0 active_anon 0 inactive_file 0 active_file 0 unevictable 0 hierarchical_memory_limit 157286400 hierarchical_memsw_limit 157286400 total_cache 0 total_rss 157286400 total_mapped_file 0 total_pgpgin 10326454 total_pgpgout 10288054 total_swap 0 total_pgfault 12939677 total_pgmajfault 4283 total_inactive_anon 0 total_active_anon 157286400 total_inactive_file 0 total_active_file 0 total_unevictable 0 i also grabber oom_adj, oom_score_adj and stack of all processes, here it is: http://www.watchdog.sk/lkml/memcg-bug.tar Notice that stack is different for few processes. Stack for all processes were NOT chaging and was still the same. Btw, don't know if it matters but i was several cgroup subsystems mounted and i'm also using them (i was not activating freezer in this case, don't know if it can be active automatically by kernel or what, didn't checked if cgroup was freezed but i suppose it wasn't): none /cgroups cgroup defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0 Thank you. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx131.postini.com [74.125.245.131]) by kanga.kvack.org (Postfix) with SMTP id E01E06B007B for ; Fri, 23 Nov 2012 04:44:25 -0500 (EST) Subject: =?utf-8?q?Re=3A_memory=2Dcgroup_bug?= Date: Fri, 23 Nov 2012 10:44:23 +0100 From: "azurIt" References: <20121121200207.01068046@pobox.sk>, <20121122152441.GA9609@dhcp22.suse.cz>, <20121122190526.390C7A28@pobox.sk>, <20121122214249.GA20319@dhcp22.suse.cz>, <20121122233434.3D5E35E6@pobox.sk>, <20121123074023.GA24698@dhcp22.suse.cz>, <20121123102137.10D6D653@pobox.sk> <20121123092829.GE24698@dhcp22.suse.cz> In-Reply-To: <20121123092829.GE24698@dhcp22.suse.cz> MIME-Version: 1.0 Message-Id: <20121123104423.338C7725@pobox.sk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" >On Fri 23-11-12 10:21:37, azurIt wrote: >> >Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or >> >use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom >> >function. >> If 'YOUR_VMLINUX' is supposed to be my kernel image: >> >> # gdb vmlinuz-3.2.34-grsec-1 >> GNU gdb (GDB) 7.0.1-debian >> Copyright (C) 2009 Free Software Foundation, Inc. >> License GPLv3+: GNU GPL version 3 or later >> This is free software: you are free to change and redistribute it. >> There is NO WARRANTY, to the extent permitted by law. Type "show copying" >> and "show warranty" for details. >> This GDB was configured as "x86_64-linux-gnu". >> For bug reporting instructions, please see: >> ... >> "/root/bug/vmlinuz-3.2.34-grsec-1": not in executable format: File format not recognized >> >> >> # objdump -d vmlinuz-3.2.34-grsec-1 > >You need vmlinux not vmlinuz... ok, got it but still no luck: # gdb vmlinux GNU gdb (GDB) 7.0.1-debian Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: ... Reading symbols from /root/bug/dddddddd/vmlinux...(no debugging symbols found)...done. (gdb) disassemble mem_cgroup_handle_oom No symbol table is loaded. Use the "file" command. # objdump -d vmlinux | grep mem_cgroup_handle_oom i can recompile the kernel if anything needs to be added into it. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx126.postini.com [74.125.245.126]) by kanga.kvack.org (Postfix) with SMTP id 127DD6B0088 for ; Fri, 23 Nov 2012 05:10:39 -0500 (EST) Received: by mail-vc0-f169.google.com with SMTP id gb30so2920788vcb.14 for ; Fri, 23 Nov 2012 02:10:38 -0800 (PST) Date: Fri, 23 Nov 2012 11:10:34 +0100 From: Michal Hocko Subject: Re: memory-cgroup bug Message-ID: <20121123101034.GG24698@dhcp22.suse.cz> References: <20121121200207.01068046@pobox.sk> <20121122152441.GA9609@dhcp22.suse.cz> <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123092829.GE24698@dhcp22.suse.cz> <20121123104423.338C7725@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121123104423.338C7725@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist On Fri 23-11-12 10:44:23, azurIt wrote: [...] > # gdb vmlinux > GNU gdb (GDB) 7.0.1-debian > Copyright (C) 2009 Free Software Foundation, Inc. > License GPLv3+: GNU GPL version 3 or later > This is free software: you are free to change and redistribute it. > There is NO WARRANTY, to the extent permitted by law. Type "show copying" > and "show warranty" for details. > This GDB was configured as "x86_64-linux-gnu". > For bug reporting instructions, please see: > ... > Reading symbols from /root/bug/dddddddd/vmlinux...(no debugging symbols found)...done. > (gdb) disassemble mem_cgroup_handle_oom > No symbol table is loaded. Use the "file" command. > > > > # objdump -d vmlinux | grep mem_cgroup_handle_oom > Hmm, strange so the function is on the stack but it has been inlined? Doesn't make much sense to me. > i can recompile the kernel if anything needs to be added into it. If you could instrument mem_cgroup_handle_oom with some printks (before we take the memcg_oom_lock, before we schedule and into mem_cgroup_out_of_memory) -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx156.postini.com [74.125.245.156]) by kanga.kvack.org (Postfix) with SMTP id 14CCA6B005D for ; Fri, 23 Nov 2012 09:59:07 -0500 (EST) Subject: =?utf-8?q?Re=3A_memory=2Dcgroup_bug?= Date: Fri, 23 Nov 2012 15:59:04 +0100 From: "azurIt" References: <20121121200207.01068046@pobox.sk>, <20121122152441.GA9609@dhcp22.suse.cz>, <20121122190526.390C7A28@pobox.sk>, <20121122214249.GA20319@dhcp22.suse.cz>, <20121122233434.3D5E35E6@pobox.sk>, <20121123074023.GA24698@dhcp22.suse.cz>, <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> In-Reply-To: <20121123100438.GF24698@dhcp22.suse.cz> MIME-Version: 1.0 Message-Id: <20121123155904.490039C5@pobox.sk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= >If you could instrument mem_cgroup_handle_oom with some printks (before >we take the memcg_oom_lock, before we schedule and into >mem_cgroup_out_of_memory) If you send me patch i can do it. I'm, unfortunately, not able to code it. >> It, luckily, happend again so i have more info. >> >> - there wasn't any logs in kernel from OOM for that cgroup >> - there were 16 processes in cgroup >> - processes in cgroup were taking togather 100% of CPU (it >> was allowed to use only one core, so 100% of that core) >> - memory.failcnt was groving fast >> - oom_control: >> oom_kill_disable 0 >> under_oom 0 (this was looping from 0 to 1) > >So there was an OOM going on but no messages in the log? Really strange. >Kame already asked about oom_score_adj of the processes in the group but >it didn't look like all the processes would have oom disabled, right? There were no messages telling that some processes were killed because of OOM. >> - limit_in_bytes was set to 157286400 >> - content of stat (as you can see, the whole memory limit was used): >> cache 0 >> rss 0 > >This looks like a top-level group for your user. Yes, it was from /cgroup// >> mapped_file 0 >> pgpgin 0 >> pgpgout 0 >> swap 0 >> pgfault 0 >> pgmajfault 0 >> inactive_anon 0 >> active_anon 0 >> inactive_file 0 >> active_file 0 >> unevictable 0 >> hierarchical_memory_limit 157286400 >> hierarchical_memsw_limit 157286400 >> total_cache 0 >> total_rss 157286400 > >OK, so all the memory is anonymous and you have no swap so the oom is >the only thing to do. What will happen if the same situation occurs globally? No swap, every bit of memory used. Will kernel be able to start OOM killer? Maybe the same thing is happening in cgroup - there's simply no space to run OOM killer. And maybe this is why it's happening rarely - usually there are still at least few KBs of memory left to start OOM killer. >Hmm, all processes waiting for oom are stuck at the very same place: >$ grep mem_cgroup_handle_oom -r [0-9]* >30858/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 >30859/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 >30860/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 >30892/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 >30898/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 >31588/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 >32044/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 >32358/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 >6031/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 >6534/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 >7020/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 > >We are taking memcg_oom_lock spinlock twice in that function + we can >schedule. As none of the tasks is scheduled this would suggest that you >are blocked at the first lock. But who got the lock then? >This is really strange. >Btw. is sysrq+t resp. sysrq+w showing the same traces as >/proc//stat? Unfortunately i'm connecting remotely to the servers (SSH). >> Notice that stack is different for few processes. > >Yes others are in VFS resp ext3. ext3_write_begin looks a bit dangerous >but it grabs the page before it really starts a transaction. Maybe these processes were throttled by cgroup-blkio at the same time and are still keeping the lock? So the problem occurs when there are low on memory and cgroup is doing IO out of it's limits. Only guessing and telling my thoughts. >> Stack for all processes were NOT chaging and was still the same. > >Could you take few snapshots over time? Will do next time but i can't keep services freezed for a long time or customers will be angry. >> didn't checked if cgroup was freezed but i suppose it wasn't): >> none /cgroups cgroup defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0 > >Do you see the same issue if only memory controller was mounted (resp. >cpuset which you seem to use as well from your description). Uh, we are using all mounted subsystems :( I will be able to umount only freezer and maybe blkio for some time. Will it help? >I know you said booting into a vanilla kernel would be problematic but >could you at least rule out te cgroup patches that you have mentioned? >If you need to move a task to a group based by an uid you can use >cgrules daemon (libcgroup1 package) for that as well. We are using cgroup-uid cos it's MUCH MUCH MUCH more efective and better. For example, i don't believe that cgroup-task will work with that daemon. What will happen if cgrules won't be able to add process into cgroup because of task limit? Process will probably continue and will run outside of any cgroup which is wrong. With cgroup-task + cgroup-uid, such processes cannot be even started (and this is what we need). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx193.postini.com [74.125.245.193]) by kanga.kvack.org (Postfix) with SMTP id 251A06B005A for ; Sat, 24 Nov 2012 19:10:49 -0500 (EST) Subject: =?utf-8?q?Re=3A_memory=2Dcgroup_bug?= Date: Sun, 25 Nov 2012 01:10:47 +0100 From: "azurIt" References: <20121121200207.01068046@pobox.sk>, <20121122152441.GA9609@dhcp22.suse.cz>, <20121122190526.390C7A28@pobox.sk>, <20121122214249.GA20319@dhcp22.suse.cz>, <20121122233434.3D5E35E6@pobox.sk>, <20121123074023.GA24698@dhcp22.suse.cz>, <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> In-Reply-To: <20121123100438.GF24698@dhcp22.suse.cz> MIME-Version: 1.0 Message-Id: <20121125011047.7477BB5E@pobox.sk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= >Could you take few snapshots over time? Here it is, now from different server, snapshot was taken every second for 10 minutes (hope it's enough): www.watchdog.sk/lkml/memcg-bug-2.tar.gz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx169.postini.com [74.125.245.169]) by kanga.kvack.org (Postfix) with SMTP id C3AFB6B005A for ; Sun, 25 Nov 2012 05:17:11 -0500 (EST) Received: by mail-ea0-f169.google.com with SMTP id a12so4060097eaa.14 for ; Sun, 25 Nov 2012 02:17:10 -0800 (PST) Date: Sun, 25 Nov 2012 11:17:07 +0100 From: Michal Hocko Subject: Re: memory-cgroup bug Message-ID: <20121125101707.GA10623@dhcp22.suse.cz> References: <20121121200207.01068046@pobox.sk> <20121122152441.GA9609@dhcp22.suse.cz> <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121123155904.490039C5@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121123155904.490039C5@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist On Fri 23-11-12 15:59:04, azurIt wrote: > >If you could instrument mem_cgroup_handle_oom with some printks (before > >we take the memcg_oom_lock, before we schedule and into > >mem_cgroup_out_of_memory) > > > If you send me patch i can do it. I'm, unfortunately, not able to code it. Inlined at the end of the email. Please note I have compile tested it. It might produce a lot of output. > >> It, luckily, happend again so i have more info. > >> > >> - there wasn't any logs in kernel from OOM for that cgroup > >> - there were 16 processes in cgroup > >> - processes in cgroup were taking togather 100% of CPU (it > >> was allowed to use only one core, so 100% of that core) > >> - memory.failcnt was groving fast > >> - oom_control: > >> oom_kill_disable 0 > >> under_oom 0 (this was looping from 0 to 1) > > > >So there was an OOM going on but no messages in the log? Really strange. > >Kame already asked about oom_score_adj of the processes in the group but > >it didn't look like all the processes would have oom disabled, right? > > > There were no messages telling that some processes were killed because of OOM. dmesg | grep "Out of memory" doesn't tell anything, right? > >> - limit_in_bytes was set to 157286400 > >> - content of stat (as you can see, the whole memory limit was used): > >> cache 0 > >> rss 0 > > > >This looks like a top-level group for your user. > > > Yes, it was from /cgroup// > > > >> mapped_file 0 > >> pgpgin 0 > >> pgpgout 0 > >> swap 0 > >> pgfault 0 > >> pgmajfault 0 > >> inactive_anon 0 > >> active_anon 0 > >> inactive_file 0 > >> active_file 0 > >> unevictable 0 > >> hierarchical_memory_limit 157286400 > >> hierarchical_memsw_limit 157286400 > >> total_cache 0 > >> total_rss 157286400 > > > >OK, so all the memory is anonymous and you have no swap so the oom is > >the only thing to do. > > > What will happen if the same situation occurs globally? No swap, every > bit of memory used. Will kernel be able to start OOM killer? OOM killer is not a task. It doesn't allocate any memory. It just walks the process list and picks up a task with the highest score. If the global oom is not able to find any such a task (e.g. because all of them have oom disabled) the the system panics. > Maybe the same thing is happening in cgroup cgroup oom differs only in that aspect that the system doesn't panic if there is no suitable task to kill. [...] > >> Notice that stack is different for few processes. > > > >Yes others are in VFS resp ext3. ext3_write_begin looks a bit dangerous > >but it grabs the page before it really starts a transaction. > > > Maybe these processes were throttled by cgroup-blkio at the same time > and are still keeping the lock? If you are thinking about memcg_oom_lock then this is not possible because the lock is held only for short times. There is no other lock that memcg oom holds. > So the problem occurs when there are low on memory and cgroup is doing > IO out of it's limits. Only guessing and telling my thoughts. The lockup (if this is what happens) still might be related to the IO controller if the killed task cannot finish due to pending IO, though. [...] > >> didn't checked if cgroup was freezed but i suppose it wasn't): > >> none /cgroups cgroup defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0 > > > >Do you see the same issue if only memory controller was mounted (resp. > >cpuset which you seem to use as well from your description). > > > Uh, we are using all mounted subsystems :( I will be able to umount > only freezer and maybe blkio for some time. Will it help? Not sure about that without further data. > >I know you said booting into a vanilla kernel would be problematic but > >could you at least rule out te cgroup patches that you have mentioned? > >If you need to move a task to a group based by an uid you can use > >cgrules daemon (libcgroup1 package) for that as well. > > > We are using cgroup-uid cos it's MUCH MUCH MUCH more efective and > better. For example, i don't believe that cgroup-task will work with > that daemon. What will happen if cgrules won't be able to add process > into cgroup because of task limit? Process will probably continue and > will run outside of any cgroup which is wrong. With cgroup-task + > cgroup-uid, such processes cannot be even started (and this is what we > need). I am not familiar with cgroup-task controller so I cannot comment on that. --- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c8425b1..7f26ec8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1863,6 +1863,7 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) { struct oom_wait_info owait; bool locked, need_to_kill; + int ret = false; owait.mem = memcg; owait.wait.flags = 0; @@ -1873,6 +1874,7 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) mem_cgroup_mark_under_oom(memcg); /* At first, try to OOM lock hierarchy under memcg.*/ + printk("XXX: %d waiting for memcg_oom_lock\n", current->pid); spin_lock(&memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); /* @@ -1887,12 +1889,14 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) mem_cgroup_oom_notify(memcg); spin_unlock(&memcg_oom_lock); + printk("XXX: %d need_to_kill:%d locked:%d\n", current->pid, need_to_kill, locked); if (need_to_kill) { finish_wait(&memcg_oom_waitq, &owait.wait); mem_cgroup_out_of_memory(memcg, mask); } else { schedule(); finish_wait(&memcg_oom_waitq, &owait.wait); + printk("XXX: %d woken up\n", current->pid); } spin_lock(&memcg_oom_lock); if (locked) @@ -1903,10 +1907,13 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) mem_cgroup_unmark_under_oom(memcg); if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) - return false; + goto out; /* Give chance to dying process */ schedule_timeout_uninterruptible(1); - return true; + ret = true; +out: + printk("XXX: %d done with %d\n", current->pid, ret); + return ret; } /* diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 069b64e..a7db813 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -568,6 +568,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) */ if (fatal_signal_pending(current)) { set_thread_flag(TIF_MEMDIE); + printk("XXX: %d skipping task with fatal signal pending\n", current->pid); return; } @@ -576,8 +577,10 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) read_lock(&tasklist_lock); retry: p = select_bad_process(&points, limit, mem, NULL); - if (!p || PTR_ERR(p) == -1UL) + if (!p || PTR_ERR(p) == -1UL) { + printk("XXX: %d nothing to kill\n", current->pid); goto out; + } if (oom_kill_process(p, gfp_mask, 0, points, limit, mem, NULL, "Memory cgroup out of memory")) -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx194.postini.com [74.125.245.194]) by kanga.kvack.org (Postfix) with SMTP id 80A826B0068 for ; Sun, 25 Nov 2012 08:02:11 -0500 (EST) Received: by mail-ee0-f41.google.com with SMTP id d41so7192279eek.14 for ; Sun, 25 Nov 2012 05:02:09 -0800 (PST) Date: Sun, 25 Nov 2012 14:02:08 +0100 From: Michal Hocko Subject: Re: memory-cgroup bug Message-ID: <20121125130208.GC10623@dhcp22.suse.cz> References: <20121122152441.GA9609@dhcp22.suse.cz> <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121123155904.490039C5@pobox.sk> <20121125101707.GA10623@dhcp22.suse.cz> <20121125133953.AD1B2F0A@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121125133953.AD1B2F0A@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist On Sun 25-11-12 13:39:53, azurIt wrote: > >Inlined at the end of the email. Please note I have compile tested > >it. It might produce a lot of output. > > > Thank you very much, i will install it ASAP (probably this night). Please don't. If my analysis is correct which I am almost 100% sure it is then it would cause excessive logging. I am sorry I cannot come up with something else in the mean time. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx134.postini.com [74.125.245.134]) by kanga.kvack.org (Postfix) with SMTP id CAB626B005A for ; Sun, 25 Nov 2012 08:27:11 -0500 (EST) Subject: =?utf-8?q?Re=3A_memory=2Dcgroup_bug?= Date: Sun, 25 Nov 2012 14:27:09 +0100 From: "azurIt" References: <20121122152441.GA9609@dhcp22.suse.cz>, <20121122190526.390C7A28@pobox.sk>, <20121122214249.GA20319@dhcp22.suse.cz>, <20121122233434.3D5E35E6@pobox.sk>, <20121123074023.GA24698@dhcp22.suse.cz>, <20121123102137.10D6D653@pobox.sk>, <20121123100438.GF24698@dhcp22.suse.cz>, <20121123155904.490039C5@pobox.sk>, <20121125101707.GA10623@dhcp22.suse.cz>, <20121125133953.AD1B2F0A@pobox.sk> <20121125130208.GC10623@dhcp22.suse.cz> In-Reply-To: <20121125130208.GC10623@dhcp22.suse.cz> MIME-Version: 1.0 Message-Id: <20121125142709.19F4E8C2@pobox.sk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= >> Thank you very much, i will install it ASAP (probably this night). > >Please don't. If my analysis is correct which I am almost 100% sure it >is then it would cause excessive logging. I am sorry I cannot come up >with something else in the mean time. Ok then. I will, meanwhile, try to contact Andrea Righi (author of cgroup-task etc.) and ask him to send here his opinion about relation between freezes and his patches. Maybe it's some kind of a bug in memcg which don't appear in current vanilla code and is triggered by conditions created by, for example, cgroup-task. I noticed that there is always the exact number of freezed processes as the limit set for number of tasks by cgroup-task (i already tried to raise this limit AFTER the cgroup was freezed, didn't change anything). I'm sure it's not the problem with cgroup-task alone, it's 100% related also to memcg (but maybe there must be the combination of both of them). Thank you so far for your time! azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx165.postini.com [74.125.245.165]) by kanga.kvack.org (Postfix) with SMTP id 330286B0068 for ; Sun, 25 Nov 2012 08:44:43 -0500 (EST) Received: by mail-ea0-f169.google.com with SMTP id a12so4106412eaa.14 for ; Sun, 25 Nov 2012 05:44:41 -0800 (PST) Date: Sun, 25 Nov 2012 14:44:40 +0100 From: Michal Hocko Subject: Re: memory-cgroup bug Message-ID: <20121125134440.GD10623@dhcp22.suse.cz> References: <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121123155904.490039C5@pobox.sk> <20121125101707.GA10623@dhcp22.suse.cz> <20121125133953.AD1B2F0A@pobox.sk> <20121125130208.GC10623@dhcp22.suse.cz> <20121125142709.19F4E8C2@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121125142709.19F4E8C2@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist On Sun 25-11-12 14:27:09, azurIt wrote: > >> Thank you very much, i will install it ASAP (probably this night). > > > >Please don't. If my analysis is correct which I am almost 100% sure it > >is then it would cause excessive logging. I am sorry I cannot come up > >with something else in the mean time. > > > Ok then. I will, meanwhile, try to contact Andrea Righi (author of > cgroup-task etc.) and ask him to send here his opinion about relation > between freezes and his patches. As I described in other email. This seems to be a deadlock in memcg oom so I do not think that other patches influence this. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx139.postini.com [74.125.245.139]) by kanga.kvack.org (Postfix) with SMTP id 504C26B005A for ; Mon, 26 Nov 2012 02:57:10 -0500 (EST) Date: Mon, 26 Nov 2012 08:57:07 +0100 From: Michal Hocko Subject: Re: memory-cgroup bug Message-ID: <20121126075656.GA17860@dhcp22.suse.cz> References: <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121126013855.AF118F5E@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki On Mon 26-11-12 01:38:55, azurIt wrote: > >This is hackish but it should help you in this case. Kamezawa, what do > >you think about that? Should we generalize this and prepare something > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > >automatically and use the function whenever we are in a locked context? > >To be honest I do not like this very much but nothing more sensible > >(without touching non-memcg paths) comes to my mind. > > > I installed kernel with this patch, will report back if problem occurs > again OR in few weeks if everything will be ok. Thank you! Thanks! > Btw, will this patch be backported to 3.2? Once we agree on a proper solution it will be backported to the stable trees. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx187.postini.com [74.125.245.187]) by kanga.kvack.org (Postfix) with SMTP id 17F536B0062 for ; Mon, 26 Nov 2012 08:18:40 -0500 (EST) Date: Mon, 26 Nov 2012 14:18:37 +0100 From: Michal Hocko Subject: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121126131837.GC17860@dhcp22.suse.cz> References: <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121126013855.AF118F5E@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner [CCing also Johannes - the thread started here: https://lkml.org/lkml/2012/11/21/497] On Mon 26-11-12 01:38:55, azurIt wrote: > >This is hackish but it should help you in this case. Kamezawa, what do > >you think about that? Should we generalize this and prepare something > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > >automatically and use the function whenever we are in a locked context? > >To be honest I do not like this very much but nothing more sensible > >(without touching non-memcg paths) comes to my mind. > > > I installed kernel with this patch, will report back if problem occurs > again OR in few weeks if everything will be ok. Thank you! Now that I am looking at the patch closer it will not work because it depends on other patch which is not merged yet and even that one would help on its own because __GFP_NORETRY doesn't break the charge loop. Sorry I have missed that... The patch bellow should help though. (it is based on top of the current -mm tree but I will send a backport to 3.2 in the reply as well) --- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx139.postini.com [74.125.245.139]) by kanga.kvack.org (Postfix) with SMTP id 682986B0062 for ; Mon, 26 Nov 2012 08:21:51 -0500 (EST) Date: Mon, 26 Nov 2012 14:21:49 +0100 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121126132149.GD17860@dhcp22.suse.cz> References: <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121126131837.GC17860@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Here we go with the patch for 3.2.34. Could you test with this one, please? --- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx163.postini.com [74.125.245.163]) by kanga.kvack.org (Postfix) with SMTP id 08B8C6B004D for ; Mon, 26 Nov 2012 13:24:33 -0500 (EST) Date: Mon, 26 Nov 2012 13:24:21 -0500 From: Johannes Weiner Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121126182421.GB2301@cmpxchg.org> References: <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126174622.GE2799@cmpxchg.org> <20121126180444.GA12602@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121126180444.GA12602@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote: > On Mon 26-11-12 12:46:22, Johannes Weiner wrote: > > On Mon, Nov 26, 2012 at 02:18:37PM +0100, Michal Hocko wrote: > > > [CCing also Johannes - the thread started here: > > > https://lkml.org/lkml/2012/11/21/497] > > > > > > On Mon 26-11-12 01:38:55, azurIt wrote: > > > > >This is hackish but it should help you in this case. Kamezawa, what do > > > > >you think about that? Should we generalize this and prepare something > > > > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > > > > >automatically and use the function whenever we are in a locked context? > > > > >To be honest I do not like this very much but nothing more sensible > > > > >(without touching non-memcg paths) comes to my mind. > > > > > > > > > > > > I installed kernel with this patch, will report back if problem occurs > > > > again OR in few weeks if everything will be ok. Thank you! > > > > > > Now that I am looking at the patch closer it will not work because it > > > depends on other patch which is not merged yet and even that one would > > > help on its own because __GFP_NORETRY doesn't break the charge loop. > > > Sorry I have missed that... > > > > > > The patch bellow should help though. (it is based on top of the current > > > -mm tree but I will send a backport to 3.2 in the reply as well) > > > --- > > > >From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001 > > > From: Michal Hocko > > > Date: Mon, 26 Nov 2012 11:47:57 +0100 > > > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > > > > > memcg oom killer might deadlock if the process which falls down to > > > mem_cgroup_handle_oom holds a lock which prevents other task to > > > terminate because it is blocked on the very same lock. > > > This can happen when a write system call needs to allocate a page but > > > the allocation hits the memcg hard limit and there is nothing to reclaim > > > (e.g. there is no swap or swap limit is hit as well and all cache pages > > > have been reclaimed already) and the process selected by memcg OOM > > > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > > > > > Process A > > > [] do_truncate+0x58/0xa0 # takes i_mutex > > > [] do_last+0x250/0xa30 > > > [] path_openat+0xd7/0x440 > > > [] do_filp_open+0x49/0xa0 > > > [] do_sys_open+0x106/0x240 > > > [] sys_open+0x20/0x30 > > > [] system_call_fastpath+0x18/0x1d > > > [] 0xffffffffffffffff > > > > > > Process B > > > [] mem_cgroup_handle_oom+0x241/0x3b0 > > > [] T.1146+0x5ab/0x5c0 > > > [] mem_cgroup_cache_charge+0xbe/0xe0 > > > [] add_to_page_cache_locked+0x4c/0x140 > > > [] add_to_page_cache_lru+0x22/0x50 > > > [] grab_cache_page_write_begin+0x8b/0xe0 > > > [] ext3_write_begin+0x88/0x270 > > > [] generic_file_buffered_write+0x116/0x290 > > > [] __generic_file_aio_write+0x27c/0x480 > > > [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > > > [] do_sync_write+0xea/0x130 > > > [] vfs_write+0xf3/0x1f0 > > > [] sys_write+0x51/0x90 > > > [] system_call_fastpath+0x18/0x1d > > > [] 0xffffffffffffffff > > > > So process B manages to lock the hierarchy, calls > > mem_cgroup_out_of_memory() and retries the charge infinitely, waiting > > for task A to die. All while it holds the i_mutex, preventing task A > > from dying, right? > > Right. > > > I think global oom already handles this in a much better way: invoke > > the OOM killer, sleep for a second, then return to userspace to > > relinquish all kernel resources and locks. The only reason why we > > can't simply change from an endless retry loop is because we don't > > want to return VM_FAULT_OOM and invoke the global OOM killer. > > Exactly. > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > > respectively. This way, the memcg OOM killer is invoked as it should > > but nobody gets stuck anywhere livelocking with the exiting task. > > Hmm, we would still have a problem with oom disabled (aka user space OOM > killer), right? All processes but those in mem_cgroup_handle_oom are > risky to be killed. Could we still let everybody get stuck in there when the OOM killer is disabled and let userspace take care of it? > Other POV might be, why we should trigger an OOM killer from those paths > in the first place. Write or read (or even readahead) are all calls that > should rather fail than cause an OOM killer in my opinion. Readahead is arguable, but we kill globally for read() and write() and I think we should do the same for memcg. The OOM killer is there to resolve a problem that comes from overcommitting the machine but the overuse does not have to be from the application that pushes the machine over the edge, that's why we don't just kill the allocating task but actually go look for the best candidate. If you have one memory hog that overuses the resources, attempted memory consumption in a different program should invoke the OOM killer. It does not matter if this is a page fault (would still happen with your patch) or a bufferd read/write (would no longer happen). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx198.postini.com [74.125.245.198]) by kanga.kvack.org (Postfix) with SMTP id 6B0036B0062 for ; Mon, 26 Nov 2012 14:03:33 -0500 (EST) Received: by mail-wi0-f179.google.com with SMTP id hj6so2880227wib.8 for ; Mon, 26 Nov 2012 11:03:31 -0800 (PST) Date: Mon, 26 Nov 2012 20:03:29 +0100 From: Michal Hocko Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121126190329.GB12602@dhcp22.suse.cz> References: <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126174622.GE2799@cmpxchg.org> <20121126180444.GA12602@dhcp22.suse.cz> <20121126182421.GB2301@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121126182421.GB2301@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki On Mon 26-11-12 13:24:21, Johannes Weiner wrote: > On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote: > > On Mon 26-11-12 12:46:22, Johannes Weiner wrote: [...] > > > I think global oom already handles this in a much better way: invoke > > > the OOM killer, sleep for a second, then return to userspace to > > > relinquish all kernel resources and locks. The only reason why we > > > can't simply change from an endless retry loop is because we don't > > > want to return VM_FAULT_OOM and invoke the global OOM killer. > > > > Exactly. > > > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > > > respectively. This way, the memcg OOM killer is invoked as it should > > > but nobody gets stuck anywhere livelocking with the exiting task. > > > > Hmm, we would still have a problem with oom disabled (aka user space OOM > > killer), right? All processes but those in mem_cgroup_handle_oom are > > risky to be killed. > > Could we still let everybody get stuck in there when the OOM killer is > disabled and let userspace take care of it? I am not sure what exactly you mean by "userspace take care of it" but if those processes are stuck and holding the lock then it is usually hard to find that out. Well if somebody is familiar with internal then it is doable but this makes the interface really unusable for regular usage. > > Other POV might be, why we should trigger an OOM killer from those paths > > in the first place. Write or read (or even readahead) are all calls that > > should rather fail than cause an OOM killer in my opinion. > > Readahead is arguable, but we kill globally for read() and write() and > I think we should do the same for memcg. Fair point but the global case is little bit easier than memcg in this case because nobody can hook on OOM killer and provide a userspace implementation for it which is one of the cooler feature of memcg... I am all open to any suggestions but we should somehow fix this (and backport it to stable trees as this is there for quite some time. The current report shows that the problem is not that hard to trigger). > The OOM killer is there to resolve a problem that comes from > overcommitting the machine but the overuse does not have to be from > the application that pushes the machine over the edge, that's why we > don't just kill the allocating task but actually go look for the best > candidate. If you have one memory hog that overuses the resources, > attempted memory consumption in a different program should invoke the > OOM killer. > It does not matter if this is a page fault (would still happen with > your patch) or a bufferd read/write (would no longer happen). true and it is sad that mmap then behaves slightly different than read/write which should I've mentioned in the changelog. As I said I am open to other suggestions. Thanks -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx163.postini.com [74.125.245.163]) by kanga.kvack.org (Postfix) with SMTP id A66686B004D for ; Mon, 26 Nov 2012 14:29:53 -0500 (EST) Date: Mon, 26 Nov 2012 14:29:41 -0500 From: Johannes Weiner Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121126192941.GC2301@cmpxchg.org> References: <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126174622.GE2799@cmpxchg.org> <20121126180444.GA12602@dhcp22.suse.cz> <20121126182421.GB2301@cmpxchg.org> <20121126190329.GB12602@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121126190329.GB12602@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki On Mon, Nov 26, 2012 at 08:03:29PM +0100, Michal Hocko wrote: > On Mon 26-11-12 13:24:21, Johannes Weiner wrote: > > On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote: > > > On Mon 26-11-12 12:46:22, Johannes Weiner wrote: > [...] > > > > I think global oom already handles this in a much better way: invoke > > > > the OOM killer, sleep for a second, then return to userspace to > > > > relinquish all kernel resources and locks. The only reason why we > > > > can't simply change from an endless retry loop is because we don't > > > > want to return VM_FAULT_OOM and invoke the global OOM killer. > > > > > > Exactly. > > > > > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > > > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > > > > respectively. This way, the memcg OOM killer is invoked as it should > > > > but nobody gets stuck anywhere livelocking with the exiting task. > > > > > > Hmm, we would still have a problem with oom disabled (aka user space OOM > > > killer), right? All processes but those in mem_cgroup_handle_oom are > > > risky to be killed. > > > > Could we still let everybody get stuck in there when the OOM killer is > > disabled and let userspace take care of it? > > I am not sure what exactly you mean by "userspace take care of it" but > if those processes are stuck and holding the lock then it is usually > hard to find that out. Well if somebody is familiar with internal then > it is doable but this makes the interface really unusable for regular > usage. If oom_kill_disable is set, then all processes get stuck all the way down in the charge stack. Whatever resource they pin, you may deadlock on if you try to touch it while handling the problem from userspace. I don't see how this is a new problem...? Or do you mean something else? > > > Other POV might be, why we should trigger an OOM killer from those paths > > > in the first place. Write or read (or even readahead) are all calls that > > > should rather fail than cause an OOM killer in my opinion. > > > > Readahead is arguable, but we kill globally for read() and write() and > > I think we should do the same for memcg. > > Fair point but the global case is little bit easier than memcg in this > case because nobody can hook on OOM killer and provide a userspace > implementation for it which is one of the cooler feature of memcg... > I am all open to any suggestions but we should somehow fix this (and > backport it to stable trees as this is there for quite some time. The > current report shows that the problem is not that hard to trigger). As per above, the userspace OOM handling is risky as hell anyway. What happens when an anonymous fault waits in memcg userspace OOM while holding the mmap_sem, and a writer lines up behind it? Your userspace OOM handler had better not look at any of the /proc files of the stuck task that require the mmap_sem. At the same token, it probably shouldn't touch the same files a memcg task is stuck trying to read/write. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx109.postini.com [74.125.245.109]) by kanga.kvack.org (Postfix) with SMTP id 047186B006E for ; Mon, 26 Nov 2012 15:08:54 -0500 (EST) Received: by mail-vc0-f169.google.com with SMTP id gb30so6325061vcb.14 for ; Mon, 26 Nov 2012 12:08:53 -0800 (PST) Date: Mon, 26 Nov 2012 21:08:48 +0100 From: Michal Hocko Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121126200848.GC12602@dhcp22.suse.cz> References: <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126174622.GE2799@cmpxchg.org> <20121126180444.GA12602@dhcp22.suse.cz> <20121126182421.GB2301@cmpxchg.org> <20121126190329.GB12602@dhcp22.suse.cz> <20121126192941.GC2301@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121126192941.GC2301@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki On Mon 26-11-12 14:29:41, Johannes Weiner wrote: > On Mon, Nov 26, 2012 at 08:03:29PM +0100, Michal Hocko wrote: > > On Mon 26-11-12 13:24:21, Johannes Weiner wrote: > > > On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote: > > > > On Mon 26-11-12 12:46:22, Johannes Weiner wrote: > > [...] > > > > > I think global oom already handles this in a much better way: invoke > > > > > the OOM killer, sleep for a second, then return to userspace to > > > > > relinquish all kernel resources and locks. The only reason why we > > > > > can't simply change from an endless retry loop is because we don't > > > > > want to return VM_FAULT_OOM and invoke the global OOM killer. > > > > > > > > Exactly. > > > > > > > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > > > > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > > > > > respectively. This way, the memcg OOM killer is invoked as it should > > > > > but nobody gets stuck anywhere livelocking with the exiting task. > > > > > > > > Hmm, we would still have a problem with oom disabled (aka user space OOM > > > > killer), right? All processes but those in mem_cgroup_handle_oom are > > > > risky to be killed. > > > > > > Could we still let everybody get stuck in there when the OOM killer is > > > disabled and let userspace take care of it? > > > > I am not sure what exactly you mean by "userspace take care of it" but > > if those processes are stuck and holding the lock then it is usually > > hard to find that out. Well if somebody is familiar with internal then > > it is doable but this makes the interface really unusable for regular > > usage. > > If oom_kill_disable is set, then all processes get stuck all the way > down in the charge stack. Whatever resource they pin, you may > deadlock on if you try to touch it while handling the problem from > userspace. OK, I guess I am getting what you are trying to say. So what you are suggesting is to just let mem_cgroup_out_of_memory send the signal and move on without retry (or with few charge retries without further OOM killing) and fail the charge with your new FAULT_OOM_HANDLED (resp. something like FAULT_RETRY) error code resp. ENOMEM depending on the caller. OOM disabled case would be "you are on your own" because this has been dangerous anyway. Correct? I do agree that the current endless retry loop is far from being ideal and can see some updates but I am quite nervous about any potential regressions in this area (e.g. too aggressive OOM etc...). I have to think about it some more. Anyway if you have some more specific ideas I would be happy to review patches. [...] Thanks -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx144.postini.com [74.125.245.144]) by kanga.kvack.org (Postfix) with SMTP id 6DE076B0075 for ; Mon, 26 Nov 2012 15:46:40 -0500 (EST) Subject: =?utf-8?q?Re=3A_=5BPATCH_=2Dmm=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Mon, 26 Nov 2012 21:46:38 +0100 From: "azurIt" References: <20121125120524.GB10623@dhcp22.suse.cz>, <20121125135542.GE10623@dhcp22.suse.cz>, <20121126013855.AF118F5E@pobox.sk>, <20121126131837.GC17860@dhcp22.suse.cz>, <20121126174622.GE2799@cmpxchg.org>, <20121126180444.GA12602@dhcp22.suse.cz>, <20121126182421.GB2301@cmpxchg.org>, <20121126190329.GB12602@dhcp22.suse.cz>, <20121126192941.GC2301@cmpxchg.org>, <20121126200848.GC12602@dhcp22.suse.cz> <20121126201918.GD2301@cmpxchg.org> In-Reply-To: <20121126201918.GD2301@cmpxchg.org> MIME-Version: 1.0 Message-Id: <20121126214638.64723F01@pobox.sk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: =?utf-8?q?Johannes_Weiner?= , =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= >This issue has been around for a while so frankly I don't think it's >urgent enough to rush things. Well, it's quite urgent at least for us :( i wasn't reported this so far cos i wasn't sure it's a kernel thing. I will be really happy and thankfull if fix for this can go to 3.2 in some near future.. Thank you very much! azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx105.postini.com [74.125.245.105]) by kanga.kvack.org (Postfix) with SMTP id DAF2A6B0044 for ; Mon, 26 Nov 2012 17:06:43 -0500 (EST) Received: by mail-wi0-f179.google.com with SMTP id hj6so3008857wib.8 for ; Mon, 26 Nov 2012 14:06:42 -0800 (PST) Date: Mon, 26 Nov 2012 23:06:40 +0100 From: Michal Hocko Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121126220640.GE12602@dhcp22.suse.cz> References: <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126174622.GE2799@cmpxchg.org> <20121126180444.GA12602@dhcp22.suse.cz> <20121126182421.GB2301@cmpxchg.org> <20121126190329.GB12602@dhcp22.suse.cz> <20121126192941.GC2301@cmpxchg.org> <20121126200848.GC12602@dhcp22.suse.cz> <20121126201918.GD2301@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121126201918.GD2301@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki On Mon 26-11-12 15:19:18, Johannes Weiner wrote: > On Mon, Nov 26, 2012 at 09:08:48PM +0100, Michal Hocko wrote: [...] > > OK, I guess I am getting what you are trying to say. So what you are > > suggesting is to just let mem_cgroup_out_of_memory send the signal and > > move on without retry (or with few charge retries without further OOM > > killing) and fail the charge with your new FAULT_OOM_HANDLED (resp. > > something like FAULT_RETRY) error code resp. ENOMEM depending on the > > caller. OOM disabled case would be "you are on your own" because this > > has been dangerous anyway. Correct? > > Yes. > > > I do agree that the current endless retry loop is far from being ideal > > and can see some updates but I am quite nervous about any potential > > regressions in this area (e.g. too aggressive OOM etc...). I have to > > think about it some more. > > Agreed on all points. Maybe we can keep a couple of the oom retry > iterations or something like that, which is still much more than what > global does and I don't think the global OOM killer is overly eager. Yes we can offer less blood and more confort > > Testing will show more. > > > Anyway if you have some more specific ideas I would be happy to review > > patches. > > Okay, I just wanted to check back with you before going down this > path. What are we going to do short term, though? Do you want to > push the disable-oom-for-pagecache for now or should we put the > VM_FAULT_OOM_HANDLED fix in the next version and do stable backports? > > This issue has been around for a while so frankly I don't think it's > urgent enough to rush things. Yes, but now we have a real usecase where this hurts AFAIU. Unless we come up with a fix/reasonable workaround I would rather go with something simpler for starter and more sofisticated later. I have to double check other places where we do charging but the last time I've checked we don't hold page locks on already visible pages (we do precharge in __do_fault f.e.), mem_map for reading in the page fault path is also safe (with oom enabled) and I guess that tmpfs is ok as well. Then we have a page cache and that one should be covered by my patch. So we should be covered. But I like your idea long term. Thanks! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx108.postini.com [74.125.245.108]) by kanga.kvack.org (Postfix) with SMTP id DB5ED6B0044 for ; Mon, 26 Nov 2012 19:06:10 -0500 (EST) Received: from m4.gw.fujitsu.co.jp (unknown [10.0.50.74]) by fgwmail5.fujitsu.co.jp (Postfix) with ESMTP id 182F63EE0AE for ; Tue, 27 Nov 2012 09:06:09 +0900 (JST) Received: from smail (m4 [127.0.0.1]) by outgoing.m4.gw.fujitsu.co.jp (Postfix) with ESMTP id E0D5A45DE56 for ; Tue, 27 Nov 2012 09:06:08 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (s4.gw.fujitsu.co.jp [10.0.50.94]) by m4.gw.fujitsu.co.jp (Postfix) with ESMTP id C5F2545DE4E for ; Tue, 27 Nov 2012 09:06:08 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id B61371DB8044 for ; Tue, 27 Nov 2012 09:06:08 +0900 (JST) Received: from m1001.s.css.fujitsu.com (m1001.s.css.fujitsu.com [10.240.81.139]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 64EB31DB803E for ; Tue, 27 Nov 2012 09:06:08 +0900 (JST) Message-ID: <50B403CA.501@jp.fujitsu.com> Date: Tue, 27 Nov 2012 09:05:30 +0900 From: Kamezawa Hiroyuki MIME-Version: 1.0 Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked References: <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> In-Reply-To: <20121126131837.GC17860@dhcp22.suse.cz> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , Johannes Weiner (2012/11/26 22:18), Michal Hocko wrote: > [CCing also Johannes - the thread started here: > https://lkml.org/lkml/2012/11/21/497] > > On Mon 26-11-12 01:38:55, azurIt wrote: >>> This is hackish but it should help you in this case. Kamezawa, what do >>> you think about that? Should we generalize this and prepare something >>> like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY >>> automatically and use the function whenever we are in a locked context? >>> To be honest I do not like this very much but nothing more sensible >>> (without touching non-memcg paths) comes to my mind. >> >> >> I installed kernel with this patch, will report back if problem occurs >> again OR in few weeks if everything will be ok. Thank you! > > Now that I am looking at the patch closer it will not work because it > depends on other patch which is not merged yet and even that one would > help on its own because __GFP_NORETRY doesn't break the charge loop. > Sorry I have missed that... > > The patch bellow should help though. (it is based on top of the current > -mm tree but I will send a backport to 3.2 in the reply as well) > --- > From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001 > From: Michal Hocko > Date: Mon, 26 Nov 2012 11:47:57 +0100 > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > memcg oom killer might deadlock if the process which falls down to > mem_cgroup_handle_oom holds a lock which prevents other task to > terminate because it is blocked on the very same lock. > This can happen when a write system call needs to allocate a page but > the allocation hits the memcg hard limit and there is nothing to reclaim > (e.g. there is no swap or swap limit is hit as well and all cache pages > have been reclaimed already) and the process selected by memcg OOM > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > Process A > [] do_truncate+0x58/0xa0 # takes i_mutex > [] do_last+0x250/0xa30 > [] path_openat+0xd7/0x440 > [] do_filp_open+0x49/0xa0 > [] do_sys_open+0x106/0x240 > [] sys_open+0x20/0x30 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff > > Process B > [] mem_cgroup_handle_oom+0x241/0x3b0 > [] T.1146+0x5ab/0x5c0 > [] mem_cgroup_cache_charge+0xbe/0xe0 > [] add_to_page_cache_locked+0x4c/0x140 > [] add_to_page_cache_lru+0x22/0x50 > [] grab_cache_page_write_begin+0x8b/0xe0 > [] ext3_write_begin+0x88/0x270 > [] generic_file_buffered_write+0x116/0x290 > [] __generic_file_aio_write+0x27c/0x480 > [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > [] do_sync_write+0xea/0x130 > [] vfs_write+0xf3/0x1f0 > [] sys_write+0x51/0x90 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff > > This is not a hard deadlock though because administrator can still > intervene and increase the limit on the group which helps the writer to > finish the allocation and release the lock. > > This patch heals the problem by forbidding OOM from page cache charges > (namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper > function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which > then tells mem_cgroup_charge_common that OOM is not allowed for the > charge. No OOM from this path, except for fixing the bug, also make some > sense as we really do not want to cause an OOM because of a page cache > usage. > As a possibly visible result add_to_page_cache_lru might fail more often > with ENOMEM but this is to be expected if the limit is set and it is > preferable than OOM killer IMO. > > __GFP_NORETRY is abused for this memcg specific flag because it has been > used to prevent from OOM already (since not-merged-yet "memcg: reclaim > when more than one page needed"). The only difference is that the flag > doesn't prevent from reclaim anymore which kind of makes sense because > the global memory allocator triggers reclaim as well. The retry without > any reclaim on __GFP_NORETRY doesn't make much sense anyway because this > is effectively a busy loop with allowed OOM in this path. > > Reported-by: azurIt > Signed-off-by: Michal Hocko As a short term fix, I think this patch will work enough and seems simple enough. Acked-by: KAMEZAWA Hiroyuki Reading discussion between you and Johannes, to release locks, I understand the memcg need to return "RETRY" for a long term fix. Thinking a little, it will be simple to return "RETRY" to all processes waited on oom kill queue of a memcg and it can be done by a small fixes to memory.c. Thank you. -Kame > --- > include/linux/gfp.h | 3 +++ > include/linux/memcontrol.h | 12 ++++++++++++ > mm/filemap.c | 8 +++++++- > mm/memcontrol.c | 5 +---- > 4 files changed, 23 insertions(+), 5 deletions(-) > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index 10e667f..aac9b21 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -152,6 +152,9 @@ struct vm_area_struct; > /* 4GB DMA on some platforms */ > #define GFP_DMA32 __GFP_DMA32 > > +/* memcg oom killer is not allowed */ > +#define GFP_MEMCG_NO_OOM __GFP_NORETRY > + > /* Convert GFP flags to their corresponding migrate type */ > static inline int allocflags_to_migratetype(gfp_t gfp_flags) > { > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 095d2b4..1ad4bc6 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -65,6 +65,12 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); > extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > gfp_t gfp_mask); > > +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, > + struct mm_struct *mm, gfp_t gfp_mask) > +{ > + return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM); > +} > + > struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); > struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); > > @@ -215,6 +221,12 @@ static inline int mem_cgroup_cache_charge(struct page *page, > return 0; > } > > +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, > + struct mm_struct *mm, gfp_t gfp_mask) > +{ > + return 0; > +} > + > static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp) > { > diff --git a/mm/filemap.c b/mm/filemap.c > index 83efee7..ef14351 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > VM_BUG_ON(!PageLocked(page)); > VM_BUG_ON(PageSwapBacked(page)); > > - error = mem_cgroup_cache_charge(page, current->mm, > + /* > + * Cannot trigger OOM even if gfp_mask would allow that normally > + * because we might be called from a locked context and that > + * could lead to deadlocks if the killed process is waiting for > + * the same lock. > + */ > + error = mem_cgroup_cache_charge_no_oom(page, current->mm, > gfp_mask & GFP_RECLAIM_MASK); > if (error) > goto out; > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 02ee2f7..b4754ba 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2430,9 +2430,6 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > if (!(gfp_mask & __GFP_WAIT)) > return CHARGE_WOULDBLOCK; > > - if (gfp_mask & __GFP_NORETRY) > - return CHARGE_NOMEM; > - > ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags); > if (mem_cgroup_margin(mem_over_limit) >= nr_pages) > return CHARGE_RETRY; > @@ -3713,7 +3710,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, > { > struct mem_cgroup *memcg = NULL; > unsigned int nr_pages = 1; > - bool oom = true; > + bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); > int ret; > > if (PageTransHuge(page)) { > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx175.postini.com [74.125.245.175]) by kanga.kvack.org (Postfix) with SMTP id 67D2F6B006C for ; Tue, 27 Nov 2012 04:54:55 -0500 (EST) Date: Tue, 27 Nov 2012 10:54:52 +0100 From: Michal Hocko Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121127095452.GD20537@dhcp22.suse.cz> References: <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50B403CA.501@jp.fujitsu.com> Sender: owner-linux-mm@kvack.org List-ID: To: Kamezawa Hiroyuki Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , Johannes Weiner On Tue 27-11-12 09:05:30, KAMEZAWA Hiroyuki wrote: [...] > As a short term fix, I think this patch will work enough and seems simple enough. > Acked-by: KAMEZAWA Hiroyuki Thanks! If Johannes is also ok with this for now I will resubmit the patch to Andrew after I hear back from the reporter. > Reading discussion between you and Johannes, to release locks, I understand > the memcg need to return "RETRY" for a long term fix. Thinking a little, > it will be simple to return "RETRY" to all processes waited on oom kill queue > of a memcg and it can be done by a small fixes to memory.c. I wouldn't call it simple but it is doable. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx169.postini.com [74.125.245.169]) by kanga.kvack.org (Postfix) with SMTP id 03F416B0068 for ; Tue, 27 Nov 2012 14:48:29 -0500 (EST) Date: Tue, 27 Nov 2012 14:48:13 -0500 From: Johannes Weiner Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121127194813.GP24381@cmpxchg.org> References: <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50B403CA.501@jp.fujitsu.com> Sender: owner-linux-mm@kvack.org List-ID: To: Kamezawa Hiroyuki Cc: Michal Hocko , azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist On Tue, Nov 27, 2012 at 09:05:30AM +0900, Kamezawa Hiroyuki wrote: > (2012/11/26 22:18), Michal Hocko wrote: > >[CCing also Johannes - the thread started here: > >https://lkml.org/lkml/2012/11/21/497] > > > >On Mon 26-11-12 01:38:55, azurIt wrote: > >>>This is hackish but it should help you in this case. Kamezawa, what do > >>>you think about that? Should we generalize this and prepare something > >>>like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > >>>automatically and use the function whenever we are in a locked context? > >>>To be honest I do not like this very much but nothing more sensible > >>>(without touching non-memcg paths) comes to my mind. > >> > >> > >>I installed kernel with this patch, will report back if problem occurs > >>again OR in few weeks if everything will be ok. Thank you! > > > >Now that I am looking at the patch closer it will not work because it > >depends on other patch which is not merged yet and even that one would > >help on its own because __GFP_NORETRY doesn't break the charge loop. > >Sorry I have missed that... > > > >The patch bellow should help though. (it is based on top of the current > >-mm tree but I will send a backport to 3.2 in the reply as well) > >--- > > From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001 > >From: Michal Hocko > >Date: Mon, 26 Nov 2012 11:47:57 +0100 > >Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > > >memcg oom killer might deadlock if the process which falls down to > >mem_cgroup_handle_oom holds a lock which prevents other task to > >terminate because it is blocked on the very same lock. > >This can happen when a write system call needs to allocate a page but > >the allocation hits the memcg hard limit and there is nothing to reclaim > >(e.g. there is no swap or swap limit is hit as well and all cache pages > >have been reclaimed already) and the process selected by memcg OOM > >killer is blocked on i_mutex on the same inode (e.g. truncate it). > > > >Process A > >[] do_truncate+0x58/0xa0 # takes i_mutex > >[] do_last+0x250/0xa30 > >[] path_openat+0xd7/0x440 > >[] do_filp_open+0x49/0xa0 > >[] do_sys_open+0x106/0x240 > >[] sys_open+0x20/0x30 > >[] system_call_fastpath+0x18/0x1d > >[] 0xffffffffffffffff > > > >Process B > >[] mem_cgroup_handle_oom+0x241/0x3b0 > >[] T.1146+0x5ab/0x5c0 > >[] mem_cgroup_cache_charge+0xbe/0xe0 > >[] add_to_page_cache_locked+0x4c/0x140 > >[] add_to_page_cache_lru+0x22/0x50 > >[] grab_cache_page_write_begin+0x8b/0xe0 > >[] ext3_write_begin+0x88/0x270 > >[] generic_file_buffered_write+0x116/0x290 > >[] __generic_file_aio_write+0x27c/0x480 > >[] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > >[] do_sync_write+0xea/0x130 > >[] vfs_write+0xf3/0x1f0 > >[] sys_write+0x51/0x90 > >[] system_call_fastpath+0x18/0x1d > >[] 0xffffffffffffffff > > > >This is not a hard deadlock though because administrator can still > >intervene and increase the limit on the group which helps the writer to > >finish the allocation and release the lock. > > > >This patch heals the problem by forbidding OOM from page cache charges > >(namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper > >function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which > >then tells mem_cgroup_charge_common that OOM is not allowed for the > >charge. No OOM from this path, except for fixing the bug, also make some > >sense as we really do not want to cause an OOM because of a page cache > >usage. > >As a possibly visible result add_to_page_cache_lru might fail more often > >with ENOMEM but this is to be expected if the limit is set and it is > >preferable than OOM killer IMO. > > > >__GFP_NORETRY is abused for this memcg specific flag because it has been > >used to prevent from OOM already (since not-merged-yet "memcg: reclaim > >when more than one page needed"). The only difference is that the flag > >doesn't prevent from reclaim anymore which kind of makes sense because > >the global memory allocator triggers reclaim as well. The retry without > >any reclaim on __GFP_NORETRY doesn't make much sense anyway because this > >is effectively a busy loop with allowed OOM in this path. > > > >Reported-by: azurIt > >Signed-off-by: Michal Hocko > > As a short term fix, I think this patch will work enough and seems simple enough. > Acked-by: KAMEZAWA Hiroyuki Yes, let's do this for now. > >diff --git a/include/linux/gfp.h b/include/linux/gfp.h > >index 10e667f..aac9b21 100644 > >--- a/include/linux/gfp.h > >+++ b/include/linux/gfp.h > >@@ -152,6 +152,9 @@ struct vm_area_struct; > > /* 4GB DMA on some platforms */ > > #define GFP_DMA32 __GFP_DMA32 > > > >+/* memcg oom killer is not allowed */ > >+#define GFP_MEMCG_NO_OOM __GFP_NORETRY Could we leave this within memcg, please? An extra flag to mem_cgroup_cache_charge() or the like. GFP flags are about controlling the page allocator, this seems abusive. We have an oom flag down in try_charge, maybe just propagate this up the stack? > >diff --git a/mm/filemap.c b/mm/filemap.c > >index 83efee7..ef14351 100644 > >--- a/mm/filemap.c > >+++ b/mm/filemap.c > >@@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > > VM_BUG_ON(!PageLocked(page)); > > VM_BUG_ON(PageSwapBacked(page)); > > > >- error = mem_cgroup_cache_charge(page, current->mm, > >+ /* > >+ * Cannot trigger OOM even if gfp_mask would allow that normally > >+ * because we might be called from a locked context and that > >+ * could lead to deadlocks if the killed process is waiting for > >+ * the same lock. > >+ */ > >+ error = mem_cgroup_cache_charge_no_oom(page, current->mm, > > gfp_mask & GFP_RECLAIM_MASK); > > if (error) > > goto out; Shmem does not use this function but also charges under the i_mutex in the write path and fallocate at least. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx204.postini.com [74.125.245.204]) by kanga.kvack.org (Postfix) with SMTP id 3173C6B004D for ; Tue, 27 Nov 2012 15:54:40 -0500 (EST) Received: by mail-ea0-f169.google.com with SMTP id a12so5237955eaa.14 for ; Tue, 27 Nov 2012 12:54:38 -0800 (PST) Date: Tue, 27 Nov 2012 21:54:36 +0100 From: Michal Hocko Subject: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121127205431.GA2433@dhcp22.suse.cz> References: <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> <20121127194813.GP24381@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121127194813.GP24381@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner , KAMEZAWA Hiroyuki Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist On Tue 27-11-12 14:48:13, Johannes Weiner wrote: [...] > > >diff --git a/include/linux/gfp.h b/include/linux/gfp.h > > >index 10e667f..aac9b21 100644 > > >--- a/include/linux/gfp.h > > >+++ b/include/linux/gfp.h > > >@@ -152,6 +152,9 @@ struct vm_area_struct; > > > /* 4GB DMA on some platforms */ > > > #define GFP_DMA32 __GFP_DMA32 > > > > > >+/* memcg oom killer is not allowed */ > > >+#define GFP_MEMCG_NO_OOM __GFP_NORETRY > > Could we leave this within memcg, please? An extra flag to > mem_cgroup_cache_charge() or the like. GFP flags are about > controlling the page allocator, this seems abusive. We have an oom > flag down in try_charge, maybe just propagate this up the stack? OK, what about the patch bellow? I have dropped Kame's Acked-by because it has been reworked. The patch is the same in principle. > > >diff --git a/mm/filemap.c b/mm/filemap.c > > >index 83efee7..ef14351 100644 > > >--- a/mm/filemap.c > > >+++ b/mm/filemap.c > > >@@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > > > VM_BUG_ON(!PageLocked(page)); > > > VM_BUG_ON(PageSwapBacked(page)); > > > > > >- error = mem_cgroup_cache_charge(page, current->mm, > > >+ /* > > >+ * Cannot trigger OOM even if gfp_mask would allow that normally > > >+ * because we might be called from a locked context and that > > >+ * could lead to deadlocks if the killed process is waiting for > > >+ * the same lock. > > >+ */ > > >+ error = mem_cgroup_cache_charge_no_oom(page, current->mm, > > > gfp_mask & GFP_RECLAIM_MASK); > > > if (error) > > > goto out; > > Shmem does not use this function but also charges under the i_mutex in > the write path and fallocate at least. Right you are --- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx153.postini.com [74.125.245.153]) by kanga.kvack.org (Postfix) with SMTP id 940DE6B004D for ; Tue, 27 Nov 2012 15:59:47 -0500 (EST) Received: by mail-ea0-f169.google.com with SMTP id a12so5240213eaa.14 for ; Tue, 27 Nov 2012 12:59:46 -0800 (PST) Date: Tue, 27 Nov 2012 21:59:44 +0100 From: Michal Hocko Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121127205944.GB2433@dhcp22.suse.cz> References: <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> <20121127194813.GP24381@cmpxchg.org> <20121127205431.GA2433@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121127205431.GA2433@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner , KAMEZAWA Hiroyuki Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist Sorry, forgot to about one shmem charge: --- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx150.postini.com [74.125.245.150]) by kanga.kvack.org (Postfix) with SMTP id 7D15A6B004D for ; Wed, 28 Nov 2012 10:26:48 -0500 (EST) Date: Wed, 28 Nov 2012 10:26:31 -0500 From: Johannes Weiner Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121128152631.GT24381@cmpxchg.org> References: <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> <20121127194813.GP24381@cmpxchg.org> <20121127205431.GA2433@dhcp22.suse.cz> <20121127205944.GB2433@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121127205944.GB2433@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: KAMEZAWA Hiroyuki , azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist On Tue, Nov 27, 2012 at 09:59:44PM +0100, Michal Hocko wrote: > @@ -3863,7 +3862,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > return 0; > > if (!PageSwapCache(page)) > - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); > + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); > else { /* page is swapcache/shmem */ > ret = __mem_cgroup_try_charge_swapin(mm, page, > gfp_mask, &memcg); I think you need to pass it down the swapcache path too, as that is what happens when the shmem page written to is in swap and has been read into swapcache by the time of charging. > @@ -1152,8 +1152,16 @@ repeat: > goto failed; > } > > + /* > + * Cannot trigger OOM even if gfp_mask would allow that > + * normally because we might be called from a locked > + * context (i_mutex held) if this is a write lock or > + * fallocate and that could lead to deadlocks if the > + * killed process is waiting for the same lock. > + */ Indentation broken? > error = mem_cgroup_cache_charge(page, current->mm, > - gfp & GFP_RECLAIM_MASK); > + gfp & GFP_RECLAIM_MASK, > + sgp < SGP_WRITE); The code tests for read-only paths a bunch of times using sgp != SGP_WRITE && sgp != SGP_FALLOC Would probably be more consistent and more robust to use this here as well? > @@ -1209,7 +1217,8 @@ repeat: > SetPageSwapBacked(page); > __set_page_locked(page); > error = mem_cgroup_cache_charge(page, current->mm, > - gfp & GFP_RECLAIM_MASK); > + gfp & GFP_RECLAIM_MASK, > + sgp < SGP_WRITE); Same. Otherwise, the patch looks good to me, thanks for persisting :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx187.postini.com [74.125.245.187]) by kanga.kvack.org (Postfix) with SMTP id 51D9C6B0068 for ; Wed, 28 Nov 2012 11:04:50 -0500 (EST) Date: Wed, 28 Nov 2012 17:04:47 +0100 From: Michal Hocko Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121128160447.GH12309@dhcp22.suse.cz> References: <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> <20121127194813.GP24381@cmpxchg.org> <20121127205431.GA2433@dhcp22.suse.cz> <20121127205944.GB2433@dhcp22.suse.cz> <20121128152631.GT24381@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121128152631.GT24381@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: KAMEZAWA Hiroyuki , azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist On Wed 28-11-12 10:26:31, Johannes Weiner wrote: > On Tue, Nov 27, 2012 at 09:59:44PM +0100, Michal Hocko wrote: > > @@ -3863,7 +3862,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > > return 0; > > > > if (!PageSwapCache(page)) > > - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); > > + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); > > else { /* page is swapcache/shmem */ > > ret = __mem_cgroup_try_charge_swapin(mm, page, > > gfp_mask, &memcg); > > I think you need to pass it down the swapcache path too, as that is > what happens when the shmem page written to is in swap and has been > read into swapcache by the time of charging. You are right, of course. I shouldn't send patches late in the evening after staring to a crashdump for a good part of the day. /me ashamed. > > @@ -1152,8 +1152,16 @@ repeat: > > goto failed; > > } > > > > + /* > > + * Cannot trigger OOM even if gfp_mask would allow that > > + * normally because we might be called from a locked > > + * context (i_mutex held) if this is a write lock or > > + * fallocate and that could lead to deadlocks if the > > + * killed process is waiting for the same lock. > > + */ > > Indentation broken? c&p > > error = mem_cgroup_cache_charge(page, current->mm, > > - gfp & GFP_RECLAIM_MASK); > > + gfp & GFP_RECLAIM_MASK, > > + sgp < SGP_WRITE); > > The code tests for read-only paths a bunch of times using > > sgp != SGP_WRITE && sgp != SGP_FALLOC > > Would probably be more consistent and more robust to use this here as > well? Yes my laziness. I was considering that but it was really long so I've chosen the simpler way. But you are right that consistency is probably better here > > @@ -1209,7 +1217,8 @@ repeat: > > SetPageSwapBacked(page); > > __set_page_locked(page); > > error = mem_cgroup_cache_charge(page, current->mm, > > - gfp & GFP_RECLAIM_MASK); > > + gfp & GFP_RECLAIM_MASK, > > + sgp < SGP_WRITE); > > Same. > > Otherwise, the patch looks good to me, thanks for persisting :) Thanks for the throughout review. Here we go with the fixed version. --- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx168.postini.com [74.125.245.168]) by kanga.kvack.org (Postfix) with SMTP id 90A4C6B006C for ; Wed, 28 Nov 2012 11:37:59 -0500 (EST) Date: Wed, 28 Nov 2012 11:37:36 -0500 From: Johannes Weiner Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121128163736.GV24381@cmpxchg.org> References: <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> <20121127194813.GP24381@cmpxchg.org> <20121127205431.GA2433@dhcp22.suse.cz> <20121127205944.GB2433@dhcp22.suse.cz> <20121128152631.GT24381@cmpxchg.org> <20121128160447.GH12309@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121128160447.GH12309@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: KAMEZAWA Hiroyuki , azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote: > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 095d2b4..5abe441 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm, > gfp_t gfp_mask); > /* for swap handling */ > extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > - struct page *page, gfp_t mask, struct mem_cgroup **memcgp); > + struct page *page, gfp_t mask, struct mem_cgroup **memcgp, > + bool oom); Ok, now I feel almost bad for asking, but why the public interface, too? You only ever pass "true" in there and this is unlikely to change anytime soon, no? > @@ -3754,7 +3753,8 @@ int mem_cgroup_newpage_charge(struct page *page, > static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, > struct page *page, > gfp_t mask, > - struct mem_cgroup **memcgp) > + struct mem_cgroup **memcgp, > + bool oom) > { > struct mem_cgroup *memcg; > struct page_cgroup *pc; > @@ -3776,20 +3776,21 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, > if (!memcg) > goto charge_cur_mm; > *memcgp = memcg; > - ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true); > + ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, oom); > css_put(&memcg->css); > if (ret == -EINTR) > ret = 0; > return ret; > charge_cur_mm: > - ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, true); > + ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, oom); > if (ret == -EINTR) > ret = 0; > return ret; > } Only this one is needed... > @@ -3851,7 +3852,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page, > } > > int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > - gfp_t gfp_mask) > + gfp_t gfp_mask, bool oom) > { > struct mem_cgroup *memcg = NULL; > enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; > @@ -3863,10 +3864,10 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > return 0; > > if (!PageSwapCache(page)) > - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); > + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); > else { /* page is swapcache/shmem */ > ret = __mem_cgroup_try_charge_swapin(mm, page, > - gfp_mask, &memcg); > + gfp_mask, &memcg, oom); > if (!ret) > __mem_cgroup_commit_charge_swapin(page, memcg, type); > } ...for this site. > diff --git a/mm/memory.c b/mm/memory.c > index 6891d3b..afad903 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2991,7 +2991,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, > } > } > > - if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) { > + if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr, true)) { > ret = VM_FAULT_OOM; > goto out_page; > } Can not happen for shmem, the fault handler uses vma->vm_ops->fault. > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 2f8e429..8ec511e 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -828,7 +828,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, > int ret = 1; > > if (mem_cgroup_try_charge_swapin(vma->vm_mm, page, > - GFP_KERNEL, &memcg)) { > + GFP_KERNEL, &memcg, true)) { > ret = -ENOMEM; > goto out_nolock; > } Can not happen for shmem, uses shmem_unuse() instead. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx179.postini.com [74.125.245.179]) by kanga.kvack.org (Postfix) with SMTP id B60386B00A9 for ; Wed, 28 Nov 2012 11:48:26 -0500 (EST) Date: Wed, 28 Nov 2012 17:48:24 +0100 From: Michal Hocko Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121128164824.GC22201@dhcp22.suse.cz> References: <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> <20121127194813.GP24381@cmpxchg.org> <20121127205431.GA2433@dhcp22.suse.cz> <20121127205944.GB2433@dhcp22.suse.cz> <20121128152631.GT24381@cmpxchg.org> <20121128160447.GH12309@dhcp22.suse.cz> <20121128163736.GV24381@cmpxchg.org> <20121128164640.GB22201@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121128164640.GB22201@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: KAMEZAWA Hiroyuki , azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist On Wed 28-11-12 17:46:40, Michal Hocko wrote: > On Wed 28-11-12 11:37:36, Johannes Weiner wrote: > > On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote: > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > > index 095d2b4..5abe441 100644 > > > --- a/include/linux/memcontrol.h > > > +++ b/include/linux/memcontrol.h > > > @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm, > > > gfp_t gfp_mask); > > > /* for swap handling */ > > > extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > > > - struct page *page, gfp_t mask, struct mem_cgroup **memcgp); > > > + struct page *page, gfp_t mask, struct mem_cgroup **memcgp, > > > + bool oom); > > > > Ok, now I feel almost bad for asking, but why the public interface, > > too? > > Would it work out if I tell it was to double check that your review > quality is not decreased after that many revisions? :P > > Incremental update and the full patch in the reply --- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx124.postini.com [74.125.245.124]) by kanga.kvack.org (Postfix) with SMTP id 775E36B0062 for ; Wed, 28 Nov 2012 13:44:49 -0500 (EST) Date: Wed, 28 Nov 2012 13:44:33 -0500 From: Johannes Weiner Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121128184433.GH2301@cmpxchg.org> References: <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> <20121127194813.GP24381@cmpxchg.org> <20121127205431.GA2433@dhcp22.suse.cz> <20121127205944.GB2433@dhcp22.suse.cz> <20121128152631.GT24381@cmpxchg.org> <20121128160447.GH12309@dhcp22.suse.cz> <20121128163736.GV24381@cmpxchg.org> <20121128164640.GB22201@dhcp22.suse.cz> <20121128164824.GC22201@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121128164824.GC22201@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: KAMEZAWA Hiroyuki , azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist On Wed, Nov 28, 2012 at 05:48:24PM +0100, Michal Hocko wrote: > On Wed 28-11-12 17:46:40, Michal Hocko wrote: > > On Wed 28-11-12 11:37:36, Johannes Weiner wrote: > > > On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote: > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > > > index 095d2b4..5abe441 100644 > > > > --- a/include/linux/memcontrol.h > > > > +++ b/include/linux/memcontrol.h > > > > @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm, > > > > gfp_t gfp_mask); > > > > /* for swap handling */ > > > > extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > > > > - struct page *page, gfp_t mask, struct mem_cgroup **memcgp); > > > > + struct page *page, gfp_t mask, struct mem_cgroup **memcgp, > > > > + bool oom); > > > > > > Ok, now I feel almost bad for asking, but why the public interface, > > > too? > > > > Would it work out if I tell it was to double check that your review > > quality is not decreased after that many revisions? :P Deal. > >From e21bb704947e9a477ec1df9121575c606dbfcb52 Mon Sep 17 00:00:00 2001 > From: Michal Hocko > Date: Wed, 28 Nov 2012 17:46:32 +0100 > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > memcg oom killer might deadlock if the process which falls down to > mem_cgroup_handle_oom holds a lock which prevents other task to > terminate because it is blocked on the very same lock. > This can happen when a write system call needs to allocate a page but > the allocation hits the memcg hard limit and there is nothing to reclaim > (e.g. there is no swap or swap limit is hit as well and all cache pages > have been reclaimed already) and the process selected by memcg OOM > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > Process A > [] do_truncate+0x58/0xa0 # takes i_mutex > [] do_last+0x250/0xa30 > [] path_openat+0xd7/0x440 > [] do_filp_open+0x49/0xa0 > [] do_sys_open+0x106/0x240 > [] sys_open+0x20/0x30 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff > > Process B > [] mem_cgroup_handle_oom+0x241/0x3b0 > [] T.1146+0x5ab/0x5c0 > [] mem_cgroup_cache_charge+0xbe/0xe0 > [] add_to_page_cache_locked+0x4c/0x140 > [] add_to_page_cache_lru+0x22/0x50 > [] grab_cache_page_write_begin+0x8b/0xe0 > [] ext3_write_begin+0x88/0x270 > [] generic_file_buffered_write+0x116/0x290 > [] __generic_file_aio_write+0x27c/0x480 > [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > [] do_sync_write+0xea/0x130 > [] vfs_write+0xf3/0x1f0 > [] sys_write+0x51/0x90 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff > > This is not a hard deadlock though because administrator can still > intervene and increase the limit on the group which helps the writer to > finish the allocation and release the lock. > > This patch heals the problem by forbidding OOM from page cache charges > (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom > argument which is pushed down the call chain. > > As a possibly visible result add_to_page_cache_lru might fail more often > with ENOMEM but this is to be expected if the limit is set and it is > preferable than OOM killer IMO. > > Changes since v1 > - do not abuse gfp_flags and rather use oom parameter directly as per > Johannes > - handle also shmem write fauls resp. fallocate properly as per Johannes > > Reported-by: azurIt > Signed-off-by: Michal Hocko Acked-by: Johannes Weiner Thanks, Michal! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx161.postini.com [74.125.245.161]) by kanga.kvack.org (Postfix) with SMTP id 2EFC56B006C for ; Wed, 28 Nov 2012 15:20:43 -0500 (EST) Received: by mail-qc0-f169.google.com with SMTP id t2so12534706qcq.14 for ; Wed, 28 Nov 2012 12:20:42 -0800 (PST) Date: Wed, 28 Nov 2012 12:20:44 -0800 (PST) From: Hugh Dickins Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked In-Reply-To: <20121128164824.GC22201@dhcp22.suse.cz> Message-ID: References: <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> <20121127194813.GP24381@cmpxchg.org> <20121127205431.GA2433@dhcp22.suse.cz> <20121127205944.GB2433@dhcp22.suse.cz> <20121128152631.GT24381@cmpxchg.org> <20121128160447.GH12309@dhcp22.suse.cz> <20121128163736.GV24381@cmpxchg.org> <20121128164640.GB22201@dhcp22.suse.cz> <20121128164824.GC22201@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: Johannes Weiner , KAMEZAWA Hiroyuki , azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist On Wed, 28 Nov 2012, Michal Hocko wrote: > From e21bb704947e9a477ec1df9121575c606dbfcb52 Mon Sep 17 00:00:00 2001 > From: Michal Hocko > Date: Wed, 28 Nov 2012 17:46:32 +0100 > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > memcg oom killer might deadlock if the process which falls down to > mem_cgroup_handle_oom holds a lock which prevents other task to > terminate because it is blocked on the very same lock. > This can happen when a write system call needs to allocate a page but > the allocation hits the memcg hard limit and there is nothing to reclaim > (e.g. there is no swap or swap limit is hit as well and all cache pages > have been reclaimed already) and the process selected by memcg OOM > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > Process A > [] do_truncate+0x58/0xa0 # takes i_mutex > [] do_last+0x250/0xa30 > [] path_openat+0xd7/0x440 > [] do_filp_open+0x49/0xa0 > [] do_sys_open+0x106/0x240 > [] sys_open+0x20/0x30 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff > > Process B > [] mem_cgroup_handle_oom+0x241/0x3b0 > [] T.1146+0x5ab/0x5c0 > [] mem_cgroup_cache_charge+0xbe/0xe0 > [] add_to_page_cache_locked+0x4c/0x140 > [] add_to_page_cache_lru+0x22/0x50 > [] grab_cache_page_write_begin+0x8b/0xe0 > [] ext3_write_begin+0x88/0x270 > [] generic_file_buffered_write+0x116/0x290 > [] __generic_file_aio_write+0x27c/0x480 > [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > [] do_sync_write+0xea/0x130 > [] vfs_write+0xf3/0x1f0 > [] sys_write+0x51/0x90 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff > > This is not a hard deadlock though because administrator can still > intervene and increase the limit on the group which helps the writer to > finish the allocation and release the lock. > > This patch heals the problem by forbidding OOM from page cache charges > (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom > argument which is pushed down the call chain. > > As a possibly visible result add_to_page_cache_lru might fail more often > with ENOMEM but this is to be expected if the limit is set and it is > preferable than OOM killer IMO. > > Changes since v1 > - do not abuse gfp_flags and rather use oom parameter directly as per > Johannes > - handle also shmem write fauls resp. fallocate properly as per Johannes > > Reported-by: azurIt > Signed-off-by: Michal Hocko Sorry, Michal, you've laboured hard on this: but I dislike it so much that I'm here overcoming my dread of entering an OOM-killer discussion, and the resultant deluge of unwelcome CCs for eternity afterwards. I had been relying on Johannes to repeat his "This issue has been around for a while so frankly I don't think it's urgent enough to rush things", but it looks like I have to be the one to repeat it. Your analysis of azurIt's traces may well be correct, and this patch may indeed ameliorate the situation, and it's fine as something for azurIt to try and report on and keep in his tree; but I hope that it does not go upstream and to stable. Why do I dislike it so much? I suppose because it's both too general and too limited at the same time. Too general in that it changes the behaviour on OOM for a large set of memcg charges, all those that go through add_to_page_cache_locked(), when only a subset of those have the i_mutex issue. If you're going to be that general, why not go further? Leave the mem_cgroup_cache_charge() interface as is, make it not-OOM internally, no need for SGP_WRITE,SGP_FALLOC distinctions in mm/shmem.c. No other filesystem gets the benefit of those distinctions: isn't it better to keep it simple? (And I can see a partial truncation case where shmem uses SGP_READ under i_mutex; and the change to shmem_unuse behaviour is a non-issue, since swapoff invites itself to be killed anyway.) Too limited in that i_mutex is just the held resource which azurIt's traces have led you to, but it's a general problem that the OOM-killed task might be waiting for a resource that the OOM-killing task holds. I suspect that if we try hard enough (I admit I have not), we can find an example of such a potential deadlock for almost every memcg charge site. mmap_sem? not as easy to invent a case with that as I thought, since it needs a down_write, and the typical page allocations happen with down_read, and I can't think of a process which does down_write on another's mm. But i_mutex is always good, once you remember the case of write to file from userspace page which got paged out, so the fault path has to read it back in, while i_mutex is still held at the outer level. An unusual case? Well, normally yes, but we're considering out-of-memory conditions, which may converge upon cases like this. Wouldn't it be nice if I could be constructive? But I'm sceptical even of Johannes's faith in what the global OOM killer would do: how does __alloc_pages_slowpath() get out of its "goto restart" loop, excepting the trivial case when the killer is the killed? I wonder why this issue has hit azurIt and no other reporter? No swap plays a part in it, but that's not so unusual. Yours glOOMily, Hugh > --- > include/linux/memcontrol.h | 5 +++-- > mm/filemap.c | 9 +++++++-- > mm/memcontrol.c | 20 ++++++++++---------- > mm/shmem.c | 17 ++++++++++++++--- > 4 files changed, 34 insertions(+), 17 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 095d2b4..8f48d5e 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -63,7 +63,7 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page, > extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); > > extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > - gfp_t gfp_mask); > + gfp_t gfp_mask, bool oom); > > struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); > struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); > @@ -210,7 +210,8 @@ static inline int mem_cgroup_newpage_charge(struct page *page, > } > > static inline int mem_cgroup_cache_charge(struct page *page, > - struct mm_struct *mm, gfp_t gfp_mask) > + struct mm_struct *mm, gfp_t gfp_mask, > + bool oom) > { > return 0; > } > diff --git a/mm/filemap.c b/mm/filemap.c > index 83efee7..ef8fbd5 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > VM_BUG_ON(!PageLocked(page)); > VM_BUG_ON(PageSwapBacked(page)); > > - error = mem_cgroup_cache_charge(page, current->mm, > - gfp_mask & GFP_RECLAIM_MASK); > + /* > + * Cannot trigger OOM even if gfp_mask would allow that normally > + * because we might be called from a locked context and that > + * could lead to deadlocks if the killed process is waiting for > + * the same lock. > + */ > + error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false); > if (error) > goto out; > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 02ee2f7..3c9b1c5 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -3709,11 +3709,10 @@ out: > * < 0 if the cgroup is over its limit > */ > static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, > - gfp_t gfp_mask, enum charge_type ctype) > + gfp_t gfp_mask, enum charge_type ctype, bool oom) > { > struct mem_cgroup *memcg = NULL; > unsigned int nr_pages = 1; > - bool oom = true; > int ret; > > if (PageTransHuge(page)) { > @@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page, > VM_BUG_ON(page->mapping && !PageAnon(page)); > VM_BUG_ON(!mm); > return mem_cgroup_charge_common(page, mm, gfp_mask, > - MEM_CGROUP_CHARGE_TYPE_ANON); > + MEM_CGROUP_CHARGE_TYPE_ANON, true); > } > > /* > @@ -3754,7 +3753,8 @@ int mem_cgroup_newpage_charge(struct page *page, > static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, > struct page *page, > gfp_t mask, > - struct mem_cgroup **memcgp) > + struct mem_cgroup **memcgp, > + bool oom) > { > struct mem_cgroup *memcg; > struct page_cgroup *pc; > @@ -3776,13 +3776,13 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, > if (!memcg) > goto charge_cur_mm; > *memcgp = memcg; > - ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true); > + ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, oom); > css_put(&memcg->css); > if (ret == -EINTR) > ret = 0; > return ret; > charge_cur_mm: > - ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, true); > + ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, oom); > if (ret == -EINTR) > ret = 0; > return ret; > @@ -3808,7 +3808,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, > ret = 0; > return ret; > } > - return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp); > + return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, true); > } > > void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg) > @@ -3851,7 +3851,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page, > } > > int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > - gfp_t gfp_mask) > + gfp_t gfp_mask, bool oom) > { > struct mem_cgroup *memcg = NULL; > enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; > @@ -3863,10 +3863,10 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > return 0; > > if (!PageSwapCache(page)) > - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); > + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); > else { /* page is swapcache/shmem */ > ret = __mem_cgroup_try_charge_swapin(mm, page, > - gfp_mask, &memcg); > + gfp_mask, &memcg, oom); > if (!ret) > __mem_cgroup_commit_charge_swapin(page, memcg, type); > } > diff --git a/mm/shmem.c b/mm/shmem.c > index 55054a7..3b27db4 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page) > * the shmem_swaplist_mutex which might hold up shmem_writepage(). > * Charged back to the user (not to caller) when swap account is used. > */ > - error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL); > + error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true); > if (error) > goto out; > /* No radix_tree_preload: swap entry keeps a place for page in tree */ > @@ -1152,8 +1152,17 @@ repeat: > goto failed; > } > > + /* > + * Cannot trigger OOM even if gfp_mask would allow that > + * normally because we might be called from a locked > + * context (i_mutex held) if this is a write lock or > + * fallocate and that could lead to deadlocks if the > + * killed process is waiting for the same lock. > + */ > error = mem_cgroup_cache_charge(page, current->mm, > - gfp & GFP_RECLAIM_MASK); > + gfp & GFP_RECLAIM_MASK, > + sgp != SGP_WRITE && > + sgp != SGP_FALLOC); > if (!error) { > error = shmem_add_to_page_cache(page, mapping, index, > gfp, swp_to_radix_entry(swap)); > @@ -1209,7 +1218,9 @@ repeat: > SetPageSwapBacked(page); > __set_page_locked(page); > error = mem_cgroup_cache_charge(page, current->mm, > - gfp & GFP_RECLAIM_MASK); > + gfp & GFP_RECLAIM_MASK, > + sgp != SGP_WRITE && > + sgp != SGP_FALLOC); > if (error) > goto decused; > error = radix_tree_preload(gfp & GFP_RECLAIM_MASK); > -- > 1.7.10.4 > > -- > Michal Hocko > SUSE Labs > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx134.postini.com [74.125.245.134]) by kanga.kvack.org (Postfix) with SMTP id AC6986B0044 for ; Thu, 29 Nov 2012 20:45:14 -0500 (EST) Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Fri, 30 Nov 2012 02:45:12 +0100 From: "azurIt" References: <20121122214249.GA20319@dhcp22.suse.cz>, <20121122233434.3D5E35E6@pobox.sk>, <20121123074023.GA24698@dhcp22.suse.cz>, <20121123102137.10D6D653@pobox.sk>, <20121123100438.GF24698@dhcp22.suse.cz>, <20121125011047.7477BB5E@pobox.sk>, <20121125120524.GB10623@dhcp22.suse.cz>, <20121125135542.GE10623@dhcp22.suse.cz>, <20121126013855.AF118F5E@pobox.sk>, <20121126131837.GC17860@dhcp22.suse.cz> <20121126132149.GD17860@dhcp22.suse.cz> In-Reply-To: <20121126132149.GD17860@dhcp22.suse.cz> MIME-Version: 1.0 Message-Id: <20121130024512.EBFBD851@pobox.sk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >Here we go with the patch for 3.2.34. Could you test with this one, >please? I installed kernel with this patch, will report back if problem occurs again OR in few weeks if everything will be ok. Thank you! azurIt -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx130.postini.com [74.125.245.130]) by kanga.kvack.org (Postfix) with SMTP id 2E90A6B00A0 for ; Fri, 30 Nov 2012 07:45:09 -0500 (EST) Date: Fri, 30 Nov 2012 13:45:06 +0100 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121130124506.GH29317@dhcp22.suse.cz> References: <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126132149.GD17860@dhcp22.suse.cz> <20121130032918.59B3F780@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121130032918.59B3F780@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Fri 30-11-12 03:29:18, azurIt wrote: > >Here we go with the patch for 3.2.34. Could you test with this one, > >please? > > > Michal, unfortunately i had to boot to another kernel because the one > with this patch keeps killing my MySQL server :( it was, probably, > doing it on OOM in any cgroup - looks like OOM was not choosing > processes only from cgroup which is out of memory. Here is the log > from syslog: http://www.watchdog.sk/lkml/oom_mysqld You are seeing also global OOM: Nov 30 02:53:56 server01 kernel: [ 818.233159] Pid: 9247, comm: apache2 Not tainted 3.2.34-grsec #1 Nov 30 02:53:56 server01 kernel: [ 818.233289] Call Trace: Nov 30 02:53:56 server01 kernel: [ 818.233470] [] dump_header+0x7e/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.233600] [] ? find_lock_task_mm+0x2f/0x70 Nov 30 02:53:56 server01 kernel: [ 818.233721] [] oom_kill_process+0x85/0x2a0 Nov 30 02:53:56 server01 kernel: [ 818.233842] [] out_of_memory+0xe5/0x200 Nov 30 02:53:56 server01 kernel: [ 818.233963] [] ? pte_alloc_one+0x3f/0x50 Nov 30 02:53:56 server01 kernel: [ 818.234082] [] pagefault_out_of_memory+0xbd/0x110 Nov 30 02:53:56 server01 kernel: [ 818.234204] [] mm_fault_error+0xb6/0x1a0 Nov 30 02:53:56 server01 kernel: [ 818.235886] [] do_page_fault+0x3ee/0x460 Nov 30 02:53:56 server01 kernel: [ 818.236006] [] ? vma_merge+0x1f7/0x2c0 Nov 30 02:53:56 server01 kernel: [ 818.236124] [] ? do_brk+0x267/0x400 Nov 30 02:53:56 server01 kernel: [ 818.236244] [] ? gr_learn_resource+0x42/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.236367] [] page_fault+0x1f/0x30 [...] Nov 30 02:53:56 server01 kernel: [ 818.356297] Out of memory: Kill process 2188 (mysqld) score 60 or sacrifice child Nov 30 02:53:56 server01 kernel: [ 818.356493] Killed process 2188 (mysqld) total-vm:3330016kB, anon-rss:864176kB, file-rss:8072kB Then you also have memcg oom killer: Nov 30 02:53:56 server01 kernel: [ 818.375717] Task in /1037/uid killed as a result of limit of /1037 Nov 30 02:53:56 server01 kernel: [ 818.375886] memory: usage 102400kB, limit 102400kB, failcnt 736 Nov 30 02:53:56 server01 kernel: [ 818.376008] memory+swap: usage 102400kB, limit 102400kB, failcnt 0 The messages are intermixed and I guess rate limitting jumped in as well, because I cannot associate all the oom messages to a specific OOM event. Anyway your system is under both global and local memory pressure. You didn't see apache going down previously because it was probably the one which was stuck and could be killed. Anyway you need to setup your system more carefully. > Maybe i should mention that MySQL server has it's own cgroup (called > 'mysql') but with no limits to any resources. Where is that group in the hierarchy? > > azurIt > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx155.postini.com [74.125.245.155]) by kanga.kvack.org (Postfix) with SMTP id 9D1646B00A2 for ; Fri, 30 Nov 2012 07:53:32 -0500 (EST) Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Fri, 30 Nov 2012 13:53:30 +0100 From: "azurIt" References: <20121123074023.GA24698@dhcp22.suse.cz>, <20121123102137.10D6D653@pobox.sk>, <20121123100438.GF24698@dhcp22.suse.cz>, <20121125011047.7477BB5E@pobox.sk>, <20121125120524.GB10623@dhcp22.suse.cz>, <20121125135542.GE10623@dhcp22.suse.cz>, <20121126013855.AF118F5E@pobox.sk>, <20121126131837.GC17860@dhcp22.suse.cz>, <20121126132149.GD17860@dhcp22.suse.cz>, <20121130032918.59B3F780@pobox.sk> <20121130124506.GH29317@dhcp22.suse.cz> In-Reply-To: <20121130124506.GH29317@dhcp22.suse.cz> MIME-Version: 1.0 Message-Id: <20121130135330.6D012B71@pobox.sk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >Anyway your system is under both global and local memory pressure. You >didn't see apache going down previously because it was probably the one >which was stuck and could be killed. >Anyway you need to setup your system more carefully. No, it wasn't, i'm 1000% sure (i was on SSH). Here is the memory usage graph from that system on that time: http://www.watchdog.sk/lkml/memory.png The blank part is rebooting into new kernel. MySQL server was killed several times, then i rebooted into previous kernel and problem was gone (not a single MySQL kill). You can see two MySQL kills there on 03:54 and 03:04:30. > >> Maybe i should mention that MySQL server has it's own cgroup (called >> 'mysql') but with no limits to any resources. > >Where is that group in the hierarchy? In root. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx163.postini.com [74.125.245.163]) by kanga.kvack.org (Postfix) with SMTP id 210DD6B0093 for ; Fri, 30 Nov 2012 09:44:34 -0500 (EST) Date: Fri, 30 Nov 2012 15:44:31 +0100 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121130144431.GI29317@dhcp22.suse.cz> References: <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126132149.GD17860@dhcp22.suse.cz> <20121130032918.59B3F780@pobox.sk> <20121130124506.GH29317@dhcp22.suse.cz> <20121130144427.51A09169@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121130144427.51A09169@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Fri 30-11-12 14:44:27, azurIt wrote: > >Anyway your system is under both global and local memory pressure. You > >didn't see apache going down previously because it was probably the one > >which was stuck and could be killed. > >Anyway you need to setup your system more carefully. > > > There is, also, an evidence that system has enough of memory! :) Just > take column 'rss' from process list in OOM message and sum it - you > will get 2489911. It's probably in KB so it's about 2.4 GB. System has > 14 GB of RAM so this also match data on my graph - 2.4 is about 17% of > 14. Hmm, that corresponds to the ZONE_DMA32 size pretty nicely but that zone is hardly touched: Nov 30 02:53:56 server01 kernel: [ 818.241291] DMA32 free:2523636kB min:2672kB low:3340kB high:4008kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2542248kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no DMA32 zone is usually fills up first 4G unless your HW remaps the rest of the memory above 4G or you have a numa machine and the rest of the memory is at other node. Could you post your memory map printed during the boot? (e820: BIOS-provided physical RAM map: and following lines) There is also ZONE_NORMAL which is also not used much Nov 30 02:53:56 server01 kernel: [ 818.242163] Normal free:6924716kB min:12512kB low:15640kB high:18768kB active_anon:1463128kB inactive_anon:2072kB active_file:1803964kB inactive_file:1072628kB unevictable:3924kB isolated(anon):0kB isolated(file):0kB present:11893760kB mlocked:3924kB dirty:1000kB writeback:776kB mapped:35656kB shmem:3828kB slab_reclaimable:202560kB slab_unreclaimable:50696kB kernel_stack:2944kB pagetables:158616kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no You have mentioned that you are comounting with cpuset. If this happens to be a NUMA machine have you made the access to all nodes available? Also what does /proc/sys/vm/zone_reclaim_mode says? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx182.postini.com [74.125.245.182]) by kanga.kvack.org (Postfix) with SMTP id 8F8B06B00A3 for ; Fri, 30 Nov 2012 10:03:49 -0500 (EST) Date: Fri, 30 Nov 2012 16:03:47 +0100 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121130150347.GJ29317@dhcp22.suse.cz> References: <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126132149.GD17860@dhcp22.suse.cz> <20121130032918.59B3F780@pobox.sk> <20121130124506.GH29317@dhcp22.suse.cz> <20121130144427.51A09169@pobox.sk> <20121130144431.GI29317@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121130144431.GI29317@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Fri 30-11-12 15:44:31, Michal Hocko wrote: > On Fri 30-11-12 14:44:27, azurIt wrote: > > >Anyway your system is under both global and local memory pressure. You > > >didn't see apache going down previously because it was probably the one > > >which was stuck and could be killed. > > >Anyway you need to setup your system more carefully. > > > > > > There is, also, an evidence that system has enough of memory! :) Just > > take column 'rss' from process list in OOM message and sum it - you > > will get 2489911. It's probably in KB so it's about 2.4 GB. System has > > 14 GB of RAM so this also match data on my graph - 2.4 is about 17% of > > 14. > > Hmm, that corresponds to the ZONE_DMA32 size pretty nicely but that zone > is hardly touched: > Nov 30 02:53:56 server01 kernel: [ 818.241291] DMA32 free:2523636kB min:2672kB low:3340kB high:4008kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2542248kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no > > DMA32 zone is usually fills up first 4G unless your HW remaps the rest > of the memory above 4G or you have a numa machine and the rest of the > memory is at other node. Could you post your memory map printed during > the boot? (e820: BIOS-provided physical RAM map: and following lines) > > There is also ZONE_NORMAL which is also not used much > Nov 30 02:53:56 server01 kernel: [ 818.242163] Normal free:6924716kB min:12512kB low:15640kB high:18768kB active_anon:1463128kB inactive_anon:2072kB active_file:1803964kB inactive_file:1072628kB unevictable:3924kB isolated(anon):0kB isolated(file):0kB present:11893760kB mlocked:3924kB dirty:1000kB writeback:776kB mapped:35656kB shmem:3828kB slab_reclaimable:202560kB slab_unreclaimable:50696kB kernel_stack:2944kB pagetables:158616kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no > > You have mentioned that you are comounting with cpuset. If this happens > to be a NUMA machine have you made the access to all nodes available? And now that I am looking at the oom message more closely I can see Nov 30 02:53:56 server01 kernel: [ 818.232812] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 Nov 30 02:53:56 server01 kernel: [ 818.233029] apache2 cpuset=uid mems_allowed=0 Nov 30 02:53:56 server01 kernel: [ 818.233159] Pid: 9247, comm: apache2 Not tainted 3.2.34-grsec #1 Nov 30 02:53:56 server01 kernel: [ 818.233289] Call Trace: Nov 30 02:53:56 server01 kernel: [ 818.233470] [] dump_header+0x7e/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.233600] [] ? find_lock_task_mm+0x2f/0x70 Nov 30 02:53:56 server01 kernel: [ 818.233721] [] oom_kill_process+0x85/0x2a0 Nov 30 02:53:56 server01 kernel: [ 818.233842] [] out_of_memory+0xe5/0x200 Nov 30 02:53:56 server01 kernel: [ 818.233963] [] ? pte_alloc_one+0x3f/0x50 Nov 30 02:53:56 server01 kernel: [ 818.234082] [] pagefault_out_of_memory+0xbd/0x110 Nov 30 02:53:56 server01 kernel: [ 818.234204] [] mm_fault_error+0xb6/0x1a0 Nov 30 02:53:56 server01 kernel: [ 818.235886] [] do_page_fault+0x3ee/0x460 Nov 30 02:53:56 server01 kernel: [ 818.236006] [] ? vma_merge+0x1f7/0x2c0 Nov 30 02:53:56 server01 kernel: [ 818.236124] [] ? do_brk+0x267/0x400 Nov 30 02:53:56 server01 kernel: [ 818.236244] [] ? gr_learn_resource+0x42/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.236367] [] page_fault+0x1f/0x30 Which is interesting from 2 perspectives. Only the first node (Node-0) is allowed which would suggest that the cpuset controller is not configured to all nodes. It is still surprising Node 0 wouldn't have any memory (I would expect ZONE_DMA32 would be sitting there). Anyway, the more interesting thing is gfp_mask is GFP_NOWAIT allocation from the page fault? Huh this shouldn't happen - ever. > Also what does /proc/sys/vm/zone_reclaim_mode says? > -- > Michal Hocko > SUSE Labs > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx185.postini.com [74.125.245.185]) by kanga.kvack.org (Postfix) with SMTP id E4C996B00C9 for ; Fri, 30 Nov 2012 10:59:39 -0500 (EST) Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Fri, 30 Nov 2012 16:59:37 +0100 From: "azurIt" References: <20121125120524.GB10623@dhcp22.suse.cz>, <20121125135542.GE10623@dhcp22.suse.cz>, <20121126013855.AF118F5E@pobox.sk>, <20121126131837.GC17860@dhcp22.suse.cz>, <20121126132149.GD17860@dhcp22.suse.cz>, <20121130032918.59B3F780@pobox.sk>, <20121130124506.GH29317@dhcp22.suse.cz>, <20121130144427.51A09169@pobox.sk>, <20121130144431.GI29317@dhcp22.suse.cz>, <20121130160811.6BB25BDD@pobox.sk> <20121130153942.GL29317@dhcp22.suse.cz> In-Reply-To: <20121130153942.GL29317@dhcp22.suse.cz> MIME-Version: 1.0 Message-Id: <20121130165937.F9564EBE@pobox.sk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >> Here is the full boot log: >> www.watchdog.sk/lkml/kern.log > >The log is not complete. Could you paste the comple dmesg output? Or >even better, do you have logs from the previous run? What is missing there? All kernel messages are logging into /var/log/kern.log (it's the same as dmesg), dmesg itself was already rewrited by other messages. I think it's all what that kernel printed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx152.postini.com [74.125.245.152]) by kanga.kvack.org (Postfix) with SMTP id D27276B00CD for ; Fri, 30 Nov 2012 11:53:50 -0500 (EST) Date: Fri, 30 Nov 2012 17:53:47 +0100 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121130165347.GO29317@dhcp22.suse.cz> References: <20121126132149.GD17860@dhcp22.suse.cz> <20121130032918.59B3F780@pobox.sk> <20121130124506.GH29317@dhcp22.suse.cz> <20121130144427.51A09169@pobox.sk> <20121130144431.GI29317@dhcp22.suse.cz> <20121130160811.6BB25BDD@pobox.sk> <20121130153942.GL29317@dhcp22.suse.cz> <20121130165937.F9564EBE@pobox.sk> <20121130161923.GN29317@dhcp22.suse.cz> <20121130172651.B6917602@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121130172651.B6917602@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Fri 30-11-12 17:26:51, azurIt wrote: > >Could you also post your complete containers configuration, maybe there > >is something strange in there (basically grep . -r YOUR_CGROUP_MNT > >except for tasks files which are of no use right now). > > > Here it is: > http://www.watchdog.sk/lkml/cgroups.gz The only strange thing I noticed is that some groups have 0 limit. Is this intentional? grep memory.limit_in_bytes cgroups | grep -v uid | sed 's@.*/@@' | sort | uniq -c 3 memory.limit_in_bytes:0 254 memory.limit_in_bytes:104857600 107 memory.limit_in_bytes:157286400 68 memory.limit_in_bytes:209715200 10 memory.limit_in_bytes:262144000 28 memory.limit_in_bytes:314572800 1 memory.limit_in_bytes:346030080 1 memory.limit_in_bytes:524288000 2 memory.limit_in_bytes:9223372036854775807 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx142.postini.com [74.125.245.142]) by kanga.kvack.org (Postfix) with SMTP id 5A7256B0044 for ; Tue, 4 Dec 2012 20:36:46 -0500 (EST) Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Wed, 05 Dec 2012 02:36:44 +0100 From: "azurIt" References: <20121126131837.GC17860@dhcp22.suse.cz>, <20121126132149.GD17860@dhcp22.suse.cz>, <20121130032918.59B3F780@pobox.sk>, <20121130124506.GH29317@dhcp22.suse.cz>, <20121130144427.51A09169@pobox.sk>, <20121130144431.GI29317@dhcp22.suse.cz>, <20121130160811.6BB25BDD@pobox.sk>, <20121130153942.GL29317@dhcp22.suse.cz>, <20121130165937.F9564EBE@pobox.sk>, <20121130161923.GN29317@dhcp22.suse.cz> <20121203151601.GA17093@dhcp22.suse.cz> In-Reply-To: <20121203151601.GA17093@dhcp22.suse.cz> MIME-Version: 1.0 Message-Id: <20121205023644.18C3006B@pobox.sk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >The following should print the traces when we hand over ENOMEM to the >caller. It should catch all charge paths (migration is not covered but >that one is not important here). If we don't see any traces from here >and there is still global OOM striking then there must be something else >to trigger this. >Could you test this with the patch which aims at fixing your deadlock, >please? I realise that this is a production environment but I do not see >anything relevant in the code. Michal, i think/hope this is what you wanted: http://www.watchdog.sk/lkml/oom_mysqld2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx179.postini.com [74.125.245.179]) by kanga.kvack.org (Postfix) with SMTP id 528EC6B0044 for ; Wed, 5 Dec 2012 09:17:25 -0500 (EST) Date: Wed, 5 Dec 2012 15:17:22 +0100 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121205141722.GA9714@dhcp22.suse.cz> References: <20121130032918.59B3F780@pobox.sk> <20121130124506.GH29317@dhcp22.suse.cz> <20121130144427.51A09169@pobox.sk> <20121130144431.GI29317@dhcp22.suse.cz> <20121130160811.6BB25BDD@pobox.sk> <20121130153942.GL29317@dhcp22.suse.cz> <20121130165937.F9564EBE@pobox.sk> <20121130161923.GN29317@dhcp22.suse.cz> <20121203151601.GA17093@dhcp22.suse.cz> <20121205023644.18C3006B@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121205023644.18C3006B@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Wed 05-12-12 02:36:44, azurIt wrote: > >The following should print the traces when we hand over ENOMEM to the > >caller. It should catch all charge paths (migration is not covered but > >that one is not important here). If we don't see any traces from here > >and there is still global OOM striking then there must be something else > >to trigger this. > >Could you test this with the patch which aims at fixing your deadlock, > >please? I realise that this is a production environment but I do not see > >anything relevant in the code. > > > Michal, > > i think/hope this is what you wanted: > http://www.watchdog.sk/lkml/oom_mysqld2 Dec 5 02:20:48 server01 kernel: [ 380.995947] WARNING: at mm/memcontrol.c:2400 T.1146+0x2c1/0x5d0() Dec 5 02:20:48 server01 kernel: [ 380.995950] Hardware name: S5000VSA Dec 5 02:20:48 server01 kernel: [ 380.995952] Pid: 5351, comm: apache2 Not tainted 3.2.34-grsec #1 Dec 5 02:20:48 server01 kernel: [ 380.995954] Call Trace: Dec 5 02:20:48 server01 kernel: [ 380.995960] [] warn_slowpath_common+0x7a/0xb0 Dec 5 02:20:48 server01 kernel: [ 380.995963] [] warn_slowpath_null+0x1a/0x20 Dec 5 02:20:48 server01 kernel: [ 380.995965] [] T.1146+0x2c1/0x5d0 Dec 5 02:20:48 server01 kernel: [ 380.995967] [] mem_cgroup_charge_common+0x53/0x90 Dec 5 02:20:48 server01 kernel: [ 380.995970] [] mem_cgroup_newpage_charge+0x45/0x50 Dec 5 02:20:48 server01 kernel: [ 380.995974] [] handle_pte_fault+0x609/0x940 Dec 5 02:20:48 server01 kernel: [ 380.995978] [] ? pte_alloc_one+0x3f/0x50 Dec 5 02:20:48 server01 kernel: [ 380.995981] [] handle_mm_fault+0x138/0x260 Dec 5 02:20:48 server01 kernel: [ 380.995983] [] do_page_fault+0x13d/0x460 Dec 5 02:20:48 server01 kernel: [ 380.995986] [] ? do_mmap_pgoff+0x3dc/0x430 Dec 5 02:20:48 server01 kernel: [ 380.995988] [] ? remove_vma+0x5d/0x80 Dec 5 02:20:48 server01 kernel: [ 380.995992] [] page_fault+0x1f/0x30 Dec 5 02:20:48 server01 kernel: [ 380.995994] ---[ end trace 25bbb3e634c25b7f ]--- Dec 5 02:20:48 server01 kernel: [ 380.996373] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 Dec 5 02:20:48 server01 kernel: [ 380.996377] apache2 cpuset=uid mems_allowed=0 Dec 5 02:20:48 server01 kernel: [ 380.996379] Pid: 5351, comm: apache2 Tainted: G W 3.2.34-grsec #1 Dec 5 02:20:48 server01 kernel: [ 380.996380] Call Trace: Dec 5 02:20:48 server01 kernel: [ 380.996384] [] dump_header+0x7e/0x1e0 Dec 5 02:20:48 server01 kernel: [ 380.996387] [] ? find_lock_task_mm+0x2f/0x70 Dec 5 02:20:48 server01 kernel: [ 380.996389] [] oom_kill_process+0x85/0x2a0 Dec 5 02:20:48 server01 kernel: [ 380.996392] [] out_of_memory+0xe5/0x200 Dec 5 02:20:48 server01 kernel: [ 380.996394] [] ? pte_alloc_one+0x3f/0x50 Dec 5 02:20:48 server01 kernel: [ 380.996397] [] pagefault_out_of_memory+0xbd/0x110 Dec 5 02:20:48 server01 kernel: [ 380.996399] [] mm_fault_error+0xb6/0x1a0 Dec 5 02:20:48 server01 kernel: [ 380.996401] [] do_page_fault+0x3ee/0x460 Dec 5 02:20:48 server01 kernel: [ 380.996403] [] ? do_mmap_pgoff+0x3dc/0x430 Dec 5 02:20:48 server01 kernel: [ 380.996405] [] ? remove_vma+0x5d/0x80 Dec 5 02:20:48 server01 kernel: [ 380.996408] [] page_fault+0x1f/0x30 OK, so the ENOMEM seems to be leaking from mem_cgroup_newpage_charge. This can only happen if this was an atomic allocation request (!__GFP_WAIT) or if oom is not allowed which is the case only for transparent huge page allocation. The first case can be excluded (in the clean 3.2 stable kernel) because all callers of mem_cgroup_newpage_charge use GFP_KERNEL. The later one should be OK because the page fault should fallback to a regular page if THP allocation/charge fails. [/me goes to double check] Hmm do_huge_pmd_wp_page seems to charge a huge page and fails with VM_FAULT_OOM without any fallback. We should do_huge_pmd_wp_page_fallback instead. This has been fixed in 3.5-rc1 by 1f1d06c3 (thp, memcg: split hugepage for memcg oom on cow) but it hasn't been backported to 3.2. The patch applies to 3.2 without any further modifications. I didn't have time to test it but if it helps you we should push this to the stable tree. --- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx181.postini.com [74.125.245.181]) by kanga.kvack.org (Postfix) with SMTP id BCDE16B005D for ; Wed, 5 Dec 2012 19:29:26 -0500 (EST) Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Thu, 06 Dec 2012 01:29:24 +0100 From: "azurIt" References: <20121130032918.59B3F780@pobox.sk>, <20121130124506.GH29317@dhcp22.suse.cz>, <20121130144427.51A09169@pobox.sk>, <20121130144431.GI29317@dhcp22.suse.cz>, <20121130160811.6BB25BDD@pobox.sk>, <20121130153942.GL29317@dhcp22.suse.cz>, <20121130165937.F9564EBE@pobox.sk>, <20121130161923.GN29317@dhcp22.suse.cz>, <20121203151601.GA17093@dhcp22.suse.cz>, <20121205023644.18C3006B@pobox.sk> <20121205141722.GA9714@dhcp22.suse.cz> In-Reply-To: <20121205141722.GA9714@dhcp22.suse.cz> MIME-Version: 1.0 Message-Id: <20121206012924.FE077FD7@pobox.sk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >OK, so the ENOMEM seems to be leaking from mem_cgroup_newpage_charge. >This can only happen if this was an atomic allocation request >(!__GFP_WAIT) or if oom is not allowed which is the case only for >transparent huge page allocation. >The first case can be excluded (in the clean 3.2 stable kernel) because >all callers of mem_cgroup_newpage_charge use GFP_KERNEL. The later one >should be OK because the page fault should fallback to a regular page if >THP allocation/charge fails. >[/me goes to double check] >Hmm do_huge_pmd_wp_page seems to charge a huge page and fails with >VM_FAULT_OOM without any fallback. We should do_huge_pmd_wp_page_fallback >instead. This has been fixed in 3.5-rc1 by 1f1d06c3 (thp, memcg: split >hugepage for memcg oom on cow) but it hasn't been backported to 3.2. The >patch applies to 3.2 without any further modifications. I didn't have >time to test it but if it helps you we should push this to the stable >tree. This, unfortunately, didn't fix the problem :( http://www.watchdog.sk/lkml/oom_mysqld3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx130.postini.com [74.125.245.130]) by kanga.kvack.org (Postfix) with SMTP id 608FF6B0068 for ; Thu, 6 Dec 2012 04:54:26 -0500 (EST) Date: Thu, 6 Dec 2012 10:54:23 +0100 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121206095423.GB10931@dhcp22.suse.cz> References: <20121130144427.51A09169@pobox.sk> <20121130144431.GI29317@dhcp22.suse.cz> <20121130160811.6BB25BDD@pobox.sk> <20121130153942.GL29317@dhcp22.suse.cz> <20121130165937.F9564EBE@pobox.sk> <20121130161923.GN29317@dhcp22.suse.cz> <20121203151601.GA17093@dhcp22.suse.cz> <20121205023644.18C3006B@pobox.sk> <20121205141722.GA9714@dhcp22.suse.cz> <20121206012924.FE077FD7@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121206012924.FE077FD7@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Thu 06-12-12 01:29:24, azurIt wrote: > >OK, so the ENOMEM seems to be leaking from mem_cgroup_newpage_charge. > >This can only happen if this was an atomic allocation request > >(!__GFP_WAIT) or if oom is not allowed which is the case only for > >transparent huge page allocation. > >The first case can be excluded (in the clean 3.2 stable kernel) because > >all callers of mem_cgroup_newpage_charge use GFP_KERNEL. The later one > >should be OK because the page fault should fallback to a regular page if > >THP allocation/charge fails. > >[/me goes to double check] > >Hmm do_huge_pmd_wp_page seems to charge a huge page and fails with > >VM_FAULT_OOM without any fallback. We should do_huge_pmd_wp_page_fallback > >instead. This has been fixed in 3.5-rc1 by 1f1d06c3 (thp, memcg: split > >hugepage for memcg oom on cow) but it hasn't been backported to 3.2. The > >patch applies to 3.2 without any further modifications. I didn't have > >time to test it but if it helps you we should push this to the stable > >tree. > > > This, unfortunately, didn't fix the problem :( > http://www.watchdog.sk/lkml/oom_mysqld3 Dohh. The very same stack mem_cgroup_newpage_charge called from the page fault. The heavy inlining is not particularly helping here... So there must be some other THP charge leaking out. [/me is diving into the code again] * do_huge_pmd_anonymous_page falls back to handle_pte_fault * do_huge_pmd_wp_page_fallback falls back to simple pages so it doesn't charge the huge page * do_huge_pmd_wp_page splits the huge page and retries with fallback to handle_pte_fault * collapse_huge_page is not called in the page fault path * do_wp_page, do_anonymous_page and __do_fault operate on a single page so the memcg charging cannot return ENOMEM There are no other callers AFAICS so I am getting clueless. Maybe more debugging will tell us something (the inlining has been reduced for thp paths which can reduce performance in thp page fault heavy workloads but this will give us better traces - I hope). Anyway do you see the same problem if transparent huge pages are disabled? echo never > /sys/kernel/mm/transparent_hugepage/enabled) --- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx198.postini.com [74.125.245.198]) by kanga.kvack.org (Postfix) with SMTP id 1894E6B006E for ; Thu, 6 Dec 2012 05:12:52 -0500 (EST) Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Thu, 06 Dec 2012 11:12:49 +0100 From: "azurIt" References: <20121130144427.51A09169@pobox.sk>, <20121130144431.GI29317@dhcp22.suse.cz>, <20121130160811.6BB25BDD@pobox.sk>, <20121130153942.GL29317@dhcp22.suse.cz>, <20121130165937.F9564EBE@pobox.sk>, <20121130161923.GN29317@dhcp22.suse.cz>, <20121203151601.GA17093@dhcp22.suse.cz>, <20121205023644.18C3006B@pobox.sk>, <20121205141722.GA9714@dhcp22.suse.cz>, <20121206012924.FE077FD7@pobox.sk> <20121206095423.GB10931@dhcp22.suse.cz> In-Reply-To: <20121206095423.GB10931@dhcp22.suse.cz> MIME-Version: 1.0 Message-Id: <20121206111249.58F013EA@pobox.sk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >Dohh. The very same stack mem_cgroup_newpage_charge called from the page >fault. The heavy inlining is not particularly helping here... So there >must be some other THP charge leaking out. >[/me is diving into the code again] > >* do_huge_pmd_anonymous_page falls back to handle_pte_fault >* do_huge_pmd_wp_page_fallback falls back to simple pages so it doesn't > charge the huge page >* do_huge_pmd_wp_page splits the huge page and retries with fallback to > handle_pte_fault >* collapse_huge_page is not called in the page fault path >* do_wp_page, do_anonymous_page and __do_fault operate on a single page > so the memcg charging cannot return ENOMEM > >There are no other callers AFAICS so I am getting clueless. Maybe more >debugging will tell us something (the inlining has been reduced for thp >paths which can reduce performance in thp page fault heavy workloads but >this will give us better traces - I hope). Should i apply all patches togather? (fix for this bug, more log messages, backported fix from 3.5 and this new one) >Anyway do you see the same problem if transparent huge pages are >disabled? >echo never > /sys/kernel/mm/transparent_hugepage/enabled) # cat /sys/kernel/mm/transparent_hugepage/enabled cat: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx165.postini.com [74.125.245.165]) by kanga.kvack.org (Postfix) with SMTP id 468546B005A for ; Sun, 9 Dec 2012 20:20:41 -0500 (EST) Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Mon, 10 Dec 2012 02:20:38 +0100 From: "azurIt" References: <20121130144427.51A09169@pobox.sk>, <20121130144431.GI29317@dhcp22.suse.cz>, <20121130160811.6BB25BDD@pobox.sk>, <20121130153942.GL29317@dhcp22.suse.cz>, <20121130165937.F9564EBE@pobox.sk>, <20121130161923.GN29317@dhcp22.suse.cz>, <20121203151601.GA17093@dhcp22.suse.cz>, <20121205023644.18C3006B@pobox.sk>, <20121205141722.GA9714@dhcp22.suse.cz>, <20121206012924.FE077FD7@pobox.sk> <20121206095423.GB10931@dhcp22.suse.cz> In-Reply-To: <20121206095423.GB10931@dhcp22.suse.cz> MIME-Version: 1.0 Message-Id: <20121210022038.E6570D37@pobox.sk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >There are no other callers AFAICS so I am getting clueless. Maybe more >debugging will tell us something (the inlining has been reduced for thp >paths which can reduce performance in thp page fault heavy workloads but >this will give us better traces - I hope). Michal, this was printing so many debug messages to console that the whole server hangs and i had to hard reset it after several minutes :( Sorry but i cannot test such a things in production. There's no problem with one soft reset which takes 4 minutes but this hard reset creates about 20 minutes outage (mainly cos of disk quotas checking). Last logged message: Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds. Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx140.postini.com [74.125.245.140]) by kanga.kvack.org (Postfix) with SMTP id B31356B005A for ; Mon, 10 Dec 2012 04:43:41 -0500 (EST) Date: Mon, 10 Dec 2012 10:43:38 +0100 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121210094318.GA6777@dhcp22.suse.cz> References: <20121130160811.6BB25BDD@pobox.sk> <20121130153942.GL29317@dhcp22.suse.cz> <20121130165937.F9564EBE@pobox.sk> <20121130161923.GN29317@dhcp22.suse.cz> <20121203151601.GA17093@dhcp22.suse.cz> <20121205023644.18C3006B@pobox.sk> <20121205141722.GA9714@dhcp22.suse.cz> <20121206012924.FE077FD7@pobox.sk> <20121206095423.GB10931@dhcp22.suse.cz> <20121210022038.E6570D37@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121210022038.E6570D37@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Mon 10-12-12 02:20:38, azurIt wrote: [...] > Michal, Hi, > this was printing so many debug messages to console that the whole > server hangs Hmm, this is _really_ surprising. The latest patch didn't add any new logging actually. It just enahanced messages which were already printed out previously + changed few functions to be not inlined so they show up in the traces. So the only explanation is that the workload has changed or the patches got misapplied. > and i had to hard reset it after several minutes :( Sorry > but i cannot test such a things in production. There's no problem with > one soft reset which takes 4 minutes but this hard reset creates about > 20 minutes outage (mainly cos of disk quotas checking). Understood. > Last logged message: > > Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds. Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0 This explains why you have seen your machine hung. I am not familiar with grsec but stalling each fork 30s sounds really bad. Anyway this will not help me much. Do you happen to still have any of those logged traces from the last run? Apart from that. If my current understanding is correct then this is related to transparent huge pages (and leaking charge to the page fault handler). Do you see the same problem if you disable THP before you start your workload? (echo never > /sys/kernel/mm/transparent_hugepage/enabled) -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx196.postini.com [74.125.245.196]) by kanga.kvack.org (Postfix) with SMTP id 32E776B005A for ; Mon, 10 Dec 2012 05:18:20 -0500 (EST) Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Mon, 10 Dec 2012 11:18:17 +0100 From: "azurIt" References: <20121130160811.6BB25BDD@pobox.sk>, <20121130153942.GL29317@dhcp22.suse.cz>, <20121130165937.F9564EBE@pobox.sk>, <20121130161923.GN29317@dhcp22.suse.cz>, <20121203151601.GA17093@dhcp22.suse.cz>, <20121205023644.18C3006B@pobox.sk>, <20121205141722.GA9714@dhcp22.suse.cz>, <20121206012924.FE077FD7@pobox.sk>, <20121206095423.GB10931@dhcp22.suse.cz>, <20121210022038.E6570D37@pobox.sk> <20121210094318.GA6777@dhcp22.suse.cz> In-Reply-To: <20121210094318.GA6777@dhcp22.suse.cz> MIME-Version: 1.0 Message-Id: <20121210111817.F697F53E@pobox.sk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >Hmm, this is _really_ surprising. The latest patch didn't add any new >logging actually. It just enahanced messages which were already printed >out previously + changed few functions to be not inlined so they show up >in the traces. So the only explanation is that the workload has changed >or the patches got misapplied. This time i installed 3.2.35, maybe some changes between .34 and .35 did this? Should i try .34? >> Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds. Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0 > >This explains why you have seen your machine hung. I am not familiar >with grsec but stalling each fork 30s sounds really bad. Btw, i never ever saw such a message from grsecurity yet. Will write to grsec mailing list about explanation. >Anyway this will not help me much. Do you happen to still have any of >those logged traces from the last run? Unfortunately not, it didn't log anything and tons of messages were printed only to console (i was logged via IP-KVM). It looked that printing is infinite, i rebooted it after few minutes. >Apart from that. If my current understanding is correct then this is >related to transparent huge pages (and leaking charge to the page fault >handler). Do you see the same problem if you disable THP before you >start your workload? (echo never > /sys/kernel/mm/transparent_hugepage/enabled) # cat /sys/kernel/mm/transparent_hugepage/enabled cat: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory # ls -la /sys/kernel/mm total 0 drwx------ 3 root root 0 Dec 10 11:11 . drwx------ 5 root root 0 Dec 10 02:06 .. drwx------ 2 root root 0 Dec 10 11:11 cleancache -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx148.postini.com [74.125.245.148]) by kanga.kvack.org (Postfix) with SMTP id 067E26B005A for ; Mon, 10 Dec 2012 10:52:07 -0500 (EST) Date: Mon, 10 Dec 2012 16:52:05 +0100 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121210155205.GB6777@dhcp22.suse.cz> References: <20121130165937.F9564EBE@pobox.sk> <20121130161923.GN29317@dhcp22.suse.cz> <20121203151601.GA17093@dhcp22.suse.cz> <20121205023644.18C3006B@pobox.sk> <20121205141722.GA9714@dhcp22.suse.cz> <20121206012924.FE077FD7@pobox.sk> <20121206095423.GB10931@dhcp22.suse.cz> <20121210022038.E6570D37@pobox.sk> <20121210094318.GA6777@dhcp22.suse.cz> <20121210111817.F697F53E@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121210111817.F697F53E@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Mon 10-12-12 11:18:17, azurIt wrote: > >Hmm, this is _really_ surprising. The latest patch didn't add any new > >logging actually. It just enahanced messages which were already printed > >out previously + changed few functions to be not inlined so they show up > >in the traces. So the only explanation is that the workload has changed > >or the patches got misapplied. > > > This time i installed 3.2.35, maybe some changes between .34 and .35 > did this? Should i try .34? I would try to limit changes to minimum. So the original kernel you were using + the first patch to prevent OOM from the write path + 2 debugging patches. > >> Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds. Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0 > > > >This explains why you have seen your machine hung. I am not familiar > >with grsec but stalling each fork 30s sounds really bad. > > > Btw, i never ever saw such a message from grsecurity yet. Will write to grsec mailing list about explanation. > > > >Anyway this will not help me much. Do you happen to still have any of > >those logged traces from the last run? > > > Unfortunately not, it didn't log anything and tons of messages were > printed only to console (i was logged via IP-KVM). It looked that > printing is infinite, i rebooted it after few minutes. But was it at least related to the debugging from the patch or it was rather a totally unrelated thing? > >Apart from that. If my current understanding is correct then this is > >related to transparent huge pages (and leaking charge to the page fault > >handler). Do you see the same problem if you disable THP before you > >start your workload? (echo never > /sys/kernel/mm/transparent_hugepage/enabled) > > # cat /sys/kernel/mm/transparent_hugepage/enabled > cat: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory Weee. Then it cannot be related to THP at all. Which makes this even bigger mystery. We really need to find out who is leaking that charge. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx179.postini.com [74.125.245.179]) by kanga.kvack.org (Postfix) with SMTP id BFD4A6B005A for ; Mon, 17 Dec 2012 14:55:24 -0500 (EST) Received: by mail-ea0-f169.google.com with SMTP id a12so2746342eaa.14 for ; Mon, 17 Dec 2012 11:55:23 -0800 (PST) Date: Mon, 17 Dec 2012 20:55:10 +0100 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121217195510.GA16375@dhcp22.suse.cz> References: <20121205141722.GA9714@dhcp22.suse.cz> <20121206012924.FE077FD7@pobox.sk> <20121206095423.GB10931@dhcp22.suse.cz> <20121210022038.E6570D37@pobox.sk> <20121210094318.GA6777@dhcp22.suse.cz> <20121210111817.F697F53E@pobox.sk> <20121210155205.GB6777@dhcp22.suse.cz> <20121217023430.5A390FD7@pobox.sk> <20121217163203.GD25432@dhcp22.suse.cz> <20121217192301.829A7020@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121217192301.829A7020@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Mon 17-12-12 19:23:01, azurIt wrote: > >[Ohh, I am really an idiot. I screwed the first patch] > >- bool oom = true; > >+ bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); > > > >Which obviously doesn't work. It should read !(gfp_mask &GFP_MEMCG_NO_OOM). > > No idea how I could have missed that. I am really sorry about that. > > > :D no problem :) so, now it should really work as expected and > completely fix my original problem? It should mitigate the problem. The real fix shouldn't be that specific (as per discussion in other thread). The chance this will get upstream is not big and that means that it will not get to the stable tree either. > is it safe to apply it on 3.2.35? I didn't check what are the differences but I do not think there is anything to conflict with it. > Thank you very much! HTH -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx177.postini.com [74.125.245.177]) by kanga.kvack.org (Postfix) with SMTP id 658286B002B for ; Tue, 18 Dec 2012 09:22:25 -0500 (EST) Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Tue, 18 Dec 2012 15:22:23 +0100 From: "azurIt" References: <20121205141722.GA9714@dhcp22.suse.cz>, <20121206012924.FE077FD7@pobox.sk>, <20121206095423.GB10931@dhcp22.suse.cz>, <20121210022038.E6570D37@pobox.sk>, <20121210094318.GA6777@dhcp22.suse.cz>, <20121210111817.F697F53E@pobox.sk>, <20121210155205.GB6777@dhcp22.suse.cz>, <20121217023430.5A390FD7@pobox.sk>, <20121217163203.GD25432@dhcp22.suse.cz>, <20121217192301.829A7020@pobox.sk> <20121217195510.GA16375@dhcp22.suse.cz> In-Reply-To: <20121217195510.GA16375@dhcp22.suse.cz> MIME-Version: 1.0 Message-Id: <20121218152223.6912832C@pobox.sk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >It should mitigate the problem. The real fix shouldn't be that specific >(as per discussion in other thread). The chance this will get upstream >is not big and that means that it will not get to the stable tree >either. OOM is no longer killing processes outside target cgroups, so everything looks fine so far. Will report back when i will have more info. Thnks! azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx114.postini.com [74.125.245.114]) by kanga.kvack.org (Postfix) with SMTP id 31B6A6B002B for ; Mon, 24 Dec 2012 08:25:28 -0500 (EST) Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Mon, 24 Dec 2012 14:25:26 +0100 From: "azurIt" References: <20121206095423.GB10931@dhcp22.suse.cz>, <20121210022038.E6570D37@pobox.sk>, <20121210094318.GA6777@dhcp22.suse.cz>, <20121210111817.F697F53E@pobox.sk>, <20121210155205.GB6777@dhcp22.suse.cz>, <20121217023430.5A390FD7@pobox.sk>, <20121217163203.GD25432@dhcp22.suse.cz>, <20121217192301.829A7020@pobox.sk>, <20121217195510.GA16375@dhcp22.suse.cz>, <20121218152223.6912832C@pobox.sk> <20121218152004.GA25208@dhcp22.suse.cz> In-Reply-To: <20121218152004.GA25208@dhcp22.suse.cz> MIME-Version: 1.0 Message-Id: <20121224142526.020165D3@pobox.sk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >OK, good to hear and fingers crossed. I will try to get back to the >original problem and a better solution sometimes early next year when >all the things settle a bit. Michal, problem, unfortunately, happened again :( twice. When it happened first time (two days ago) i don't want to believe it so i recompiled the kernel and boot it again to be sure i really used your patch. Today it happened again, here is report: http://watchdog.sk/lkml/memcg-bug-3.tar.gz Here is patch which i used (kernel 3.2.35, i didn't use any other from your patches): http://watchdog.sk/lkml/5-memcg-fix.patch azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx150.postini.com [74.125.245.150]) by kanga.kvack.org (Postfix) with SMTP id 074546B002B for ; Mon, 24 Dec 2012 08:38:52 -0500 (EST) Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Mon, 24 Dec 2012 14:38:50 +0100 From: "azurIt" References: <20121206095423.GB10931@dhcp22.suse.cz>, <20121210022038.E6570D37@pobox.sk>, <20121210094318.GA6777@dhcp22.suse.cz>, <20121210111817.F697F53E@pobox.sk>, <20121210155205.GB6777@dhcp22.suse.cz>, <20121217023430.5A390FD7@pobox.sk>, <20121217163203.GD25432@dhcp22.suse.cz>, <20121217192301.829A7020@pobox.sk>, <20121217195510.GA16375@dhcp22.suse.cz>, <20121218152223.6912832C@pobox.sk> <20121218152004.GA25208@dhcp22.suse.cz> In-Reply-To: <20121218152004.GA25208@dhcp22.suse.cz> MIME-Version: 1.0 Message-Id: <20121224143850.B611B3C3@pobox.sk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >OK, good to hear and fingers crossed. I will try to get back to the >original problem and a better solution sometimes early next year when >all the things settle a bit. Btw, i noticed one more thing when problem is happening (=when any cgroup is stucked), i fogot to mention it before, sorry :( . It's related to HDDs, something is slowing them down in a strange way. All services are working normally and i really cannot notice any slowness, the only thing which i noticed is affeceted is our backup software ( www.Bacula.org ). When problem occurs at night, so it's happening when backup is running, backup is extremely slow and usually don't finish until i kill processes inside affected cgroup (=until i resolve the problem). Backup software is NOT doing big HDD bandwidth BUT it's doing quite huge number of disk operations (it needs to stat every file and directory). I believe that only speed of disk operations are affected and are very slow. Merry christmas! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx113.postini.com [74.125.245.113]) by kanga.kvack.org (Postfix) with SMTP id 4CE636B002B for ; Fri, 28 Dec 2012 11:22:13 -0500 (EST) Date: Fri, 28 Dec 2012 17:22:09 +0100 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121228162209.GA1455@dhcp22.suse.cz> References: <20121210094318.GA6777@dhcp22.suse.cz> <20121210111817.F697F53E@pobox.sk> <20121210155205.GB6777@dhcp22.suse.cz> <20121217023430.5A390FD7@pobox.sk> <20121217163203.GD25432@dhcp22.suse.cz> <20121217192301.829A7020@pobox.sk> <20121217195510.GA16375@dhcp22.suse.cz> <20121218152223.6912832C@pobox.sk> <20121218152004.GA25208@dhcp22.suse.cz> <20121224142526.020165D3@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121224142526.020165D3@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Mon 24-12-12 14:25:26, azurIt wrote: > >OK, good to hear and fingers crossed. I will try to get back to the > >original problem and a better solution sometimes early next year when > >all the things settle a bit. > > > Michal, problem, unfortunately, happened again :( twice. When it > happened first time (two days ago) i don't want to believe it so i > recompiled the kernel and boot it again to be sure i really used your > patch. Today it happened again, here is report: > http://watchdog.sk/lkml/memcg-bug-3.tar.gz Hmm, 1356352982/1507/stack says [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1147+0x5ab/0x5c0 [] mem_cgroup_cache_charge+0xbe/0xe0 [] add_to_page_cache_locked+0x4f/0x140 [] add_to_page_cache_lru+0x22/0x50 [] find_or_create_page+0x73/0xb0 [] __getblk+0xea/0x2c0 [] ext3_getblk+0xeb/0x240 [] ext3_bread+0x19/0x90 [] ext3_dx_find_entry+0x83/0x1e0 [] ext3_find_entry+0x2e4/0x480 [] ext3_lookup+0x4d/0x120 [] d_alloc_and_lookup+0x45/0x90 [] do_lookup+0x278/0x390 [] path_lookupat+0xae/0x7e0 [] do_path_lookup+0x35/0xe0 [] user_path_at_empty+0x59/0xb0 [] user_path_at+0x11/0x20 [] vfs_fstatat+0x47/0x80 [] vfs_lstat+0x1e/0x20 [] sys_newlstat+0x24/0x50 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff which suggests that the patch is incomplete and that I am blind :/ mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following follow-up patch on top of the one you already have (which should catch all the remaining cases). Sorry about that... --- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 89997ac..559a54d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2779,6 +2779,7 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg, int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask) { + bool oom = !(gfp_mask & GFP_MEMCG_NO_OOM); struct mem_cgroup *memcg = NULL; int ret; @@ -2791,7 +2792,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, mm = &init_mm; if (page_is_file_cache(page)) { - ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, true); + ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, oom); if (ret || !memcg) return ret; @@ -2827,6 +2828,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t mask, struct mem_cgroup **ptr) { + bool oom = !(mask & GFP_MEMCG_NO_OOM); struct mem_cgroup *memcg; int ret; @@ -2849,13 +2851,13 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, if (!memcg) goto charge_cur_mm; *ptr = memcg; - ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, true); + ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, oom); css_put(&memcg->css); return ret; charge_cur_mm: if (unlikely(!mm)) mm = &init_mm; - return __mem_cgroup_try_charge(mm, mask, 1, ptr, true); + return __mem_cgroup_try_charge(mm, mask, 1, ptr, oom); } static void -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx206.postini.com [74.125.245.206]) by kanga.kvack.org (Postfix) with SMTP id BF2DF6B006C for ; Sat, 29 Dec 2012 20:09:49 -0500 (EST) Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Sun, 30 Dec 2012 02:09:47 +0100 From: "azurIt" References: <20121210094318.GA6777@dhcp22.suse.cz>, <20121210111817.F697F53E@pobox.sk>, <20121210155205.GB6777@dhcp22.suse.cz>, <20121217023430.5A390FD7@pobox.sk>, <20121217163203.GD25432@dhcp22.suse.cz>, <20121217192301.829A7020@pobox.sk>, <20121217195510.GA16375@dhcp22.suse.cz>, <20121218152223.6912832C@pobox.sk>, <20121218152004.GA25208@dhcp22.suse.cz>, <20121224142526.020165D3@pobox.sk> <20121228162209.GA1455@dhcp22.suse.cz> In-Reply-To: <20121228162209.GA1455@dhcp22.suse.cz> MIME-Version: 1.0 Message-Id: <20121230020947.AA002F34@pobox.sk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >which suggests that the patch is incomplete and that I am blind :/ >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following >follow-up patch on top of the one you already have (which should catch >all the remaining cases). >Sorry about that... This was, again, killing my MySQL server (search for "(mysqld)"): http://www.watchdog.sk/lkml/oom_mysqld5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx163.postini.com [74.125.245.163]) by kanga.kvack.org (Postfix) with SMTP id 934756B006C for ; Sun, 30 Dec 2012 06:08:18 -0500 (EST) Date: Sun, 30 Dec 2012 12:08:15 +0100 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121230110815.GA12940@dhcp22.suse.cz> References: <20121210155205.GB6777@dhcp22.suse.cz> <20121217023430.5A390FD7@pobox.sk> <20121217163203.GD25432@dhcp22.suse.cz> <20121217192301.829A7020@pobox.sk> <20121217195510.GA16375@dhcp22.suse.cz> <20121218152223.6912832C@pobox.sk> <20121218152004.GA25208@dhcp22.suse.cz> <20121224142526.020165D3@pobox.sk> <20121228162209.GA1455@dhcp22.suse.cz> <20121230020947.AA002F34@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121230020947.AA002F34@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Sun 30-12-12 02:09:47, azurIt wrote: > >which suggests that the patch is incomplete and that I am blind :/ > >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache > >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following > >follow-up patch on top of the one you already have (which should catch > >all the remaining cases). > >Sorry about that... > > > This was, again, killing my MySQL server (search for "(mysqld)"): > http://www.watchdog.sk/lkml/oom_mysqld5 grep "Kill process" oom_mysqld5 Dec 30 01:53:34 server01 kernel: [ 367.061801] Memory cgroup out of memory: Kill process 5512 (apache2) score 716 or sacrifice child Dec 30 01:53:35 server01 kernel: [ 367.338024] Memory cgroup out of memory: Kill process 5517 (apache2) score 718 or sacrifice child Dec 30 01:53:35 server01 kernel: [ 367.747888] Memory cgroup out of memory: Kill process 5513 (apache2) score 721 or sacrifice child Dec 30 01:53:36 server01 kernel: [ 368.159860] Memory cgroup out of memory: Kill process 5516 (apache2) score 726 or sacrifice child Dec 30 01:53:36 server01 kernel: [ 368.665606] Memory cgroup out of memory: Kill process 5520 (apache2) score 733 or sacrifice child Dec 30 01:53:36 server01 kernel: [ 368.765652] Out of memory: Kill process 1778 (mysqld) score 39 or sacrifice child Dec 30 01:53:36 server01 kernel: [ 369.101753] Memory cgroup out of memory: Kill process 5519 (apache2) score 754 or sacrifice child Dec 30 01:53:37 server01 kernel: [ 369.464262] Memory cgroup out of memory: Kill process 5583 (apache2) score 762 or sacrifice child Dec 30 01:53:37 server01 kernel: [ 369.465017] Out of memory: Kill process 5506 (apache2) score 18 or sacrifice child Dec 30 01:53:37 server01 kernel: [ 369.574932] Memory cgroup out of memory: Kill process 5523 (apache2) score 759 or sacrifice child So your mysqld has been killed by the global OOM not memcg. But why when you seem to be perfectly fine regarding memory? I guess the following backtrace is relevant: Dec 30 01:53:36 server01 kernel: [ 368.569720] DMA: 0*4kB 1*8kB 0*16kB 1*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15912kB Dec 30 01:53:36 server01 kernel: [ 368.570447] DMA32: 9*4kB 10*8kB 8*16kB 6*32kB 5*64kB 6*128kB 4*256kB 2*512kB 3*1024kB 3*2048kB 613*4096kB = 2523636kB Dec 30 01:53:36 server01 kernel: [ 368.571175] Normal: 5*4kB 2060*8kB 4122*16kB 2550*32kB 2667*64kB 722*128kB 197*256kB 68*512kB 15*1024kB 4*2048kB 1855*4096kB = 8134036kB Dec 30 01:53:36 server01 kernel: [ 368.571906] 308964 total pagecache pages Dec 30 01:53:36 server01 kernel: [ 368.572023] 0 pages in swap cache Dec 30 01:53:36 server01 kernel: [ 368.572140] Swap cache stats: add 0, delete 0, find 0/0 Dec 30 01:53:36 server01 kernel: [ 368.572260] Free swap = 0kB Dec 30 01:53:36 server01 kernel: [ 368.572375] Total swap = 0kB Dec 30 01:53:36 server01 kernel: [ 368.597836] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 Dec 30 01:53:36 server01 kernel: [ 368.598034] apache2 cpuset=uid mems_allowed=0 Dec 30 01:53:36 server01 kernel: [ 368.598152] Pid: 5385, comm: apache2 Not tainted 3.2.35-grsec #1 Dec 30 01:53:36 server01 kernel: [ 368.598273] Call Trace: Dec 30 01:53:36 server01 kernel: [ 368.598396] [] dump_header+0x7e/0x1e0 Dec 30 01:53:36 server01 kernel: [ 368.598516] [] ? find_lock_task_mm+0x2f/0x70 Dec 30 01:53:36 server01 kernel: [ 368.598638] [] oom_kill_process+0x85/0x2a0 Dec 30 01:53:36 server01 kernel: [ 368.598759] [] out_of_memory+0xe5/0x200 Dec 30 01:53:36 server01 kernel: [ 368.598880] [] pagefault_out_of_memory+0xbd/0x110 Dec 30 01:53:36 server01 kernel: [ 368.599006] [] mm_fault_error+0xb6/0x1a0 Dec 30 01:53:36 server01 kernel: [ 368.599127] [] do_page_fault+0x3ee/0x460 Dec 30 01:53:36 server01 kernel: [ 368.599250] [] ? mntput+0x1f/0x30 Dec 30 01:53:36 server01 kernel: [ 368.599371] [] ? fput+0x156/0x200 Dec 30 01:53:36 server01 kernel: [ 368.599496] [] page_fault+0x1f/0x30 This would suggest that an unexpected ENOMEM leaked during page fault path. I do not see which one could that be because you said THP (CONFIG_TRANSPARENT_HUGEPAGE) are disabled (and the other patch I have mentioned in the thread should fix that issue - btw. the patch is already scheduled for stable tree). __do_fault, do_anonymous_page and do_wp_page call mem_cgroup_newpage_charge with GFP_KERNEL which means that we do memcg OOM and never return ENOMEM. do_swap_page calls mem_cgroup_try_charge_swapin with GFP_KERNEL as well. I might have missed something but I will not get to look closer before 2nd January. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx151.postini.com [74.125.245.151]) by kanga.kvack.org (Postfix) with SMTP id 247636B0005 for ; Fri, 25 Jan 2013 10:07:26 -0500 (EST) Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Fri, 25 Jan 2013 16:07:23 +0100 From: "azurIt" References: <20121210155205.GB6777@dhcp22.suse.cz>, <20121217023430.5A390FD7@pobox.sk>, <20121217163203.GD25432@dhcp22.suse.cz>, <20121217192301.829A7020@pobox.sk>, <20121217195510.GA16375@dhcp22.suse.cz>, <20121218152223.6912832C@pobox.sk>, <20121218152004.GA25208@dhcp22.suse.cz>, <20121224142526.020165D3@pobox.sk>, <20121228162209.GA1455@dhcp22.suse.cz>, <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> In-Reply-To: <20121230110815.GA12940@dhcp22.suse.cz> MIME-Version: 1.0 Message-Id: <20130125160723.FAE73567@pobox.sk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= Any news? Thnx! azur ______________________________________________________________ > Od: "Michal Hocko" > Komu: azurIt > DA!tum: 30.12.2012 12:08 > Predmet: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked > > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , "Johannes Weiner" >On Sun 30-12-12 02:09:47, azurIt wrote: >> >which suggests that the patch is incomplete and that I am blind :/ >> >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache >> >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following >> >follow-up patch on top of the one you already have (which should catch >> >all the remaining cases). >> >Sorry about that... >> >> >> This was, again, killing my MySQL server (search for "(mysqld)"): >> http://www.watchdog.sk/lkml/oom_mysqld5 > >grep "Kill process" oom_mysqld5 >Dec 30 01:53:34 server01 kernel: [ 367.061801] Memory cgroup out of memory: Kill process 5512 (apache2) score 716 or sacrifice child >Dec 30 01:53:35 server01 kernel: [ 367.338024] Memory cgroup out of memory: Kill process 5517 (apache2) score 718 or sacrifice child >Dec 30 01:53:35 server01 kernel: [ 367.747888] Memory cgroup out of memory: Kill process 5513 (apache2) score 721 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 368.159860] Memory cgroup out of memory: Kill process 5516 (apache2) score 726 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 368.665606] Memory cgroup out of memory: Kill process 5520 (apache2) score 733 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 368.765652] Out of memory: Kill process 1778 (mysqld) score 39 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 369.101753] Memory cgroup out of memory: Kill process 5519 (apache2) score 754 or sacrifice child >Dec 30 01:53:37 server01 kernel: [ 369.464262] Memory cgroup out of memory: Kill process 5583 (apache2) score 762 or sacrifice child >Dec 30 01:53:37 server01 kernel: [ 369.465017] Out of memory: Kill process 5506 (apache2) score 18 or sacrifice child >Dec 30 01:53:37 server01 kernel: [ 369.574932] Memory cgroup out of memory: Kill process 5523 (apache2) score 759 or sacrifice child > >So your mysqld has been killed by the global OOM not memcg. But why when >you seem to be perfectly fine regarding memory? I guess the following >backtrace is relevant: >Dec 30 01:53:36 server01 kernel: [ 368.569720] DMA: 0*4kB 1*8kB 0*16kB 1*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15912kB >Dec 30 01:53:36 server01 kernel: [ 368.570447] DMA32: 9*4kB 10*8kB 8*16kB 6*32kB 5*64kB 6*128kB 4*256kB 2*512kB 3*1024kB 3*2048kB 613*4096kB = 2523636kB >Dec 30 01:53:36 server01 kernel: [ 368.571175] Normal: 5*4kB 2060*8kB 4122*16kB 2550*32kB 2667*64kB 722*128kB 197*256kB 68*512kB 15*1024kB 4*2048kB 1855*4096kB = 8134036kB >Dec 30 01:53:36 server01 kernel: [ 368.571906] 308964 total pagecache pages >Dec 30 01:53:36 server01 kernel: [ 368.572023] 0 pages in swap cache >Dec 30 01:53:36 server01 kernel: [ 368.572140] Swap cache stats: add 0, delete 0, find 0/0 >Dec 30 01:53:36 server01 kernel: [ 368.572260] Free swap = 0kB >Dec 30 01:53:36 server01 kernel: [ 368.572375] Total swap = 0kB >Dec 30 01:53:36 server01 kernel: [ 368.597836] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 >Dec 30 01:53:36 server01 kernel: [ 368.598034] apache2 cpuset=uid mems_allowed=0 >Dec 30 01:53:36 server01 kernel: [ 368.598152] Pid: 5385, comm: apache2 Not tainted 3.2.35-grsec #1 >Dec 30 01:53:36 server01 kernel: [ 368.598273] Call Trace: >Dec 30 01:53:36 server01 kernel: [ 368.598396] [] dump_header+0x7e/0x1e0 >Dec 30 01:53:36 server01 kernel: [ 368.598516] [] ? find_lock_task_mm+0x2f/0x70 >Dec 30 01:53:36 server01 kernel: [ 368.598638] [] oom_kill_process+0x85/0x2a0 >Dec 30 01:53:36 server01 kernel: [ 368.598759] [] out_of_memory+0xe5/0x200 >Dec 30 01:53:36 server01 kernel: [ 368.598880] [] pagefault_out_of_memory+0xbd/0x110 >Dec 30 01:53:36 server01 kernel: [ 368.599006] [] mm_fault_error+0xb6/0x1a0 >Dec 30 01:53:36 server01 kernel: [ 368.599127] [] do_page_fault+0x3ee/0x460 >Dec 30 01:53:36 server01 kernel: [ 368.599250] [] ? mntput+0x1f/0x30 >Dec 30 01:53:36 server01 kernel: [ 368.599371] [] ? fput+0x156/0x200 >Dec 30 01:53:36 server01 kernel: [ 368.599496] [] page_fault+0x1f/0x30 > >This would suggest that an unexpected ENOMEM leaked during page fault >path. I do not see which one could that be because you said THP >(CONFIG_TRANSPARENT_HUGEPAGE) are disabled (and the other patch I have >mentioned in the thread should fix that issue - btw. the patch is >already scheduled for stable tree). > __do_fault, do_anonymous_page and do_wp_page call >mem_cgroup_newpage_charge with GFP_KERNEL which means that >we do memcg OOM and never return ENOMEM. do_swap_page calls >mem_cgroup_try_charge_swapin with GFP_KERNEL as well. > >I might have missed something but I will not get to look closer before >2nd January. >-- >Michal Hocko >SUSE Labs > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx181.postini.com [74.125.245.181]) by kanga.kvack.org (Postfix) with SMTP id B26E36B0005 for ; Fri, 25 Jan 2013 11:31:40 -0500 (EST) Date: Fri, 25 Jan 2013 17:31:30 +0100 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20130125163130.GF4721@dhcp22.suse.cz> References: <20121217163203.GD25432@dhcp22.suse.cz> <20121217192301.829A7020@pobox.sk> <20121217195510.GA16375@dhcp22.suse.cz> <20121218152223.6912832C@pobox.sk> <20121218152004.GA25208@dhcp22.suse.cz> <20121224142526.020165D3@pobox.sk> <20121228162209.GA1455@dhcp22.suse.cz> <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20130125160723.FAE73567@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Fri 25-01-13 16:07:23, azurIt wrote: > Any news? Thnx! Sorry, but I didn't get to this one yet. > > azur > > > > ______________________________________________________________ > > Od: "Michal Hocko" > > Komu: azurIt > > Datum: 30.12.2012 12:08 > > Predmet: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked > > > > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , "Johannes Weiner" > >On Sun 30-12-12 02:09:47, azurIt wrote: > >> >which suggests that the patch is incomplete and that I am blind :/ > >> >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache > >> >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following > >> >follow-up patch on top of the one you already have (which should catch > >> >all the remaining cases). > >> >Sorry about that... > >> > >> > >> This was, again, killing my MySQL server (search for "(mysqld)"): > >> http://www.watchdog.sk/lkml/oom_mysqld5 > > > >grep "Kill process" oom_mysqld5 > >Dec 30 01:53:34 server01 kernel: [ 367.061801] Memory cgroup out of memory: Kill process 5512 (apache2) score 716 or sacrifice child > >Dec 30 01:53:35 server01 kernel: [ 367.338024] Memory cgroup out of memory: Kill process 5517 (apache2) score 718 or sacrifice child > >Dec 30 01:53:35 server01 kernel: [ 367.747888] Memory cgroup out of memory: Kill process 5513 (apache2) score 721 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 368.159860] Memory cgroup out of memory: Kill process 5516 (apache2) score 726 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 368.665606] Memory cgroup out of memory: Kill process 5520 (apache2) score 733 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 368.765652] Out of memory: Kill process 1778 (mysqld) score 39 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 369.101753] Memory cgroup out of memory: Kill process 5519 (apache2) score 754 or sacrifice child > >Dec 30 01:53:37 server01 kernel: [ 369.464262] Memory cgroup out of memory: Kill process 5583 (apache2) score 762 or sacrifice child > >Dec 30 01:53:37 server01 kernel: [ 369.465017] Out of memory: Kill process 5506 (apache2) score 18 or sacrifice child > >Dec 30 01:53:37 server01 kernel: [ 369.574932] Memory cgroup out of memory: Kill process 5523 (apache2) score 759 or sacrifice child > > > >So your mysqld has been killed by the global OOM not memcg. But why when > >you seem to be perfectly fine regarding memory? I guess the following > >backtrace is relevant: > >Dec 30 01:53:36 server01 kernel: [ 368.569720] DMA: 0*4kB 1*8kB 0*16kB 1*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15912kB > >Dec 30 01:53:36 server01 kernel: [ 368.570447] DMA32: 9*4kB 10*8kB 8*16kB 6*32kB 5*64kB 6*128kB 4*256kB 2*512kB 3*1024kB 3*2048kB 613*4096kB = 2523636kB > >Dec 30 01:53:36 server01 kernel: [ 368.571175] Normal: 5*4kB 2060*8kB 4122*16kB 2550*32kB 2667*64kB 722*128kB 197*256kB 68*512kB 15*1024kB 4*2048kB 1855*4096kB = 8134036kB > >Dec 30 01:53:36 server01 kernel: [ 368.571906] 308964 total pagecache pages > >Dec 30 01:53:36 server01 kernel: [ 368.572023] 0 pages in swap cache > >Dec 30 01:53:36 server01 kernel: [ 368.572140] Swap cache stats: add 0, delete 0, find 0/0 > >Dec 30 01:53:36 server01 kernel: [ 368.572260] Free swap = 0kB > >Dec 30 01:53:36 server01 kernel: [ 368.572375] Total swap = 0kB > >Dec 30 01:53:36 server01 kernel: [ 368.597836] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > >Dec 30 01:53:36 server01 kernel: [ 368.598034] apache2 cpuset=uid mems_allowed=0 > >Dec 30 01:53:36 server01 kernel: [ 368.598152] Pid: 5385, comm: apache2 Not tainted 3.2.35-grsec #1 > >Dec 30 01:53:36 server01 kernel: [ 368.598273] Call Trace: > >Dec 30 01:53:36 server01 kernel: [ 368.598396] [] dump_header+0x7e/0x1e0 > >Dec 30 01:53:36 server01 kernel: [ 368.598516] [] ? find_lock_task_mm+0x2f/0x70 > >Dec 30 01:53:36 server01 kernel: [ 368.598638] [] oom_kill_process+0x85/0x2a0 > >Dec 30 01:53:36 server01 kernel: [ 368.598759] [] out_of_memory+0xe5/0x200 > >Dec 30 01:53:36 server01 kernel: [ 368.598880] [] pagefault_out_of_memory+0xbd/0x110 > >Dec 30 01:53:36 server01 kernel: [ 368.599006] [] mm_fault_error+0xb6/0x1a0 > >Dec 30 01:53:36 server01 kernel: [ 368.599127] [] do_page_fault+0x3ee/0x460 > >Dec 30 01:53:36 server01 kernel: [ 368.599250] [] ? mntput+0x1f/0x30 > >Dec 30 01:53:36 server01 kernel: [ 368.599371] [] ? fput+0x156/0x200 > >Dec 30 01:53:36 server01 kernel: [ 368.599496] [] page_fault+0x1f/0x30 > > > >This would suggest that an unexpected ENOMEM leaked during page fault > >path. I do not see which one could that be because you said THP > >(CONFIG_TRANSPARENT_HUGEPAGE) are disabled (and the other patch I have > >mentioned in the thread should fix that issue - btw. the patch is > >already scheduled for stable tree). > > __do_fault, do_anonymous_page and do_wp_page call > >mem_cgroup_newpage_charge with GFP_KERNEL which means that > >we do memcg OOM and never return ENOMEM. do_swap_page calls > >mem_cgroup_try_charge_swapin with GFP_KERNEL as well. > > > >I might have missed something but I will not get to look closer before > >2nd January. > >-- > >Michal Hocko > >SUSE Labs > > > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx169.postini.com [74.125.245.169]) by kanga.kvack.org (Postfix) with SMTP id 6E76B6B0010 for ; Tue, 5 Feb 2013 08:49:47 -0500 (EST) Date: Tue, 5 Feb 2013 14:49:42 +0100 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20130205134937.GA22804@dhcp22.suse.cz> References: <20121217192301.829A7020@pobox.sk> <20121217195510.GA16375@dhcp22.suse.cz> <20121218152223.6912832C@pobox.sk> <20121218152004.GA25208@dhcp22.suse.cz> <20121224142526.020165D3@pobox.sk> <20121228162209.GA1455@dhcp22.suse.cz> <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130125163130.GF4721@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Fri 25-01-13 17:31:30, Michal Hocko wrote: > On Fri 25-01-13 16:07:23, azurIt wrote: > > Any news? Thnx! > > Sorry, but I didn't get to this one yet. Sorry, to get back to this that late but I was busy as hell since the beginning of the year. Has the issue repeated since then? You said you didn't apply other than the above mentioned patch. Could you apply also debugging part of the patches I have sent? In case you don't have it handy then it should be this one: --- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx198.postini.com [74.125.245.198]) by kanga.kvack.org (Postfix) with SMTP id 33AD16B0002 for ; Tue, 5 Feb 2013 11:09:38 -0500 (EST) Date: Tue, 5 Feb 2013 17:09:34 +0100 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20130205160934.GB22804@dhcp22.suse.cz> References: <20121218152223.6912832C@pobox.sk> <20121218152004.GA25208@dhcp22.suse.cz> <20121224142526.020165D3@pobox.sk> <20121228162209.GA1455@dhcp22.suse.cz> <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130205154947.CD6411E2@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Tue 05-02-13 15:49:47, azurIt wrote: [...] > Just to be sure - am i supposed to apply this two patches? > http://watchdog.sk/lkml/patches/ 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I mentioned in a follow up email. Here is the full patch: --- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx172.postini.com [74.125.245.172]) by kanga.kvack.org (Postfix) with SMTP id 8F80D6B0005 for ; Tue, 5 Feb 2013 11:31:08 -0500 (EST) Date: Tue, 5 Feb 2013 17:31:06 +0100 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20130205163106.GC22804@dhcp22.suse.cz> References: <20121218152223.6912832C@pobox.sk> <20121218152004.GA25208@dhcp22.suse.cz> <20121224142526.020165D3@pobox.sk> <20121228162209.GA1455@dhcp22.suse.cz> <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130205154947.CD6411E2@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Tue 05-02-13 15:49:47, azurIt wrote: [...] > I have another old problem which is maybe also related to this. I > wasn't connecting it with this before but now i'm not sure. Two of our > servers, which are affected by this cgroup problem, are also randomly > freezing completely (few times per month). These are the symptoms: > - servers are answering to ping > - it is possible to connect via SSH but connection is freezed after > sending the password > - it is possible to login via console but it is freezed after typeing > the login > These symptoms are very similar to HDD problems or HDD overload (but > there is no overload for sure). The only way to fix it is, probably, > hard rebooting the server (didn't find any other way). What do you > think? Can this be related? This is hard to tell without further information. > Maybe HDDs are locked in the similar way the cgroups are - we already > found out that cgroup freezeing is related also to HDD activity. Maybe > there is a little chance that the whole HDD subsystem ends in > deadlock? "HDD subsystem" whatever that means cannot be blocked by memcg being stuck. Certain access to soem files might be an issue because those could have locks held but I do not see other relations. I would start by checking the HW, trying to focus on reducing elements that could contribute - aka try to nail down to the minimum set which reproduces the issue. I cannot help you much with that I am afraid. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx189.postini.com [74.125.245.189]) by kanga.kvack.org (Postfix) with SMTP id 90FD56B0002 for ; Tue, 5 Feb 2013 13:59:59 -0500 (EST) Received: by mail-wg0-f41.google.com with SMTP id ds1so4353395wgb.4 for ; Tue, 05 Feb 2013 10:59:58 -0800 (PST) Date: Tue, 5 Feb 2013 19:59:53 +0100 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20130205185953.GB3959@dhcp22.suse.cz> References: <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130205174651.GA3959@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Greg Thelen Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Tue 05-02-13 10:09:57, Greg Thelen wrote: > On Tue, Feb 05 2013, Michal Hocko wrote: > > > On Tue 05-02-13 08:48:23, Greg Thelen wrote: > >> On Tue, Feb 05 2013, Michal Hocko wrote: > >> > >> > On Tue 05-02-13 15:49:47, azurIt wrote: > >> > [...] > >> >> Just to be sure - am i supposed to apply this two patches? > >> >> http://watchdog.sk/lkml/patches/ > >> > > >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > >> > mentioned in a follow up email. Here is the full patch: > >> > --- > >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 > >> > From: Michal Hocko > >> > Date: Mon, 26 Nov 2012 11:47:57 +0100 > >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > >> > > >> > memcg oom killer might deadlock if the process which falls down to > >> > mem_cgroup_handle_oom holds a lock which prevents other task to > >> > terminate because it is blocked on the very same lock. > >> > This can happen when a write system call needs to allocate a page but > >> > the allocation hits the memcg hard limit and there is nothing to reclaim > >> > (e.g. there is no swap or swap limit is hit as well and all cache pages > >> > have been reclaimed already) and the process selected by memcg OOM > >> > killer is blocked on i_mutex on the same inode (e.g. truncate it). > >> > > >> > Process A > >> > [] do_truncate+0x58/0xa0 # takes i_mutex > >> > [] do_last+0x250/0xa30 > >> > [] path_openat+0xd7/0x440 > >> > [] do_filp_open+0x49/0xa0 > >> > [] do_sys_open+0x106/0x240 > >> > [] sys_open+0x20/0x30 > >> > [] system_call_fastpath+0x18/0x1d > >> > [] 0xffffffffffffffff > >> > > >> > Process B > >> > [] mem_cgroup_handle_oom+0x241/0x3b0 > >> > [] T.1146+0x5ab/0x5c0 > >> > [] mem_cgroup_cache_charge+0xbe/0xe0 > >> > [] add_to_page_cache_locked+0x4c/0x140 > >> > [] add_to_page_cache_lru+0x22/0x50 > >> > [] grab_cache_page_write_begin+0x8b/0xe0 > >> > [] ext3_write_begin+0x88/0x270 > >> > [] generic_file_buffered_write+0x116/0x290 > >> > [] __generic_file_aio_write+0x27c/0x480 > >> > [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > >> > [] do_sync_write+0xea/0x130 > >> > [] vfs_write+0xf3/0x1f0 > >> > [] sys_write+0x51/0x90 > >> > [] system_call_fastpath+0x18/0x1d > >> > [] 0xffffffffffffffff > >> > >> It looks like grab_cache_page_write_begin() passes __GFP_FS into > >> __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me > >> think that this deadlock is also possible in the page allocator even > >> before getting to add_to_page_cache_lru. no? > > > > I am not that familiar with VFS but i_mutex is a high level lock AFAIR > > and it shouldn't be called from the pageout path so __page_cache_alloc > > should be safe. > > I wasn't clear, sorry. My concern is not that pageout() grabs i_mutex. > My concern is that __page_cache_alloc() will invoke the oom killer and > select a victim which wants i_mutex. This victim will deadlock because > the oom killer caller already holds i_mutex. That would be true for the memcg oom because that one is blocking but the global oom just puts the allocator into sleep for a while and then the allocator should back off eventually (unless this is NOFAIL allocation). I would need to look closer whether this is really the case - I haven't seen that allocator code path for a while... > The wild accusation I am making is that anyone who invokes the oom > killer and waits on the victim to die is essentially grabbing all of > the locks that any of the oom killer victims may grab (e.g. i_mutex). True. > To avoid deadlock the oom killer can only be called is while holding > no locks that the oom victim demands. I think some locks are grabbed > in a way that allows the lock request to fail if the task has a fatal > signal pending, so they are safe. But any locks acquisitions that > cannot fail (e.g. mutex_lock) will deadlock with the oom killing > process. So the oom killing process cannot hold any such locks which > the victim will attempt to grab. Hopefully I'm missing something. Agreed. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx197.postini.com [74.125.245.197]) by kanga.kvack.org (Postfix) with SMTP id 437CF6B0005 for ; Wed, 6 Feb 2013 09:01:23 -0500 (EST) Date: Wed, 6 Feb 2013 15:01:19 +0100 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20130206140119.GD10254@dhcp22.suse.cz> References: <20121224142526.020165D3@pobox.sk> <20121228162209.GA1455@dhcp22.suse.cz> <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130206021721.1AE9E3C7@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Wed 06-02-13 02:17:21, azurIt wrote: > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > >mentioned in a follow up email. Here is the full patch: > > > Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: > http://www.watchdog.sk/lkml/oom_mysqld6 [...] WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() Hardware name: S5000VSA gfp_mask:4304 nr_pages:1 oom:0 ret:2 Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 Call Trace: [] warn_slowpath_common+0x7a/0xb0 [] warn_slowpath_fmt+0x46/0x50 [] ? mem_cgroup_margin+0x73/0xa0 [] T.1149+0x2d9/0x610 [] ? blk_finish_plug+0x18/0x50 [] mem_cgroup_cache_charge+0xc4/0xf0 [] add_to_page_cache_locked+0x4f/0x140 [] add_to_page_cache_lru+0x22/0x50 [] filemap_fault+0x252/0x4f0 [] __do_fault+0x78/0x5a0 [] handle_pte_fault+0x84/0x940 [] ? vma_prio_tree_insert+0x30/0x50 [] ? vma_link+0x88/0xe0 [] handle_mm_fault+0x138/0x260 [] do_page_fault+0x13d/0x460 [] ? do_mmap_pgoff+0x3dc/0x430 [] page_fault+0x1f/0x30 ---[ end trace 8817670349022007 ]--- apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 apache2 cpuset=uid mems_allowed=0 Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 Call Trace: [] dump_header+0x7e/0x1e0 [] ? find_lock_task_mm+0x2f/0x70 [] oom_kill_process+0x85/0x2a0 [] out_of_memory+0xe5/0x200 [] pagefault_out_of_memory+0xbd/0x110 [] mm_fault_error+0xb6/0x1a0 [] do_page_fault+0x3ee/0x460 [] ? do_mmap_pgoff+0x3dc/0x430 [] page_fault+0x1f/0x30 The first trace comes from the debugging WARN and it clearly points to a file fault path. __do_fault pre-charges a page in case we need to do CoW (copy-on-write) for the returned page. This one falls back to memcg OOM and never returns ENOMEM as I have mentioned earlier. However, the fs fault handler (filemap_fault here) can fallback to page_cache_read if the readahead (do_sync_mmap_readahead) fails to get page to the page cache. And we can see this happening in the first trace. page_cache_read then calls add_to_page_cache_lru and eventually gets to add_to_page_cache_locked which calls mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should happen. This ENOMEM gets to the fault handler and kaboom. So the fix is really much more complex than I thought. Although add_to_page_cache_locked sounded like a good place it turned out to be not in fact. We need something more clever appaerently. One way would be not misusing __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32 bits for those flags in gfp_t so there should be some room there. Or we could do this per task flag, same we do for NO_IO in the current -mm tree. The later one seems easier wrt. gfp_mask passing horror - e.g. __generic_file_aio_write doesn't pass flags and it can be called from unlocked contexts as well. I have to think about it some more. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx104.postini.com [74.125.245.104]) by kanga.kvack.org (Postfix) with SMTP id 5488A6B0005 for ; Wed, 6 Feb 2013 09:22:23 -0500 (EST) Date: Wed, 6 Feb 2013 15:22:19 +0100 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20130206142219.GF10254@dhcp22.suse.cz> References: <20121228162209.GA1455@dhcp22.suse.cz> <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> <20130206140119.GD10254@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130206140119.GD10254@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Wed 06-02-13 15:01:19, Michal Hocko wrote: > On Wed 06-02-13 02:17:21, azurIt wrote: > > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > > >mentioned in a follow up email. Here is the full patch: > > > > > > Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: > > http://www.watchdog.sk/lkml/oom_mysqld6 > > [...] > WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() > Hardware name: S5000VSA > gfp_mask:4304 nr_pages:1 oom:0 ret:2 > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > Call Trace: > [] warn_slowpath_common+0x7a/0xb0 > [] warn_slowpath_fmt+0x46/0x50 > [] ? mem_cgroup_margin+0x73/0xa0 > [] T.1149+0x2d9/0x610 > [] ? blk_finish_plug+0x18/0x50 > [] mem_cgroup_cache_charge+0xc4/0xf0 > [] add_to_page_cache_locked+0x4f/0x140 > [] add_to_page_cache_lru+0x22/0x50 > [] filemap_fault+0x252/0x4f0 > [] __do_fault+0x78/0x5a0 > [] handle_pte_fault+0x84/0x940 > [] ? vma_prio_tree_insert+0x30/0x50 > [] ? vma_link+0x88/0xe0 > [] handle_mm_fault+0x138/0x260 > [] do_page_fault+0x13d/0x460 > [] ? do_mmap_pgoff+0x3dc/0x430 > [] page_fault+0x1f/0x30 > ---[ end trace 8817670349022007 ]--- > apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > apache2 cpuset=uid mems_allowed=0 > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > Call Trace: > [] dump_header+0x7e/0x1e0 > [] ? find_lock_task_mm+0x2f/0x70 > [] oom_kill_process+0x85/0x2a0 > [] out_of_memory+0xe5/0x200 > [] pagefault_out_of_memory+0xbd/0x110 > [] mm_fault_error+0xb6/0x1a0 > [] do_page_fault+0x3ee/0x460 > [] ? do_mmap_pgoff+0x3dc/0x430 > [] page_fault+0x1f/0x30 > > The first trace comes from the debugging WARN and it clearly points to > a file fault path. __do_fault pre-charges a page in case we need to > do CoW (copy-on-write) for the returned page. This one falls back to > memcg OOM and never returns ENOMEM as I have mentioned earlier. > However, the fs fault handler (filemap_fault here) can fallback to > page_cache_read if the readahead (do_sync_mmap_readahead) fails > to get page to the page cache. And we can see this happening in > the first trace. page_cache_read then calls add_to_page_cache_lru > and eventually gets to add_to_page_cache_locked which calls > mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should > happen. This ENOMEM gets to the fault handler and kaboom. > > So the fix is really much more complex than I thought. Although > add_to_page_cache_locked sounded like a good place it turned out to be > not in fact. > > We need something more clever appaerently. One way would be not misusing > __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32 > bits for those flags in gfp_t so there should be some room there. > Or we could do this per task flag, same we do for NO_IO in the current > -mm tree. > The later one seems easier wrt. gfp_mask passing horror - e.g. > __generic_file_aio_write doesn't pass flags and it can be called from > unlocked contexts as well. Ouch, PF_ flags space seem to be drained already because task_struct::flags is just unsigned int so there is just one bit left. I am not sure this is the best use for it. This will be a real pain! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx112.postini.com [74.125.245.112]) by kanga.kvack.org (Postfix) with SMTP id E7B1F6B0005 for ; Wed, 6 Feb 2013 11:00:54 -0500 (EST) Date: Wed, 6 Feb 2013 17:00:51 +0100 From: Michal Hocko Subject: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Message-ID: <20130206160051.GG10254@dhcp22.suse.cz> References: <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> <20130206140119.GD10254@dhcp22.suse.cz> <20130206142219.GF10254@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130206142219.GF10254@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Wed 06-02-13 15:22:19, Michal Hocko wrote: > On Wed 06-02-13 15:01:19, Michal Hocko wrote: > > On Wed 06-02-13 02:17:21, azurIt wrote: > > > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > > > >mentioned in a follow up email. Here is the full patch: > > > > > > > > > Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: > > > http://www.watchdog.sk/lkml/oom_mysqld6 > > > > [...] > > WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() > > Hardware name: S5000VSA > > gfp_mask:4304 nr_pages:1 oom:0 ret:2 > > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > > Call Trace: > > [] warn_slowpath_common+0x7a/0xb0 > > [] warn_slowpath_fmt+0x46/0x50 > > [] ? mem_cgroup_margin+0x73/0xa0 > > [] T.1149+0x2d9/0x610 > > [] ? blk_finish_plug+0x18/0x50 > > [] mem_cgroup_cache_charge+0xc4/0xf0 > > [] add_to_page_cache_locked+0x4f/0x140 > > [] add_to_page_cache_lru+0x22/0x50 > > [] filemap_fault+0x252/0x4f0 > > [] __do_fault+0x78/0x5a0 > > [] handle_pte_fault+0x84/0x940 > > [] ? vma_prio_tree_insert+0x30/0x50 > > [] ? vma_link+0x88/0xe0 > > [] handle_mm_fault+0x138/0x260 > > [] do_page_fault+0x13d/0x460 > > [] ? do_mmap_pgoff+0x3dc/0x430 > > [] page_fault+0x1f/0x30 > > ---[ end trace 8817670349022007 ]--- > > apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > > apache2 cpuset=uid mems_allowed=0 > > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > > Call Trace: > > [] dump_header+0x7e/0x1e0 > > [] ? find_lock_task_mm+0x2f/0x70 > > [] oom_kill_process+0x85/0x2a0 > > [] out_of_memory+0xe5/0x200 > > [] pagefault_out_of_memory+0xbd/0x110 > > [] mm_fault_error+0xb6/0x1a0 > > [] do_page_fault+0x3ee/0x460 > > [] ? do_mmap_pgoff+0x3dc/0x430 > > [] page_fault+0x1f/0x30 > > > > The first trace comes from the debugging WARN and it clearly points to > > a file fault path. __do_fault pre-charges a page in case we need to > > do CoW (copy-on-write) for the returned page. This one falls back to > > memcg OOM and never returns ENOMEM as I have mentioned earlier. > > However, the fs fault handler (filemap_fault here) can fallback to > > page_cache_read if the readahead (do_sync_mmap_readahead) fails > > to get page to the page cache. And we can see this happening in > > the first trace. page_cache_read then calls add_to_page_cache_lru > > and eventually gets to add_to_page_cache_locked which calls > > mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should > > happen. This ENOMEM gets to the fault handler and kaboom. > > > > So the fix is really much more complex than I thought. Although > > add_to_page_cache_locked sounded like a good place it turned out to be > > not in fact. > > > > We need something more clever appaerently. One way would be not misusing > > __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32 > > bits for those flags in gfp_t so there should be some room there. > > Or we could do this per task flag, same we do for NO_IO in the current > > -mm tree. > > The later one seems easier wrt. gfp_mask passing horror - e.g. > > __generic_file_aio_write doesn't pass flags and it can be called from > > unlocked contexts as well. > > Ouch, PF_ flags space seem to be drained already because > task_struct::flags is just unsigned int so there is just one bit left. I > am not sure this is the best use for it. This will be a real pain! OK, so this something that should help you without any risk of false OOMs. I do not believe that something like that would be accepted upstream because it is really heavy. We will need to come up with something more clever for upstream. I have also added a warning which will trigger when the charge fails. If you see too many of those messages then there is something bad going on and the lack of OOM causes userspace to loop without getting any progress. So there you go - your personal patch ;) You can drop all other patches. Please note I have just compile tested it. But it should be pretty trivial to check it is correct --- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx181.postini.com [74.125.245.181]) by kanga.kvack.org (Postfix) with SMTP id 2F19F6B0005 for ; Thu, 7 Feb 2013 07:31:50 -0500 (EST) Date: Thu, 7 Feb 2013 13:31:40 +0100 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20130207123140.GA15820@dhcp22.suse.cz> References: <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> <20130206140119.GD10254@dhcp22.suse.cz> <51138999.3090006@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51138999.3090006@jp.fujitsu.com> Sender: owner-linux-mm@kvack.org List-ID: To: Kamezawa Hiroyuki Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , Johannes Weiner On Thu 07-02-13 20:01:45, KAMEZAWA Hiroyuki wrote: > (2013/02/06 23:01), Michal Hocko wrote: > >On Wed 06-02-13 02:17:21, azurIt wrote: > >>>5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > >>>mentioned in a follow up email. Here is the full patch: > >> > >> > >>Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: > >>http://www.watchdog.sk/lkml/oom_mysqld6 > > > >[...] > >WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() > >Hardware name: S5000VSA > >gfp_mask:4304 nr_pages:1 oom:0 ret:2 > >Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > >Call Trace: > > [] warn_slowpath_common+0x7a/0xb0 > > [] warn_slowpath_fmt+0x46/0x50 > > [] ? mem_cgroup_margin+0x73/0xa0 > > [] T.1149+0x2d9/0x610 > > [] ? blk_finish_plug+0x18/0x50 > > [] mem_cgroup_cache_charge+0xc4/0xf0 > > [] add_to_page_cache_locked+0x4f/0x140 > > [] add_to_page_cache_lru+0x22/0x50 > > [] filemap_fault+0x252/0x4f0 > > [] __do_fault+0x78/0x5a0 > > [] handle_pte_fault+0x84/0x940 > > [] ? vma_prio_tree_insert+0x30/0x50 > > [] ? vma_link+0x88/0xe0 > > [] handle_mm_fault+0x138/0x260 > > [] do_page_fault+0x13d/0x460 > > [] ? do_mmap_pgoff+0x3dc/0x430 > > [] page_fault+0x1f/0x30 > >---[ end trace 8817670349022007 ]--- > >apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > >apache2 cpuset=uid mems_allowed=0 > >Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > >Call Trace: > > [] dump_header+0x7e/0x1e0 > > [] ? find_lock_task_mm+0x2f/0x70 > > [] oom_kill_process+0x85/0x2a0 > > [] out_of_memory+0xe5/0x200 > > [] pagefault_out_of_memory+0xbd/0x110 > > [] mm_fault_error+0xb6/0x1a0 > > [] do_page_fault+0x3ee/0x460 > > [] ? do_mmap_pgoff+0x3dc/0x430 > > [] page_fault+0x1f/0x30 > > > >The first trace comes from the debugging WARN and it clearly points to > >a file fault path. __do_fault pre-charges a page in case we need to > >do CoW (copy-on-write) for the returned page. This one falls back to > >memcg OOM and never returns ENOMEM as I have mentioned earlier. > >However, the fs fault handler (filemap_fault here) can fallback to > >page_cache_read if the readahead (do_sync_mmap_readahead) fails > >to get page to the page cache. And we can see this happening in > >the first trace. page_cache_read then calls add_to_page_cache_lru > >and eventually gets to add_to_page_cache_locked which calls > >mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should > >happen. This ENOMEM gets to the fault handler and kaboom. > > > > Hmm. do we need to increase the "limit" virtually at memcg oom until > the oom-killed process dies ? It may be doable by increasing stock->cache > of each cpu....I think kernel can offer extra virtual charge up to > oom-killed process's memory usage..... If we can guarantee that the overflow charges do not exceed the memory usage of the killed process then this would work. The question is, how do we find out how much we can overflow. immigrate_on_move will play some role as well as the amount of the shared memory. I am afraid this would get too complex. Nevertheless the idea is nice. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx124.postini.com [74.125.245.124]) by kanga.kvack.org (Postfix) with SMTP id 7B86F6B0005 for ; Thu, 7 Feb 2013 20:40:52 -0500 (EST) Received: from m4.gw.fujitsu.co.jp (unknown [10.0.50.74]) by fgwmail6.fujitsu.co.jp (Postfix) with ESMTP id 8E4293EE0C0 for ; Fri, 8 Feb 2013 10:40:50 +0900 (JST) Received: from smail (m4 [127.0.0.1]) by outgoing.m4.gw.fujitsu.co.jp (Postfix) with ESMTP id 6A96145DE52 for ; Fri, 8 Feb 2013 10:40:50 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (s4.gw.fujitsu.co.jp [10.0.50.94]) by m4.gw.fujitsu.co.jp (Postfix) with ESMTP id 4D9A645DE4F for ; Fri, 8 Feb 2013 10:40:50 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 37C4E1DB803E for ; Fri, 8 Feb 2013 10:40:50 +0900 (JST) Received: from m1000.s.css.fujitsu.com (m1000.s.css.fujitsu.com [10.240.81.136]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id DA3721DB802F for ; Fri, 8 Feb 2013 10:40:49 +0900 (JST) Message-ID: <5114577D.70608@jp.fujitsu.com> Date: Fri, 08 Feb 2013 10:40:13 +0900 From: Kamezawa Hiroyuki MIME-Version: 1.0 Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked References: <20121224142526.020165D3@pobox.sk> <20121228162209.GA1455@dhcp22.suse.cz> <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> <20130206140119.GD10254@dhcp22.suse.cz> <51138999.3090006@jp.fujitsu.com> In-Reply-To: <51138999.3090006@jp.fujitsu.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , Johannes Weiner (2013/02/07 20:01), Kamezawa Hiroyuki wrote: > (2013/02/06 23:01), Michal Hocko wrote: >> On Wed 06-02-13 02:17:21, azurIt wrote: >>>> 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >>>> mentioned in a follow up email. Here is the full patch: >>> >>> >>> Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: >>> http://www.watchdog.sk/lkml/oom_mysqld6 >> >> [...] >> WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() >> Hardware name: S5000VSA >> gfp_mask:4304 nr_pages:1 oom:0 ret:2 >> Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >> Call Trace: >> [] warn_slowpath_common+0x7a/0xb0 >> [] warn_slowpath_fmt+0x46/0x50 >> [] ? mem_cgroup_margin+0x73/0xa0 >> [] T.1149+0x2d9/0x610 >> [] ? blk_finish_plug+0x18/0x50 >> [] mem_cgroup_cache_charge+0xc4/0xf0 >> [] add_to_page_cache_locked+0x4f/0x140 >> [] add_to_page_cache_lru+0x22/0x50 >> [] filemap_fault+0x252/0x4f0 >> [] __do_fault+0x78/0x5a0 >> [] handle_pte_fault+0x84/0x940 >> [] ? vma_prio_tree_insert+0x30/0x50 >> [] ? vma_link+0x88/0xe0 >> [] handle_mm_fault+0x138/0x260 >> [] do_page_fault+0x13d/0x460 >> [] ? do_mmap_pgoff+0x3dc/0x430 >> [] page_fault+0x1f/0x30 >> ---[ end trace 8817670349022007 ]--- >> apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 >> apache2 cpuset=uid mems_allowed=0 >> Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >> Call Trace: >> [] dump_header+0x7e/0x1e0 >> [] ? find_lock_task_mm+0x2f/0x70 >> [] oom_kill_process+0x85/0x2a0 >> [] out_of_memory+0xe5/0x200 >> [] pagefault_out_of_memory+0xbd/0x110 >> [] mm_fault_error+0xb6/0x1a0 >> [] do_page_fault+0x3ee/0x460 >> [] ? do_mmap_pgoff+0x3dc/0x430 >> [] page_fault+0x1f/0x30 >> >> The first trace comes from the debugging WARN and it clearly points to >> a file fault path. __do_fault pre-charges a page in case we need to >> do CoW (copy-on-write) for the returned page. This one falls back to >> memcg OOM and never returns ENOMEM as I have mentioned earlier. >> However, the fs fault handler (filemap_fault here) can fallback to >> page_cache_read if the readahead (do_sync_mmap_readahead) fails >> to get page to the page cache. And we can see this happening in >> the first trace. page_cache_read then calls add_to_page_cache_lru >> and eventually gets to add_to_page_cache_locked which calls >> mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should >> happen. This ENOMEM gets to the fault handler and kaboom. >> > > Hmm. do we need to increase the "limit" virtually at memcg oom until > the oom-killed process dies ? Here is my naive idea... == From 1a46318cf89e7df94bd4844f29105b61dacf335b Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Fri, 8 Feb 2013 10:43:52 +0900 Subject: [PATCH] [Don't Apply][PATCH] memcg relax resource at OOM situation. When an OOM happens, a task is killed and resources will be freed. A problem here is that a task, which is oom-killed, may wait for some other resource in which memory resource is required. Some thread waits for free memory may holds some mutex and oom-killed process wait for the mutex. To avoid this, relaxing charged memory by giving virtual resource can be a help. The system can get back it at uncharge(). This is a sample native implementation. Signed-off-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 73 insertions(+), 6 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 25ac5f4..4dea49a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -301,6 +301,9 @@ struct mem_cgroup { /* set when res.limit == memsw.limit */ bool memsw_is_minimum; + /* extra resource at emergency situation */ + unsigned long loan; + spinlock_t loan_lock; /* protect arrays of thresholds */ struct mutex thresholds_lock; @@ -2034,6 +2037,61 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, mem_cgroup_iter_break(root_memcg, victim); return total; } +/* + * When a memcg is in OOM situation, this lack of resource may cause deadlock + * because of complicated lock dependency(i_mutex...). To avoid that, we + * need extra resource or avoid charging. + * + * A memcg can request resource in an emergency state. We call it as loan. + * A memcg will return a loan when it does uncharge resource. We disallow + * double-loan and moving task to other groups until the loan is fully + * returned. + * + * Note: the problem here is that we cannot know what amount resouce should + * be necessary to exiting an emergency state..... + */ +#define LOAN_MAX (2 * 1024 * 1024) + +static void mem_cgroup_make_loan(struct mem_cgroup *memcg) +{ + u64 usage; + unsigned long amount; + + amount = LOAN_MAX; + + usage = res_counter_read_u64(&memcg->res, RES_USAGE); + if (amount > usage /2 ) + amount = usage / 2; + spin_lock(&memcg->loan_lock); + if (memcg->loan) { + spin_unlock(&memcg->loan_lock); + return; + } + memcg->loan = amount; + res_counter_uncharge(&memcg->res, amount); + if (do_swap_account) + res_counter_uncharge(&memcg->memsw, amount); + spin_unlock(&memcg->loan_lock); +} + +/* return amount of free resource which can be uncharged */ +static unsigned long +mem_cgroup_may_return_loan(struct mem_cgroup *memcg, unsigned long val) +{ + unsigned long tmp; + /* we don't care small race here */ + if (unlikely(!memcg->loan)) + return val; + spin_lock(&memcg->loan_lock); + if (memcg->loan) { + tmp = min(memcg->loan, val); + memcg->loan -= tmp; + val -= tmp; + } + spin_unlock(&memcg->loan_lock); + return val; +} + /* * Check OOM-Killer is already running under our hierarchy. @@ -2182,6 +2240,7 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, if (need_to_kill) { finish_wait(&memcg_oom_waitq, &owait.wait); mem_cgroup_out_of_memory(memcg, mask, order); + mem_cgroup_make_loan(memcg); } else { schedule(); finish_wait(&memcg_oom_waitq, &owait.wait); @@ -2748,6 +2807,8 @@ static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg, if (!mem_cgroup_is_root(memcg)) { unsigned long bytes = nr_pages * PAGE_SIZE; + bytes = mem_cgroup_may_return_loan(memcg, bytes); + res_counter_uncharge(&memcg->res, bytes); if (do_swap_account) res_counter_uncharge(&memcg->memsw, bytes); @@ -3989,6 +4050,7 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg, { struct memcg_batch_info *batch = NULL; bool uncharge_memsw = true; + unsigned long val; /* If swapout, usage of swap doesn't decrease */ if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) @@ -4029,9 +4091,11 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg, batch->memsw_nr_pages++; return; direct_uncharge: - res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE); + val = nr_pages * PAGE_SIZE; + val = mem_cgroup_may_return_loan(memcg, val); + res_counter_uncharge(&memcg->res, val); if (uncharge_memsw) - res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE); + res_counter_uncharge(&memcg->memsw, val); if (unlikely(batch->memcg != memcg)) memcg_oom_recover(memcg); } @@ -4182,6 +4246,7 @@ void mem_cgroup_uncharge_start(void) void mem_cgroup_uncharge_end(void) { struct memcg_batch_info *batch = ¤t->memcg_batch; + unsigned long val; if (!batch->do_batch) return; @@ -4192,16 +4257,16 @@ void mem_cgroup_uncharge_end(void) if (!batch->memcg) return; + val = batch->nr_pages * PAGE_SIZE; + val = mem_cgroup_may_return_loan(batch->memcg, val); /* * This "batch->memcg" is valid without any css_get/put etc... * bacause we hide charges behind us. */ if (batch->nr_pages) - res_counter_uncharge(&batch->memcg->res, - batch->nr_pages * PAGE_SIZE); + res_counter_uncharge(&batch->memcg->res, val); if (batch->memsw_nr_pages) - res_counter_uncharge(&batch->memcg->memsw, - batch->memsw_nr_pages * PAGE_SIZE); + res_counter_uncharge(&batch->memcg->memsw, val); memcg_oom_recover(batch->memcg); /* forget this pointer (for sanity check) */ batch->memcg = NULL; @@ -6291,6 +6356,8 @@ mem_cgroup_css_alloc(struct cgroup *cont) memcg->move_charge_at_immigrate = 0; mutex_init(&memcg->thresholds_lock); spin_lock_init(&memcg->move_lock); + memcg->loan = 0; + spin_lock_init(&memcg->loan_lock); return &memcg->css; -- 1.7.10.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx181.postini.com [74.125.245.181]) by kanga.kvack.org (Postfix) with SMTP id C5A0E6B0008 for ; Thu, 7 Feb 2013 23:16:54 -0500 (EST) Received: from m2.gw.fujitsu.co.jp (unknown [10.0.50.72]) by fgwmail6.fujitsu.co.jp (Postfix) with ESMTP id 2EF593EE0B5 for ; Fri, 8 Feb 2013 13:16:53 +0900 (JST) Received: from smail (m2 [127.0.0.1]) by outgoing.m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 18B8C45DE50 for ; Fri, 8 Feb 2013 13:16:53 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (s2.gw.fujitsu.co.jp [10.0.50.92]) by m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 0227845DE4D for ; Fri, 8 Feb 2013 13:16:53 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id EAE671DB803A for ; Fri, 8 Feb 2013 13:16:52 +0900 (JST) Received: from m1000.s.css.fujitsu.com (m1000.s.css.fujitsu.com [10.240.81.136]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id 97AE91DB8038 for ; Fri, 8 Feb 2013 13:16:52 +0900 (JST) Message-ID: <51147C1B.1000402@jp.fujitsu.com> Date: Fri, 08 Feb 2013 13:16:27 +0900 From: Kamezawa Hiroyuki MIME-Version: 1.0 Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked References: <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> <20130206140119.GD10254@dhcp22.suse.cz> <51138999.3090006@jp.fujitsu.com> <20130207123140.GA15820@dhcp22.suse.cz> In-Reply-To: <20130207123140.GA15820@dhcp22.suse.cz> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , Johannes Weiner (2013/02/07 21:31), Michal Hocko wrote: > On Thu 07-02-13 20:01:45, KAMEZAWA Hiroyuki wrote: >> (2013/02/06 23:01), Michal Hocko wrote: >>> On Wed 06-02-13 02:17:21, azurIt wrote: >>>>> 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >>>>> mentioned in a follow up email. Here is the full patch: >>>> >>>> >>>> Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: >>>> http://www.watchdog.sk/lkml/oom_mysqld6 >>> >>> [...] >>> WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() >>> Hardware name: S5000VSA >>> gfp_mask:4304 nr_pages:1 oom:0 ret:2 >>> Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >>> Call Trace: >>> [] warn_slowpath_common+0x7a/0xb0 >>> [] warn_slowpath_fmt+0x46/0x50 >>> [] ? mem_cgroup_margin+0x73/0xa0 >>> [] T.1149+0x2d9/0x610 >>> [] ? blk_finish_plug+0x18/0x50 >>> [] mem_cgroup_cache_charge+0xc4/0xf0 >>> [] add_to_page_cache_locked+0x4f/0x140 >>> [] add_to_page_cache_lru+0x22/0x50 >>> [] filemap_fault+0x252/0x4f0 >>> [] __do_fault+0x78/0x5a0 >>> [] handle_pte_fault+0x84/0x940 >>> [] ? vma_prio_tree_insert+0x30/0x50 >>> [] ? vma_link+0x88/0xe0 >>> [] handle_mm_fault+0x138/0x260 >>> [] do_page_fault+0x13d/0x460 >>> [] ? do_mmap_pgoff+0x3dc/0x430 >>> [] page_fault+0x1f/0x30 >>> ---[ end trace 8817670349022007 ]--- >>> apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 >>> apache2 cpuset=uid mems_allowed=0 >>> Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >>> Call Trace: >>> [] dump_header+0x7e/0x1e0 >>> [] ? find_lock_task_mm+0x2f/0x70 >>> [] oom_kill_process+0x85/0x2a0 >>> [] out_of_memory+0xe5/0x200 >>> [] pagefault_out_of_memory+0xbd/0x110 >>> [] mm_fault_error+0xb6/0x1a0 >>> [] do_page_fault+0x3ee/0x460 >>> [] ? do_mmap_pgoff+0x3dc/0x430 >>> [] page_fault+0x1f/0x30 >>> >>> The first trace comes from the debugging WARN and it clearly points to >>> a file fault path. __do_fault pre-charges a page in case we need to >>> do CoW (copy-on-write) for the returned page. This one falls back to >>> memcg OOM and never returns ENOMEM as I have mentioned earlier. >>> However, the fs fault handler (filemap_fault here) can fallback to >>> page_cache_read if the readahead (do_sync_mmap_readahead) fails >>> to get page to the page cache. And we can see this happening in >>> the first trace. page_cache_read then calls add_to_page_cache_lru >>> and eventually gets to add_to_page_cache_locked which calls >>> mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should >>> happen. This ENOMEM gets to the fault handler and kaboom. >>> >> >> Hmm. do we need to increase the "limit" virtually at memcg oom until >> the oom-killed process dies ? It may be doable by increasing stock->cache >> of each cpu....I think kernel can offer extra virtual charge up to >> oom-killed process's memory usage..... > > If we can guarantee that the overflow charges do not exceed the memory > usage of the killed process then this would work. The question is, how > do we find out how much we can overflow. immigrate_on_move will play > some role as well as the amount of the shared memory. I am afraid this > would get too complex. Nevertheless the idea is nice. > Yes, that's the problem. If we don't do in correct way, resouce usage undeflow can happen. I guess we can count it per task_struct at charging page-faulted anon pages. _Or_ in other consideration, for example, we do charge 1MB per thread regardless of its memory usage. And use it as a security at OOM-killing. Implemtation will be easy but explanation may be difficult.. Thanks, -Kame Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx104.postini.com [74.125.245.104]) by kanga.kvack.org (Postfix) with SMTP id 85B246B000A for ; Fri, 8 Feb 2013 00:03:10 -0500 (EST) Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_if_PF=5FNO=5FMEMCG=5FOOM_is_set?= Date: Fri, 08 Feb 2013 06:03:04 +0100 From: "azurIt" References: <20121230020947.AA002F34@pobox.sk>, <20121230110815.GA12940@dhcp22.suse.cz>, <20130125160723.FAE73567@pobox.sk>, <20130125163130.GF4721@dhcp22.suse.cz>, <20130205134937.GA22804@dhcp22.suse.cz>, <20130205154947.CD6411E2@pobox.sk>, <20130205160934.GB22804@dhcp22.suse.cz>, <20130206021721.1AE9E3C7@pobox.sk>, <20130206140119.GD10254@dhcp22.suse.cz>, <20130206142219.GF10254@dhcp22.suse.cz> <20130206160051.GG10254@dhcp22.suse.cz> In-Reply-To: <20130206160051.GG10254@dhcp22.suse.cz> MIME-Version: 1.0 Message-Id: <20130208060304.799F362F@pobox.sk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= Michal, thank you very much but it just didn't work and broke everything :( This happened: Problem started to occur really often immediately after booting the new kernel, every few minutes for one of my users. But everything other seems to work fine so i gave it a try for a day (which was a mistake). I grabbed some data for you and go to sleep: http://watchdog.sk/lkml/memcg-bug-4.tar.gz Few hours later i was woke up from my sweet sweet dreams by alerts smses - Apache wasn't working and our system failed to restart it. When i observed the situation, two apache processes (of that user as above) were still running and it wasn't possible to kill them by any way. I grabbed some data for you: http://watchdog.sk/lkml/memcg-bug-5.tar.gz Then I logged to the console and this was waiting for me: http://watchdog.sk/lkml/error.jpg Finally i rebooted into different kernel, wrote this e-mail and go to my lovely bed ;) ______________________________________________________________ > Od: "Michal Hocko" > Komu: azurIt > DA!tum: 06.02.2013 17:00 > Predmet: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set > > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , "Johannes Weiner" >On Wed 06-02-13 15:22:19, Michal Hocko wrote: >> On Wed 06-02-13 15:01:19, Michal Hocko wrote: >> > On Wed 06-02-13 02:17:21, azurIt wrote: >> > > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >> > > >mentioned in a follow up email. Here is the full patch: >> > > >> > > >> > > Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: >> > > http://www.watchdog.sk/lkml/oom_mysqld6 >> > >> > [...] >> > WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() >> > Hardware name: S5000VSA >> > gfp_mask:4304 nr_pages:1 oom:0 ret:2 >> > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >> > Call Trace: >> > [] warn_slowpath_common+0x7a/0xb0 >> > [] warn_slowpath_fmt+0x46/0x50 >> > [] ? mem_cgroup_margin+0x73/0xa0 >> > [] T.1149+0x2d9/0x610 >> > [] ? blk_finish_plug+0x18/0x50 >> > [] mem_cgroup_cache_charge+0xc4/0xf0 >> > [] add_to_page_cache_locked+0x4f/0x140 >> > [] add_to_page_cache_lru+0x22/0x50 >> > [] filemap_fault+0x252/0x4f0 >> > [] __do_fault+0x78/0x5a0 >> > [] handle_pte_fault+0x84/0x940 >> > [] ? vma_prio_tree_insert+0x30/0x50 >> > [] ? vma_link+0x88/0xe0 >> > [] handle_mm_fault+0x138/0x260 >> > [] do_page_fault+0x13d/0x460 >> > [] ? do_mmap_pgoff+0x3dc/0x430 >> > [] page_fault+0x1f/0x30 >> > ---[ end trace 8817670349022007 ]--- >> > apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 >> > apache2 cpuset=uid mems_allowed=0 >> > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >> > Call Trace: >> > [] dump_header+0x7e/0x1e0 >> > [] ? find_lock_task_mm+0x2f/0x70 >> > [] oom_kill_process+0x85/0x2a0 >> > [] out_of_memory+0xe5/0x200 >> > [] pagefault_out_of_memory+0xbd/0x110 >> > [] mm_fault_error+0xb6/0x1a0 >> > [] do_page_fault+0x3ee/0x460 >> > [] ? do_mmap_pgoff+0x3dc/0x430 >> > [] page_fault+0x1f/0x30 >> > >> > The first trace comes from the debugging WARN and it clearly points to >> > a file fault path. __do_fault pre-charges a page in case we need to >> > do CoW (copy-on-write) for the returned page. This one falls back to >> > memcg OOM and never returns ENOMEM as I have mentioned earlier. >> > However, the fs fault handler (filemap_fault here) can fallback to >> > page_cache_read if the readahead (do_sync_mmap_readahead) fails >> > to get page to the page cache. And we can see this happening in >> > the first trace. page_cache_read then calls add_to_page_cache_lru >> > and eventually gets to add_to_page_cache_locked which calls >> > mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should >> > happen. This ENOMEM gets to the fault handler and kaboom. >> > >> > So the fix is really much more complex than I thought. Although >> > add_to_page_cache_locked sounded like a good place it turned out to be >> > not in fact. >> > >> > We need something more clever appaerently. One way would be not misusing >> > __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32 >> > bits for those flags in gfp_t so there should be some room there. >> > Or we could do this per task flag, same we do for NO_IO in the current >> > -mm tree. >> > The later one seems easier wrt. gfp_mask passing horror - e.g. >> > __generic_file_aio_write doesn't pass flags and it can be called from >> > unlocked contexts as well. >> >> Ouch, PF_ flags space seem to be drained already because >> task_struct::flags is just unsigned int so there is just one bit left. I >> am not sure this is the best use for it. This will be a real pain! > >OK, so this something that should help you without any risk of false >OOMs. I do not believe that something like that would be accepted >upstream because it is really heavy. We will need to come up with >something more clever for upstream. >I have also added a warning which will trigger when the charge fails. If >you see too many of those messages then there is something bad going on >and the lack of OOM causes userspace to loop without getting any >progress. > >So there you go - your personal patch ;) You can drop all other patches. >Please note I have just compile tested it. But it should be pretty >trivial to check it is correct >--- >>From 6f155187f77c971b45caf05dbc80ca9c20bc278c Mon Sep 17 00:00:00 2001 >From: Michal Hocko >Date: Wed, 6 Feb 2013 16:45:07 +0100 >Subject: [PATCH 1/2] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set > >memcg oom killer might deadlock if the process which falls down to >mem_cgroup_handle_oom holds a lock which prevents other task to >terminate because it is blocked on the very same lock. >This can happen when a write system call needs to allocate a page but >the allocation hits the memcg hard limit and there is nothing to reclaim >(e.g. there is no swap or swap limit is hit as well and all cache pages >have been reclaimed already) and the process selected by memcg OOM >killer is blocked on i_mutex on the same inode (e.g. truncate it). > >Process A >[] do_truncate+0x58/0xa0 # takes i_mutex >[] do_last+0x250/0xa30 >[] path_openat+0xd7/0x440 >[] do_filp_open+0x49/0xa0 >[] do_sys_open+0x106/0x240 >[] sys_open+0x20/0x30 >[] system_call_fastpath+0x18/0x1d >[] 0xffffffffffffffff > >Process B >[] mem_cgroup_handle_oom+0x241/0x3b0 >[] T.1146+0x5ab/0x5c0 >[] mem_cgroup_cache_charge+0xbe/0xe0 >[] add_to_page_cache_locked+0x4c/0x140 >[] add_to_page_cache_lru+0x22/0x50 >[] grab_cache_page_write_begin+0x8b/0xe0 >[] ext3_write_begin+0x88/0x270 >[] generic_file_buffered_write+0x116/0x290 >[] __generic_file_aio_write+0x27c/0x480 >[] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex >[] do_sync_write+0xea/0x130 >[] vfs_write+0xf3/0x1f0 >[] sys_write+0x51/0x90 >[] system_call_fastpath+0x18/0x1d >[] 0xffffffffffffffff > >This is not a hard deadlock though because administrator can still >intervene and increase the limit on the group which helps the writer to >finish the allocation and release the lock. > >This patch heals the problem by forbidding OOM from dangerous context. >Memcg charging code has no way to find out whether it is called from a >locked context we have to help it via process flags. PF_OOM_ORIGIN flag >removed recently will be reused for PF_NO_MEMCG_OOM which signals that >the memcg OOM killer could lead to a deadlock. >Only locked callers of __generic_file_aio_write are currently marked. I >am pretty sure there are more places (I didn't check shmem and hugetlb >uses fancy instantion mutex during page fault and filesystems might >use some locks during the write) but I've ignored those as this will >probably be just a user specific patch without any way to get upstream >in the current form. > >Reported-by: azurIt >Signed-off-by: Michal Hocko >--- > drivers/staging/pohmelfs/inode.c | 2 ++ > include/linux/sched.h | 1 + > mm/filemap.c | 2 ++ > mm/memcontrol.c | 18 ++++++++++++++---- > 4 files changed, 19 insertions(+), 4 deletions(-) > >diff --git a/drivers/staging/pohmelfs/inode.c b/drivers/staging/pohmelfs/inode.c >index 7a19555..523de82e 100644 >--- a/drivers/staging/pohmelfs/inode.c >+++ b/drivers/staging/pohmelfs/inode.c >@@ -921,7 +921,9 @@ ssize_t pohmelfs_write(struct file *file, const char __user *buf, > if (ret) > goto err_out_unlock; > >+ current->flags |= PF_NO_MEMCG_OOM; > ret = __generic_file_aio_write(&kiocb, &iov, 1, &kiocb.ki_pos); >+ current->flags &= ~PF_NO_MEMCG_OOM; > *ppos = kiocb.ki_pos; > > mutex_unlock(&inode->i_mutex); >diff --git a/include/linux/sched.h b/include/linux/sched.h >index 1e86bb4..f275c8f 100644 >--- a/include/linux/sched.h >+++ b/include/linux/sched.h >@@ -1781,6 +1781,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t * > #define PF_FROZEN 0x00010000 /* frozen for system suspend */ > #define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */ > #define PF_KSWAPD 0x00040000 /* I am kswapd */ >+#define PF_NO_MEMCG_OOM 0x00080000 /* Memcg OOM could lead to a deadlock */ > #define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ > #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ > #define PF_RANDOMIZE 0x00400000 /* randomize virtual address space */ >diff --git a/mm/filemap.c b/mm/filemap.c >index 556858c..58a316b 100644 >--- a/mm/filemap.c >+++ b/mm/filemap.c >@@ -2617,7 +2617,9 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov, > > mutex_lock(&inode->i_mutex); > blk_start_plug(&plug); >+ current->flags |= PF_NO_MEMCG_OOM; > ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos); >+ current->flags &= ~PF_NO_MEMCG_OOM; > mutex_unlock(&inode->i_mutex); > > if (ret > 0 || ret == -EIOCBQUEUED) { >diff --git a/mm/memcontrol.c b/mm/memcontrol.c >index c8425b1..128b615 100644 >--- a/mm/memcontrol.c >+++ b/mm/memcontrol.c >@@ -2397,6 +2397,14 @@ done: > return 0; > nomem: > *ptr = NULL; >+ if (printk_ratelimit()) >+ printk(KERN_WARNING"%s: task:%s pid:%d got ENOMEM without OOM for memcg:%p." >+ " If this message shows up very often for the" >+ " same task then there is a risk that the" >+ " process is not able to make any progress" >+ " because of the current limit. Try to enlarge" >+ " the hard limit.\n", __FUNCTION__, >+ current->comm, current->pid, memcg); > return -ENOMEM; > bypass: > *ptr = NULL; >@@ -2703,7 +2711,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, > struct mem_cgroup *memcg = NULL; > unsigned int nr_pages = 1; > struct page_cgroup *pc; >- bool oom = true; >+ bool oom = !(current->flags & PF_NO_MEMCG_OOM); > int ret; > > if (PageTransHuge(page)) { >@@ -2770,6 +2778,7 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg, > int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > gfp_t gfp_mask) > { >+ bool oom = !(current->flags & PF_NO_MEMCG_OOM); > struct mem_cgroup *memcg = NULL; > int ret; > >@@ -2782,7 +2791,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > mm = &init_mm; > > if (page_is_file_cache(page)) { >- ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, true); >+ ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, oom); > if (ret || !memcg) > return ret; > >@@ -2818,6 +2827,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > struct page *page, > gfp_t mask, struct mem_cgroup **ptr) > { >+ bool oom = !(current->flags & PF_NO_MEMCG_OOM); > struct mem_cgroup *memcg; > int ret; > >@@ -2840,13 +2850,13 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > if (!memcg) > goto charge_cur_mm; > *ptr = memcg; >- ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, true); >+ ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, oom); > css_put(&memcg->css); > return ret; > charge_cur_mm: > if (unlikely(!mm)) > mm = &init_mm; >- return __mem_cgroup_try_charge(mm, mask, 1, ptr, true); >+ return __mem_cgroup_try_charge(mm, mask, 1, ptr, oom); > } > > static void >-- >1.7.10.4 > >-- >Michal Hocko >SUSE Labs > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx194.postini.com [74.125.245.194]) by kanga.kvack.org (Postfix) with SMTP id 98C556B0005 for ; Fri, 8 Feb 2013 07:38:57 -0500 (EST) Date: Fri, 8 Feb 2013 13:38:54 +0100 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Message-ID: <20130208123854.GB7557@dhcp22.suse.cz> References: <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> <20130206140119.GD10254@dhcp22.suse.cz> <20130206142219.GF10254@dhcp22.suse.cz> <20130206160051.GG10254@dhcp22.suse.cz> <20130208060304.799F362F@pobox.sk> <20130208094420.GA7557@dhcp22.suse.cz> <20130208120249.FD733220@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130208120249.FD733220@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Fri 08-02-13 12:02:49, azurIt wrote: > > > >Do you have logs from that time period? > > > >I have only glanced through the stacks and most of the threads are > >waiting in the mem_cgroup_handle_oom (mostly from the page fault path > >where we do not have other options than waiting) which suggests that > >your memory limit is seriously underestimated. If you look at the number > >of charging failures (memory.failcnt per-group file) then you will get > >9332083 failures in _average_ per group. This is a lot! > >Not all those failures end with OOM, of course. But it clearly signals > >that the workload need much more memory than the limit allows. > > > What type of logs? I have all. kernel log would be sufficient. > Memory usage graph: > http://www.watchdog.sk/lkml/memory2.png > > New kernel was booted about 1:15. Data in memcg-bug-4.tar.gz were taken about 2:35 and data in memcg-bug-5.tar.gz about 5:25. There was always lots of free memory. Higher memory consumption between 3:39 and 5:33 was caused by data backup and was completed few minutes before i restarted the server (this was just a coincidence). > > > > >There are only 5 groups in this one and all of them have no memory > >charged (so no OOM going on). All tasks are somewhere in the ptrace > >code. > > > It's all from the same cgroup but from different time. > > > > >grep cache -r . > >./1360297489/memory.stat:cache 0 > >./1360297489/memory.stat:total_cache 65642496 > >./1360297491/memory.stat:cache 0 > >./1360297491/memory.stat:total_cache 65642496 > >./1360297492/memory.stat:cache 0 > >./1360297492/memory.stat:total_cache 65642496 > >./1360297490/memory.stat:cache 0 > >./1360297490/memory.stat:total_cache 65642496 > >./1360297488/memory.stat:cache 0 > >./1360297488/memory.stat:total_cache 65642496 > > > >which suggests that this is a parent group and the memory is charged in > >a child group. I guess that all those are under OOM as the number seems > >like they have limit at 62M. > > > The cgroup has limit 330M (346030080 bytes). This limit is for top level groups, right? Those seem to children which have 62MB charged - is that a limit for those children? > As i said, these two processes Which are those two processes? > were stucked and was impossible to kill them. They were, > maybe, the processes which i was trying to 'strace' before - 'strace' > was freezed as always when the cgroup has this problem and i killed it > (i was just trying if it is the original cgroup problem). I have no idea what is the strace role here. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx160.postini.com [74.125.245.160]) by kanga.kvack.org (Postfix) with SMTP id 6A9206B0005 for ; Fri, 8 Feb 2013 09:47:23 -0500 (EST) Date: Fri, 8 Feb 2013 15:47:20 +0100 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Message-ID: <20130208144720.GC7557@dhcp22.suse.cz> References: <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> <20130206140119.GD10254@dhcp22.suse.cz> <20130206142219.GF10254@dhcp22.suse.cz> <20130206160051.GG10254@dhcp22.suse.cz> <20130208060304.799F362F@pobox.sk> <20130208094420.GA7557@dhcp22.suse.cz> <20130208120249.FD733220@pobox.sk> <20130208123854.GB7557@dhcp22.suse.cz> <20130208145616.FB78CE24@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130208145616.FB78CE24@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Fri 08-02-13 14:56:16, azurIt wrote: > Data are inside memcg-bug-5.tar.gz in directories bug/// ohh, I didn't get those were timestamp directories. It makes more sense now. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx198.postini.com [74.125.245.198]) by kanga.kvack.org (Postfix) with SMTP id 491936B0005 for ; Fri, 8 Feb 2013 10:24:05 -0500 (EST) Date: Fri, 8 Feb 2013 16:24:02 +0100 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Message-ID: <20130208152402.GD7557@dhcp22.suse.cz> References: <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> <20130206140119.GD10254@dhcp22.suse.cz> <20130206142219.GF10254@dhcp22.suse.cz> <20130206160051.GG10254@dhcp22.suse.cz> <20130208060304.799F362F@pobox.sk> <20130208094420.GA7557@dhcp22.suse.cz> <20130208120249.FD733220@pobox.sk> <20130208123854.GB7557@dhcp22.suse.cz> <20130208145616.FB78CE24@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130208145616.FB78CE24@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Fri 08-02-13 14:56:16, azurIt wrote: > >kernel log would be sufficient. > > > Full kernel log from kernel with you newest patch: > http://watchdog.sk/lkml/kern2.log OK, so the log says that there is a little slaughter on your yard: $ grep "Memory cgroup out of memory:" kern2.log | wc -l 220 $ grep "Memory cgroup out of memory:" kern2.log | sed 's@.*Kill process \([0-9]*\) .*@\1@' | sort -u | wc -l 220 Which means that the oom killer didn't try to kill any task more than once which is good because it tells us that the killed task manages to die before we trigger oom again. So this is definitely not a deadlock. You are just hitting OOM very often. $ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n 1 Task in /1091/uid killed as a result of limit of /1091 1 Task in /1223/uid killed as a result of limit of /1223 1 Task in /1229/uid killed as a result of limit of /1229 1 Task in /1255/uid killed as a result of limit of /1255 1 Task in /1424/uid killed as a result of limit of /1424 1 Task in /1470/uid killed as a result of limit of /1470 1 Task in /1567/uid killed as a result of limit of /1567 2 Task in /1080/uid killed as a result of limit of /1080 3 Task in /1381/uid killed as a result of limit of /1381 4 Task in /1185/uid killed as a result of limit of /1185 4 Task in /1289/uid killed as a result of limit of /1289 4 Task in /1709/uid killed as a result of limit of /1709 5 Task in /1279/uid killed as a result of limit of /1279 6 Task in /1020/uid killed as a result of limit of /1020 6 Task in /1527/uid killed as a result of limit of /1527 9 Task in /1388/uid killed as a result of limit of /1388 17 Task in /1281/uid killed as a result of limit of /1281 22 Task in /1599/uid killed as a result of limit of /1599 30 Task in /1155/uid killed as a result of limit of /1155 31 Task in /1258/uid killed as a result of limit of /1258 71 Task in /1293/uid killed as a result of limit of /1293 So the group 1293 suffers the most. I would check how much memory the worklod in the group really needs because this level of OOM cannot possible be healthy. The log also says that the deadlock prevention implemented by the patch triggered and some writes really failed due to potential OOM: $ grep "If this message shows up" kern2.log Feb 8 01:17:10 server01 kernel: [ 431.033593] __mem_cgroup_try_charge: task:apache2 pid:6733 got ENOMEM without OOM for memcg:ffff8803807d5600. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 01:22:52 server01 kernel: [ 773.556782] __mem_cgroup_try_charge: task:apache2 pid:12092 got ENOMEM without OOM for memcg:ffff8803807d5600. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 01:22:52 server01 kernel: [ 773.567916] __mem_cgroup_try_charge: task:apache2 pid:12093 got ENOMEM without OOM for memcg:ffff8803807d5600. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 01:29:00 server01 kernel: [ 1141.355693] __mem_cgroup_try_charge: task:apache2 pid:17734 got ENOMEM without OOM for memcg:ffff88036e956e00. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 03:30:39 server01 kernel: [ 8440.346811] __mem_cgroup_try_charge: task:apache2 pid:8687 got ENOMEM without OOM for memcg:ffff8803654d6e00. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. This doesn't look very unhealthy. I have expected that write would fail more often but it seems that the biggest memory pressure comes from mmaps and page faults which have no way other than OOM. So my suggestion would be to reconsider limits for groups to provide more realistical environment. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx173.postini.com [74.125.245.173]) by kanga.kvack.org (Postfix) with SMTP id D2BBF6B0005 for ; Fri, 8 Feb 2013 10:58:07 -0500 (EST) Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_if_PF=5FNO=5FMEMCG=5FOOM_is_set?= Date: Fri, 08 Feb 2013 16:58:05 +0100 From: "azurIt" References: <20130205160934.GB22804@dhcp22.suse.cz>, <20130206021721.1AE9E3C7@pobox.sk>, <20130206140119.GD10254@dhcp22.suse.cz>, <20130206142219.GF10254@dhcp22.suse.cz>, <20130206160051.GG10254@dhcp22.suse.cz>, <20130208060304.799F362F@pobox.sk>, <20130208094420.GA7557@dhcp22.suse.cz>, <20130208120249.FD733220@pobox.sk>, <20130208123854.GB7557@dhcp22.suse.cz>, <20130208145616.FB78CE24@pobox.sk> <20130208152402.GD7557@dhcp22.suse.cz> In-Reply-To: <20130208152402.GD7557@dhcp22.suse.cz> MIME-Version: 1.0 Message-Id: <20130208165805.8908B143@pobox.sk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >Which means that the oom killer didn't try to kill any task more than >once which is good because it tells us that the killed task manages to >die before we trigger oom again. So this is definitely not a deadlock. >You are just hitting OOM very often. >$ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n > 1 Task in /1091/uid killed as a result of limit of /1091 > 1 Task in /1223/uid killed as a result of limit of /1223 > 1 Task in /1229/uid killed as a result of limit of /1229 > 1 Task in /1255/uid killed as a result of limit of /1255 > 1 Task in /1424/uid killed as a result of limit of /1424 > 1 Task in /1470/uid killed as a result of limit of /1470 > 1 Task in /1567/uid killed as a result of limit of /1567 > 2 Task in /1080/uid killed as a result of limit of /1080 > 3 Task in /1381/uid killed as a result of limit of /1381 > 4 Task in /1185/uid killed as a result of limit of /1185 > 4 Task in /1289/uid killed as a result of limit of /1289 > 4 Task in /1709/uid killed as a result of limit of /1709 > 5 Task in /1279/uid killed as a result of limit of /1279 > 6 Task in /1020/uid killed as a result of limit of /1020 > 6 Task in /1527/uid killed as a result of limit of /1527 > 9 Task in /1388/uid killed as a result of limit of /1388 > 17 Task in /1281/uid killed as a result of limit of /1281 > 22 Task in /1599/uid killed as a result of limit of /1599 > 30 Task in /1155/uid killed as a result of limit of /1155 > 31 Task in /1258/uid killed as a result of limit of /1258 > 71 Task in /1293/uid killed as a result of limit of /1293 > >So the group 1293 suffers the most. I would check how much memory the >worklod in the group really needs because this level of OOM cannot >possible be healthy. I took the kernel log from yesterday from the same time frame: $ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n 1 Task in /1252/uid killed as a result of limit of /1252 1 Task in /1709/uid killed as a result of limit of /1709 2 Task in /1185/uid killed as a result of limit of /1185 2 Task in /1388/uid killed as a result of limit of /1388 2 Task in /1567/uid killed as a result of limit of /1567 2 Task in /1650/uid killed as a result of limit of /1650 3 Task in /1527/uid killed as a result of limit of /1527 5 Task in /1552/uid killed as a result of limit of /1552 1634 Task in /1258/uid killed as a result of limit of /1258 As you can see, there were much more OOM in '1258' and no such problems like this night (well, there were never such problems before :) ). As i said, cgroup 1258 were freezing every few minutes with your latest patch so there must be something wrong (it usually freezes about once per day). And it was really freezed (i checked that), the sypthoms were: - cannot strace any of cgroup processes - no new processes were started, still the same processes were 'running' - kernel was unable to resolve this by it's own - all processes togather were taking 100% CPU - the whole memory limit was used (see memcg-bug-4.tar.gz for more info) Unfortunately i forget to check if killing only few of the processes will resolve it (i always killed them all yesterday night). Don't know if is was in deadlock or not but kernel was definitely unable to resolve the problem. And there is still a mystery of two freezed processes which cannot be killed. By the way, i KNOW that so much OOM is not healthy but the client simply don't want to buy more memory. He knows about the problem of unsufficient memory limit. Thank you. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx142.postini.com [74.125.245.142]) by kanga.kvack.org (Postfix) with SMTP id 33CEC6B0005 for ; Fri, 8 Feb 2013 11:01:23 -0500 (EST) Date: Fri, 8 Feb 2013 17:01:19 +0100 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20130208160119.GE7557@dhcp22.suse.cz> References: <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> <20130206140119.GD10254@dhcp22.suse.cz> <51138999.3090006@jp.fujitsu.com> <5114577D.70608@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5114577D.70608@jp.fujitsu.com> Sender: owner-linux-mm@kvack.org List-ID: To: Kamezawa Hiroyuki Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , Johannes Weiner On Fri 08-02-13 10:40:13, KAMEZAWA Hiroyuki wrote: > (2013/02/07 20:01), Kamezawa Hiroyuki wrote: [...] > >Hmm. do we need to increase the "limit" virtually at memcg oom until > >the oom-killed process dies ? > > Here is my naive idea... and the next step would be http://en.wikipedia.org/wiki/Credit_default_swap :P But seriously now. The idea is not bad at all. This implementation would need some tweaks to work though (e.g. you would need to wake oom sleepers when you get a loan - because those are ones which can block the resource). We should also give the borrowed charges only to those who would oom to prevent from stealing. I think that it should be mem_cgroup_out_of_memory who establishes the loan and it can have a look at how much memory the killed task frees - e.g. some portion of get_mm_rss() or a more precise but much more expensive traversing via private vmas and check whether they charged memory from the target memcg hierarchy (this is a slow path anyway). But who knows maybe a fixed 2MB would work out as well. Thanks! > == > From 1a46318cf89e7df94bd4844f29105b61dacf335b Mon Sep 17 00:00:00 2001 > From: KAMEZAWA Hiroyuki > Date: Fri, 8 Feb 2013 10:43:52 +0900 > Subject: [PATCH] [Don't Apply][PATCH] memcg relax resource at OOM situation. > > When an OOM happens, a task is killed and resources will be freed. > > A problem here is that a task, which is oom-killed, may wait for > some other resource in which memory resource is required. Some thread > waits for free memory may holds some mutex and oom-killed process > wait for the mutex. > > To avoid this, relaxing charged memory by giving virtual resource > can be a help. The system can get back it at uncharge(). > This is a sample native implementation. > > Signed-off-by: KAMEZAWA Hiroyuki > --- > mm/memcontrol.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++----- > 1 file changed, 73 insertions(+), 6 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 25ac5f4..4dea49a 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -301,6 +301,9 @@ struct mem_cgroup { > /* set when res.limit == memsw.limit */ > bool memsw_is_minimum; > + /* extra resource at emergency situation */ > + unsigned long loan; > + spinlock_t loan_lock; > /* protect arrays of thresholds */ > struct mutex thresholds_lock; > @@ -2034,6 +2037,61 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, > mem_cgroup_iter_break(root_memcg, victim); > return total; > } > +/* > + * When a memcg is in OOM situation, this lack of resource may cause deadlock > + * because of complicated lock dependency(i_mutex...). To avoid that, we > + * need extra resource or avoid charging. > + * > + * A memcg can request resource in an emergency state. We call it as loan. > + * A memcg will return a loan when it does uncharge resource. We disallow > + * double-loan and moving task to other groups until the loan is fully > + * returned. > + * > + * Note: the problem here is that we cannot know what amount resouce should > + * be necessary to exiting an emergency state..... > + */ > +#define LOAN_MAX (2 * 1024 * 1024) > + > +static void mem_cgroup_make_loan(struct mem_cgroup *memcg) > +{ > + u64 usage; > + unsigned long amount; > + > + amount = LOAN_MAX; > + > + usage = res_counter_read_u64(&memcg->res, RES_USAGE); > + if (amount > usage /2 ) > + amount = usage / 2; > + spin_lock(&memcg->loan_lock); > + if (memcg->loan) { > + spin_unlock(&memcg->loan_lock); > + return; > + } > + memcg->loan = amount; > + res_counter_uncharge(&memcg->res, amount); > + if (do_swap_account) > + res_counter_uncharge(&memcg->memsw, amount); > + spin_unlock(&memcg->loan_lock); > +} > + > +/* return amount of free resource which can be uncharged */ > +static unsigned long > +mem_cgroup_may_return_loan(struct mem_cgroup *memcg, unsigned long val) > +{ > + unsigned long tmp; > + /* we don't care small race here */ > + if (unlikely(!memcg->loan)) > + return val; > + spin_lock(&memcg->loan_lock); > + if (memcg->loan) { > + tmp = min(memcg->loan, val); > + memcg->loan -= tmp; > + val -= tmp; > + } > + spin_unlock(&memcg->loan_lock); > + return val; > +} > + > /* > * Check OOM-Killer is already running under our hierarchy. > @@ -2182,6 +2240,7 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, > if (need_to_kill) { > finish_wait(&memcg_oom_waitq, &owait.wait); > mem_cgroup_out_of_memory(memcg, mask, order); > + mem_cgroup_make_loan(memcg); > } else { > schedule(); > finish_wait(&memcg_oom_waitq, &owait.wait); > @@ -2748,6 +2807,8 @@ static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg, > if (!mem_cgroup_is_root(memcg)) { > unsigned long bytes = nr_pages * PAGE_SIZE; > + bytes = mem_cgroup_may_return_loan(memcg, bytes); > + > res_counter_uncharge(&memcg->res, bytes); > if (do_swap_account) > res_counter_uncharge(&memcg->memsw, bytes); > @@ -3989,6 +4050,7 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg, > { > struct memcg_batch_info *batch = NULL; > bool uncharge_memsw = true; > + unsigned long val; > /* If swapout, usage of swap doesn't decrease */ > if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) > @@ -4029,9 +4091,11 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg, > batch->memsw_nr_pages++; > return; > direct_uncharge: > - res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE); > + val = nr_pages * PAGE_SIZE; > + val = mem_cgroup_may_return_loan(memcg, val); > + res_counter_uncharge(&memcg->res, val); > if (uncharge_memsw) > - res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE); > + res_counter_uncharge(&memcg->memsw, val); > if (unlikely(batch->memcg != memcg)) > memcg_oom_recover(memcg); > } > @@ -4182,6 +4246,7 @@ void mem_cgroup_uncharge_start(void) > void mem_cgroup_uncharge_end(void) > { > struct memcg_batch_info *batch = ¤t->memcg_batch; > + unsigned long val; > if (!batch->do_batch) > return; > @@ -4192,16 +4257,16 @@ void mem_cgroup_uncharge_end(void) > if (!batch->memcg) > return; > + val = batch->nr_pages * PAGE_SIZE; > + val = mem_cgroup_may_return_loan(batch->memcg, val); > /* > * This "batch->memcg" is valid without any css_get/put etc... > * bacause we hide charges behind us. > */ > if (batch->nr_pages) > - res_counter_uncharge(&batch->memcg->res, > - batch->nr_pages * PAGE_SIZE); > + res_counter_uncharge(&batch->memcg->res, val); > if (batch->memsw_nr_pages) > - res_counter_uncharge(&batch->memcg->memsw, > - batch->memsw_nr_pages * PAGE_SIZE); > + res_counter_uncharge(&batch->memcg->memsw, val); > memcg_oom_recover(batch->memcg); > /* forget this pointer (for sanity check) */ > batch->memcg = NULL; > @@ -6291,6 +6356,8 @@ mem_cgroup_css_alloc(struct cgroup *cont) > memcg->move_charge_at_immigrate = 0; > mutex_init(&memcg->thresholds_lock); > spin_lock_init(&memcg->move_lock); > + memcg->loan = 0; > + spin_lock_init(&memcg->loan_lock); > return &memcg->css; > -- > 1.7.10.2 > > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx127.postini.com [74.125.245.127]) by kanga.kvack.org (Postfix) with SMTP id BB3D06B0005 for ; Fri, 8 Feb 2013 11:29:20 -0500 (EST) Date: Fri, 8 Feb 2013 17:29:18 +0100 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20130208162918.GF7557@dhcp22.suse.cz> References: <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130205174651.GA3959@dhcp22.suse.cz> <20130205185953.GB3959@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Greg Thelen Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Thu 07-02-13 20:27:00, Greg Thelen wrote: > On Tue, Feb 05 2013, Michal Hocko wrote: > > > On Tue 05-02-13 10:09:57, Greg Thelen wrote: > >> On Tue, Feb 05 2013, Michal Hocko wrote: > >> > >> > On Tue 05-02-13 08:48:23, Greg Thelen wrote: > >> >> On Tue, Feb 05 2013, Michal Hocko wrote: > >> >> > >> >> > On Tue 05-02-13 15:49:47, azurIt wrote: > >> >> > [...] > >> >> >> Just to be sure - am i supposed to apply this two patches? > >> >> >> http://watchdog.sk/lkml/patches/ > >> >> > > >> >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > >> >> > mentioned in a follow up email. Here is the full patch: > >> >> > --- > >> >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 > >> >> > From: Michal Hocko > >> >> > Date: Mon, 26 Nov 2012 11:47:57 +0100 > >> >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > >> >> > > >> >> > memcg oom killer might deadlock if the process which falls down to > >> >> > mem_cgroup_handle_oom holds a lock which prevents other task to > >> >> > terminate because it is blocked on the very same lock. > >> >> > This can happen when a write system call needs to allocate a page but > >> >> > the allocation hits the memcg hard limit and there is nothing to reclaim > >> >> > (e.g. there is no swap or swap limit is hit as well and all cache pages > >> >> > have been reclaimed already) and the process selected by memcg OOM > >> >> > killer is blocked on i_mutex on the same inode (e.g. truncate it). > >> >> > > >> >> > Process A > >> >> > [] do_truncate+0x58/0xa0 # takes i_mutex > >> >> > [] do_last+0x250/0xa30 > >> >> > [] path_openat+0xd7/0x440 > >> >> > [] do_filp_open+0x49/0xa0 > >> >> > [] do_sys_open+0x106/0x240 > >> >> > [] sys_open+0x20/0x30 > >> >> > [] system_call_fastpath+0x18/0x1d > >> >> > [] 0xffffffffffffffff > >> >> > > >> >> > Process B > >> >> > [] mem_cgroup_handle_oom+0x241/0x3b0 > >> >> > [] T.1146+0x5ab/0x5c0 > >> >> > [] mem_cgroup_cache_charge+0xbe/0xe0 > >> >> > [] add_to_page_cache_locked+0x4c/0x140 > >> >> > [] add_to_page_cache_lru+0x22/0x50 > >> >> > [] grab_cache_page_write_begin+0x8b/0xe0 > >> >> > [] ext3_write_begin+0x88/0x270 > >> >> > [] generic_file_buffered_write+0x116/0x290 > >> >> > [] __generic_file_aio_write+0x27c/0x480 > >> >> > [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > >> >> > [] do_sync_write+0xea/0x130 > >> >> > [] vfs_write+0xf3/0x1f0 > >> >> > [] sys_write+0x51/0x90 > >> >> > [] system_call_fastpath+0x18/0x1d > >> >> > [] 0xffffffffffffffff > >> >> > >> >> It looks like grab_cache_page_write_begin() passes __GFP_FS into > >> >> __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me > >> >> think that this deadlock is also possible in the page allocator even > >> >> before getting to add_to_page_cache_lru. no? > >> > > >> > I am not that familiar with VFS but i_mutex is a high level lock AFAIR > >> > and it shouldn't be called from the pageout path so __page_cache_alloc > >> > should be safe. > >> > >> I wasn't clear, sorry. My concern is not that pageout() grabs i_mutex. > >> My concern is that __page_cache_alloc() will invoke the oom killer and > >> select a victim which wants i_mutex. This victim will deadlock because > >> the oom killer caller already holds i_mutex. > > > > That would be true for the memcg oom because that one is blocking but > > the global oom just puts the allocator into sleep for a while and then > > the allocator should back off eventually (unless this is NOFAIL > > allocation). I would need to look closer whether this is really the case > > - I haven't seen that allocator code path for a while... > > I think the page allocator can loop forever waiting for an oom victim to > terminate even without NOFAIL. Especially if the oom victim wants a > resource exclusively held by the allocating thread (e.g. i_mutex). It > looks like the same deadlock you describe is also possible (though more > rare) without memcg. OK, I have checked the allocator slow path and you are right even GFP_KERNEL will not fail. This can lead to similar deadlocks - e.g. OOM killed task blocked on down_write(mmap_sem) while the page fault handler holding mmap_sem for reading and allocating a new page without any progress. Luckily there are memory reserves where the allocator fall back eventually so the allocation should be able to get some memory and release the lock. There is still a theoretical chance this would block though. This sounds like a corner case though so I wouldn't care about it very much. > If the looping thread is an eligible oom victim (i.e. not oom disabled, > not an kernel thread, etc) then the page allocator can return NULL in so > long as NOFAIL is not used. So any allocator which is able to call the > oom killer and is not oom disabled (kernel thread, etc) is already > exposed to the possibility of page allocator failure. So if the page > allocator could detect the deadlock, then it could safely return NULL. > Maybe after looping N times without forward progress the page allocator > should consider failing unless NOFAIL is given. page allocator is quite tricky to touch and the chances of this deadlock are not that big. > if memcg oom kill has been tried a reasonable number of times. Simply > failing the memcg charge with ENOMEM seems easier to support than > exceeding limit (Kame's loan patch). We cannot do that in the page fault path because this would lead to a global oom killer. We would need to either retry the page fault or send KILL to the faulting process. But I do not like this much as this could lead to DoS attacks. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx164.postini.com [74.125.245.164]) by kanga.kvack.org (Postfix) with SMTP id C25666B0005 for ; Fri, 8 Feb 2013 12:10:15 -0500 (EST) Date: Fri, 8 Feb 2013 18:10:12 +0100 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Message-ID: <20130208171012.GH7557@dhcp22.suse.cz> References: <20130206140119.GD10254@dhcp22.suse.cz> <20130206142219.GF10254@dhcp22.suse.cz> <20130206160051.GG10254@dhcp22.suse.cz> <20130208060304.799F362F@pobox.sk> <20130208094420.GA7557@dhcp22.suse.cz> <20130208120249.FD733220@pobox.sk> <20130208123854.GB7557@dhcp22.suse.cz> <20130208145616.FB78CE24@pobox.sk> <20130208152402.GD7557@dhcp22.suse.cz> <20130208165805.8908B143@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130208165805.8908B143@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Fri 08-02-13 16:58:05, azurIt wrote: [...] > I took the kernel log from yesterday from the same time frame: > > $ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n > 1 Task in /1252/uid killed as a result of limit of /1252 > 1 Task in /1709/uid killed as a result of limit of /1709 > 2 Task in /1185/uid killed as a result of limit of /1185 > 2 Task in /1388/uid killed as a result of limit of /1388 > 2 Task in /1567/uid killed as a result of limit of /1567 > 2 Task in /1650/uid killed as a result of limit of /1650 > 3 Task in /1527/uid killed as a result of limit of /1527 > 5 Task in /1552/uid killed as a result of limit of /1552 > 1634 Task in /1258/uid killed as a result of limit of /1258 > > As you can see, there were much more OOM in '1258' and no such > problems like this night (well, there were never such problems before > :) ). Well, all the patch does is that it prevents from the deadlock we have seen earlier. Previously the writer would block on the oom wait queue while it fails with ENOMEM now. Caller sees this as a short write which can be retried (it is a question whether userspace can cope with that properly). All other OOMs are preserved. I suspect that all the problems you are seeing now are just side effects of the OOM conditions. > As i said, cgroup 1258 were freezing every few minutes with your > latest patch so there must be something wrong (it usually freezes > about once per day). And it was really freezed (i checked that), the > sypthoms were: I assume you have checked that the killed processes eventually die, right? > - cannot strace any of cgroup processes > - no new processes were started, still the same processes were 'running' > - kernel was unable to resolve this by it's own > - all processes togather were taking 100% CPU > - the whole memory limit was used > (see memcg-bug-4.tar.gz for more info) Well, I do not see anything supsicious during that time period (timestamps translate between Fri Feb 8 02:34:05 and Fri Feb 8 02:36:48). The kernel log shows a lot of oom during that time. All killed processes die eventually. > Unfortunately i forget to check if killing only few of the processes > will resolve it (i always killed them all yesterday night). Don't > know if is was in deadlock or not but kernel was definitely unable > to resolve the problem. Nothing shows it would be a deadlock so far. It is well possible that the userspace went mad when seeing a lot of processes dying because it doesn't expect it. > And there is still a mystery of two freezed processes which cannot be > killed. > > By the way, i KNOW that so much OOM is not healthy but the client > simply don't want to buy more memory. He knows about the problem of > unsufficient memory limit. Well, then you would see a permanent flood of OOM killing, I am afraid. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx186.postini.com [74.125.245.186]) by kanga.kvack.org (Postfix) with SMTP id 8F8076B0002 for ; Sun, 10 Feb 2013 10:03:18 -0500 (EST) Received: by mail-ee0-f53.google.com with SMTP id e53so2825180eek.40 for ; Sun, 10 Feb 2013 07:03:16 -0800 (PST) Date: Sun, 10 Feb 2013 16:03:13 +0100 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Message-ID: <20130210150310.GA9504@dhcp22.suse.cz> References: <20130206160051.GG10254@dhcp22.suse.cz> <20130208060304.799F362F@pobox.sk> <20130208094420.GA7557@dhcp22.suse.cz> <20130208120249.FD733220@pobox.sk> <20130208123854.GB7557@dhcp22.suse.cz> <20130208145616.FB78CE24@pobox.sk> <20130208152402.GD7557@dhcp22.suse.cz> <20130208165805.8908B143@pobox.sk> <20130208171012.GH7557@dhcp22.suse.cz> <20130208220243.EDEE0825@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130208220243.EDEE0825@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Fri 08-02-13 22:02:43, azurIt wrote: > > > >I assume you have checked that the killed processes eventually die, > >right? > > > When i killed them by hand, yes, they dissappeard from process list (i > saw it). I don't know if they really died when OOM killed them. > > > >Well, I do not see anything supsicious during that time period > >(timestamps translate between Fri Feb 8 02:34:05 and Fri Feb 8 > >02:36:48). The kernel log shows a lot of oom during that time. All > >killed processes die eventually. > > > No, they didn't died by OOM when cgroup was freezed. Just check PIDs > from memcg-bug-4.tar.gz and try to find them in kernel log. OK, you seem to be right. My initial examination showed that each cgroup under OOM was able to move forward - in other words it was able to send SIGKILL somebody and we didn't loop on a single task which cannot die for some reason. Now when looking closer it seem we really have 2 tasks which didn't die after being killed by OOM killer: $ for i in `grep "Memory cgroup out of memory:" kern2.log | sed 's@.*Kill process \([0-9]*\) .*@\1@'`; do find bug -name $i; done | sed 's@.*/@@' | sort | uniq -c 141 18211 141 8102 $ md5sum bug/*/18211/stack | cut -d" " -f1 | uniq -c 141 3b8ce17e82a065a24ee046112033e1e8 So all the stacks are same: [] ptrace_stop+0x114/0x290 [] ptrace_do_notify+0x88/0xa0 [] ptrace_notify+0x53/0x70 [] syscall_trace_enter+0xf8/0x1c0 [] tracesys+0x71/0xd7 [] 0xffffffffffffffff stuck in the ptrace code. The other task is more interesting: $ md5sum bug/*/8102/stack | cut -d" " -f1 | sort | uniq -c 135 042e893c0e6657ed321ea9045e528f3e 6 dc7e71ce73be2a5c73404b565926e709 All snapshots with 042e893c0e6657ed321ea9045e528f3e are in: [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1149+0x5f3/0x600 [] mem_cgroup_charge_common+0x6c/0xb0 [] mem_cgroup_newpage_charge+0x45/0x50 [] handle_pte_fault+0x609/0x940 [] handle_mm_fault+0x138/0x260 [] do_page_fault+0x13d/0x460 [] page_fault+0x1f/0x30 [] 0xffffffffffffffff While the others do not show any stack: cat 1360287257/8102/stack [] 0xffffffffffffffff Which is quite interesting because we are talking about snapshots starting at 1360287245 (which maps to 02:34:05) but the kern2.log tells us that this process has been killed much earlier at: Feb 8 01:18:30 server01 kernel: [ 511.139921] Task in /1293/uid killed as a result of limit of /1293 [...] Feb 8 01:18:30 server01 kernel: [ 511.229755] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name Feb 8 01:18:30 server01 kernel: [ 511.230146] [ 8102] 1293 8102 170258 65869 7 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230339] [ 8113] 1293 8113 163756 59442 5 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230528] [ 8116] 1293 8116 170094 65675 2 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230726] [ 8119] 1293 8119 170094 65675 6 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230924] [ 8123] 1293 8123 169070 64612 7 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.231132] [ 8124] 1293 8124 170094 65675 5 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.231321] [ 8125] 1293 8125 170094 65673 1 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.231516] Memory cgroup out of memory: Kill process 8102 (apache2) score 1000 or sacrifice child This would suggest that the task is hung and cannot be killed but if we have a look at the following OOM in the same group 1293 it was _not_ present in the process list for that group: Feb 8 01:18:33 server01 kernel: [ 514.789550] Task in /1293/uid killed as a result of limit of /1293 [...] Feb 8 01:18:33 server01 kernel: [ 514.893198] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name Feb 8 01:18:33 server01 kernel: [ 514.893594] [ 8113] 1293 8113 168212 64036 1 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.893786] [ 8116] 1293 8116 170258 65870 6 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.893976] [ 8119] 1293 8119 170258 65870 7 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894166] [ 8123] 1293 8123 170158 65824 6 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894356] [ 8124] 1293 8124 170258 65870 5 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894547] [ 8125] 1293 8125 170158 65824 1 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894749] [ 8149] 1293 8149 163989 59647 7 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894944] Memory cgroup out of memory: Kill process 8113 (apache2) score 1000 or sacrifice child This is all _before_ you started collecting stacks and it also says that 8102 is gone. This all suggests that a) stack unwinder which displays /proc//stack is somehow confused and it doesn't show the correct stack for this process and b) the two processes cannot terminate due to some issue related to ptrace (stracing) the dying process. The above oom list doesn't include any processes which already released the memory which would explain why you still can see it as a member of the group (when looking into cgroup/tasks file). My guess would be that there is a bug in ptrace which doesn't free a reference to the task so it cannot cannot go away although it has dropped all the resources already. > Why are all PIDs waiting on 'mem_cgroup_handle_oom' and there is no > OOM message in the log? I am not sure what you mean here but there are $ grep "Memory cgroup out of memory:" kern2.collected.log | wc -l 16 OOM killer events during the time you were gathering memcg-bug-4 data. > Data in memcg-bug-4.tar.gz are only for 2 > minutes but i let it run for about 15-20 minutes, no single process > killed by OOM. I can see $ grep "Memory cgroup out of memory:" kern2.after.log | wc -l 57 killed after 02:38:47 when you stopped gathering data for memcg-bug-4 > I'm 100% sure that OOM was not killing them (maybe it was trying to > but it didn't happen). OK, let's do a little exercise. The list of processes eligible for OOM are listed before any task is killed. So if we collect both pid lists and "Kill process" messages per pid then no entries in the pid list should be present after the specific pid is killed. $ mkdir out $ for i in `grep "Memory cgroup out of memory: Kill process" kern2.log | sed 's@.*Kill process \([0-9]*\) .*@\1@'` do grep -e "Memory cgroup out of memory: Kill process $i" \ -e "\[ *\<$i\]" kern2.log > out/$i done $ for i in out/* do tail -n1 $i | grep "Memory cgroup out of memory:" >/dev/null|| echo "$i has already killed tasks" done out/6698 has already killed tasks out/6703 has already killed tasks OK, so there are two pids which were listed after they have been killed. Let's have a look at them. $ cat out/6698 Feb 8 01:17:04 server01 kernel: [ 425.497924] [ 6698] 1293 6698 170258 65846 1 0 0 apache2 Feb 8 01:17:05 server01 kernel: [ 426.079010] [ 6698] 1293 6698 170258 65846 1 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.144460] [ 6698] 1293 6698 169358 65220 1 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.146058] Memory cgroup out of memory: Kill process 6698 (apache2) score 1000 or sacrifice child Feb 8 03:27:57 server01 kernel: [ 8278.439896] [ 6698] 1020 6698 168518 64219 0 0 0 apache2 Feb 8 03:27:57 server01 kernel: [ 8278.879439] [ 6698] 1020 6698 168518 64218 6 0 0 apache2 Feb 8 03:27:59 server01 kernel: [ 8280.023944] [ 6698] 1020 6698 168816 64540 7 0 0 apache2 Feb 8 03:28:02 server01 kernel: [ 8283.242282] [ 6698] 1020 6698 171953 67751 6 0 0 apache2 $ cat out/6703 Feb 8 01:17:04 server01 kernel: [ 425.498118] [ 6703] 1293 6703 170258 65844 6 0 0 apache2 Feb 8 01:17:05 server01 kernel: [ 426.079206] [ 6703] 1293 6703 170258 65844 6 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.144653] [ 6703] 1293 6703 169358 65219 2 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.258924] [ 6703] 1293 6703 169358 65219 5 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.260282] Memory cgroup out of memory: Kill process 6703 (apache2) score 1000 or sacrifice child Feb 8 03:27:57 server01 kernel: [ 8278.440043] [ 6703] 1020 6703 166286 61978 7 0 0 apache2 Feb 8 03:27:57 server01 kernel: [ 8278.879587] [ 6703] 1020 6703 166286 61977 7 0 0 apache2 Feb 8 03:27:59 server01 kernel: [ 8280.024091] [ 6703] 1020 6703 166484 62233 7 0 0 apache2 Feb 8 03:28:02 server01 kernel: [ 8283.242429] [ 6703] 1020 6703 167402 63118 0 0 0 apache2 Lists have the following columns: [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name As we can see the uid changed for both pids after it has been killed (from 1293 to 1020) which suggests that the pid has been reused later for a different user (which is a clear sign that those pids died) - thus different group in your setup. So those two died as well, apparently. > >Nothing shows it would be a deadlock so far. It is well possible that > >the userspace went mad when seeing a lot of processes dying because it > >doesn't expect it. > > Lots of processes are dying also now, without your latest patch, and > no such things are happening. I'm sure there is something more it > this, maybe it revealed another bug? So far nothing shows that there would be anything broken wrt. memcg OOM killer. The ptrace issue sounds strange, all right, but that is another story and worth a separate investigation. I would be interested whether you still see anything wrong going on without that in game. You can get pretty nice overview of what is going on wrt. OOM from the log. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx144.postini.com [74.125.245.144]) by kanga.kvack.org (Postfix) with SMTP id E982E6B0002 for ; Sun, 10 Feb 2013 11:46:21 -0500 (EST) Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_if_PF=5FNO=5FMEMCG=5FOOM_is_set?= Date: Sun, 10 Feb 2013 17:46:19 +0100 From: "azurIt" References: <20130206160051.GG10254@dhcp22.suse.cz>, <20130208060304.799F362F@pobox.sk>, <20130208094420.GA7557@dhcp22.suse.cz>, <20130208120249.FD733220@pobox.sk>, <20130208123854.GB7557@dhcp22.suse.cz>, <20130208145616.FB78CE24@pobox.sk>, <20130208152402.GD7557@dhcp22.suse.cz>, <20130208165805.8908B143@pobox.sk>, <20130208171012.GH7557@dhcp22.suse.cz>, <20130208220243.EDEE0825@pobox.sk> <20130210150310.GA9504@dhcp22.suse.cz> In-Reply-To: <20130210150310.GA9504@dhcp22.suse.cz> MIME-Version: 1.0 Message-Id: <20130210174619.24F20488@pobox.sk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >stuck in the ptrace code. But this happens _after_ the cgroup was freezed and i tried to strace one of it's processes (to see what's happening): Feb 8 01:29:46 server01 kernel: [ 1187.540672] grsec: From 178.40.250.111: process /usr/lib/apache2/mpm-itk/apache2(apache2:18211) attached to via ptrace by /usr/bin/strace[strace:18258] uid/euid:0/0 gid/egid:0/0, parent /usr/bin/htop[htop:2901] uid/euid:0/0 gid/egid:0/0 >> Why are all PIDs waiting on 'mem_cgroup_handle_oom' and there is no >> OOM message in the log? > >I am not sure what you mean here but there are >$ grep "Memory cgroup out of memory:" kern2.collected.log | wc -l >16 > >OOM killer events during the time you were gathering memcg-bug-4 data. > >> Data in memcg-bug-4.tar.gz are only for 2 >> minutes but i let it run for about 15-20 minutes, no single process >> killed by OOM. > >I can see >$ grep "Memory cgroup out of memory:" kern2.after.log | wc -l >57 > >killed after 02:38:47 when you stopped gathering data for memcg-bug-4 I meant no single process was killed inside cgroup 1258 (data from this cgroup are in memcg-bug-4.tar.gz). Just get data from memcg-bug-4.tar.gz which were taken from cgroup 1258. Almost all processes are in 'mem_cgroup_handle_oom' so cgroup is under OOM. I assume that this is suppose to take only few seconds while kernel finds any process and kill it (and maybe do it again until enough of memory is freed). I was gathering the data for about 2 and a half minutes and NO SINGLE process was killed (just compate list of PIDs from the first and the last directory inside memcg-bug-4.tar.gz). Even more, no single process was killed in cgroup 1258 also after i stopped gathering the data. You can also take the list od PID from memcg-bug-4.tar.gz and you will find only 18211 and 8102 (which are the two stucked processes). So my question is: Why no process was killed inside cgroup 1258 while it was under OOM? It was under OOM for at least 2 and a half of minutes while i was gathering the data (then i let it run for additional, cca, 10 minutes and then killed processes by hand but i cannot proof this). Why kernel didn't kill any process for so long and ends the OOM? Btw, processes in cgroup 1258 (memcg-bug-4.tar.gz) are looping in this two tasks (i pasted only first line of stack): mem_cgroup_handle_oom+0x241/0x3b0 0xffffffffffffffff Some of them are in 'poll_schedule_timeout' and then they start to loop as above. Is this correct behavior? For example, do (first line of stack from process 7710 from all timestamps): for i in */7710/stack; do head -n1 $i; done -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx157.postini.com [74.125.245.157]) by kanga.kvack.org (Postfix) with SMTP id A6C366B0002 for ; Mon, 11 Feb 2013 06:22:43 -0500 (EST) Date: Mon, 11 Feb 2013 12:22:40 +0100 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Message-ID: <20130211112240.GC19922@dhcp22.suse.cz> References: <20130208094420.GA7557@dhcp22.suse.cz> <20130208120249.FD733220@pobox.sk> <20130208123854.GB7557@dhcp22.suse.cz> <20130208145616.FB78CE24@pobox.sk> <20130208152402.GD7557@dhcp22.suse.cz> <20130208165805.8908B143@pobox.sk> <20130208171012.GH7557@dhcp22.suse.cz> <20130208220243.EDEE0825@pobox.sk> <20130210150310.GA9504@dhcp22.suse.cz> <20130210174619.24F20488@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130210174619.24F20488@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Sun 10-02-13 17:46:19, azurIt wrote: > >stuck in the ptrace code. > > > But this happens _after_ the cgroup was freezed and i tried to strace > one of it's processes (to see what's happening): > > Feb 8 01:29:46 server01 kernel: [ 1187.540672] grsec: From 178.40.250.111: process /usr/lib/apache2/mpm-itk/apache2(apache2:18211) attached to via ptrace by /usr/bin/strace[strace:18258] uid/euid:0/0 gid/egid:0/0, parent /usr/bin/htop[htop:2901] uid/euid:0/0 gid/egid:0/0 Hmmm, Feb 8 01:39:16 server01 kernel: [ 1757.266678] Memory cgroup out of memory: Kill process 18211 (apache2) score 725 or sacrifice child) So the process has been killed 10 minutes ago and this was really the last OOM event for group /1258: $ grep "Task in /1258/uid killed" kern2.log | tail -n2 Feb 8 01:39:16 server01 kernel: [ 1757.045021] Task in /1258/uid killed as a result of limit of /1258 Feb 8 01:39:16 server01 kernel: [ 1757.167984] Task in /1258/uid killed as a result of limit of /1258 But this was still before you started collecting data for memcg-bug-4 (2:34) so we do not know what was the previous stack unfortunatelly. > >> Why are all PIDs waiting on 'mem_cgroup_handle_oom' and there is no > >> OOM message in the log? > > > >I am not sure what you mean here but there are > >$ grep "Memory cgroup out of memory:" kern2.collected.log | wc -l > >16 > > > >OOM killer events during the time you were gathering memcg-bug-4 data. > > > >> Data in memcg-bug-4.tar.gz are only for 2 > >> minutes but i let it run for about 15-20 minutes, no single process > >> killed by OOM. > > > >I can see > >$ grep "Memory cgroup out of memory:" kern2.after.log | wc -l > >57 > > > >killed after 02:38:47 when you stopped gathering data for memcg-bug-4 > > > I meant no single process was killed inside cgroup 1258 (data from > this cgroup are in memcg-bug-4.tar.gz). > > Just get data from memcg-bug-4.tar.gz which were taken from cgroup > 1258. Are you sure about that? When I extracted all pids from timestamp directories and greped them in the log I got this: for i in `cat bug/pids` ; do grep "\[ *\<$i\>\]" kern2.log ; done Feb 8 01:31:02 server01 kernel: [ 1263.429212] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:31:15 server01 kernel: [ 1276.655241] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:29 server01 kernel: [ 1350.797835] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:42 server01 kernel: [ 1363.662242] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:46 server01 kernel: [ 1367.181798] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:46 server01 kernel: [ 1367.381627] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:46 server01 kernel: [ 1367.490896] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:33:02 server01 kernel: [ 1383.709652] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:36:26 server01 kernel: [ 1587.458967] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:36:26 server01 kernel: [ 1587.558419] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:36:26 server01 kernel: [ 1587.652474] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:02 server01 kernel: [ 1743.107086] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:16 server01 kernel: [ 1757.015359] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:16 server01 kernel: [ 1757.133998] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:16 server01 kernel: [ 1757.262992] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:18:12 server01 kernel: [ 493.156641] [ 7888] 1293 7888 169326 64876 3 0 0 apache2 Feb 8 01:18:12 server01 kernel: [ 493.269129] [ 7888] 1293 7888 169390 64876 4 0 0 apache2 Feb 8 01:18:21 server01 kernel: [ 502.384221] [ 8011] 1293 8011 170094 65675 5 0 0 apache2 Feb 8 01:18:24 server01 kernel: [ 505.052600] [ 8011] 1293 8011 170260 65854 2 0 0 apache2 Feb 8 01:18:24 server01 kernel: [ 505.200454] [ 8011] 1293 8011 170260 65854 2 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.538637] [ 8054] 1258 8054 164404 60618 1 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230146] [ 8102] 1293 8102 170258 65869 7 0 0 apache2 So at least 7888, 8011 and 8102 were from a different group (1293). Others were never listed in the eligible processes list which is a bit unexpected. It is also unfortunate because I cannot match them to their groups from the log. $ for i in `cat bug/pids` ; do grep "\[ *\<$i\>\]" kern2.log >/dev/null || echo "$i not listed" ; done 7265 not listed 7474 not listed 7710 not listed 7969 not listed 7988 not listed 7997 not listed 8000 not listed 8014 not listed 8016 not listed 8019 not listed 8057 not listed 8058 not listed 8059 not listed 8063 not listed 8064 not listed 8066 not listed 8067 not listed 8069 not listed 8070 not listed 8071 not listed 8072 not listed 8075 not listed 8091 not listed 8092 not listed 8094 not listed 8098 not listed 8099 not listed 8100 not listed Are you sure all of them belong to 1258 group? > Almost all processes are in 'mem_cgroup_handle_oom' so cgroup > is under OOM. You are right, almost all of them are waiting in mem_cgroup_handle_oom which suggest that they should be listed in a per group eligible tasks list. One way how this might happen is when a process which manages to get oom_lock has a fatal signal pending. Then we wouldn't get to oom_kill_process and no OOM messages would get printed. This is correct because such a task would terminate soon anyway and all the waiters would wake up eventually. If not enough memory would be freed another task would get the oom_lock and this one would trigger OOM (unless it has fatal signal pending as well). Another option would be that no task could be selected - e.g. because select_bad_process sees TIF_MEMDIE marked task - the one already killed by OOM killer but that wasn't able to terminate for some reason. 18211 could be such a task. But we do not know what was going on with it before strace attached to it. Finally it is possible that the OOM header (everything up to Kill process) was suppressed because of rate limiting. But $ grep -B1 "Kill process" kern2.log Feb 8 01:15:02 server01 kernel: [ 304.000402] [ 4969] 1258 4969 163761 59554 6 0 0 apache2 Feb 8 01:15:02 server01 kernel: [ 304.000649] Memory cgroup out of memory: Kill process 4816 (apache2) score 1000 or sacrifice child -- Feb 8 01:15:51 server01 kernel: [ 352.924573] [ 5847] 1709 5847 163433 58952 6 0 0 apache2 Feb 8 01:15:51 server01 kernel: [ 352.924761] Memory cgroup out of memory: Kill process 5212 (apache2) score 1000 or sacrifice child [...] says that the message was preceded by a process list so we can exclude rate limiting. > I assume that this is suppose to take only few seconds > while kernel finds any process and kill it (and maybe do it again > until enough of memory is freed). I was gathering the data for > about 2 and a half minutes and NO SINGLE process was killed (just > compate list of PIDs from the first and the last directory inside > memcg-bug-4.tar.gz). Even more, no single process was killed in cgroup > 1258 also after i stopped gathering the data. You can also take the > list od PID from memcg-bug-4.tar.gz and you will find only 18211 and > 8102 (which are the two stucked processes). > > So my question is: Why no process was killed inside cgroup 1258 > while it was under OOM? I would bet that there is something weird going on with pid:18211. But I do not have enough information to find out what and why. > It was under OOM for at least 2 and a half of minutes while i was > gathering the data (then i let it run for additional, cca, 10 minutes > and then killed processes by hand but i cannot proof this). Why kernel > didn't kill any process for so long and ends the OOM? As already mentioned above, select_bad_process doesn't select any task if there is one which is on the way out. Maybe this is what is going on here. > Btw, processes in cgroup 1258 (memcg-bug-4.tar.gz) are looping in this > two tasks (i pasted only first line of stack): > mem_cgroup_handle_oom+0x241/0x3b0 > 0xffffffffffffffff 0xffffffffffffffff is just a bogus entry. No idea why this happens. > Some of them are in 'poll_schedule_timeout' and then they start to > loop as above. Is this correct behavior? > For example, do (first line of stack from process 7710 from all > timestamps): for i in */7710/stack; do head -n1 $i; done Yes, this is perfectly ok, because that task starts with: $ cat bug/1360287245/7710/stack [] poll_schedule_timeout+0x49/0x70 [] do_sys_poll+0x54b/0x680 [] sys_poll+0x7c/0xf0 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff and then later on it gets into OOM because of a page fault: $ cat bug/1360287250/7710/stack [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1149+0x5f3/0x600 [] mem_cgroup_charge_common+0x6c/0xb0 [] mem_cgroup_newpage_charge+0x45/0x50 [] do_wp_page+0x14e/0x800 [] handle_pte_fault+0x264/0x940 [] handle_mm_fault+0x138/0x260 [] do_page_fault+0x13d/0x460 [] page_fault+0x1f/0x30 [] 0xffffffffffffffff And it loops in it until the end which is possible as well if the group is under permanent OOM condition and the task is not selected to be killed. Unfortunately I am not able to reproduce this behavior even if I try to hammer OOM like mad so I am afraid I cannot help you much without further debugging patches. I do realize that experimenting in your environment is a problem but I do not many options left. Please do not use strace and rather collect /proc/pid/stack instead. It would be also helpful to get group/tasks file to have a full list of tasks in the group --- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx149.postini.com [74.125.245.149]) by kanga.kvack.org (Postfix) with SMTP id 8AC5F6B0002 for ; Fri, 22 Feb 2013 03:23:34 -0500 (EST) Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_if_PF=5FNO=5FMEMCG=5FOOM_is_set?= Date: Fri, 22 Feb 2013 09:23:32 +0100 From: "azurIt" References: <20130208094420.GA7557@dhcp22.suse.cz>, <20130208120249.FD733220@pobox.sk>, <20130208123854.GB7557@dhcp22.suse.cz>, <20130208145616.FB78CE24@pobox.sk>, <20130208152402.GD7557@dhcp22.suse.cz>, <20130208165805.8908B143@pobox.sk>, <20130208171012.GH7557@dhcp22.suse.cz>, <20130208220243.EDEE0825@pobox.sk>, <20130210150310.GA9504@dhcp22.suse.cz>, <20130210174619.24F20488@pobox.sk> <20130211112240.GC19922@dhcp22.suse.cz> In-Reply-To: <20130211112240.GC19922@dhcp22.suse.cz> MIME-Version: 1.0 Message-Id: <20130222092332.4001E4B6@pobox.sk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= >Unfortunately I am not able to reproduce this behavior even if I try >to hammer OOM like mad so I am afraid I cannot help you much without >further debugging patches. >I do realize that experimenting in your environment is a problem but I >do not many options left. Please do not use strace and rather collect >/proc/pid/stack instead. It would be also helpful to get group/tasks >file to have a full list of tasks in the group Hi Michal, sorry that i didn't response for a while. Today i installed kernel with your two patches and i'm running it now. I'm still having problems with OOM which is not able to handle low memory and is not killing processes. Here is some info: - data from cgroup 1258 while it was under OOM and no processes were killed (so OOM don't stop and cgroup was freezed) http://watchdog.sk/lkml/memcg-bug-6.tar.gz I noticed problem about on 8:39 and waited until 8:57 (nothing happend). Then i killed process 19864 which seems to help and other processes probably ends and cgroup started to work. But problem accoured again about 20 seconds later, so i killed all processes at 8:58. The problem is occuring all the time since then. All processes (in that cgroup) are always in state 'D' when it occurs. - kernel log from boot until now http://watchdog.sk/lkml/kern3.gz Btw, something probably happened also at about 3:09 but i wasn't able to gather any data because my 'load check script' killed all apache processes (load was more than 100). azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx103.postini.com [74.125.245.103]) by kanga.kvack.org (Postfix) with SMTP id 0244F6B0002 for ; Fri, 22 Feb 2013 07:52:21 -0500 (EST) Date: Fri, 22 Feb 2013 13:52:17 +0100 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Message-ID: <20130222125217.GA32285@dhcp22.suse.cz> References: <20130208123854.GB7557@dhcp22.suse.cz> <20130208145616.FB78CE24@pobox.sk> <20130208152402.GD7557@dhcp22.suse.cz> <20130208165805.8908B143@pobox.sk> <20130208171012.GH7557@dhcp22.suse.cz> <20130208220243.EDEE0825@pobox.sk> <20130210150310.GA9504@dhcp22.suse.cz> <20130210174619.24F20488@pobox.sk> <20130211112240.GC19922@dhcp22.suse.cz> <20130222092332.4001E4B6@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130222092332.4001E4B6@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Hi, On Fri 22-02-13 09:23:32, azurIt wrote: [...] > sorry that i didn't response for a while. Today i installed kernel > with your two patches and i'm running it now. I am not sure how much time I'll have for this today but just to make sure we are on the same page, could you point me to the two patches you have applied in the mean time? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx151.postini.com [74.125.245.151]) by kanga.kvack.org (Postfix) with SMTP id 394DF6B0002 for ; Fri, 22 Feb 2013 08:00:20 -0500 (EST) Date: Fri, 22 Feb 2013 14:00:17 +0100 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Message-ID: <20130222130017.GB32285@dhcp22.suse.cz> References: <20130222125217.GA32285@dhcp22.suse.cz> <20130222135442.ADFFF498@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130222135442.ADFFF498@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Fri 22-02-13 13:54:42, azurIt wrote: > >I am not sure how much time I'll have for this today but just to make > >sure we are on the same page, could you point me to the two patches you > >have applied in the mean time? > > > Here: > http://watchdog.sk/lkml/patches2 OK, looks correct. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx151.postini.com [74.125.245.151]) by kanga.kvack.org (Postfix) with SMTP id 29C8D6B0032 for ; Thu, 6 Jun 2013 12:04:51 -0400 (EDT) Date: Thu, 6 Jun 2013 18:04:46 +0200 From: Michal Hocko Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Message-ID: <20130606160446.GE24115@dhcp22.suse.cz> References: <20130208123854.GB7557@dhcp22.suse.cz> <20130208145616.FB78CE24@pobox.sk> <20130208152402.GD7557@dhcp22.suse.cz> <20130208165805.8908B143@pobox.sk> <20130208171012.GH7557@dhcp22.suse.cz> <20130208220243.EDEE0825@pobox.sk> <20130210150310.GA9504@dhcp22.suse.cz> <20130210174619.24F20488@pobox.sk> <20130211112240.GC19922@dhcp22.suse.cz> <20130222092332.4001E4B6@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130222092332.4001E4B6@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Hi, I am really sorry it took so long but I was constantly preempted by other stuff. I hope I have a good news for you, though. Johannes has found a nice way how to overcome deadlock issues from memcg OOM which might help you. Would you be willing to test with his patch (http://permalink.gmane.org/gmane.linux.kernel.mm/101437). Unlike my patch which handles just the i_mutex case his patch solved all possible locks. I can backport the patch for your kernel (are you still using 3.2 kernel or you have moved to a newer one?). On Fri 22-02-13 09:23:32, azurIt wrote: > >Unfortunately I am not able to reproduce this behavior even if I try > >to hammer OOM like mad so I am afraid I cannot help you much without > >further debugging patches. > >I do realize that experimenting in your environment is a problem but I > >do not many options left. Please do not use strace and rather collect > >/proc/pid/stack instead. It would be also helpful to get group/tasks > >file to have a full list of tasks in the group > > > > Hi Michal, > > > sorry that i didn't response for a while. Today i installed kernel with your two patches and i'm running it now. I'm still having problems with OOM which is not able to handle low memory and is not killing processes. Here is some info: > > - data from cgroup 1258 while it was under OOM and no processes were killed (so OOM don't stop and cgroup was freezed) > http://watchdog.sk/lkml/memcg-bug-6.tar.gz > > I noticed problem about on 8:39 and waited until 8:57 (nothing happend). Then i killed process 19864 which seems to help and other processes probably ends and cgroup started to work. But problem accoured again about 20 seconds later, so i killed all processes at 8:58. The problem is occuring all the time since then. All processes (in that cgroup) are always in state 'D' when it occurs. > > > - kernel log from boot until now > http://watchdog.sk/lkml/kern3.gz > > > Btw, something probably happened also at about 3:09 but i wasn't able to gather any data because my 'load check script' killed all apache processes (load was more than 100). > > > > azur > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx158.postini.com [74.125.245.158]) by kanga.kvack.org (Postfix) with SMTP id C76D46B0032 for ; Thu, 6 Jun 2013 12:16:35 -0400 (EDT) Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_if_PF=5FNO=5FMEMCG=5FOOM_is_set?= Date: Thu, 06 Jun 2013 18:16:33 +0200 From: "azurIt" References: <20130208123854.GB7557@dhcp22.suse.cz>, <20130208145616.FB78CE24@pobox.sk>, <20130208152402.GD7557@dhcp22.suse.cz>, <20130208165805.8908B143@pobox.sk>, <20130208171012.GH7557@dhcp22.suse.cz>, <20130208220243.EDEE0825@pobox.sk>, <20130210150310.GA9504@dhcp22.suse.cz>, <20130210174619.24F20488@pobox.sk>, <20130211112240.GC19922@dhcp22.suse.cz>, <20130222092332.4001E4B6@pobox.sk> <20130606160446.GE24115@dhcp22.suse.cz> In-Reply-To: <20130606160446.GE24115@dhcp22.suse.cz> MIME-Version: 1.0 Message-Id: <20130606181633.BCC3E02E@pobox.sk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= Hello Michal, nice to read you! :) Yes, i'm still on 3.2. Could you be so kind and try to backport it? Thank you very much! azur ______________________________________________________________ > Od: "Michal Hocko" > Komu: azurIt > DA!tum: 06.06.2013 18:04 > Predmet: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set > > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , "Johannes Weiner" >Hi, > >I am really sorry it took so long but I was constantly preempted by >other stuff. I hope I have a good news for you, though. Johannes has >found a nice way how to overcome deadlock issues from memcg OOM which >might help you. Would you be willing to test with his patch >(http://permalink.gmane.org/gmane.linux.kernel.mm/101437). Unlike my >patch which handles just the i_mutex case his patch solved all possible >locks. > >I can backport the patch for your kernel (are you still using 3.2 kernel >or you have moved to a newer one?). > >On Fri 22-02-13 09:23:32, azurIt wrote: >> >Unfortunately I am not able to reproduce this behavior even if I try >> >to hammer OOM like mad so I am afraid I cannot help you much without >> >further debugging patches. >> >I do realize that experimenting in your environment is a problem but I >> >do not many options left. Please do not use strace and rather collect >> >/proc/pid/stack instead. It would be also helpful to get group/tasks >> >file to have a full list of tasks in the group >> >> >> >> Hi Michal, >> >> >> sorry that i didn't response for a while. Today i installed kernel with your two patches and i'm running it now. I'm still having problems with OOM which is not able to handle low memory and is not killing processes. Here is some info: >> >> - data from cgroup 1258 while it was under OOM and no processes were killed (so OOM don't stop and cgroup was freezed) >> http://watchdog.sk/lkml/memcg-bug-6.tar.gz >> >> I noticed problem about on 8:39 and waited until 8:57 (nothing happend). Then i killed process 19864 which seems to help and other processes probably ends and cgroup started to work. But problem accoured again about 20 seconds later, so i killed all processes at 8:58. The problem is occuring all the time since then. All processes (in that cgroup) are always in state 'D' when it occurs. >> >> >> - kernel log from boot until now >> http://watchdog.sk/lkml/kern3.gz >> >> >> Btw, something probably happened also at about 3:09 but i wasn't able to gather any data because my 'load check script' killed all apache processes (load was more than 100). >> >> >> >> azur >> -- >> To unsubscribe from this list: send the line "unsubscribe cgroups" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > >-- >Michal Hocko >SUSE Labs > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx199.postini.com [74.125.245.199]) by kanga.kvack.org (Postfix) with SMTP id B7CF46B0032 for ; Fri, 7 Jun 2013 09:12:00 -0400 (EDT) Date: Fri, 7 Jun 2013 15:11:57 +0200 From: Michal Hocko Subject: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130607131157.GF8117@dhcp22.suse.cz> References: <20130208152402.GD7557@dhcp22.suse.cz> <20130208165805.8908B143@pobox.sk> <20130208171012.GH7557@dhcp22.suse.cz> <20130208220243.EDEE0825@pobox.sk> <20130210150310.GA9504@dhcp22.suse.cz> <20130210174619.24F20488@pobox.sk> <20130211112240.GC19922@dhcp22.suse.cz> <20130222092332.4001E4B6@pobox.sk> <20130606160446.GE24115@dhcp22.suse.cz> <20130606181633.BCC3E02E@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130606181633.BCC3E02E@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner On Thu 06-06-13 18:16:33, azurIt wrote: > Hello Michal, > > nice to read you! :) Yes, i'm still on 3.2. Could you be so kind and > try to backport it? Thank you very much! Here we go. I hope I didn't screw anything (Johannes might double check) because there were quite some changes in the area since 3.2. Nothing earth shattering though. Please note that I have only compile tested this. Also make sure you remove the previous patches you have from me. --- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx149.postini.com [74.125.245.149]) by kanga.kvack.org (Postfix) with SMTP id A512A6B0036 for ; Mon, 24 Jun 2013 16:13:57 -0400 (EDT) Date: Mon, 24 Jun 2013 16:13:45 -0400 From: Johannes Weiner Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130624201345.GA21822@cmpxchg.org> References: <20130210150310.GA9504@dhcp22.suse.cz> <20130210174619.24F20488@pobox.sk> <20130211112240.GC19922@dhcp22.suse.cz> <20130222092332.4001E4B6@pobox.sk> <20130606160446.GE24115@dhcp22.suse.cz> <20130606181633.BCC3E02E@pobox.sk> <20130607131157.GF8117@dhcp22.suse.cz> <20130617122134.2E072BA8@pobox.sk> <20130619132614.GC16457@dhcp22.suse.cz> <20130622220958.D10567A4@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130622220958.D10567A4@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: Michal Hocko , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki Hi guys, On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote: > >> But i'm sure of one thing - when problem occurs, nothing is able to > >> access hard drives (every process which tries it is freezed until > >> problem is resolved or server is rebooted). > > > >I would be really interesting to see what those tasks are blocked on. > > I'm trying to get it, stay tuned :) > > Today i noticed one bug, not 100% sure it is related to 'your' patch > but i didn't seen this before. I noticed that i have lots of cgroups > which cannot be removed - if i do 'rmdir ', it > just hangs and never complete. Even more, it's not possible to > access the whole cgroup filesystem until i kill that rmdir > (anything, which tries it, just hangs). All unremoveable cgroups has > this in 'memory.oom_control': oom_kill_disable 0 under_oom 1 Somebody acquires the OOM wait reference to the memcg and marks it under oom but then does not call into mem_cgroup_oom_synchronize() to clean up. That's why under_oom is set and the rmdir waits for outstanding references. > And, yes, 'tasks' file is empty. It's not a kernel thread that does it because all kernel-context handle_mm_fault() are annotated properly, which means the task must be userspace and, since tasks is empty, have exited before synchronizing. Can you try with the following patch on top? diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 5db0490..9a0b152 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -846,17 +846,6 @@ static noinline int mm_fault_error(struct pt_regs *regs, unsigned long error_code, unsigned long address, unsigned int fault) { - /* - * Pagefault was interrupted by SIGKILL. We have no reason to - * continue pagefault. - */ - if (fatal_signal_pending(current)) { - if (!(fault & VM_FAULT_RETRY)) - up_read(¤t->mm->mmap_sem); - if (!(error_code & PF_USER)) - no_context(regs, error_code, address); - return 1; - } if (!(fault & VM_FAULT_ERROR)) return 0; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx129.postini.com [74.125.245.129]) by kanga.kvack.org (Postfix) with SMTP id CA3C06B0032 for ; Fri, 28 Jun 2013 06:06:15 -0400 (EDT) Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=5D_memcg=3A_do_not_trap_chargers_with_full_callstack_on_OOM?= Date: Fri, 28 Jun 2013 12:06:13 +0200 From: "azurIt" References: <20130210150310.GA9504@dhcp22.suse.cz>, <20130210174619.24F20488@pobox.sk>, <20130211112240.GC19922@dhcp22.suse.cz>, <20130222092332.4001E4B6@pobox.sk>, <20130606160446.GE24115@dhcp22.suse.cz>, <20130606181633.BCC3E02E@pobox.sk>, <20130607131157.GF8117@dhcp22.suse.cz>, <20130617122134.2E072BA8@pobox.sk>, <20130619132614.GC16457@dhcp22.suse.cz>, <20130622220958.D10567A4@pobox.sk> <20130624201345.GA21822@cmpxchg.org> In-Reply-To: <20130624201345.GA21822@cmpxchg.org> MIME-Version: 1.0 Message-Id: <20130628120613.6D6CAD21@pobox.sk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: =?utf-8?q?Johannes_Weiner?= Cc: =?utf-8?q?Michal_Hocko?= , linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= >It's not a kernel thread that does it because all kernel-context >handle_mm_fault() are annotated properly, which means the task must be >userspace and, since tasks is empty, have exited before synchronizing. > >Can you try with the following patch on top? Michal and Johannes, i have some observations which i made: Original patch from Johannes was really fixing something but definitely not everything and was introducing new problems. I'm running unpatched kernel from time i send my last message and problems with freezing cgroups are occuring very often (several times per day) - they were, on the other hand, quite rare with patch from Johannes. Johannes, i didn't try your last patch yet. I would like to wait until you or Michal look at my last message which contained detailed information about freezing of cgroups on kernel running your original patch (which was suppose to fix it for good). Even more, i would like to hear your opinion about that stucked processes which was holding web server port and which forced me to reboot production server at the middle of the day :( more information was in my last message. Thank you very much for your time. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx129.postini.com [74.125.245.129]) by kanga.kvack.org (Postfix) with SMTP id BA8086B0033 for ; Fri, 5 Jul 2013 14:17:39 -0400 (EDT) Date: Fri, 5 Jul 2013 14:17:28 -0400 From: Johannes Weiner Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130705181728.GQ17812@cmpxchg.org> References: <20130211112240.GC19922@dhcp22.suse.cz> <20130222092332.4001E4B6@pobox.sk> <20130606160446.GE24115@dhcp22.suse.cz> <20130606181633.BCC3E02E@pobox.sk> <20130607131157.GF8117@dhcp22.suse.cz> <20130617122134.2E072BA8@pobox.sk> <20130619132614.GC16457@dhcp22.suse.cz> <20130622220958.D10567A4@pobox.sk> <20130624201345.GA21822@cmpxchg.org> <20130628120613.6D6CAD21@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130628120613.6D6CAD21@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: Michal Hocko , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki Hi azurIt, On Fri, Jun 28, 2013 at 12:06:13PM +0200, azurIt wrote: > >It's not a kernel thread that does it because all kernel-context > >handle_mm_fault() are annotated properly, which means the task must be > >userspace and, since tasks is empty, have exited before synchronizing. > > > >Can you try with the following patch on top? > > > Michal and Johannes, > > i have some observations which i made: Original patch from Johannes > was really fixing something but definitely not everything and was > introducing new problems. I'm running unpatched kernel from time i > send my last message and problems with freezing cgroups are occuring > very often (several times per day) - they were, on the other hand, > quite rare with patch from Johannes. That's good! > Johannes, i didn't try your last patch yet. I would like to wait > until you or Michal look at my last message which contained detailed > information about freezing of cgroups on kernel running your > original patch (which was suppose to fix it for good). Even more, i > would like to hear your opinion about that stucked processes which > was holding web server port and which forced me to reboot production > server at the middle of the day :( more information was in my last > message. Thank you very much for your time. I looked at your debug messages but could not find anything that would hint at a deadlock. All tasks are stuck in the refrigerator, so I assume you use the freezer cgroup and enabled it somehow? Sorry about your production server locking up, but from the stacks I don't see any connection to the OOM problems you were having... :/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx113.postini.com [74.125.245.113]) by kanga.kvack.org (Postfix) with SMTP id F07D16B0033 for ; Fri, 5 Jul 2013 15:19:05 -0400 (EDT) Date: Fri, 5 Jul 2013 15:18:54 -0400 From: Johannes Weiner Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130705191854.GR17812@cmpxchg.org> References: <20130606160446.GE24115@dhcp22.suse.cz> <20130606181633.BCC3E02E@pobox.sk> <20130607131157.GF8117@dhcp22.suse.cz> <20130617122134.2E072BA8@pobox.sk> <20130619132614.GC16457@dhcp22.suse.cz> <20130622220958.D10567A4@pobox.sk> <20130624201345.GA21822@cmpxchg.org> <20130628120613.6D6CAD21@pobox.sk> <20130705181728.GQ17812@cmpxchg.org> <20130705210246.11D2135A@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130705210246.11D2135A@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: Michal Hocko , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote: > >I looked at your debug messages but could not find anything that would > >hint at a deadlock. All tasks are stuck in the refrigerator, so I > >assume you use the freezer cgroup and enabled it somehow? > > > Yes, i'm really using freezer cgroup BUT i was checking if it's not > doing problems - unfortunately, several days passed from that day > and now i don't fully remember if i was checking it for both cases > (unremoveabled cgroups and these freezed processes holding web > server port). I'm 100% sure i was checking it for unremoveable > cgroups but not so sure for the other problem (i had to act quickly > in that case). Are you sure (from stacks) that freezer cgroup was > enabled there? Yeah, all the traces without exception look like this: 1372089762/23433/stack:[] refrigerator+0x95/0x160 1372089762/23433/stack:[] get_signal_to_deliver+0x1cb/0x540 1372089762/23433/stack:[] do_signal+0x6b/0x750 1372089762/23433/stack:[] do_notify_resume+0x55/0x80 1372089762/23433/stack:[] int_signal+0x12/0x17 1372089762/23433/stack:[] 0xffffffffffffffff so the freezer was already enabled when you took the backtraces. > Btw, what about that other stacks? I mean this file: > http://watchdog.sk/lkml/memcg-bug-7.tar.gz > > It was taken while running the kernel with your patch and from > cgroup which was under unresolveable OOM (just like my very original > problem). I looked at these traces too, but none of the tasks are stuck in rmdir or the OOM path. Some /are/ in the page fault path, but they are happily doing reclaim and don't appear to be stuck. So I'm having a hard time matching this data to what you otherwise observed. However, based on what you reported the most likely explanation for the continued hangs is the unfinished OOM handling for which I sent the followup patch for arch/x86/mm/fault.c. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx103.postini.com [74.125.245.103]) by kanga.kvack.org (Postfix) with SMTP id 4888E6B0031 for ; Tue, 9 Jul 2013 09:00:22 -0400 (EDT) Date: Tue, 9 Jul 2013 15:00:17 +0200 From: Michal Hocko Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130709130017.GE20281@dhcp22.suse.cz> References: <20130210174619.24F20488@pobox.sk> <20130211112240.GC19922@dhcp22.suse.cz> <20130222092332.4001E4B6@pobox.sk> <20130606160446.GE24115@dhcp22.suse.cz> <20130606181633.BCC3E02E@pobox.sk> <20130607131157.GF8117@dhcp22.suse.cz> <20130617122134.2E072BA8@pobox.sk> <20130619132614.GC16457@dhcp22.suse.cz> <20130622220958.D10567A4@pobox.sk> <20130624201345.GA21822@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130624201345.GA21822@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki On Mon 24-06-13 16:13:45, Johannes Weiner wrote: > Hi guys, > > On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote: > > >> But i'm sure of one thing - when problem occurs, nothing is able to > > >> access hard drives (every process which tries it is freezed until > > >> problem is resolved or server is rebooted). > > > > > >I would be really interesting to see what those tasks are blocked on. > > > > I'm trying to get it, stay tuned :) > > > > Today i noticed one bug, not 100% sure it is related to 'your' patch > > but i didn't seen this before. I noticed that i have lots of cgroups > > which cannot be removed - if i do 'rmdir ', it > > just hangs and never complete. Even more, it's not possible to > > access the whole cgroup filesystem until i kill that rmdir > > (anything, which tries it, just hangs). All unremoveable cgroups has > > this in 'memory.oom_control': oom_kill_disable 0 under_oom 1 > > Somebody acquires the OOM wait reference to the memcg and marks it > under oom but then does not call into mem_cgroup_oom_synchronize() to > clean up. That's why under_oom is set and the rmdir waits for > outstanding references. > > > And, yes, 'tasks' file is empty. > > It's not a kernel thread that does it because all kernel-context > handle_mm_fault() are annotated properly, which means the task must be > userspace and, since tasks is empty, have exited before synchronizing. Yes, well spotted. I have missed that while reviewing your patch. The follow up fix looks correct. > Can you try with the following patch on top? > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index 5db0490..9a0b152 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -846,17 +846,6 @@ static noinline int > mm_fault_error(struct pt_regs *regs, unsigned long error_code, > unsigned long address, unsigned int fault) > { > - /* > - * Pagefault was interrupted by SIGKILL. We have no reason to > - * continue pagefault. > - */ > - if (fatal_signal_pending(current)) { > - if (!(fault & VM_FAULT_RETRY)) > - up_read(¤t->mm->mmap_sem); > - if (!(error_code & PF_USER)) > - no_context(regs, error_code, address); > - return 1; > - } > if (!(fault & VM_FAULT_ERROR)) > return 0; > -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx135.postini.com [74.125.245.135]) by kanga.kvack.org (Postfix) with SMTP id 4A2D56B0031 for ; Tue, 9 Jul 2013 09:08:11 -0400 (EDT) Date: Tue, 9 Jul 2013 15:08:08 +0200 From: Michal Hocko Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130709130808.GF20281@dhcp22.suse.cz> References: <20130211112240.GC19922@dhcp22.suse.cz> <20130222092332.4001E4B6@pobox.sk> <20130606160446.GE24115@dhcp22.suse.cz> <20130606181633.BCC3E02E@pobox.sk> <20130607131157.GF8117@dhcp22.suse.cz> <20130617122134.2E072BA8@pobox.sk> <20130619132614.GC16457@dhcp22.suse.cz> <20130622220958.D10567A4@pobox.sk> <20130624201345.GA21822@cmpxchg.org> <20130709130017.GE20281@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130709130017.GE20281@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki On Tue 09-07-13 15:00:17, Michal Hocko wrote: > On Mon 24-06-13 16:13:45, Johannes Weiner wrote: > > Hi guys, > > > > On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote: > > > >> But i'm sure of one thing - when problem occurs, nothing is able to > > > >> access hard drives (every process which tries it is freezed until > > > >> problem is resolved or server is rebooted). > > > > > > > >I would be really interesting to see what those tasks are blocked on. > > > > > > I'm trying to get it, stay tuned :) > > > > > > Today i noticed one bug, not 100% sure it is related to 'your' patch > > > but i didn't seen this before. I noticed that i have lots of cgroups > > > which cannot be removed - if i do 'rmdir ', it > > > just hangs and never complete. Even more, it's not possible to > > > access the whole cgroup filesystem until i kill that rmdir > > > (anything, which tries it, just hangs). All unremoveable cgroups has > > > this in 'memory.oom_control': oom_kill_disable 0 under_oom 1 > > > > Somebody acquires the OOM wait reference to the memcg and marks it > > under oom but then does not call into mem_cgroup_oom_synchronize() to > > clean up. That's why under_oom is set and the rmdir waits for > > outstanding references. > > > > > And, yes, 'tasks' file is empty. > > > > It's not a kernel thread that does it because all kernel-context > > handle_mm_fault() are annotated properly, which means the task must be > > userspace and, since tasks is empty, have exited before synchronizing. > > Yes, well spotted. I have missed that while reviewing your patch. > The follow up fix looks correct. Hmm, I guess you wanted to remove !(fault & VM_FAULT_ERROR) test as well otherwise the else BUG() path would be unreachable and we wouldn't know that something fishy is going on. > > Can you try with the following patch on top? > > > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > > index 5db0490..9a0b152 100644 > > --- a/arch/x86/mm/fault.c > > +++ b/arch/x86/mm/fault.c > > @@ -846,17 +846,6 @@ static noinline int > > mm_fault_error(struct pt_regs *regs, unsigned long error_code, > > unsigned long address, unsigned int fault) > > { > > - /* > > - * Pagefault was interrupted by SIGKILL. We have no reason to > > - * continue pagefault. > > - */ > > - if (fatal_signal_pending(current)) { > > - if (!(fault & VM_FAULT_RETRY)) > > - up_read(¤t->mm->mmap_sem); > > - if (!(error_code & PF_USER)) > > - no_context(regs, error_code, address); > > - return 1; > > - } > > if (!(fault & VM_FAULT_ERROR)) > > return 0; > > > > -- > Michal Hocko > SUSE Labs > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx110.postini.com [74.125.245.110]) by kanga.kvack.org (Postfix) with SMTP id C0BB16B0031 for ; Tue, 9 Jul 2013 09:10:01 -0400 (EDT) Date: Tue, 9 Jul 2013 15:10:00 +0200 From: Michal Hocko Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130709131000.GG20281@dhcp22.suse.cz> References: <20130222092332.4001E4B6@pobox.sk> <20130606160446.GE24115@dhcp22.suse.cz> <20130606181633.BCC3E02E@pobox.sk> <20130607131157.GF8117@dhcp22.suse.cz> <20130617122134.2E072BA8@pobox.sk> <20130619132614.GC16457@dhcp22.suse.cz> <20130622220958.D10567A4@pobox.sk> <20130624201345.GA21822@cmpxchg.org> <20130709130017.GE20281@dhcp22.suse.cz> <20130709130808.GF20281@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130709130808.GF20281@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki On Tue 09-07-13 15:08:08, Michal Hocko wrote: > On Tue 09-07-13 15:00:17, Michal Hocko wrote: > > On Mon 24-06-13 16:13:45, Johannes Weiner wrote: > > > Hi guys, > > > > > > On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote: > > > > >> But i'm sure of one thing - when problem occurs, nothing is able to > > > > >> access hard drives (every process which tries it is freezed until > > > > >> problem is resolved or server is rebooted). > > > > > > > > > >I would be really interesting to see what those tasks are blocked on. > > > > > > > > I'm trying to get it, stay tuned :) > > > > > > > > Today i noticed one bug, not 100% sure it is related to 'your' patch > > > > but i didn't seen this before. I noticed that i have lots of cgroups > > > > which cannot be removed - if i do 'rmdir ', it > > > > just hangs and never complete. Even more, it's not possible to > > > > access the whole cgroup filesystem until i kill that rmdir > > > > (anything, which tries it, just hangs). All unremoveable cgroups has > > > > this in 'memory.oom_control': oom_kill_disable 0 under_oom 1 > > > > > > Somebody acquires the OOM wait reference to the memcg and marks it > > > under oom but then does not call into mem_cgroup_oom_synchronize() to > > > clean up. That's why under_oom is set and the rmdir waits for > > > outstanding references. > > > > > > > And, yes, 'tasks' file is empty. > > > > > > It's not a kernel thread that does it because all kernel-context > > > handle_mm_fault() are annotated properly, which means the task must be > > > userspace and, since tasks is empty, have exited before synchronizing. > > > > Yes, well spotted. I have missed that while reviewing your patch. > > The follow up fix looks correct. > > Hmm, I guess you wanted to remove !(fault & VM_FAULT_ERROR) test as well > otherwise the else BUG() path would be unreachable and we wouldn't know > that something fishy is going on. No, scratch it! We need it for VM_FAULT_RETRY. Sorry about the noise. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx159.postini.com [74.125.245.159]) by kanga.kvack.org (Postfix) with SMTP id CF9CB6B0031 for ; Tue, 9 Jul 2013 09:54:52 -0400 (EDT) Date: Tue, 9 Jul 2013 15:54:50 +0200 From: Michal Hocko Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130709135450.GI20281@dhcp22.suse.cz> References: <20130619132614.GC16457@dhcp22.suse.cz> <20130622220958.D10567A4@pobox.sk> <20130624201345.GA21822@cmpxchg.org> <20130628120613.6D6CAD21@pobox.sk> <20130705181728.GQ17812@cmpxchg.org> <20130705210246.11D2135A@pobox.sk> <20130705191854.GR17812@cmpxchg.org> <20130708014224.50F06960@pobox.sk> <20130709131029.GH20281@dhcp22.suse.cz> <20130709151921.5160C199@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130709151921.5160C199@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: Johannes Weiner , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki On Tue 09-07-13 15:19:21, azurIt wrote: [...] > Now i realized that i forgot to remove UID from that cgroup before > trying to remove it, so cgroup cannot be removed anyway (we are using > third party cgroup called cgroup-uid from Andrea Righi, which is able > to associate all user's processes with target cgroup). Look here for > cgroup-uid patch: > https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > permanently '1'. This is really strange. Could you post the whole diff against stable tree you are using (except for grsecurity stuff and the above cgroup-uid patch)? Btw. the bellow patch might help us to point to the exit path which leaves wait_on_memcg without mem_cgroup_oom_synchronize: --- diff --git a/kernel/exit.c b/kernel/exit.c index e6e01b9..ad472e0 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -895,6 +895,7 @@ NORET_TYPE void do_exit(long code) profile_task_exit(tsk); + WARN_ON(current->memcg_oom.wait_on_memcg); WARN_ON(blk_needs_flush_plug(tsk)); if (unlikely(in_interrupt())) -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx153.postini.com [74.125.245.153]) by kanga.kvack.org (Postfix) with SMTP id B3CBE6B0032 for ; Thu, 11 Jul 2013 03:25:11 -0400 (EDT) Date: Thu, 11 Jul 2013 09:25:07 +0200 From: Michal Hocko Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130711072507.GA21667@dhcp22.suse.cz> References: <20130624201345.GA21822@cmpxchg.org> <20130628120613.6D6CAD21@pobox.sk> <20130705181728.GQ17812@cmpxchg.org> <20130705210246.11D2135A@pobox.sk> <20130705191854.GR17812@cmpxchg.org> <20130708014224.50F06960@pobox.sk> <20130709131029.GH20281@dhcp22.suse.cz> <20130709151921.5160C199@pobox.sk> <20130709135450.GI20281@dhcp22.suse.cz> <20130710182506.F25DF461@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130710182506.F25DF461@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: Johannes Weiner , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea@gmail.com On Wed 10-07-13 18:25:06, azurIt wrote: > >> Now i realized that i forgot to remove UID from that cgroup before > >> trying to remove it, so cgroup cannot be removed anyway (we are using > >> third party cgroup called cgroup-uid from Andrea Righi, which is able > >> to associate all user's processes with target cgroup). Look here for > >> cgroup-uid patch: > >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > >> > >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > >> permanently '1'. > > > >This is really strange. Could you post the whole diff against stable > >tree you are using (except for grsecurity stuff and the above cgroup-uid > >patch)? > > > Here are all patches which i applied to kernel 3.2.48 in my last test: > http://watchdog.sk/lkml/patches3/ The two patches from Johannes seem correct. >>From a quick look even grsecurity patchset shouldn't interfere as it doesn't seem to put any code between handle_mm_fault and mm_fault_error and there also doesn't seem to be any new handle_mm_fault call sites. But I cannot tell there aren't other code paths which would lead to a memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx205.postini.com [74.125.245.205]) by kanga.kvack.org (Postfix) with SMTP id D36DF6B0031 for ; Sat, 13 Jul 2013 19:26:43 -0400 (EDT) Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=5D_memcg=3A_do_not_trap_chargers_with_full_callstack_on_OOM?= Date: Sun, 14 Jul 2013 01:26:41 +0200 From: "azurIt" References: <20130624201345.GA21822@cmpxchg.org>, <20130628120613.6D6CAD21@pobox.sk>, <20130705181728.GQ17812@cmpxchg.org>, <20130705210246.11D2135A@pobox.sk>, <20130705191854.GR17812@cmpxchg.org>, <20130708014224.50F06960@pobox.sk>, <20130709131029.GH20281@dhcp22.suse.cz>, <20130709151921.5160C199@pobox.sk>, <20130709135450.GI20281@dhcp22.suse.cz>, <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> In-Reply-To: <20130711072507.GA21667@dhcp22.suse.cz> MIME-Version: 1.0 Message-Id: <20130714012641.C2DA4E05@pobox.sk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: =?utf-8?q?Michal_Hocko?= Cc: =?utf-8?q?Johannes_Weiner?= , linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , righi.andrea@gmail.com > CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com >On Wed 10-07-13 18:25:06, azurIt wrote: >> >> Now i realized that i forgot to remove UID from that cgroup before >> >> trying to remove it, so cgroup cannot be removed anyway (we are using >> >> third party cgroup called cgroup-uid from Andrea Righi, which is able >> >> to associate all user's processes with target cgroup). Look here for >> >> cgroup-uid patch: >> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch >> >> >> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was >> >> permanently '1'. >> > >> >This is really strange. Could you post the whole diff against stable >> >tree you are using (except for grsecurity stuff and the above cgroup-uid >> >patch)? >> >> >> Here are all patches which i applied to kernel 3.2.48 in my last test: >> http://watchdog.sk/lkml/patches3/ > >The two patches from Johannes seem correct. > >>From a quick look even grsecurity patchset shouldn't interfere as it >doesn't seem to put any code between handle_mm_fault and mm_fault_error >and there also doesn't seem to be any new handle_mm_fault call sites. > >But I cannot tell there aren't other code paths which would lead to a >memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. Michal, now i can definitely confirm that problem with unremovable cgroups persists. What info do you need from me? I applied also your little 'WARN_ON' patch. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx181.postini.com [74.125.245.181]) by kanga.kvack.org (Postfix) with SMTP id F0C906B0031 for ; Mon, 15 Jul 2013 11:41:21 -0400 (EDT) Date: Mon, 15 Jul 2013 17:41:19 +0200 From: Michal Hocko Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130715154119.GA32435@dhcp22.suse.cz> References: <20130705210246.11D2135A@pobox.sk> <20130705191854.GR17812@cmpxchg.org> <20130708014224.50F06960@pobox.sk> <20130709131029.GH20281@dhcp22.suse.cz> <20130709151921.5160C199@pobox.sk> <20130709135450.GI20281@dhcp22.suse.cz> <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130714015112.FFCB7AF7@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: Johannes Weiner , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea@gmail.com On Sun 14-07-13 01:51:12, azurIt wrote: > > CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com > >> CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com > >>On Wed 10-07-13 18:25:06, azurIt wrote: > >>> >> Now i realized that i forgot to remove UID from that cgroup before > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > >>> >> to associate all user's processes with target cgroup). Look here for > >>> >> cgroup-uid patch: > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > >>> >> > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > >>> >> permanently '1'. > >>> > > >>> >This is really strange. Could you post the whole diff against stable > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > >>> >patch)? > >>> > >>> > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > >>> http://watchdog.sk/lkml/patches3/ > >> > >>The two patches from Johannes seem correct. > >> > >>From a quick look even grsecurity patchset shouldn't interfere as it > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > >>and there also doesn't seem to be any new handle_mm_fault call sites. > >> > >>But I cannot tell there aren't other code paths which would lead to a > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > >Michal, > > > >now i can definitely confirm that problem with unremovable cgroups > >persists. What info do you need from me? I applied also your little > >'WARN_ON' patch. > > Ok, i think you want this: > http://watchdog.sk/lkml/kern4.log Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: Jul 14 01:11:41 server01 kernel: [ 595.393737] [] warn_slowpath_common+0x7a/0xb0 Jul 14 01:11:41 server01 kernel: [ 595.393903] [] warn_slowpath_null+0x1a/0x20 Jul 14 01:11:41 server01 kernel: [ 595.394068] [] do_exit+0x7d0/0x870 Jul 14 01:11:41 server01 kernel: [ 595.394231] [] ? thread_group_times+0x44/0xb0 Jul 14 01:11:41 server01 kernel: [ 595.394392] [] do_group_exit+0x51/0xc0 Jul 14 01:11:41 server01 kernel: [ 595.394551] [] sys_exit_group+0x17/0x20 Jul 14 01:11:41 server01 kernel: [ 595.394714] [] system_call_fastpath+0x18/0x1d Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- OK, so you had an OOM which has been handled by in-kernel oom handler (it killed 12021) and 12037 was in the same group. The warning tells us that it went through mem_cgroup_oom as well (otherwise it wouldn't have memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then it exited on the userspace request (by exit syscall). I do not see any way how, this could happen though. If mem_cgroup_oom is called then we always return CHARGE_NOMEM which turns into ENOMEM returned by __mem_cgroup_try_charge (invoke_oom must have been set to true). So if nobody screwed the return value on the way up to page fault handler then there is no way to escape. I will check the code. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx121.postini.com [74.125.245.121]) by kanga.kvack.org (Postfix) with SMTP id E006B6B0038 for ; Mon, 15 Jul 2013 12:00:07 -0400 (EDT) Date: Mon, 15 Jul 2013 18:00:06 +0200 From: Michal Hocko Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130715160006.GB32435@dhcp22.suse.cz> References: <20130705191854.GR17812@cmpxchg.org> <20130708014224.50F06960@pobox.sk> <20130709131029.GH20281@dhcp22.suse.cz> <20130709151921.5160C199@pobox.sk> <20130709135450.GI20281@dhcp22.suse.cz> <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130715154119.GA32435@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: Johannes Weiner , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea@gmail.com On Mon 15-07-13 17:41:19, Michal Hocko wrote: > On Sun 14-07-13 01:51:12, azurIt wrote: > > > CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com > > >> CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com > > >>On Wed 10-07-13 18:25:06, azurIt wrote: > > >>> >> Now i realized that i forgot to remove UID from that cgroup before > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > > >>> >> to associate all user's processes with target cgroup). Look here for > > >>> >> cgroup-uid patch: > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > >>> >> > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > > >>> >> permanently '1'. > > >>> > > > >>> >This is really strange. Could you post the whole diff against stable > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > > >>> >patch)? > > >>> > > >>> > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > > >>> http://watchdog.sk/lkml/patches3/ > > >> > > >>The two patches from Johannes seem correct. > > >> > > >>From a quick look even grsecurity patchset shouldn't interfere as it > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > > >>and there also doesn't seem to be any new handle_mm_fault call sites. > > >> > > >>But I cannot tell there aren't other code paths which would lead to a > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > > > > >Michal, > > > > > >now i can definitely confirm that problem with unremovable cgroups > > >persists. What info do you need from me? I applied also your little > > >'WARN_ON' patch. > > > > Ok, i think you want this: > > http://watchdog.sk/lkml/kern4.log > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: > Jul 14 01:11:41 server01 kernel: [ 595.393737] [] warn_slowpath_common+0x7a/0xb0 > Jul 14 01:11:41 server01 kernel: [ 595.393903] [] warn_slowpath_null+0x1a/0x20 > Jul 14 01:11:41 server01 kernel: [ 595.394068] [] do_exit+0x7d0/0x870 > Jul 14 01:11:41 server01 kernel: [ 595.394231] [] ? thread_group_times+0x44/0xb0 > Jul 14 01:11:41 server01 kernel: [ 595.394392] [] do_group_exit+0x51/0xc0 > Jul 14 01:11:41 server01 kernel: [ 595.394551] [] sys_exit_group+0x17/0x20 > Jul 14 01:11:41 server01 kernel: [ 595.394714] [] system_call_fastpath+0x18/0x1d > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- > > OK, so you had an OOM which has been handled by in-kernel oom handler > (it killed 12021) and 12037 was in the same group. The warning tells us > that it went through mem_cgroup_oom as well (otherwise it wouldn't have > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then > it exited on the userspace request (by exit syscall). > > I do not see any way how, this could happen though. If mem_cgroup_oom > is called then we always return CHARGE_NOMEM which turns into ENOMEM > returned by __mem_cgroup_try_charge (invoke_oom must have been set to > true). So if nobody screwed the return value on the way up to page > fault handler then there is no way to escape. > > I will check the code. OK, I guess I found it: __do_fault fault = filemap_fault do_async_mmap_readahead page_cache_async_readahead ondemand_readahead __do_page_cache_readahead read_pages readpages = ext3_readpages mpage_readpages # Doesn't propagate ENOMEM add_to_page_cache_lru add_to_page_cache add_to_page_cache_locked mem_cgroup_cache_charge So the read ahead most probably. Again! Duhhh. I will try to think about a fix for this. One obvious place is mpage_readpages but __do_page_cache_readahead ignores read_pages return value as well and page_cache_async_readahead, even worse, is just void and exported as such. So this smells like a hard to fix bugger. One possible, and really ugly way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault doesn't return VM_FAULT_ERROR, but that is a crude hack. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx167.postini.com [74.125.245.167]) by kanga.kvack.org (Postfix) with SMTP id 46C9B6B0031 for ; Tue, 16 Jul 2013 11:36:19 -0400 (EDT) Date: Tue, 16 Jul 2013 11:35:44 -0400 From: Johannes Weiner Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130716153544.GX17812@cmpxchg.org> References: <20130708014224.50F06960@pobox.sk> <20130709131029.GH20281@dhcp22.suse.cz> <20130709151921.5160C199@pobox.sk> <20130709135450.GI20281@dhcp22.suse.cz> <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130715160006.GB32435@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea@gmail.com On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote: > On Mon 15-07-13 17:41:19, Michal Hocko wrote: > > On Sun 14-07-13 01:51:12, azurIt wrote: > > > > CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com > > > >> CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com > > > >>On Wed 10-07-13 18:25:06, azurIt wrote: > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > > > >>> >> to associate all user's processes with target cgroup). Look here for > > > >>> >> cgroup-uid patch: > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > > >>> >> > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > > > >>> >> permanently '1'. > > > >>> > > > > >>> >This is really strange. Could you post the whole diff against stable > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > > > >>> >patch)? > > > >>> > > > >>> > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > > > >>> http://watchdog.sk/lkml/patches3/ > > > >> > > > >>The two patches from Johannes seem correct. > > > >> > > > >>From a quick look even grsecurity patchset shouldn't interfere as it > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > > > >>and there also doesn't seem to be any new handle_mm_fault call sites. > > > >> > > > >>But I cannot tell there aren't other code paths which would lead to a > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > > > > > > > >Michal, > > > > > > > >now i can definitely confirm that problem with unremovable cgroups > > > >persists. What info do you need from me? I applied also your little > > > >'WARN_ON' patch. > > > > > > Ok, i think you want this: > > > http://watchdog.sk/lkml/kern4.log > > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [] warn_slowpath_common+0x7a/0xb0 > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [] warn_slowpath_null+0x1a/0x20 > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [] do_exit+0x7d0/0x870 > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [] ? thread_group_times+0x44/0xb0 > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [] do_group_exit+0x51/0xc0 > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [] sys_exit_group+0x17/0x20 > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [] system_call_fastpath+0x18/0x1d > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- > > > > OK, so you had an OOM which has been handled by in-kernel oom handler > > (it killed 12021) and 12037 was in the same group. The warning tells us > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then > > it exited on the userspace request (by exit syscall). > > > > I do not see any way how, this could happen though. If mem_cgroup_oom > > is called then we always return CHARGE_NOMEM which turns into ENOMEM > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to > > true). So if nobody screwed the return value on the way up to page > > fault handler then there is no way to escape. > > > > I will check the code. > > OK, I guess I found it: > __do_fault > fault = filemap_fault > do_async_mmap_readahead > page_cache_async_readahead > ondemand_readahead > __do_page_cache_readahead > read_pages > readpages = ext3_readpages > mpage_readpages # Doesn't propagate ENOMEM > add_to_page_cache_lru > add_to_page_cache > add_to_page_cache_locked > mem_cgroup_cache_charge > > So the read ahead most probably. Again! Duhhh. I will try to think > about a fix for this. One obvious place is mpage_readpages but > __do_page_cache_readahead ignores read_pages return value as well and > page_cache_async_readahead, even worse, is just void and exported as > such. > > So this smells like a hard to fix bugger. One possible, and really ugly > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault > doesn't return VM_FAULT_ERROR, but that is a crude hack. Ouch, good spot. I don't think we need to handle an OOM from the readahead code. If readahead does not produce the desired page, we retry synchroneously in page_cache_read() and handle the OOM properly. We should not signal an OOM for optional pages anyway. So either we pass a flag from the readahead code down to add_to_page_cache and mem_cgroup_cache_charge that tells the charge code to ignore OOM conditions and do not set up an OOM context. Or we DO call mem_cgroup_oom_synchronize() from the read_cache_pages, with an argument that makes it only clean up the context and not wait. It would not be completely outlandish to place it there, since it's right next to where an error from add_to_page_cache() is not further propagated back through the fault stack. I'm travelling right now, I'll send a patch when I get back (Thursday). Unless you beat me to it :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx203.postini.com [74.125.245.203]) by kanga.kvack.org (Postfix) with SMTP id 592BA6B0034 for ; Tue, 16 Jul 2013 12:09:07 -0400 (EDT) Date: Tue, 16 Jul 2013 18:09:05 +0200 From: Michal Hocko Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130716160905.GA20018@dhcp22.suse.cz> References: <20130709131029.GH20281@dhcp22.suse.cz> <20130709151921.5160C199@pobox.sk> <20130709135450.GI20281@dhcp22.suse.cz> <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> <20130716153544.GX17812@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130716153544.GX17812@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea@gmail.com On Tue 16-07-13 11:35:44, Johannes Weiner wrote: > On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote: > > On Mon 15-07-13 17:41:19, Michal Hocko wrote: > > > On Sun 14-07-13 01:51:12, azurIt wrote: > > > > > CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com > > > > >> CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com > > > > >>On Wed 10-07-13 18:25:06, azurIt wrote: > > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before > > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > > > > >>> >> to associate all user's processes with target cgroup). Look here for > > > > >>> >> cgroup-uid patch: > > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > > > >>> >> > > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > > > > >>> >> permanently '1'. > > > > >>> > > > > > >>> >This is really strange. Could you post the whole diff against stable > > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > > > > >>> >patch)? > > > > >>> > > > > >>> > > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > > > > >>> http://watchdog.sk/lkml/patches3/ > > > > >> > > > > >>The two patches from Johannes seem correct. > > > > >> > > > > >>From a quick look even grsecurity patchset shouldn't interfere as it > > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > > > > >>and there also doesn't seem to be any new handle_mm_fault call sites. > > > > >> > > > > >>But I cannot tell there aren't other code paths which would lead to a > > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > > > > > > > > > > >Michal, > > > > > > > > > >now i can definitely confirm that problem with unremovable cgroups > > > > >persists. What info do you need from me? I applied also your little > > > > >'WARN_ON' patch. > > > > > > > > Ok, i think you want this: > > > > http://watchdog.sk/lkml/kern4.log > > > > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name > > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child > > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB > > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ > > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() > > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA > > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 > > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: > > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [] warn_slowpath_common+0x7a/0xb0 > > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [] warn_slowpath_null+0x1a/0x20 > > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [] do_exit+0x7d0/0x870 > > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [] ? thread_group_times+0x44/0xb0 > > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [] do_group_exit+0x51/0xc0 > > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [] sys_exit_group+0x17/0x20 > > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [] system_call_fastpath+0x18/0x1d > > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- > > > > > > OK, so you had an OOM which has been handled by in-kernel oom handler > > > (it killed 12021) and 12037 was in the same group. The warning tells us > > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have > > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then > > > it exited on the userspace request (by exit syscall). > > > > > > I do not see any way how, this could happen though. If mem_cgroup_oom > > > is called then we always return CHARGE_NOMEM which turns into ENOMEM > > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to > > > true). So if nobody screwed the return value on the way up to page > > > fault handler then there is no way to escape. > > > > > > I will check the code. > > > > OK, I guess I found it: > > __do_fault > > fault = filemap_fault > > do_async_mmap_readahead > > page_cache_async_readahead > > ondemand_readahead > > __do_page_cache_readahead > > read_pages > > readpages = ext3_readpages > > mpage_readpages # Doesn't propagate ENOMEM > > add_to_page_cache_lru > > add_to_page_cache > > add_to_page_cache_locked > > mem_cgroup_cache_charge > > > > So the read ahead most probably. Again! Duhhh. I will try to think > > about a fix for this. One obvious place is mpage_readpages but > > __do_page_cache_readahead ignores read_pages return value as well and > > page_cache_async_readahead, even worse, is just void and exported as > > such. > > > > So this smells like a hard to fix bugger. One possible, and really ugly > > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault > > doesn't return VM_FAULT_ERROR, but that is a crude hack. > > Ouch, good spot. > > I don't think we need to handle an OOM from the readahead code. If > readahead does not produce the desired page, we retry synchroneously > in page_cache_read() and handle the OOM properly. We should not > signal an OOM for optional pages anyway. > > So either we pass a flag from the readahead code down to > add_to_page_cache and mem_cgroup_cache_charge that tells the charge > code to ignore OOM conditions and do not set up an OOM context. That was my previous attempt and it was sooo painful. > Or we DO call mem_cgroup_oom_synchronize() from the read_cache_pages, > with an argument that makes it only clean up the context and not wait. Yes, I was playing with this idea as well. I just do not like how fragile this is. We need some way to catch all possible places which might leak it. > It would not be completely outlandish to place it there, since it's > right next to where an error from add_to_page_cache() is not further > propagated back through the fault stack. > > I'm travelling right now, I'll send a patch when I get back > (Thursday). Unless you beat me to it :) I can cook something up but there is quite a big pile on my desk currently (as always :/). -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx202.postini.com [74.125.245.202]) by kanga.kvack.org (Postfix) with SMTP id DCA9E6B0031 for ; Tue, 16 Jul 2013 12:48:50 -0400 (EDT) Date: Tue, 16 Jul 2013 12:48:30 -0400 From: Johannes Weiner Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130716164830.GZ17812@cmpxchg.org> References: <20130709151921.5160C199@pobox.sk> <20130709135450.GI20281@dhcp22.suse.cz> <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> <20130716153544.GX17812@cmpxchg.org> <20130716160905.GA20018@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130716160905.GA20018@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea@gmail.com On Tue, Jul 16, 2013 at 06:09:05PM +0200, Michal Hocko wrote: > On Tue 16-07-13 11:35:44, Johannes Weiner wrote: > > On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote: > > > On Mon 15-07-13 17:41:19, Michal Hocko wrote: > > > > On Sun 14-07-13 01:51:12, azurIt wrote: > > > > > > CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com > > > > > >> CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com > > > > > >>On Wed 10-07-13 18:25:06, azurIt wrote: > > > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before > > > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > > > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > > > > > >>> >> to associate all user's processes with target cgroup). Look here for > > > > > >>> >> cgroup-uid patch: > > > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > > > > >>> >> > > > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > > > > > >>> >> permanently '1'. > > > > > >>> > > > > > > >>> >This is really strange. Could you post the whole diff against stable > > > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > > > > > >>> >patch)? > > > > > >>> > > > > > >>> > > > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > > > > > >>> http://watchdog.sk/lkml/patches3/ > > > > > >> > > > > > >>The two patches from Johannes seem correct. > > > > > >> > > > > > >>From a quick look even grsecurity patchset shouldn't interfere as it > > > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > > > > > >>and there also doesn't seem to be any new handle_mm_fault call sites. > > > > > >> > > > > > >>But I cannot tell there aren't other code paths which would lead to a > > > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > > > > > > > > > > > > > >Michal, > > > > > > > > > > > >now i can definitely confirm that problem with unremovable cgroups > > > > > >persists. What info do you need from me? I applied also your little > > > > > >'WARN_ON' patch. > > > > > > > > > > Ok, i think you want this: > > > > > http://watchdog.sk/lkml/kern4.log > > > > > > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name > > > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child > > > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB > > > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ > > > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() > > > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA > > > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 > > > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: > > > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [] warn_slowpath_common+0x7a/0xb0 > > > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [] warn_slowpath_null+0x1a/0x20 > > > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [] do_exit+0x7d0/0x870 > > > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [] ? thread_group_times+0x44/0xb0 > > > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [] do_group_exit+0x51/0xc0 > > > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [] sys_exit_group+0x17/0x20 > > > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [] system_call_fastpath+0x18/0x1d > > > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- > > > > > > > > OK, so you had an OOM which has been handled by in-kernel oom handler > > > > (it killed 12021) and 12037 was in the same group. The warning tells us > > > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have > > > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then > > > > it exited on the userspace request (by exit syscall). > > > > > > > > I do not see any way how, this could happen though. If mem_cgroup_oom > > > > is called then we always return CHARGE_NOMEM which turns into ENOMEM > > > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to > > > > true). So if nobody screwed the return value on the way up to page > > > > fault handler then there is no way to escape. > > > > > > > > I will check the code. > > > > > > OK, I guess I found it: > > > __do_fault > > > fault = filemap_fault > > > do_async_mmap_readahead > > > page_cache_async_readahead > > > ondemand_readahead > > > __do_page_cache_readahead > > > read_pages > > > readpages = ext3_readpages > > > mpage_readpages # Doesn't propagate ENOMEM > > > add_to_page_cache_lru > > > add_to_page_cache > > > add_to_page_cache_locked > > > mem_cgroup_cache_charge > > > > > > So the read ahead most probably. Again! Duhhh. I will try to think > > > about a fix for this. One obvious place is mpage_readpages but > > > __do_page_cache_readahead ignores read_pages return value as well and > > > page_cache_async_readahead, even worse, is just void and exported as > > > such. > > > > > > So this smells like a hard to fix bugger. One possible, and really ugly > > > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault > > > doesn't return VM_FAULT_ERROR, but that is a crude hack. > > > > Ouch, good spot. > > > > I don't think we need to handle an OOM from the readahead code. If > > readahead does not produce the desired page, we retry synchroneously > > in page_cache_read() and handle the OOM properly. We should not > > signal an OOM for optional pages anyway. > > > > So either we pass a flag from the readahead code down to > > add_to_page_cache and mem_cgroup_cache_charge that tells the charge > > code to ignore OOM conditions and do not set up an OOM context. > > That was my previous attempt and it was sooo painful. > > > Or we DO call mem_cgroup_oom_synchronize() from the read_cache_pages, > > with an argument that makes it only clean up the context and not wait. > > Yes, I was playing with this idea as well. I just do not like how > fragile this is. We need some way to catch all possible places which > might leak it. I don't think this is necessary, but we could add a sanity check in/near mem_cgroup_clear_userfault() that makes sure the OOM context is only set up when an error is returned. > > It would not be completely outlandish to place it there, since it's > > right next to where an error from add_to_page_cache() is not further > > propagated back through the fault stack. > > > > I'm travelling right now, I'll send a patch when I get back > > (Thursday). Unless you beat me to it :) > > I can cook something up but there is quite a big pile on my desk > currently (as always :/). No worries, I'll send an update. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx191.postini.com [74.125.245.191]) by kanga.kvack.org (Postfix) with SMTP id 7A4586B0033 for ; Fri, 19 Jul 2013 00:22:43 -0400 (EDT) Date: Fri, 19 Jul 2013 00:22:38 -0400 From: Johannes Weiner Subject: [patch 1/5] mm: invoke oom-killer from remaining unconverted page fault handlers Message-ID: <20130719042238.GD17812@cmpxchg.org> References: <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> <20130716153544.GX17812@cmpxchg.org> <20130716160905.GA20018@dhcp22.suse.cz> <20130716164830.GZ17812@cmpxchg.org> <20130719042124.GC17812@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130719042124.GC17812@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea@gmail.com [already upstream, included for 3.2 reference] A few remaining architectures directly kill the page faulting task in an out of memory situation. This is usually not a good idea since that task might not even use a significant amount of memory and so may not be the optimal victim to resolve the situation. Since 2.6.29's 1c0fe6e ("mm: invoke oom-killer from page fault") there is a hook that architecture page fault handlers are supposed to call to invoke the OOM killer and let it pick the right task to kill. Convert the remaining architectures over to this hook. To have the previous behavior of simply taking out the faulting task the vm.oom_kill_allocating_task sysctl can be set to 1. Signed-off-by: Johannes Weiner Reviewed-by: Michal Hocko Cc: KAMEZAWA Hiroyuki Acked-by: David Rientjes Acked-by: Vineet Gupta [arch/arc bits] Cc: James Hogan Cc: David Howells Cc: Jonas Bonn Cc: Chen Liqin Cc: Lennox Wu Cc: Chris Metcalf Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/mn10300/mm/fault.c | 7 ++++--- arch/openrisc/mm/fault.c | 8 ++++---- arch/score/mm/fault.c | 8 ++++---- arch/tile/mm/fault.c | 8 ++++---- 4 files changed, 16 insertions(+), 15 deletions(-) diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c index 0945409..5ac4df5 100644 --- a/arch/mn10300/mm/fault.c +++ b/arch/mn10300/mm/fault.c @@ -329,9 +329,10 @@ no_context: */ out_of_memory: up_read(&mm->mmap_sem); - printk(KERN_ALERT "VM: killing process %s\n", tsk->comm); - if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) - do_exit(SIGKILL); + if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) { + pagefault_out_of_memory(); + return; + } goto no_context; do_sigbus: diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c index a5dce82..d78881c 100644 --- a/arch/openrisc/mm/fault.c +++ b/arch/openrisc/mm/fault.c @@ -246,10 +246,10 @@ out_of_memory: __asm__ __volatile__("l.nop 1"); up_read(&mm->mmap_sem); - printk("VM: killing process %s\n", tsk->comm); - if (user_mode(regs)) - do_exit(SIGKILL); - goto no_context; + if (!user_mode(regs)) + goto no_context; + pagefault_out_of_memory(); + return; do_sigbus: up_read(&mm->mmap_sem); diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c index 47b600e..6b18fb0 100644 --- a/arch/score/mm/fault.c +++ b/arch/score/mm/fault.c @@ -172,10 +172,10 @@ out_of_memory: down_read(&mm->mmap_sem); goto survive; } - printk("VM: killing process %s\n", tsk->comm); - if (user_mode(regs)) - do_group_exit(SIGKILL); - goto no_context; + if (!user_mode(regs)) + goto no_context; + pagefault_out_of_memory(); + return; do_sigbus: up_read(&mm->mmap_sem); diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c index 25b7b90..3312531 100644 --- a/arch/tile/mm/fault.c +++ b/arch/tile/mm/fault.c @@ -540,10 +540,10 @@ out_of_memory: down_read(&mm->mmap_sem); goto survive; } - pr_alert("VM: killing process %s\n", tsk->comm); - if (!is_kernel_mode) - do_group_exit(SIGKILL); - goto no_context; + if (is_kernel_mode) + goto no_context; + pagefault_out_of_memory(); + return 0; do_sigbus: up_read(&mm->mmap_sem); -- 1.8.3.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx126.postini.com [74.125.245.126]) by kanga.kvack.org (Postfix) with SMTP id 4E7246B0031 for ; Fri, 19 Jul 2013 00:24:30 -0400 (EDT) Date: Fri, 19 Jul 2013 00:24:24 -0400 From: Johannes Weiner Subject: [patch 2/5] mm: pass userspace fault flag to generic fault handler Message-ID: <20130719042424.GE17812@cmpxchg.org> References: <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> <20130716153544.GX17812@cmpxchg.org> <20130716160905.GA20018@dhcp22.suse.cz> <20130716164830.GZ17812@cmpxchg.org> <20130719042124.GC17812@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130719042124.GC17812@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea@gmail.com The global OOM killer is (XXX: for most architectures) only invoked for userspace faults, not for faults from kernelspace (uaccess, gup). Memcg OOM handling is currently invoked for all faults. Allow it to behave like the global case by having the architectures pass a flag to the generic fault handler code that identifies userspace faults. Signed-off-by: Johannes Weiner --- arch/alpha/mm/fault.c | 8 +++++++- arch/arm/mm/fault.c | 12 +++++++++--- arch/avr32/mm/fault.c | 8 +++++++- arch/cris/mm/fault.c | 8 +++++++- arch/frv/mm/fault.c | 8 +++++++- arch/hexagon/mm/vm_fault.c | 8 +++++++- arch/ia64/mm/fault.c | 8 +++++++- arch/m32r/mm/fault.c | 8 +++++++- arch/m68k/mm/fault.c | 8 +++++++- arch/microblaze/mm/fault.c | 8 +++++++- arch/mips/mm/fault.c | 8 +++++++- arch/mn10300/mm/fault.c | 8 +++++++- arch/openrisc/mm/fault.c | 8 +++++++- arch/parisc/mm/fault.c | 8 +++++++- arch/powerpc/mm/fault.c | 8 +++++++- arch/s390/mm/fault.c | 2 ++ arch/score/mm/fault.c | 7 ++++++- arch/sh/mm/fault_32.c | 8 +++++++- arch/sh/mm/tlbflush_64.c | 8 +++++++- arch/sparc/mm/fault_32.c | 8 +++++++- arch/sparc/mm/fault_64.c | 8 +++++++- arch/tile/mm/fault.c | 7 ++++++- arch/um/kernel/trap.c | 8 +++++++- arch/unicore32/mm/fault.c | 13 +++++++++---- arch/x86/mm/fault.c | 8 ++++++-- arch/xtensa/mm/fault.c | 8 +++++++- include/linux/mm.h | 1 + 27 files changed, 179 insertions(+), 31 deletions(-) diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c index fadd5f8..fa6b4e4 100644 --- a/arch/alpha/mm/fault.c +++ b/arch/alpha/mm/fault.c @@ -89,6 +89,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr, struct mm_struct *mm = current->mm; const struct exception_table_entry *fixup; int fault, si_code = SEGV_MAPERR; + unsigned long flags = 0; siginfo_t info; /* As of EV6, a load into $31/$f31 is a prefetch, and never faults @@ -142,10 +143,15 @@ do_page_fault(unsigned long address, unsigned long mmcsr, goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (cause > 0) + flags |= FAULT_FLAG_WRITE; + /* If for any reason at all we couldn't handle the fault, make sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(mm, vma, address, cause > 0 ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); up_read(&mm->mmap_sem); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c index aa33949..31b1e69 100644 --- a/arch/arm/mm/fault.c +++ b/arch/arm/mm/fault.c @@ -231,9 +231,10 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma) static int __kprobes __do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr, - struct task_struct *tsk) + struct task_struct *tsk, struct pt_regs *regs) { struct vm_area_struct *vma; + unsigned long flags = 0; int fault; vma = find_vma(mm, addr); @@ -253,11 +254,16 @@ good_area: goto out; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (fsr & FSR_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, (fsr & FSR_WRITE) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags); if (unlikely(fault & VM_FAULT_ERROR)) return fault; if (fault & VM_FAULT_MAJOR) @@ -320,7 +326,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) #endif } - fault = __do_page_fault(mm, addr, fsr, tsk); + fault = __do_page_fault(mm, addr, fsr, tsk, regs); up_read(&mm->mmap_sem); perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c index f7040a1..ada6237 100644 --- a/arch/avr32/mm/fault.c +++ b/arch/avr32/mm/fault.c @@ -59,6 +59,7 @@ asmlinkage void do_page_fault(unsigned long ecr, struct pt_regs *regs) struct mm_struct *mm; struct vm_area_struct *vma; const struct exception_table_entry *fixup; + unsigned long flags = 0; unsigned long address; unsigned long page; int writeaccess; @@ -127,12 +128,17 @@ good_area: panic("Unhandled case %lu in do_page_fault!", ecr); } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/cris/mm/fault.c b/arch/cris/mm/fault.c index 9dcac8e..35d096a 100644 --- a/arch/cris/mm/fault.c +++ b/arch/cris/mm/fault.c @@ -55,6 +55,7 @@ do_page_fault(unsigned long address, struct pt_regs *regs, struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct * vma; + unsigned long flags = 0; siginfo_t info; int fault; @@ -156,13 +157,18 @@ do_page_fault(unsigned long address, struct pt_regs *regs, goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess & 1) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, (writeaccess & 1) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/frv/mm/fault.c b/arch/frv/mm/fault.c index a325d57..2dbf219 100644 --- a/arch/frv/mm/fault.c +++ b/arch/frv/mm/fault.c @@ -35,6 +35,7 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear struct vm_area_struct *vma; struct mm_struct *mm; unsigned long _pme, lrai, lrad, fixup; + unsigned long flags = 0; siginfo_t info; pgd_t *pge; pud_t *pue; @@ -158,12 +159,17 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear break; } + if (user_mode(__frame)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, ear0, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, ear0, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c index c10b76f..e56baf3 100644 --- a/arch/hexagon/mm/vm_fault.c +++ b/arch/hexagon/mm/vm_fault.c @@ -52,6 +52,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs) siginfo_t info; int si_code = SEGV_MAPERR; int fault; + unsigned long flags = 0; const struct exception_table_entry *fixup; /* @@ -96,7 +97,12 @@ good_area: break; } - fault = handle_mm_fault(mm, vma, address, (cause > 0)); + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (cause > 0) + flags |= FAULT_FLAG_WRITE; + + fault = handle_mm_fault(mm, vma, address, flags); /* The most common case -- we are done. */ if (likely(!(fault & VM_FAULT_ERROR))) { diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c index 20b3593..ad9ef9d 100644 --- a/arch/ia64/mm/fault.c +++ b/arch/ia64/mm/fault.c @@ -79,6 +79,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re int signal = SIGSEGV, code = SEGV_MAPERR; struct vm_area_struct *vma, *prev_vma; struct mm_struct *mm = current->mm; + unsigned long flags = 0; struct siginfo si; unsigned long mask; int fault; @@ -149,12 +150,17 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re if ((vma->vm_flags & mask) != mask) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (mask & VM_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(mm, vma, address, (mask & VM_WRITE) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { /* * We ran out of memory, or some other thing happened diff --git a/arch/m32r/mm/fault.c b/arch/m32r/mm/fault.c index 2c9aeb4..e74f6fa 100644 --- a/arch/m32r/mm/fault.c +++ b/arch/m32r/mm/fault.c @@ -79,6 +79,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code, struct mm_struct *mm; struct vm_area_struct * vma; unsigned long page, addr; + unsigned long flags = 0; int write; int fault; siginfo_t info; @@ -188,6 +189,11 @@ good_area: if ((error_code & ACE_INSTRUCTION) && !(vma->vm_flags & VM_EXEC)) goto bad_area; + if (error_code & ACE_USERMODE) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo @@ -195,7 +201,7 @@ good_area: */ addr = (address & PAGE_MASK); set_thread_fault_code(error_code); - fault = handle_mm_fault(mm, vma, addr, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, addr, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c index 2db6099..ab88a91 100644 --- a/arch/m68k/mm/fault.c +++ b/arch/m68k/mm/fault.c @@ -73,6 +73,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address, { struct mm_struct *mm = current->mm; struct vm_area_struct * vma; + unsigned long flags = 0; int write, fault; #ifdef DEBUG @@ -134,13 +135,18 @@ good_area: goto acc_err; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); #ifdef DEBUG printk("handle_mm_fault returns %d\n",fault); #endif diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c index ae97d2c..b002612 100644 --- a/arch/microblaze/mm/fault.c +++ b/arch/microblaze/mm/fault.c @@ -89,6 +89,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address, { struct vm_area_struct *vma; struct mm_struct *mm = current->mm; + unsigned long flags = 0; siginfo_t info; int code = SEGV_MAPERR; int is_write = error_code & ESR_S; @@ -206,12 +207,17 @@ good_area: goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c index 937cf33..e5b9fed 100644 --- a/arch/mips/mm/fault.c +++ b/arch/mips/mm/fault.c @@ -40,6 +40,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, unsigned long writ struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; const int field = sizeof(unsigned long) * 2; + unsigned long flags = 0; siginfo_t info; int fault; @@ -139,12 +140,17 @@ good_area: } } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c index 5ac4df5..031be56 100644 --- a/arch/mn10300/mm/fault.c +++ b/arch/mn10300/mm/fault.c @@ -121,6 +121,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long fault_code, { struct vm_area_struct *vma; struct task_struct *tsk; + unsigned long flags = 0; struct mm_struct *mm; unsigned long page; siginfo_t info; @@ -247,12 +248,17 @@ good_area: break; } + if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c index d78881c..d586119 100644 --- a/arch/openrisc/mm/fault.c +++ b/arch/openrisc/mm/fault.c @@ -52,6 +52,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address, struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct *vma; + unsigned long flags = 0; siginfo_t info; int fault; @@ -153,13 +154,18 @@ good_area: if ((vector == 0x400) && !(vma->vm_page_prot.pgprot & _PAGE_EXEC)) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write_acc) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write_acc); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c index 18162ce..a151e87 100644 --- a/arch/parisc/mm/fault.c +++ b/arch/parisc/mm/fault.c @@ -173,6 +173,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code, struct vm_area_struct *vma, *prev_vma; struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; + unsigned long flags = 0; unsigned long acc_type; int fault; @@ -195,13 +196,18 @@ good_area: if ((vma->vm_flags & acc_type) != acc_type) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (acc_type & VM_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(mm, vma, address, (acc_type & VM_WRITE) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { /* * We hit a shared mapping outside of the file, or some diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 5efe8c9..2bf339c 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -122,6 +122,7 @@ int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address, { struct vm_area_struct * vma; struct mm_struct *mm = current->mm; + unsigned long flags = 0; siginfo_t info; int code = SEGV_MAPERR; int is_write = 0, ret; @@ -305,12 +306,17 @@ good_area: goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - ret = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + ret = handle_mm_fault(mm, vma, address, flags); if (unlikely(ret & VM_FAULT_ERROR)) { if (ret & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c index a9a3018..fe6109c 100644 --- a/arch/s390/mm/fault.c +++ b/arch/s390/mm/fault.c @@ -301,6 +301,8 @@ static inline int do_exception(struct pt_regs *regs, int access, address = trans_exc_code & __FAIL_ADDR_MASK; perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); flags = FAULT_FLAG_ALLOW_RETRY; + if (regs->psw.mask & PSW_MASK_PSTATE) + flags |= FAULT_FLAG_USER; if (access == VM_WRITE || (trans_exc_code & store_indication) == 0x400) flags |= FAULT_FLAG_WRITE; down_read(&mm->mmap_sem); diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c index 6b18fb0..2ca5ae5 100644 --- a/arch/score/mm/fault.c +++ b/arch/score/mm/fault.c @@ -47,6 +47,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write, struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; const int field = sizeof(unsigned long) * 2; + unsigned long flags = 0; siginfo_t info; int fault; @@ -101,12 +102,16 @@ good_area: } survive: + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sh/mm/fault_32.c b/arch/sh/mm/fault_32.c index 7bebd04..a61b803 100644 --- a/arch/sh/mm/fault_32.c +++ b/arch/sh/mm/fault_32.c @@ -126,6 +126,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct * vma; + unsigned long flags = 0; int si_code; int fault; siginfo_t info; @@ -195,12 +196,17 @@ good_area: goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sh/mm/tlbflush_64.c b/arch/sh/mm/tlbflush_64.c index e3430e0..0a9d645 100644 --- a/arch/sh/mm/tlbflush_64.c +++ b/arch/sh/mm/tlbflush_64.c @@ -96,6 +96,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long writeaccess, struct mm_struct *mm; struct vm_area_struct * vma; const struct exception_table_entry *fixup; + unsigned long flags = 0; pte_t *pte; int fault; @@ -184,12 +185,17 @@ good_area: } } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c index 8023fd7..efa3d48 100644 --- a/arch/sparc/mm/fault_32.c +++ b/arch/sparc/mm/fault_32.c @@ -222,6 +222,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write, struct vm_area_struct *vma; struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; + unsigned long flags = 0; unsigned int fixup; unsigned long g2; int from_user = !(regs->psr & PSR_PS); @@ -285,12 +286,17 @@ good_area: goto bad_area; } + if (from_user) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c index 504c062..bc536ea 100644 --- a/arch/sparc/mm/fault_64.c +++ b/arch/sparc/mm/fault_64.c @@ -276,6 +276,7 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; + unsigned long flags = 0; unsigned int insn = 0; int si_code, fault_code, fault; unsigned long address, mm_rss; @@ -423,7 +424,12 @@ good_area: goto bad_area; } - fault = handle_mm_fault(mm, vma, address, (fault_code & FAULT_CODE_WRITE) ? FAULT_FLAG_WRITE : 0); + if (!(regs->tstate & TSTATE_PRIV)) + flags |= FAULT_FLAG_USER; + if (fault_code & FAULT_CODE_WRITE) + flags |= FAULT_FLAG_WRITE; + + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c index 3312531..b2a7fd5 100644 --- a/arch/tile/mm/fault.c +++ b/arch/tile/mm/fault.c @@ -263,6 +263,7 @@ static int handle_page_fault(struct pt_regs *regs, struct mm_struct *mm; struct vm_area_struct *vma; unsigned long stack_offset; + unsigned long flags = 0; int fault; int si_code; int is_kernel_mode; @@ -415,12 +416,16 @@ good_area: } survive: + if (!is_kernel_mode) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c index dafc947..626a85e 100644 --- a/arch/um/kernel/trap.c +++ b/arch/um/kernel/trap.c @@ -25,6 +25,7 @@ int handle_page_fault(unsigned long address, unsigned long ip, { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; + unsigned long flags = 0; pgd_t *pgd; pud_t *pud; pmd_t *pmd; @@ -62,10 +63,15 @@ good_area: if (!is_write && !(vma->vm_flags & (VM_READ | VM_EXEC))) goto out; + if (is_user) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + do { int fault; - fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) { goto out_of_memory; diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c index 283aa4b..3026943 100644 --- a/arch/unicore32/mm/fault.c +++ b/arch/unicore32/mm/fault.c @@ -169,9 +169,10 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma) } static int __do_pf(struct mm_struct *mm, unsigned long addr, unsigned int fsr, - struct task_struct *tsk) + struct task_struct *tsk, struct pt_regs *regs) { struct vm_area_struct *vma; + unsigned long flags = 0; int fault; vma = find_vma(mm, addr); @@ -191,12 +192,16 @@ good_area: goto out; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (!(fsr ^ 0x12)) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, - (!(fsr ^ 0x12)) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags); if (unlikely(fault & VM_FAULT_ERROR)) return fault; if (fault & VM_FAULT_MAJOR) @@ -252,7 +257,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs) #endif } - fault = __do_pf(mm, addr, fsr, tsk); + fault = __do_pf(mm, addr, fsr, tsk, regs); up_read(&mm->mmap_sem); /* diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 5db0490..1cebabe 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -999,8 +999,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code) struct mm_struct *mm; int fault; int write = error_code & PF_WRITE; - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE | - (write ? FAULT_FLAG_WRITE : 0); + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; tsk = current; mm = tsk->mm; @@ -1160,6 +1159,11 @@ good_area: return; } + if (error_code & PF_USER) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c index e367e30..7db9fbe 100644 --- a/arch/xtensa/mm/fault.c +++ b/arch/xtensa/mm/fault.c @@ -41,6 +41,7 @@ void do_page_fault(struct pt_regs *regs) struct mm_struct *mm = current->mm; unsigned int exccause = regs->exccause; unsigned int address = regs->excvaddr; + unsigned long flags = 0; siginfo_t info; int is_write, is_exec; @@ -101,11 +102,16 @@ good_area: if (!(vma->vm_flags & (VM_READ | VM_WRITE))) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + /* If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/include/linux/mm.h b/include/linux/mm.h index 4baadd1..846b82b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -156,6 +156,7 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_ALLOW_RETRY 0x08 /* Retry fault if blocking */ #define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */ #define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */ +#define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */ /* * This interface is used by x86 PAT code to identify a pfn mapping that is -- 1.8.3.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx142.postini.com [74.125.245.142]) by kanga.kvack.org (Postfix) with SMTP id C30FA6B0034 for ; Fri, 19 Jul 2013 00:25:52 -0400 (EDT) Date: Fri, 19 Jul 2013 00:25:47 -0400 From: Johannes Weiner Subject: [patch 4/5] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130719042547.GG17812@cmpxchg.org> References: <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> <20130716153544.GX17812@cmpxchg.org> <20130716160905.GA20018@dhcp22.suse.cz> <20130716164830.GZ17812@cmpxchg.org> <20130719042124.GC17812@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130719042124.GC17812@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea@gmail.com The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1146+0x5ab/0x5c0 [] mem_cgroup_cache_charge+0xbe/0xe0 [] add_to_page_cache_locked+0x4c/0x140 [] add_to_page_cache_lru+0x22/0x50 [] grab_cache_page_write_begin+0x8b/0xe0 [] ext3_write_begin+0x88/0x270 [] generic_file_buffered_write+0x116/0x290 [] __generic_file_aio_write+0x27c/0x480 [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [] do_sync_write+0xea/0x130 [] vfs_write+0xf3/0x1f0 [] sys_write+0x51/0x90 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff OOM kill victim: [] do_truncate+0x58/0xa0 # takes i_mutex [] do_last+0x250/0xa30 [] path_openat+0xd7/0x440 [] do_filp_open+0x49/0xa0 [] do_sys_open+0x106/0x240 [] sys_open+0x20/0x30 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 0. When OOMing in a system call (buffered IO and friends), do not invoke the OOM killer, do not sleep on a OOM waitqueue, just return -ENOMEM. Userspace should be able to handle this and it prevents anybody from looping or waiting with locks held. 1. When OOMing in a kernel fault, do not invoke the OOM killer, do not sleep on the OOM waitqueue, just return -ENOMEM. The kernel fault stack knows how to handle this. If a kernel fault is nested inside a user fault, however, user fault handling applies: 2. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 3. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. While reworking the OOM routine, also remove a needless OOM waitqueue wakeup when invoking the killer. In addition to the wakeup implied in the kill signal delivery, only uncharges and limit increases, things that actually change the memory situation, should poke the waitqueue. Reported-by: Reported-by: azurIt Debugged-by: Michal Hocko Signed-off-by: Johannes Weiner --- include/linux/memcontrol.h | 22 +++++++ include/linux/sched.h | 6 ++ mm/filemap.c | 14 ++++- mm/ksm.c | 2 +- mm/memcontrol.c | 139 +++++++++++++++++++++++++++++---------------- mm/memory.c | 37 ++++++++---- mm/oom_kill.c | 2 + 7 files changed, 159 insertions(+), 63 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index b87068a..b92e5e7 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -120,6 +120,17 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page); extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); +static inline unsigned int mem_cgroup_xchg_may_oom(struct task_struct *p, + unsigned int new) +{ + unsigned int old; + + old = p->memcg_oom.may_oom; + p->memcg_oom.may_oom = new; + + return old; +} +bool mem_cgroup_oom_synchronize(void); #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP extern int do_swap_account; #endif @@ -333,6 +344,17 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } +static inline unsigned int mem_cgroup_xchg_may_oom(struct task_struct *p, + unsigned int new) +{ + return 0; +} + +static inline bool mem_cgroup_oom_synchronize(void) +{ + return false; +} + static inline void mem_cgroup_inc_page_stat(struct page *page, enum mem_cgroup_page_stat_item idx) { diff --git a/include/linux/sched.h b/include/linux/sched.h index 1c4f3e9..7e6c9e9 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1568,6 +1568,12 @@ struct task_struct { unsigned long nr_pages; /* uncharged usage */ unsigned long memsw_nr_pages; /* uncharged mem+swap usage */ } memcg_batch; + struct memcg_oom_info { + unsigned int may_oom:1; + unsigned int in_memcg_oom:1; + int wakeups; + struct mem_cgroup *wait_on_memcg; + } memcg_oom; #endif #ifdef CONFIG_HAVE_HW_BREAKPOINT atomic_t ptrace_bp_refcnt; diff --git a/mm/filemap.c b/mm/filemap.c index 5f0a3c9..d18bd47 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1660,6 +1660,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) struct file_ra_state *ra = &file->f_ra; struct inode *inode = mapping->host; pgoff_t offset = vmf->pgoff; + unsigned int may_oom; struct page *page; pgoff_t size; int ret = 0; @@ -1669,7 +1670,12 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) return VM_FAULT_SIGBUS; /* - * Do we have something in the page cache already? + * Do we have something in the page cache already? Either + * way, try readahead, but disable the memcg OOM killer for it + * as readahead is optional and no errors are propagated up + * the fault stack, which does not allow proper unwinding of a + * memcg OOM state. The OOM killer is enabled while trying to + * instantiate the faulting page individually below. */ page = find_get_page(mapping, offset); if (likely(page)) { @@ -1677,10 +1683,14 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) * We found the page, so try async readahead before * waiting for the lock. */ + may_oom = mem_cgroup_xchg_may_oom(current, 0); do_async_mmap_readahead(vma, ra, file, page, offset); + mem_cgroup_xchg_may_oom(current, may_oom); } else { - /* No page in the page cache at all */ + /* No page in the page cache at all. */ + may_oom = mem_cgroup_xchg_may_oom(current, 0); do_sync_mmap_readahead(vma, ra, file, offset); + mem_cgroup_xchg_may_oom(current, may_oom); count_vm_event(PGMAJFAULT); mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); ret = VM_FAULT_MAJOR; diff --git a/mm/ksm.c b/mm/ksm.c index 310544a..ae7e4ae 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -338,7 +338,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr) break; if (PageKsm(page)) ret = handle_mm_fault(vma->vm_mm, vma, addr, - FAULT_FLAG_WRITE); + FAULT_FLAG_WRITE); else ret = VM_FAULT_WRITE; put_page(page); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b63f5f7..99b0101 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -249,6 +249,7 @@ struct mem_cgroup { bool oom_lock; atomic_t under_oom; + atomic_t oom_wakeups; atomic_t refcnt; @@ -1846,6 +1847,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait, static void memcg_wakeup_oom(struct mem_cgroup *memcg) { + atomic_inc(&memcg->oom_wakeups); /* for filtering, pass "memcg" as argument. */ __wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg); } @@ -1857,30 +1859,20 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) } /* - * try to call OOM killer. returns false if we should exit memory-reclaim loop. + * try to call OOM killer */ -bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask) { - struct oom_wait_info owait; - bool locked, need_to_kill; + bool locked, need_to_kill = true; - owait.mem = memcg; - owait.wait.flags = 0; - owait.wait.func = memcg_oom_wake_function; - owait.wait.private = current; - INIT_LIST_HEAD(&owait.wait.task_list); - need_to_kill = true; - mem_cgroup_mark_under_oom(memcg); + if (!current->memcg_oom.may_oom) + return; + + current->memcg_oom.in_memcg_oom = 1; /* At first, try to OOM lock hierarchy under memcg.*/ spin_lock(&memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); - /* - * Even if signal_pending(), we can't quit charge() loop without - * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL - * under OOM is always welcomed, use TASK_KILLABLE here. - */ - prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); if (!locked || memcg->oom_kill_disable) need_to_kill = false; if (locked) @@ -1888,24 +1880,86 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) spin_unlock(&memcg_oom_lock); if (need_to_kill) { - finish_wait(&memcg_oom_waitq, &owait.wait); mem_cgroup_out_of_memory(memcg, mask); } else { - schedule(); - finish_wait(&memcg_oom_waitq, &owait.wait); + /* + * A system call can just return -ENOMEM, but if this + * is a page fault and somebody else is handling the + * OOM already, we need to sleep on the OOM waitqueue + * for this memcg until the situation is resolved. + * Which can take some time because it might be + * handled by a userspace task. + * + * However, this is the charge context, which means + * that we may sit on a large call stack and hold + * various filesystem locks, the mmap_sem etc. and we + * don't want the OOM handler to deadlock on them + * while we sit here and wait. Store the current OOM + * context in the task_struct, then return -ENOMEM. + * At the end of the page fault handler, with the + * stack unwound, pagefault_out_of_memory() will check + * back with us by calling + * mem_cgroup_oom_synchronize(), possibly putting the + * task to sleep. + */ + mem_cgroup_mark_under_oom(memcg); + current->memcg_oom.wakeups = atomic_read(&memcg->oom_wakeups); + css_get(&memcg->css); + current->memcg_oom.wait_on_memcg = memcg; } - spin_lock(&memcg_oom_lock); - if (locked) + + if (locked) { + spin_lock(&memcg_oom_lock); mem_cgroup_oom_unlock(memcg); - memcg_wakeup_oom(memcg); - spin_unlock(&memcg_oom_lock); + /* + * Sleeping tasks might have been killed, make sure + * they get scheduled so they can exit. + */ + if (need_to_kill) + memcg_oom_recover(memcg); + spin_unlock(&memcg_oom_lock); + } +} - mem_cgroup_unmark_under_oom(memcg); +bool mem_cgroup_oom_synchronize(void) +{ + struct oom_wait_info owait; + struct mem_cgroup *memcg; - if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + /* OOM is global, do not handle */ + if (!current->memcg_oom.in_memcg_oom) return false; - /* Give chance to dying process */ - schedule_timeout_uninterruptible(1); + + /* + * We invoked the OOM killer but there is a chance that a kill + * did not free up any charges. Everybody else might already + * be sleeping, so restart the fault and keep the rampage + * going until some charges are released. + */ + memcg = current->memcg_oom.wait_on_memcg; + if (!memcg) + goto out; + + if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + goto out_put; + + owait.mem = memcg; + owait.wait.flags = 0; + owait.wait.func = memcg_oom_wake_function; + owait.wait.private = current; + INIT_LIST_HEAD(&owait.wait.task_list); + + prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); + /* Only sleep if we didn't miss any wakeups since OOM */ + if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups) + schedule(); + finish_wait(&memcg_oom_waitq, &owait.wait); +out_put: + mem_cgroup_unmark_under_oom(memcg); + css_put(&memcg->css); + current->memcg_oom.wait_on_memcg = NULL; +out: + current->memcg_oom.in_memcg_oom = 0; return true; } @@ -2195,11 +2249,10 @@ enum { CHARGE_RETRY, /* need to retry but retry is not bad */ CHARGE_NOMEM, /* we can't do more. return -ENOMEM */ CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */ - CHARGE_OOM_DIE, /* the current is killed because of OOM */ }; static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, - unsigned int nr_pages, bool oom_check) + unsigned int nr_pages, bool invoke_oom) { unsigned long csize = nr_pages * PAGE_SIZE; struct mem_cgroup *mem_over_limit; @@ -2257,14 +2310,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (mem_cgroup_wait_acct_move(mem_over_limit)) return CHARGE_RETRY; - /* If we don't need to call oom-killer at el, return immediately */ - if (!oom_check) - return CHARGE_NOMEM; - /* check OOM */ - if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask)) - return CHARGE_OOM_DIE; + if (invoke_oom) + mem_cgroup_oom(mem_over_limit, gfp_mask); - return CHARGE_RETRY; + return CHARGE_NOMEM; } /* @@ -2349,7 +2398,7 @@ again: } do { - bool oom_check; + bool invoke_oom = oom && !nr_oom_retries; /* If killed, bypass charge */ if (fatal_signal_pending(current)) { @@ -2357,13 +2406,7 @@ again: goto bypass; } - oom_check = false; - if (oom && !nr_oom_retries) { - oom_check = true; - nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES; - } - - ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check); + ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, invoke_oom); switch (ret) { case CHARGE_OK: break; @@ -2376,16 +2419,12 @@ again: css_put(&memcg->css); goto nomem; case CHARGE_NOMEM: /* OOM routine works */ - if (!oom) { + if (!oom || invoke_oom) { css_put(&memcg->css); goto nomem; } - /* If oom, we never return -ENOMEM */ nr_oom_retries--; break; - case CHARGE_OOM_DIE: /* Killed by OOM Killer */ - css_put(&memcg->css); - goto bypass; } } while (ret != CHARGE_OK); diff --git a/mm/memory.c b/mm/memory.c index 829d437..2be02b7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3439,22 +3439,14 @@ unlock: /* * By the time we get here, we already hold the mm semaphore */ -int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, unsigned int flags) +static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) { pgd_t *pgd; pud_t *pud; pmd_t *pmd; pte_t *pte; - __set_current_state(TASK_RUNNING); - - count_vm_event(PGFAULT); - mem_cgroup_count_vm_event(mm, PGFAULT); - - /* do counter updates before entering really critical section. */ - check_sync_rss_stat(current); - if (unlikely(is_vm_hugetlb_page(vma))) return hugetlb_fault(mm, vma, address, flags); @@ -3503,6 +3495,31 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, return handle_pte_fault(mm, vma, address, pte, pmd, flags); } +int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) +{ + int userfault = flags & FAULT_FLAG_USER; + int ret; + + __set_current_state(TASK_RUNNING); + + count_vm_event(PGFAULT); + mem_cgroup_count_vm_event(mm, PGFAULT); + + /* do counter updates before entering really critical section. */ + check_sync_rss_stat(current); + + if (userfault) + WARN_ON(mem_cgroup_xchg_may_oom(current, 1) == 1); + + ret = __handle_mm_fault(mm, vma, address, flags); + + if (userfault) + WARN_ON(mem_cgroup_xchg_may_oom(current, 0) == 0); + + return ret; +} + #ifndef __PAGETABLE_PUD_FOLDED /* * Allocate page upper directory. diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 069b64e..aa60863 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -785,6 +785,8 @@ out: */ void pagefault_out_of_memory(void) { + if (mem_cgroup_oom_synchronize()) + return; if (try_set_system_oom()) { out_of_memory(NULL, 0, 0, NULL); clear_system_oom(); -- 1.8.3.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx205.postini.com [74.125.245.205]) by kanga.kvack.org (Postfix) with SMTP id 04B846B0034 for ; Fri, 19 Jul 2013 00:26:27 -0400 (EDT) Date: Fri, 19 Jul 2013 00:26:23 -0400 From: Johannes Weiner Subject: [patch 5/5] mm: memcontrol: sanity check memcg OOM context unwind Message-ID: <20130719042623.GH17812@cmpxchg.org> References: <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> <20130716153544.GX17812@cmpxchg.org> <20130716160905.GA20018@dhcp22.suse.cz> <20130716164830.GZ17812@cmpxchg.org> <20130719042124.GC17812@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130719042124.GC17812@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea@gmail.com Catch the cases where a memcg OOM context is set up in the failed charge path but the fault handler is not actually returning VM_FAULT_ERROR, which would be required to properly finalize the OOM. Example output: the first trace shows the stack at the end of handle_mm_fault() where an unexpected memcg OOM context is detected. The subsequent trace is of whoever set up that OOM context. In this case it was the charging of readahead pages in a file fault, which does not propagate VM_FAULT_OOM on failure and should disable OOM: [ 27.805359] WARNING: at /home/hannes/src/linux/linux/mm/memory.c:3523 handle_mm_fault+0x1fb/0x3f0() [ 27.805360] Hardware name: PowerEdge 1950 [ 27.805361] Fixing unhandled memcg OOM context, set up from: [ 27.805362] Pid: 1599, comm: file Tainted: G W 3.2.0-00005-g6d10010 #97 [ 27.805363] Call Trace: [ 27.805365] [] warn_slowpath_common+0x6a/0xa0 [ 27.805367] [] warn_slowpath_fmt+0x41/0x50 [ 27.805369] [] handle_mm_fault+0x1fb/0x3f0 [ 27.805371] [] do_page_fault+0x140/0x4a0 [ 27.805373] [] ? do_mmap_pgoff+0x34b/0x360 [ 27.805376] [] page_fault+0x1f/0x30 [ 27.805377] ---[ end trace 305ec584fba81649 ]--- [ 27.805378] [] __mem_cgroup_try_charge+0x5c8/0x7e0 [ 27.805380] [] mem_cgroup_cache_charge+0xac/0x110 [ 27.805381] [] add_to_page_cache_locked+0x3e/0x120 [ 27.805383] [] add_to_page_cache_lru+0x15/0x40 [ 27.805385] [] mpage_readpages+0xc3/0x150 [ 27.805387] [] ext4_readpages+0x18/0x20 [ 27.805388] [] __do_page_cache_readahead+0x1c1/0x270 [ 27.805390] [] ra_submit+0x1c/0x20 [ 27.805392] [] filemap_fault+0x3f4/0x450 [ 27.805394] [] __do_fault+0x6d/0x510 [ 27.805395] [] handle_pte_fault+0x8a/0x920 [ 27.805397] [] handle_mm_fault+0x19c/0x3f0 [ 27.805398] [] do_page_fault+0x140/0x4a0 [ 27.805400] [] page_fault+0x1f/0x30 [ 27.805401] [] 0xffffffffffffffff Debug patch only. Not-signed-off-by: Johannes Weiner --- include/linux/sched.h | 3 +++ mm/memcontrol.c | 7 +++++++ mm/memory.c | 9 +++++++++ 3 files changed, 19 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 7e6c9e9..a77d198 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -91,6 +91,7 @@ struct sched_param { #include #include #include +#include #include @@ -1571,6 +1572,8 @@ struct task_struct { struct memcg_oom_info { unsigned int may_oom:1; unsigned int in_memcg_oom:1; + struct stack_trace trace; + unsigned long trace_entries[16]; int wakeups; struct mem_cgroup *wait_on_memcg; } memcg_oom; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 99b0101..c47c77e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -49,6 +49,7 @@ #include #include #include +#include #include "internal.h" #include @@ -1870,6 +1871,12 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask) current->memcg_oom.in_memcg_oom = 1; + current->memcg_oom.trace.nr_entries = 0; + current->memcg_oom.trace.max_entries = 16; + current->memcg_oom.trace.entries = current->memcg_oom.trace_entries; + current->memcg_oom.trace.skip = 1; + save_stack_trace(¤t->memcg_oom.trace); + /* At first, try to OOM lock hierarchy under memcg.*/ spin_lock(&memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); diff --git a/mm/memory.c b/mm/memory.c index 2be02b7..fc6d741 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -57,6 +57,7 @@ #include #include #include +#include #include #include @@ -3517,6 +3518,14 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (userfault) WARN_ON(mem_cgroup_xchg_may_oom(current, 0) == 0); +#ifdef CONFIG_CGROUP_MEM_RES_CTLR + if (WARN(!(ret & VM_FAULT_OOM) && current->memcg_oom.in_memcg_oom, + "Fixing unhandled memcg OOM context, set up from:\n")) { + print_stack_trace(¤t->memcg_oom.trace, 0); + mem_cgroup_oom_synchronize(); + } +#endif + return ret; } -- 1.8.3.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx196.postini.com [74.125.245.196]) by kanga.kvack.org (Postfix) with SMTP id 3EA6E6B006C for ; Fri, 19 Jul 2013 04:23:41 -0400 (EDT) Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=5D_memcg=3A_do_not_trap_chargers_with_full_callstack_on_OOM?= Date: Fri, 19 Jul 2013 10:23:39 +0200 From: "azurIt" References: <20130709135450.GI20281@dhcp22.suse.cz>, <20130710182506.F25DF461@pobox.sk>, <20130711072507.GA21667@dhcp22.suse.cz>, <20130714012641.C2DA4E05@pobox.sk>, <20130714015112.FFCB7AF7@pobox.sk>, <20130715154119.GA32435@dhcp22.suse.cz>, <20130715160006.GB32435@dhcp22.suse.cz>, <20130716153544.GX17812@cmpxchg.org>, <20130716160905.GA20018@dhcp22.suse.cz>, <20130716164830.GZ17812@cmpxchg.org> <20130719042124.GC17812@cmpxchg.org> In-Reply-To: <20130719042124.GC17812@cmpxchg.org> MIME-Version: 1.0 Message-Id: <20130719102339.34DF73E5@pobox.sk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: =?utf-8?q?Johannes_Weiner?= , =?utf-8?q?Michal_Hocko?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , righi.andrea@gmail.com > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com >On Tue, Jul 16, 2013 at 12:48:30PM -0400, Johannes Weiner wrote: >> On Tue, Jul 16, 2013 at 06:09:05PM +0200, Michal Hocko wrote: >> > On Tue 16-07-13 11:35:44, Johannes Weiner wrote: >> > > On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote: >> > > > On Mon 15-07-13 17:41:19, Michal Hocko wrote: >> > > > > On Sun 14-07-13 01:51:12, azurIt wrote: >> > > > > > > CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com >> > > > > > >> CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com >> > > > > > >>On Wed 10-07-13 18:25:06, azurIt wrote: >> > > > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before >> > > > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using >> > > > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able >> > > > > > >>> >> to associate all user's processes with target cgroup). Look here for >> > > > > > >>> >> cgroup-uid patch: >> > > > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch >> > > > > > >>> >> >> > > > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was >> > > > > > >>> >> permanently '1'. >> > > > > > >>> > >> > > > > > >>> >This is really strange. Could you post the whole diff against stable >> > > > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid >> > > > > > >>> >patch)? >> > > > > > >>> >> > > > > > >>> >> > > > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: >> > > > > > >>> http://watchdog.sk/lkml/patches3/ >> > > > > > >> >> > > > > > >>The two patches from Johannes seem correct. >> > > > > > >> >> > > > > > >>From a quick look even grsecurity patchset shouldn't interfere as it >> > > > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error >> > > > > > >>and there also doesn't seem to be any new handle_mm_fault call sites. >> > > > > > >> >> > > > > > >>But I cannot tell there aren't other code paths which would lead to a >> > > > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. >> > > > > > > >> > > > > > > >> > > > > > >Michal, >> > > > > > > >> > > > > > >now i can definitely confirm that problem with unremovable cgroups >> > > > > > >persists. What info do you need from me? I applied also your little >> > > > > > >'WARN_ON' patch. >> > > > > > >> > > > > > Ok, i think you want this: >> > > > > > http://watchdog.sk/lkml/kern4.log >> > > > > >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [] warn_slowpath_common+0x7a/0xb0 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [] warn_slowpath_null+0x1a/0x20 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [] do_exit+0x7d0/0x870 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [] ? thread_group_times+0x44/0xb0 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [] do_group_exit+0x51/0xc0 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [] sys_exit_group+0x17/0x20 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [] system_call_fastpath+0x18/0x1d >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- >> > > > > >> > > > > OK, so you had an OOM which has been handled by in-kernel oom handler >> > > > > (it killed 12021) and 12037 was in the same group. The warning tells us >> > > > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have >> > > > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then >> > > > > it exited on the userspace request (by exit syscall). >> > > > > >> > > > > I do not see any way how, this could happen though. If mem_cgroup_oom >> > > > > is called then we always return CHARGE_NOMEM which turns into ENOMEM >> > > > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to >> > > > > true). So if nobody screwed the return value on the way up to page >> > > > > fault handler then there is no way to escape. >> > > > > >> > > > > I will check the code. >> > > > >> > > > OK, I guess I found it: >> > > > __do_fault >> > > > fault = filemap_fault >> > > > do_async_mmap_readahead >> > > > page_cache_async_readahead >> > > > ondemand_readahead >> > > > __do_page_cache_readahead >> > > > read_pages >> > > > readpages = ext3_readpages >> > > > mpage_readpages # Doesn't propagate ENOMEM >> > > > add_to_page_cache_lru >> > > > add_to_page_cache >> > > > add_to_page_cache_locked >> > > > mem_cgroup_cache_charge >> > > > >> > > > So the read ahead most probably. Again! Duhhh. I will try to think >> > > > about a fix for this. One obvious place is mpage_readpages but >> > > > __do_page_cache_readahead ignores read_pages return value as well and >> > > > page_cache_async_readahead, even worse, is just void and exported as >> > > > such. >> > > > >> > > > So this smells like a hard to fix bugger. One possible, and really ugly >> > > > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault >> > > > doesn't return VM_FAULT_ERROR, but that is a crude hack. > >I fixed it by disabling the OOM killer altogether for readahead code. >We don't do it globally, we should not do it in the memcg, these are >optional allocations/charges. > >I also disabled it for kernel faults triggered from within a syscall >(copy_*user, get_user_pages), which should just return -ENOMEM as >usual (unless it's nested inside a userspace fault). The only >downside is that we can't get around annotating userspace faults >anymore, so every architecture fault handler now passes >FAULT_FLAG_USER to handle_mm_fault(). Makes the series a little less >self-contained, but it's not unreasonable. > >It's easy to detect leaks now by checking if the memcg OOM context is >setup and we are not returning VM_FAULT_OOM. > >Here is a combined diff based on 3.2. azurIt, any chance you could >give this a shot? I tested it on my local machines, but you have a >known reproducer of fairly unlikely scenarios... I will be out of office between 25.7. and 1.8. and I don't want to run anything which can potentially do an outage of our services. I will test this patch after 2.8. Should I use also previous patches of this one is enough? Thank you very much Johannes. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx151.postini.com [74.125.245.151]) by kanga.kvack.org (Postfix) with SMTP id 856C16B0031 for ; Wed, 24 Jul 2013 16:32:15 -0400 (EDT) Date: Wed, 24 Jul 2013 16:32:05 -0400 From: Johannes Weiner Subject: Re: [patch 3/5] x86: finish fault error path with fatal signal Message-ID: <20130724203205.GL715@cmpxchg.org> References: <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> <20130716153544.GX17812@cmpxchg.org> <20130716160905.GA20018@dhcp22.suse.cz> <20130716164830.GZ17812@cmpxchg.org> <20130719042124.GC17812@cmpxchg.org> <20130719042502.GF17812@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130719042502.GF17812@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea@gmail.com On Fri, Jul 19, 2013 at 12:25:02AM -0400, Johannes Weiner wrote: > The x86 fault handler bails in the middle of error handling when the > task has been killed. For the next patch this is a problem, because > it relies on pagefault_out_of_memory() being called even when the task > has been killed, to perform proper OOM state unwinding. > > This is a rather minor optimization, just remove it. > > Signed-off-by: Johannes Weiner > --- > arch/x86/mm/fault.c | 11 ----------- > 1 file changed, 11 deletions(-) > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index 1cebabe..90248c9 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -846,17 +846,6 @@ static noinline int > mm_fault_error(struct pt_regs *regs, unsigned long error_code, > unsigned long address, unsigned int fault) > { > - /* > - * Pagefault was interrupted by SIGKILL. We have no reason to > - * continue pagefault. > - */ > - if (fatal_signal_pending(current)) { > - if (!(fault & VM_FAULT_RETRY)) > - up_read(¤t->mm->mmap_sem); > - if (!(error_code & PF_USER)) > - no_context(regs, error_code, address); > - return 1; This is broken but I only hit it now after testing for a while. The patch has the right idea: in case of an OOM kill, we should continue the fault and not abort. What I missed is that in case of a kill during lock_page, i.e. VM_FAULT_RETRY && fatal_signal, we have to exit the fault and not do up_read() etc. This introduced a locking imbalance that would get everybody hung on mmap_sem. I moved the retry handling outside of mm_fault_error() (come on...) and stole some documentation from arm. It's now a little bit more explicit and comparable to other architectures. I'll send an updated series, patch for reference: --- From: Johannes Weiner Subject: [patch] x86: finish fault error path with fatal signal The x86 fault handler bails in the middle of error handling when the task has been killed. For the next patch this is a problem, because it relies on pagefault_out_of_memory() being called even when the task has been killed, to perform proper OOM state unwinding. This is a rather minor optimization that cuts short the fault handling by a few instructions in rare cases. Just remove it. Signed-off-by: Johannes Weiner --- arch/x86/mm/fault.c | 33 +++++++++++++-------------------- 1 file changed, 13 insertions(+), 20 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 6d77c38..0c18beb 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -842,31 +842,17 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address, force_sig_info_fault(SIGBUS, code, address, tsk, fault); } -static noinline int +static noinline void mm_fault_error(struct pt_regs *regs, unsigned long error_code, unsigned long address, unsigned int fault) { - /* - * Pagefault was interrupted by SIGKILL. We have no reason to - * continue pagefault. - */ - if (fatal_signal_pending(current)) { - if (!(fault & VM_FAULT_RETRY)) - up_read(¤t->mm->mmap_sem); - if (!(error_code & PF_USER)) - no_context(regs, error_code, address, 0, 0); - return 1; - } - if (!(fault & VM_FAULT_ERROR)) - return 0; - if (fault & VM_FAULT_OOM) { /* Kernel mode? Handle exceptions or die: */ if (!(error_code & PF_USER)) { up_read(¤t->mm->mmap_sem); no_context(regs, error_code, address, SIGSEGV, SEGV_MAPERR); - return 1; + return; } up_read(¤t->mm->mmap_sem); @@ -884,7 +870,6 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code, else BUG(); } - return 1; } static int spurious_fault_check(unsigned long error_code, pte_t *pte) @@ -1189,9 +1174,17 @@ good_area: */ fault = handle_mm_fault(mm, vma, address, flags); - if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) { - if (mm_fault_error(regs, error_code, address, fault)) - return; + /* + * If we need to retry but a fatal signal is pending, handle the + * signal first. We do not need to release the mmap_sem because it + * would already be released in __lock_page_or_retry in mm/filemap.c. + */ + if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) + return; + + if (unlikely(fault & VM_FAULT_ERROR)) { + mm_fault_error(regs, error_code, address, fault); + return; } /* -- 1.8.3.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx125.postini.com [74.125.245.125]) by kanga.kvack.org (Postfix) with SMTP id 7AFAB6B0031 for ; Thu, 25 Jul 2013 17:50:44 -0400 (EDT) Date: Thu, 25 Jul 2013 17:50:33 -0400 From: Johannes Weiner Subject: Re: [patch 3/5] x86: finish fault error path with fatal signal Message-ID: <20130725215033.GP715@cmpxchg.org> References: <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> <20130716153544.GX17812@cmpxchg.org> <20130716160905.GA20018@dhcp22.suse.cz> <20130716164830.GZ17812@cmpxchg.org> <20130719042124.GC17812@cmpxchg.org> <20130719042502.GF17812@cmpxchg.org> <20130724203205.GL715@cmpxchg.org> <51F18A99.7000306@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51F18A99.7000306@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: KOSAKI Motohiro Cc: Michal Hocko , azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea@gmail.com On Thu, Jul 25, 2013 at 04:29:13PM -0400, KOSAKI Motohiro wrote: > (7/24/13 4:32 PM), Johannes Weiner wrote: > >@@ -1189,9 +1174,17 @@ good_area: > > */ > > fault = handle_mm_fault(mm, vma, address, flags); > > > >- if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) { > >- if (mm_fault_error(regs, error_code, address, fault)) > >- return; > >+ /* > >+ * If we need to retry but a fatal signal is pending, handle the > >+ * signal first. We do not need to release the mmap_sem because it > >+ * would already be released in __lock_page_or_retry in mm/filemap.c. > >+ */ > >+ if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) > >+ return; > >+ > >+ if (unlikely(fault & VM_FAULT_ERROR)) { > >+ mm_fault_error(regs, error_code, address, fault); > >+ return; > > } > > When I made the patch you removed code, Ingo suggested we need put all rare case code > into if(unlikely()) block. Yes, this is purely micro optimization. But it is not costly > to maintain. Fair enough, thanks for the heads up! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755743Ab2KUTMv (ORCPT ); Wed, 21 Nov 2012 14:12:51 -0500 Received: from gmmr5.centrum.cz ([46.255.225.250]:59156 "EHLO gmmr5.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755720Ab2KUTMu (ORCPT ); Wed, 21 Nov 2012 14:12:50 -0500 X-Greylist: delayed 640 seconds by postgrey-1.27 at vger.kernel.org; Wed, 21 Nov 2012 14:12:50 EST To: Subject: =?utf-8?q?memory=2Dcgroup_bug?= Date: Wed, 21 Nov 2012 20:02:07 +0100 From: "azurIt" X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121121200207.01068046@pobox.sk> X-Maser: oho Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, i'm using memory cgroup for limiting our users and having a really strange problem when a cgroup gets out of its memory limit. It's very strange because it happens only sometimes (about once per week on random user), out of memory is usually handled ok. This happens when problem occures: - no new processes can be started for this cgroup - current processes are freezed and taking 100% of CPU - when i try to 'strace' any of current processes, the whole strace freezes until process is killed (strace cannot be terminated by CTRL-c) - problem can be resolved by raising memory limit for cgroup or killing of few processes inside cgroup so some memory is freed I also garbbed the content of /proc//stack of freezed process: [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1146+0x5ab/0x5c0 [] mem_cgroup_charge_common+0x56/0xa0 [] mem_cgroup_newpage_charge+0x45/0x50 [] do_wp_page+0x14e/0x800 [] handle_pte_fault+0x264/0x940 [] handle_mm_fault+0x138/0x260 [] do_page_fault+0x13d/0x460 [] page_fault+0x1f/0x30 [] 0xffffffffffffffff I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32. Any ideas? Thnx. azurIt From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752073Ab2KVS1x (ORCPT ); Thu, 22 Nov 2012 13:27:53 -0500 Received: from gmmr7.centrum.cz ([46.255.225.249]:51625 "EHLO gmmr7.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751790Ab2KVS1u (ORCPT ); Thu, 22 Nov 2012 13:27:50 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_memory=2Dcgroup_bug?= Date: Thu, 22 Nov 2012 19:05:26 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= References: <20121121200207.01068046@pobox.sk> <20121122152441.GA9609@dhcp22.suse.cz> In-Reply-To: <20121122152441.GA9609@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121122190526.390C7A28@pobox.sk> X-Maser: Georgo Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >> i'm using memory cgroup for limiting our users and having a really >> strange problem when a cgroup gets out of its memory limit. It's very >> strange because it happens only sometimes (about once per week on >> random user), out of memory is usually handled ok. > >What is your memcg configuration? Do you use deeper hierarchies, is >use_hierarchy enabled? Is the memcg oom (aka memory.oom_control) >enabled? Do you use soft limit for those groups? Is memcg swap >accounting enabled and memsw limits in place? >Is the machine under global memory pressure as well? >Could you post sysrq+t or sysrq+w? My cgroups hierarchy: /cgroups//uid/ where '' is system user id and 'uid' is just word 'uid'. Memory limits are set in /cgroups// and hierarchy is enabled. Processes are inside /cgroups//uid/ . I'm using hard limits for memory and swap BUT system has no swap at all (it has 'only' 16 GB of real RAM). memory.oom_control is set to 'oom_kill_disable 0'. Server has enough of free memory when problem occurs. >> This happens when problem occures: >> - no new processes can be started for this cgroup >> - current processes are freezed and taking 100% of CPU >> - when i try to 'strace' any of current processes, the whole strace >> freezes until process is killed (strace cannot be terminated by >> CTRL-c) >> - problem can be resolved by raising memory limit for cgroup or >> killing of few processes inside cgroup so some memory is freed >> >> I also garbbed the content of /proc//stack of freezed process: >> [] mem_cgroup_handle_oom+0x241/0x3b0 >> [] T.1146+0x5ab/0x5c0 > >Hmm what is this? Really doesn't know, i will get stack of all freezed processes next time so we can compare it. >> [] mem_cgroup_charge_common+0x56/0xa0 >> [] mem_cgroup_newpage_charge+0x45/0x50 >> [] do_wp_page+0x14e/0x800 >> [] handle_pte_fault+0x264/0x940 >> [] handle_mm_fault+0x138/0x260 >> [] do_page_fault+0x13d/0x460 >> [] page_fault+0x1f/0x30 >> [] 0xffffffffffffffff >> > >How many tasks are hung in mem_cgroup_handle_oom? If there were many >of them then it'd smell like an issue fixed by 79dfdaccd1d5 (memcg: >make oom_lock 0 and 1 based rather than counter) and its follow up fix >23751be00940 (memcg: fix hierarchical oom locking) but you are saying >that you can reproduce with 3.2 and those went in for 3.1. 2.6.32 would >make more sense. Usually maximum of several 10s of processes but i will check it next time. I was having much worse problems in 2.6.32 - when freezing happens, the whole server was affected (i wasn't able to do anything and needs to wait until my scripts takes case of it and killed apache, so i don't have any detailed info). In 3.2 only target cgroup is affected. >> I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32. > >I guess this is a clean vanilla (stable) kernel, right? Are you able to >reproduce with the latest Linus tree? Well, no. I'm using, for example, newest stable grsecurity patch. I'm also using few of Andrea Righi's cgroup subsystems but i don't believe these are doing problems: - cgroup-uid which is moving processes into cgroups based on UID - cgroup-task which can limit number of tasks in cgroup (i already tried to disable this one, it didn't help) http://www.develer.com/~arighi/linux/patches/ Unfortunately i cannot just install new and untested kernel version cos i'm not able to reproduce this problem anytime (it's happening randomly in production environment). Could it be that OOM cannot start and kill processes because there's no free memory in cgroup? Thank you! azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965214Ab2KVT3D (ORCPT ); Thu, 22 Nov 2012 14:29:03 -0500 Received: from fgwmail7.fujitsu.co.jp ([192.51.44.37]:58755 "EHLO fgwmail7.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965183Ab2KVT27 (ORCPT ); Thu, 22 Nov 2012 14:28:59 -0500 X-Greylist: delayed 3611 seconds by postgrey-1.27 at vger.kernel.org; Thu, 22 Nov 2012 14:28:59 EST X-SecurityPolicyCheck: OK by SHieldMailChecker v1.8.4 Message-ID: <50AD713F.9030909@jp.fujitsu.com> Date: Thu, 22 Nov 2012 09:26:39 +0900 From: Kamezawa Hiroyuki User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:16.0) Gecko/20121026 Thunderbird/16.0.2 MIME-Version: 1.0 To: azurIt CC: linux-kernel@vger.kernel.org, linux-mm Subject: Re: memory-cgroup bug References: <20121121200207.01068046@pobox.sk> In-Reply-To: <20121121200207.01068046@pobox.sk> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org (2012/11/22 4:02), azurIt wrote: > Hi, > > i'm using memory cgroup for limiting our users and having a really strange problem when a cgroup gets out of its memory limit. It's very strange because it happens only sometimes (about once per week on random user), out of memory is usually handled ok. This happens when problem occures: > - no new processes can be started for this cgroup > - current processes are freezed and taking 100% of CPU > - when i try to 'strace' any of current processes, the whole strace freezes until process is killed (strace cannot be terminated by CTRL-c) > - problem can be resolved by raising memory limit for cgroup or killing of few processes inside cgroup so some memory is freed > > I also garbbed the content of /proc//stack of freezed process: > [] mem_cgroup_handle_oom+0x241/0x3b0 > [] T.1146+0x5ab/0x5c0 > [] mem_cgroup_charge_common+0x56/0xa0 > [] mem_cgroup_newpage_charge+0x45/0x50 > [] do_wp_page+0x14e/0x800 > [] handle_pte_fault+0x264/0x940 > [] handle_mm_fault+0x138/0x260 > [] do_page_fault+0x13d/0x460 > [] page_fault+0x1f/0x30 > [] 0xffffffffffffffff > > I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32. > > Any ideas? Thnx. > Under OOM in memcg, only one process is allowed to work. Because processes tends to use up CPU at memory shortage. other processes are freezed. Then, the problem here is the one process which uses CPU. IIUC, 'freezed' threads are in sleep and never use CPU. It's expected oom-killer or memory-reclaim can solve the probelm. What is your memcg's memory.oom_control value ? and process's oom_adj values ? (/proc//oom_adj, /proc//oom_score_adj) Thanks, -Kame > azurIt > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757160Ab2KVTfD (ORCPT ); Thu, 22 Nov 2012 14:35:03 -0500 Received: from cantor2.suse.de ([195.135.220.15]:40522 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757139Ab2KVTe5 (ORCPT ); Thu, 22 Nov 2012 14:34:57 -0500 Date: Thu, 22 Nov 2012 16:24:41 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist Subject: Re: memory-cgroup bug Message-ID: <20121122152441.GA9609@dhcp22.suse.cz> References: <20121121200207.01068046@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment In-Reply-To: <20121121200207.01068046@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 21-11-12 20:02:07, azurIt wrote: > Hi, > > i'm using memory cgroup for limiting our users and having a really > strange problem when a cgroup gets out of its memory limit. It's very > strange because it happens only sometimes (about once per week on > random user), out of memory is usually handled ok. What is your memcg configuration? Do you use deeper hierarchies, is use_hierarchy enabled? Is the memcg oom (aka memory.oom_control) enabled? Do you use soft limit for those groups? Is memcg swap accounting enabled and memsw limits in place? Is the machine under global memory pressure as well? Could you post sysrq+t or sysrq+w? > This happens when problem occures: > - no new processes can be started for this cgroup > - current processes are freezed and taking 100% of CPU > - when i try to 'strace' any of current processes, the whole strace > freezes until process is killed (strace cannot be terminated by > CTRL-c) > - problem can be resolved by raising memory limit for cgroup or > killing of few processes inside cgroup so some memory is freed > > I also garbbed the content of /proc//stack of freezed process: > [] mem_cgroup_handle_oom+0x241/0x3b0 > [] T.1146+0x5ab/0x5c0 Hmm what is this? > [] mem_cgroup_charge_common+0x56/0xa0 > [] mem_cgroup_newpage_charge+0x45/0x50 > [] do_wp_page+0x14e/0x800 > [] handle_pte_fault+0x264/0x940 > [] handle_mm_fault+0x138/0x260 > [] do_page_fault+0x13d/0x460 > [] page_fault+0x1f/0x30 > [] 0xffffffffffffffff > How many tasks are hung in mem_cgroup_handle_oom? If there were many of them then it'd smell like an issue fixed by 79dfdaccd1d5 (memcg: make oom_lock 0 and 1 based rather than counter) and its follow up fix 23751be00940 (memcg: fix hierarchical oom locking) but you are saying that you can reproduce with 3.2 and those went in for 3.1. 2.6.32 would make more sense. > I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32. I guess this is a clean vanilla (stable) kernel, right? Are you able to reproduce with the latest Linus tree? -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932575Ab2KVVm7 (ORCPT ); Thu, 22 Nov 2012 16:42:59 -0500 Received: from mail-ee0-f46.google.com ([74.125.83.46]:56770 "EHLO mail-ee0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932311Ab2KVVmz (ORCPT ); Thu, 22 Nov 2012 16:42:55 -0500 Date: Thu, 22 Nov 2012 22:42:52 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist Subject: Re: memory-cgroup bug Message-ID: <20121122214249.GA20319@dhcp22.suse.cz> References: <20121121200207.01068046@pobox.sk> <20121122152441.GA9609@dhcp22.suse.cz> <20121122190526.390C7A28@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121122190526.390C7A28@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 22-11-12 19:05:26, azurIt wrote: [...] > My cgroups hierarchy: > /cgroups//uid/ > > where '' is system user id and 'uid' is just word 'uid'. > > Memory limits are set in /cgroups// and hierarchy is > enabled. Processes are inside /cgroups//uid/ . I'm using > hard limits for memory and swap BUT system has no swap at all > (it has 'only' 16 GB of real RAM). memory.oom_control is set to > 'oom_kill_disable 0'. Server has enough of free memory when problem > occurs. OK, so so the global reclaim shouldn't be active. This is definitely good to know. > >> This happens when problem occures: > >> - no new processes can be started for this cgroup > >> - current processes are freezed and taking 100% of CPU > >> - when i try to 'strace' any of current processes, the whole strace > >> freezes until process is killed (strace cannot be terminated by > >> CTRL-c) > >> - problem can be resolved by raising memory limit for cgroup or > >> killing of few processes inside cgroup so some memory is freed > >> > >> I also garbbed the content of /proc//stack of freezed process: > >> [] mem_cgroup_handle_oom+0x241/0x3b0 > >> [] T.1146+0x5ab/0x5c0 > > > >Hmm what is this? > > Really doesn't know, i will get stack of all freezed processes next > time so we can compare it. > > >> [] mem_cgroup_charge_common+0x56/0xa0 > >> [] mem_cgroup_newpage_charge+0x45/0x50 > >> [] do_wp_page+0x14e/0x800 > >> [] handle_pte_fault+0x264/0x940 > >> [] handle_mm_fault+0x138/0x260 > >> [] do_page_fault+0x13d/0x460 > >> [] page_fault+0x1f/0x30 > >> [] 0xffffffffffffffff Btw. is this stack stable or is the task bouncing in some loop? And finally could you post the disassembly of your version of mem_cgroup_handle_oom, please? > >How many tasks are hung in mem_cgroup_handle_oom? If there were many > >of them then it'd smell like an issue fixed by 79dfdaccd1d5 (memcg: > >make oom_lock 0 and 1 based rather than counter) and its follow up fix > >23751be00940 (memcg: fix hierarchical oom locking) but you are saying > >that you can reproduce with 3.2 and those went in for 3.1. 2.6.32 would > >make more sense. > > > Usually maximum of several 10s of processes but i will check it next > time. I was having much worse problems in 2.6.32 - when freezing > happens, the whole server was affected (i wasn't able to do anything > and needs to wait until my scripts takes case of it and killed apache, > so i don't have any detailed info). Hmm, maybe the issue fixed by 1d65f86d (mm: preallocate page before lock_page() at filemap COW) which was merged in 3.1. > In 3.2 only target cgroup is affected. > > >> I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32. > > > >I guess this is a clean vanilla (stable) kernel, right? Are you able to > >reproduce with the latest Linus tree? > > > Well, no. I'm using, for example, newest stable grsecurity patch. That shouldn't be related > I'm also using few of Andrea Righi's cgroup subsystems but i don't > believe > these are doing problems: > - cgroup-uid which is moving processes into cgroups based on UID > - cgroup-task which can limit number of tasks in cgroup (i already > tried to disable this one, it didn't help) > http://www.develer.com/~arighi/linux/patches/ I am not familiar with those pathces but I will double check. > Unfortunately i cannot just install new and untested kernel version > cos i'm not able to reproduce this problem anytime (it's happening > randomly in production environment). This will make it a bit harder to debug but let's see maybe the new traces would help... > Could it be that OOM cannot start and kill processes because there's > no free memory in cgroup? That shouldn't happen. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932664Ab2KVVpd (ORCPT ); Thu, 22 Nov 2012 16:45:33 -0500 Received: from mail-ee0-f46.google.com ([74.125.83.46]:34930 "EHLO mail-ee0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932125Ab2KVVpa (ORCPT ); Thu, 22 Nov 2012 16:45:30 -0500 Date: Thu, 22 Nov 2012 22:45:27 +0100 From: Michal Hocko To: azurIt Cc: Kamezawa Hiroyuki , linux-kernel@vger.kernel.org, linux-mm Subject: Re: memory-cgroup bug Message-ID: <20121122214527.GB20319@dhcp22.suse.cz> References: <20121121200207.01068046@pobox.sk> <50AD713F.9030909@jp.fujitsu.com> <20121122103618.79F03818@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121122103618.79F03818@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 22-11-12 10:36:18, azurIt wrote: [...] > I can look also to the data of 'freezed' proces if you need it but i > will have to wait until problem occurs again. > > The main problem is that when this problem happens, it's NOT resolved > automatically by kernel/OOM and user of cgroup, where it happend, has > non-working services until i kill his processes by hand. I'm sure > that all 'freezed' processes are taking very much CPU because also > server load goes really high - next time i will make a screenshot of > htop. I really wonder why OOM is __sometimes__ not resolving this > (it's usually is, only sometimes not). What does your kernel log says while this is happening. Are there any memcg OOM messages showing up? -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755471Ab2KVWek (ORCPT ); Thu, 22 Nov 2012 17:34:40 -0500 Received: from gmmr5.centrum.cz ([46.255.225.250]:52568 "EHLO gmmr5.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750898Ab2KVWeh (ORCPT ); Thu, 22 Nov 2012 17:34:37 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_memory=2Dcgroup_bug?= Date: Thu, 22 Nov 2012 23:34:34 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= References: <20121121200207.01068046@pobox.sk>, <20121122152441.GA9609@dhcp22.suse.cz>, <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> In-Reply-To: <20121122214249.GA20319@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121122233434.3D5E35E6@pobox.sk> X-Maser: oho Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >Btw. is this stack stable or is the task bouncing in some loop? Not sure, will check it next time. >And finally could you post the disassembly of your version of >mem_cgroup_handle_oom, please? How can i do this? >What does your kernel log says while this is happening. Are there any >memcg OOM messages showing up? I will get the logs next time. Thank you! azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754628Ab2KWGNj (ORCPT ); Fri, 23 Nov 2012 01:13:39 -0500 Received: from gmmr4.centrum.cz ([46.255.227.253]:44684 "EHLO gmmr4.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753129Ab2KWGNi (ORCPT ); Fri, 23 Nov 2012 01:13:38 -0500 X-Greylist: delayed 21660 seconds by postgrey-1.27 at vger.kernel.org; Fri, 23 Nov 2012 01:13:38 EST To: =?utf-8?q?Kamezawa_Hiroyuki?= Subject: =?utf-8?q?Re=3A_memory=2Dcgroup_bug?= Date: Thu, 22 Nov 2012 10:36:18 +0100 From: "azurIt" Cc: , "linux-mm" References: <20121121200207.01068046@pobox.sk> <50AD713F.9030909@jp.fujitsu.com> In-Reply-To: <50AD713F.9030909@jp.fujitsu.com> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121122103618.79F03818@pobox.sk> X-Maser: oho Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org ______________________________________________________________ > Od: "Kamezawa Hiroyuki" > Komu: azurIt > Dátum: 22.11.2012 01:27 > Predmet: Re: memory-cgroup bug > > CC: linux-kernel@vger.kernel.org, "linux-mm" >(2012/11/22 4:02), azurIt wrote: >> Hi, >> >> i'm using memory cgroup for limiting our users and having a really strange problem when a cgroup gets out of its memory limit. It's very strange because it happens only sometimes (about once per week on random user), out of memory is usually handled ok. This happens when problem occures: >> - no new processes can be started for this cgroup >> - current processes are freezed and taking 100% of CPU >> - when i try to 'strace' any of current processes, the whole strace freezes until process is killed (strace cannot be terminated by CTRL-c) >> - problem can be resolved by raising memory limit for cgroup or killing of few processes inside cgroup so some memory is freed >> >> I also garbbed the content of /proc//stack of freezed process: >> [] mem_cgroup_handle_oom+0x241/0x3b0 >> [] T.1146+0x5ab/0x5c0 >> [] mem_cgroup_charge_common+0x56/0xa0 >> [] mem_cgroup_newpage_charge+0x45/0x50 >> [] do_wp_page+0x14e/0x800 >> [] handle_pte_fault+0x264/0x940 >> [] handle_mm_fault+0x138/0x260 >> [] do_page_fault+0x13d/0x460 >> [] page_fault+0x1f/0x30 >> [] 0xffffffffffffffff >> >> I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32. >> >> Any ideas? Thnx. >> > >Under OOM in memcg, only one process is allowed to work. Because processes tends to use up >CPU at memory shortage. other processes are freezed. > > >Then, the problem here is the one process which uses CPU. IIUC, 'freezed' threads are >in sleep and never use CPU. It's expected oom-killer or memory-reclaim can solve the probelm. > >What is your memcg's memory.oom_control value ? oom_kill_disable 0 >and process's oom_adj values ? (/proc//oom_adj, /proc//oom_score_adj) when i look to a random user PID (Apache web server): oom_adj = 0 oom_score_adj = 0 I can look also to the data of 'freezed' proces if you need it but i will have to wait until problem occurs again. The main problem is that when this problem happens, it's NOT resolved automatically by kernel/OOM and user of cgroup, where it happend, has non-working services until i kill his processes by hand. I'm sure that all 'freezed' processes are taking very much CPU because also server load goes really high - next time i will make a screenshot of htop. I really wonder why OOM is __sometimes__ not resolving this (it's usually is, only sometimes not). Thank you! azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932933Ab2KWHka (ORCPT ); Fri, 23 Nov 2012 02:40:30 -0500 Received: from mail-vb0-f46.google.com ([209.85.212.46]:38669 "EHLO mail-vb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932867Ab2KWHk2 (ORCPT ); Fri, 23 Nov 2012 02:40:28 -0500 Date: Fri, 23 Nov 2012 08:40:23 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist Subject: Re: memory-cgroup bug Message-ID: <20121123074023.GA24698@dhcp22.suse.cz> References: <20121121200207.01068046@pobox.sk> <20121122152441.GA9609@dhcp22.suse.cz> <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121122233434.3D5E35E6@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 22-11-12 23:34:34, azurIt wrote: [...] > >And finally could you post the disassembly of your version of > >mem_cgroup_handle_oom, please? > > How can i do this? Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom function. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759029Ab2KWJ2g (ORCPT ); Fri, 23 Nov 2012 04:28:36 -0500 Received: from mail-vb0-f46.google.com ([209.85.212.46]:61017 "EHLO mail-vb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758655Ab2KWJ2e (ORCPT ); Fri, 23 Nov 2012 04:28:34 -0500 Date: Fri, 23 Nov 2012 10:28:29 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist Subject: Re: memory-cgroup bug Message-ID: <20121123092829.GE24698@dhcp22.suse.cz> References: <20121121200207.01068046@pobox.sk> <20121122152441.GA9609@dhcp22.suse.cz> <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121123102137.10D6D653@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 23-11-12 10:21:37, azurIt wrote: > >Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or > >use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom > >function. > If 'YOUR_VMLINUX' is supposed to be my kernel image: > > # gdb vmlinuz-3.2.34-grsec-1 > GNU gdb (GDB) 7.0.1-debian > Copyright (C) 2009 Free Software Foundation, Inc. > License GPLv3+: GNU GPL version 3 or later > This is free software: you are free to change and redistribute it. > There is NO WARRANTY, to the extent permitted by law. Type "show copying" > and "show warranty" for details. > This GDB was configured as "x86_64-linux-gnu". > For bug reporting instructions, please see: > ... > "/root/bug/vmlinuz-3.2.34-grsec-1": not in executable format: File format not recognized > > > # objdump -d vmlinuz-3.2.34-grsec-1 You need vmlinux not vmlinuz... -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758720Ab2KWJVo (ORCPT ); Fri, 23 Nov 2012 04:21:44 -0500 Received: from gmmr7.centrum.cz ([46.255.225.249]:49402 "EHLO gmmr7.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751868Ab2KWJVk (ORCPT ); Fri, 23 Nov 2012 04:21:40 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_memory=2Dcgroup_bug?= Date: Fri, 23 Nov 2012 10:21:37 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= References: <20121121200207.01068046@pobox.sk>, <20121122152441.GA9609@dhcp22.suse.cz>, <20121122190526.390C7A28@pobox.sk>, <20121122214249.GA20319@dhcp22.suse.cz>, <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> In-Reply-To: <20121123074023.GA24698@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121123102137.10D6D653@pobox.sk> X-Maser: Georgo Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or >use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom >function. If 'YOUR_VMLINUX' is supposed to be my kernel image: # gdb vmlinuz-3.2.34-grsec-1 GNU gdb (GDB) 7.0.1-debian Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: ... "/root/bug/vmlinuz-3.2.34-grsec-1": not in executable format: File format not recognized # objdump -d vmlinuz-3.2.34-grsec-1 objdump: vmlinuz-3.2.34-grsec-1: File format not recognized # file vmlinuz-3.2.34-grsec-1 vmlinuz-3.2.34-grsec-1: Linux kernel x86 boot executable bzImage, version 3.2.34-grsec (root@server01) #1, RO-rootFS, swap_dev 0x3, Normal VGA I'm probably doing something wrong :) It, luckily, happend again so i have more info. - there wasn't any logs in kernel from OOM for that cgroup - there were 16 processes in cgroup - processes in cgroup were taking togather 100% of CPU (it was allowed to use only one core, so 100% of that core) - memory.failcnt was groving fast - oom_control: oom_kill_disable 0 under_oom 0 (this was looping from 0 to 1) - limit_in_bytes was set to 157286400 - content of stat (as you can see, the whole memory limit was used): cache 0 rss 0 mapped_file 0 pgpgin 0 pgpgout 0 swap 0 pgfault 0 pgmajfault 0 inactive_anon 0 active_anon 0 inactive_file 0 active_file 0 unevictable 0 hierarchical_memory_limit 157286400 hierarchical_memsw_limit 157286400 total_cache 0 total_rss 157286400 total_mapped_file 0 total_pgpgin 10326454 total_pgpgout 10288054 total_swap 0 total_pgfault 12939677 total_pgmajfault 4283 total_inactive_anon 0 total_active_anon 157286400 total_inactive_file 0 total_active_file 0 total_unevictable 0 i also grabber oom_adj, oom_score_adj and stack of all processes, here it is: http://www.watchdog.sk/lkml/memcg-bug.tar Notice that stack is different for few processes. Stack for all processes were NOT chaging and was still the same. Btw, don't know if it matters but i was several cgroup subsystems mounted and i'm also using them (i was not activating freezer in this case, don't know if it can be active automatically by kernel or what, didn't checked if cgroup was freezed but i suppose it wasn't): none /cgroups cgroup defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0 Thank you. azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759121Ab2KWJfS (ORCPT ); Fri, 23 Nov 2012 04:35:18 -0500 Received: from mx2.parallels.com ([64.131.90.16]:45335 "EHLO mx2.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1759105Ab2KWJfJ (ORCPT ); Fri, 23 Nov 2012 04:35:09 -0500 Message-ID: <50AF4343.6070002@parallels.com> Date: Fri, 23 Nov 2012 13:34:59 +0400 From: Glauber Costa User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:16.0) Gecko/20121029 Thunderbird/16.0.2 MIME-Version: 1.0 To: azurIt CC: Michal Hocko , , , cgroups mailinglist Subject: Re: memory-cgroup bug References: <20121121200207.01068046@pobox.sk>, <20121122152441.GA9609@dhcp22.suse.cz>, <20121122190526.390C7A28@pobox.sk>, <20121122214249.GA20319@dhcp22.suse.cz>, <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> In-Reply-To: <20121123102137.10D6D653@pobox.sk> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/23/2012 01:21 PM, azurIt wrote: >> Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or >> use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom >> function. > If 'YOUR_VMLINUX' is supposed to be my kernel image: > > # gdb vmlinuz-3.2.34-grsec-1 this is vmlinuz, not vmlinux. This is the compressed image. > > # file vmlinuz-3.2.34-grsec-1 > vmlinuz-3.2.34-grsec-1: Linux kernel x86 boot executable bzImage, version 3.2.34-grsec (root@server01) #1, RO-rootFS, swap_dev 0x3, Normal VGA > > I'm probably doing something wrong :) You need this: [glauber@straightjacket linux-glommer]$ file vmlinux vmlinux: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, BuildID[sha1]=0xba936ee6b6096f9bc4c663f2a2ee0c2d2481c408, not stripped instead of bzImage. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759171Ab2KWJo2 (ORCPT ); Fri, 23 Nov 2012 04:44:28 -0500 Received: from gmmr7.centrum.cz ([46.255.225.249]:57559 "EHLO gmmr7.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1759104Ab2KWJo0 (ORCPT ); Fri, 23 Nov 2012 04:44:26 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_memory=2Dcgroup_bug?= Date: Fri, 23 Nov 2012 10:44:23 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= References: <20121121200207.01068046@pobox.sk>, <20121122152441.GA9609@dhcp22.suse.cz>, <20121122190526.390C7A28@pobox.sk>, <20121122214249.GA20319@dhcp22.suse.cz>, <20121122233434.3D5E35E6@pobox.sk>, <20121123074023.GA24698@dhcp22.suse.cz>, <20121123102137.10D6D653@pobox.sk> <20121123092829.GE24698@dhcp22.suse.cz> In-Reply-To: <20121123092829.GE24698@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121123104423.338C7725@pobox.sk> X-Maser: oho Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" >On Fri 23-11-12 10:21:37, azurIt wrote: >> >Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or >> >use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom >> >function. >> If 'YOUR_VMLINUX' is supposed to be my kernel image: >> >> # gdb vmlinuz-3.2.34-grsec-1 >> GNU gdb (GDB) 7.0.1-debian >> Copyright (C) 2009 Free Software Foundation, Inc. >> License GPLv3+: GNU GPL version 3 or later >> This is free software: you are free to change and redistribute it. >> There is NO WARRANTY, to the extent permitted by law. Type "show copying" >> and "show warranty" for details. >> This GDB was configured as "x86_64-linux-gnu". >> For bug reporting instructions, please see: >> ... >> "/root/bug/vmlinuz-3.2.34-grsec-1": not in executable format: File format not recognized >> >> >> # objdump -d vmlinuz-3.2.34-grsec-1 > >You need vmlinux not vmlinuz... ok, got it but still no luck: # gdb vmlinux GNU gdb (GDB) 7.0.1-debian Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: ... Reading symbols from /root/bug/dddddddd/vmlinux...(no debugging symbols found)...done. (gdb) disassemble mem_cgroup_handle_oom No symbol table is loaded. Use the "file" command. # objdump -d vmlinux | grep mem_cgroup_handle_oom i can recompile the kernel if anything needs to be added into it. azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1161097Ab2KWKEr (ORCPT ); Fri, 23 Nov 2012 05:04:47 -0500 Received: from mail-vc0-f174.google.com ([209.85.220.174]:55158 "EHLO mail-vc0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1161040Ab2KWKEo (ORCPT ); Fri, 23 Nov 2012 05:04:44 -0500 Date: Fri, 23 Nov 2012 11:04:38 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist Subject: Re: memory-cgroup bug Message-ID: <20121123100438.GF24698@dhcp22.suse.cz> References: <20121121200207.01068046@pobox.sk> <20121122152441.GA9609@dhcp22.suse.cz> <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121123102137.10D6D653@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 23-11-12 10:21:37, azurIt wrote: [...] > It, luckily, happend again so i have more info. > > - there wasn't any logs in kernel from OOM for that cgroup > - there were 16 processes in cgroup > - processes in cgroup were taking togather 100% of CPU (it > was allowed to use only one core, so 100% of that core) > - memory.failcnt was groving fast > - oom_control: > oom_kill_disable 0 > under_oom 0 (this was looping from 0 to 1) So there was an OOM going on but no messages in the log? Really strange. Kame already asked about oom_score_adj of the processes in the group but it didn't look like all the processes would have oom disabled, right? > - limit_in_bytes was set to 157286400 > - content of stat (as you can see, the whole memory limit was used): > cache 0 > rss 0 This looks like a top-level group for your user. > mapped_file 0 > pgpgin 0 > pgpgout 0 > swap 0 > pgfault 0 > pgmajfault 0 > inactive_anon 0 > active_anon 0 > inactive_file 0 > active_file 0 > unevictable 0 > hierarchical_memory_limit 157286400 > hierarchical_memsw_limit 157286400 > total_cache 0 > total_rss 157286400 OK, so all the memory is anonymous and you have no swap so the oom is the only thing to do. > total_mapped_file 0 > total_pgpgin 10326454 > total_pgpgout 10288054 > total_swap 0 > total_pgfault 12939677 > total_pgmajfault 4283 > total_inactive_anon 0 > total_active_anon 157286400 > total_inactive_file 0 > total_active_file 0 > total_unevictable 0 > > > i also grabber oom_adj, oom_score_adj and stack of all processes, here > it is: > http://www.watchdog.sk/lkml/memcg-bug.tar Hmm, all processes waiting for oom are stuck at the very same place: $ grep mem_cgroup_handle_oom -r [0-9]* 30858/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 30859/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 30860/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 30892/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 30898/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 31588/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 32044/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 32358/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 6031/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 6534/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 7020/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 We are taking memcg_oom_lock spinlock twice in that function + we can schedule. As none of the tasks is scheduled this would suggest that you are blocked at the first lock. But who got the lock then? This is really strange. Btw. is sysrq+t resp. sysrq+w showing the same traces as /proc//stat? > Notice that stack is different for few processes. Yes others are in VFS resp ext3. ext3_write_begin looks a bit dangerous but it grabs the page before it really starts a transaction. > Stack for all processes were NOT chaging and was still the same. Could you take few snapshots over time? > Btw, don't know if it matters but i was several cgroup subsystems > mounted and i'm also using them (i was not activating freezer in this > case, don't know if it can be active automatically by kernel or what, No > didn't checked if cgroup was freezed but i suppose it wasn't): > none /cgroups cgroup defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0 Do you see the same issue if only memory controller was mounted (resp. cpuset which you seem to use as well from your description). I know you said booting into a vanilla kernel would be problematic but could you at least rule out te cgroup patches that you have mentioned? If you need to move a task to a group based by an uid you can use cgrules daemon (libcgroup1 package) for that as well. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1161122Ab2KWKKl (ORCPT ); Fri, 23 Nov 2012 05:10:41 -0500 Received: from mail-vc0-f174.google.com ([209.85.220.174]:60539 "EHLO mail-vc0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1161082Ab2KWKKi (ORCPT ); Fri, 23 Nov 2012 05:10:38 -0500 Date: Fri, 23 Nov 2012 11:10:34 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist Subject: Re: memory-cgroup bug Message-ID: <20121123101034.GG24698@dhcp22.suse.cz> References: <20121121200207.01068046@pobox.sk> <20121122152441.GA9609@dhcp22.suse.cz> <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123092829.GE24698@dhcp22.suse.cz> <20121123104423.338C7725@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121123104423.338C7725@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 23-11-12 10:44:23, azurIt wrote: [...] > # gdb vmlinux > GNU gdb (GDB) 7.0.1-debian > Copyright (C) 2009 Free Software Foundation, Inc. > License GPLv3+: GNU GPL version 3 or later > This is free software: you are free to change and redistribute it. > There is NO WARRANTY, to the extent permitted by law. Type "show copying" > and "show warranty" for details. > This GDB was configured as "x86_64-linux-gnu". > For bug reporting instructions, please see: > ... > Reading symbols from /root/bug/dddddddd/vmlinux...(no debugging symbols found)...done. > (gdb) disassemble mem_cgroup_handle_oom > No symbol table is loaded. Use the "file" command. > > > > # objdump -d vmlinux | grep mem_cgroup_handle_oom > Hmm, strange so the function is on the stack but it has been inlined? Doesn't make much sense to me. > i can recompile the kernel if anything needs to be added into it. If you could instrument mem_cgroup_handle_oom with some printks (before we take the memcg_oom_lock, before we schedule and into mem_cgroup_out_of_memory) -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752314Ab2KWO7K (ORCPT ); Fri, 23 Nov 2012 09:59:10 -0500 Received: from gmmr5.centrum.cz ([46.255.225.250]:44340 "EHLO gmmr5.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750714Ab2KWO7I (ORCPT ); Fri, 23 Nov 2012 09:59:08 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_memory=2Dcgroup_bug?= Date: Fri, 23 Nov 2012 15:59:04 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= References: <20121121200207.01068046@pobox.sk>, <20121122152441.GA9609@dhcp22.suse.cz>, <20121122190526.390C7A28@pobox.sk>, <20121122214249.GA20319@dhcp22.suse.cz>, <20121122233434.3D5E35E6@pobox.sk>, <20121123074023.GA24698@dhcp22.suse.cz>, <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> In-Reply-To: <20121123100438.GF24698@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121123155904.490039C5@pobox.sk> X-Maser: oho Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >If you could instrument mem_cgroup_handle_oom with some printks (before >we take the memcg_oom_lock, before we schedule and into >mem_cgroup_out_of_memory) If you send me patch i can do it. I'm, unfortunately, not able to code it. >> It, luckily, happend again so i have more info. >> >> - there wasn't any logs in kernel from OOM for that cgroup >> - there were 16 processes in cgroup >> - processes in cgroup were taking togather 100% of CPU (it >> was allowed to use only one core, so 100% of that core) >> - memory.failcnt was groving fast >> - oom_control: >> oom_kill_disable 0 >> under_oom 0 (this was looping from 0 to 1) > >So there was an OOM going on but no messages in the log? Really strange. >Kame already asked about oom_score_adj of the processes in the group but >it didn't look like all the processes would have oom disabled, right? There were no messages telling that some processes were killed because of OOM. >> - limit_in_bytes was set to 157286400 >> - content of stat (as you can see, the whole memory limit was used): >> cache 0 >> rss 0 > >This looks like a top-level group for your user. Yes, it was from /cgroup// >> mapped_file 0 >> pgpgin 0 >> pgpgout 0 >> swap 0 >> pgfault 0 >> pgmajfault 0 >> inactive_anon 0 >> active_anon 0 >> inactive_file 0 >> active_file 0 >> unevictable 0 >> hierarchical_memory_limit 157286400 >> hierarchical_memsw_limit 157286400 >> total_cache 0 >> total_rss 157286400 > >OK, so all the memory is anonymous and you have no swap so the oom is >the only thing to do. What will happen if the same situation occurs globally? No swap, every bit of memory used. Will kernel be able to start OOM killer? Maybe the same thing is happening in cgroup - there's simply no space to run OOM killer. And maybe this is why it's happening rarely - usually there are still at least few KBs of memory left to start OOM killer. >Hmm, all processes waiting for oom are stuck at the very same place: >$ grep mem_cgroup_handle_oom -r [0-9]* >30858/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 >30859/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 >30860/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 >30892/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 >30898/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 >31588/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 >32044/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 >32358/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 >6031/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 >6534/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 >7020/stack:[] mem_cgroup_handle_oom+0x241/0x3b0 > >We are taking memcg_oom_lock spinlock twice in that function + we can >schedule. As none of the tasks is scheduled this would suggest that you >are blocked at the first lock. But who got the lock then? >This is really strange. >Btw. is sysrq+t resp. sysrq+w showing the same traces as >/proc//stat? Unfortunately i'm connecting remotely to the servers (SSH). >> Notice that stack is different for few processes. > >Yes others are in VFS resp ext3. ext3_write_begin looks a bit dangerous >but it grabs the page before it really starts a transaction. Maybe these processes were throttled by cgroup-blkio at the same time and are still keeping the lock? So the problem occurs when there are low on memory and cgroup is doing IO out of it's limits. Only guessing and telling my thoughts. >> Stack for all processes were NOT chaging and was still the same. > >Could you take few snapshots over time? Will do next time but i can't keep services freezed for a long time or customers will be angry. >> didn't checked if cgroup was freezed but i suppose it wasn't): >> none /cgroups cgroup defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0 > >Do you see the same issue if only memory controller was mounted (resp. >cpuset which you seem to use as well from your description). Uh, we are using all mounted subsystems :( I will be able to umount only freezer and maybe blkio for some time. Will it help? >I know you said booting into a vanilla kernel would be problematic but >could you at least rule out te cgroup patches that you have mentioned? >If you need to move a task to a group based by an uid you can use >cgrules daemon (libcgroup1 package) for that as well. We are using cgroup-uid cos it's MUCH MUCH MUCH more efective and better. For example, i don't believe that cgroup-task will work with that daemon. What will happen if cgrules won't be able to add process into cgroup because of task limit? Process will probably continue and will run outside of any cgroup which is wrong. With cgroup-task + cgroup-uid, such processes cannot be even started (and this is what we need). From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752839Ab2KYAKv (ORCPT ); Sat, 24 Nov 2012 19:10:51 -0500 Received: from gmmr3.centrum.cz ([46.255.225.251]:51505 "EHLO gmmr3.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752720Ab2KYAKu (ORCPT ); Sat, 24 Nov 2012 19:10:50 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_memory=2Dcgroup_bug?= Date: Sun, 25 Nov 2012 01:10:47 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= References: <20121121200207.01068046@pobox.sk>, <20121122152441.GA9609@dhcp22.suse.cz>, <20121122190526.390C7A28@pobox.sk>, <20121122214249.GA20319@dhcp22.suse.cz>, <20121122233434.3D5E35E6@pobox.sk>, <20121123074023.GA24698@dhcp22.suse.cz>, <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> In-Reply-To: <20121123100438.GF24698@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121125011047.7477BB5E@pobox.sk> X-Maser: Georgo Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >Could you take few snapshots over time? Here it is, now from different server, snapshot was taken every second for 10 minutes (hope it's enough): www.watchdog.sk/lkml/memcg-bug-2.tar.gz From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753032Ab2KYKRO (ORCPT ); Sun, 25 Nov 2012 05:17:14 -0500 Received: from mail-ea0-f174.google.com ([209.85.215.174]:45764 "EHLO mail-ea0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752899Ab2KYKRL (ORCPT ); Sun, 25 Nov 2012 05:17:11 -0500 Date: Sun, 25 Nov 2012 11:17:07 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist Subject: Re: memory-cgroup bug Message-ID: <20121125101707.GA10623@dhcp22.suse.cz> References: <20121121200207.01068046@pobox.sk> <20121122152441.GA9609@dhcp22.suse.cz> <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121123155904.490039C5@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121123155904.490039C5@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 23-11-12 15:59:04, azurIt wrote: > >If you could instrument mem_cgroup_handle_oom with some printks (before > >we take the memcg_oom_lock, before we schedule and into > >mem_cgroup_out_of_memory) > > > If you send me patch i can do it. I'm, unfortunately, not able to code it. Inlined at the end of the email. Please note I have compile tested it. It might produce a lot of output. > >> It, luckily, happend again so i have more info. > >> > >> - there wasn't any logs in kernel from OOM for that cgroup > >> - there were 16 processes in cgroup > >> - processes in cgroup were taking togather 100% of CPU (it > >> was allowed to use only one core, so 100% of that core) > >> - memory.failcnt was groving fast > >> - oom_control: > >> oom_kill_disable 0 > >> under_oom 0 (this was looping from 0 to 1) > > > >So there was an OOM going on but no messages in the log? Really strange. > >Kame already asked about oom_score_adj of the processes in the group but > >it didn't look like all the processes would have oom disabled, right? > > > There were no messages telling that some processes were killed because of OOM. dmesg | grep "Out of memory" doesn't tell anything, right? > >> - limit_in_bytes was set to 157286400 > >> - content of stat (as you can see, the whole memory limit was used): > >> cache 0 > >> rss 0 > > > >This looks like a top-level group for your user. > > > Yes, it was from /cgroup// > > > >> mapped_file 0 > >> pgpgin 0 > >> pgpgout 0 > >> swap 0 > >> pgfault 0 > >> pgmajfault 0 > >> inactive_anon 0 > >> active_anon 0 > >> inactive_file 0 > >> active_file 0 > >> unevictable 0 > >> hierarchical_memory_limit 157286400 > >> hierarchical_memsw_limit 157286400 > >> total_cache 0 > >> total_rss 157286400 > > > >OK, so all the memory is anonymous and you have no swap so the oom is > >the only thing to do. > > > What will happen if the same situation occurs globally? No swap, every > bit of memory used. Will kernel be able to start OOM killer? OOM killer is not a task. It doesn't allocate any memory. It just walks the process list and picks up a task with the highest score. If the global oom is not able to find any such a task (e.g. because all of them have oom disabled) the the system panics. > Maybe the same thing is happening in cgroup cgroup oom differs only in that aspect that the system doesn't panic if there is no suitable task to kill. [...] > >> Notice that stack is different for few processes. > > > >Yes others are in VFS resp ext3. ext3_write_begin looks a bit dangerous > >but it grabs the page before it really starts a transaction. > > > Maybe these processes were throttled by cgroup-blkio at the same time > and are still keeping the lock? If you are thinking about memcg_oom_lock then this is not possible because the lock is held only for short times. There is no other lock that memcg oom holds. > So the problem occurs when there are low on memory and cgroup is doing > IO out of it's limits. Only guessing and telling my thoughts. The lockup (if this is what happens) still might be related to the IO controller if the killed task cannot finish due to pending IO, though. [...] > >> didn't checked if cgroup was freezed but i suppose it wasn't): > >> none /cgroups cgroup defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0 > > > >Do you see the same issue if only memory controller was mounted (resp. > >cpuset which you seem to use as well from your description). > > > Uh, we are using all mounted subsystems :( I will be able to umount > only freezer and maybe blkio for some time. Will it help? Not sure about that without further data. > >I know you said booting into a vanilla kernel would be problematic but > >could you at least rule out te cgroup patches that you have mentioned? > >If you need to move a task to a group based by an uid you can use > >cgrules daemon (libcgroup1 package) for that as well. > > > We are using cgroup-uid cos it's MUCH MUCH MUCH more efective and > better. For example, i don't believe that cgroup-task will work with > that daemon. What will happen if cgrules won't be able to add process > into cgroup because of task limit? Process will probably continue and > will run outside of any cgroup which is wrong. With cgroup-task + > cgroup-uid, such processes cannot be even started (and this is what we > need). I am not familiar with cgroup-task controller so I cannot comment on that. --- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c8425b1..7f26ec8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1863,6 +1863,7 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) { struct oom_wait_info owait; bool locked, need_to_kill; + int ret = false; owait.mem = memcg; owait.wait.flags = 0; @@ -1873,6 +1874,7 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) mem_cgroup_mark_under_oom(memcg); /* At first, try to OOM lock hierarchy under memcg.*/ + printk("XXX: %d waiting for memcg_oom_lock\n", current->pid); spin_lock(&memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); /* @@ -1887,12 +1889,14 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) mem_cgroup_oom_notify(memcg); spin_unlock(&memcg_oom_lock); + printk("XXX: %d need_to_kill:%d locked:%d\n", current->pid, need_to_kill, locked); if (need_to_kill) { finish_wait(&memcg_oom_waitq, &owait.wait); mem_cgroup_out_of_memory(memcg, mask); } else { schedule(); finish_wait(&memcg_oom_waitq, &owait.wait); + printk("XXX: %d woken up\n", current->pid); } spin_lock(&memcg_oom_lock); if (locked) @@ -1903,10 +1907,13 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) mem_cgroup_unmark_under_oom(memcg); if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) - return false; + goto out; /* Give chance to dying process */ schedule_timeout_uninterruptible(1); - return true; + ret = true; +out: + printk("XXX: %d done with %d\n", current->pid, ret); + return ret; } /* diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 069b64e..a7db813 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -568,6 +568,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) */ if (fatal_signal_pending(current)) { set_thread_flag(TIF_MEMDIE); + printk("XXX: %d skipping task with fatal signal pending\n", current->pid); return; } @@ -576,8 +577,10 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) read_lock(&tasklist_lock); retry: p = select_bad_process(&points, limit, mem, NULL); - if (!p || PTR_ERR(p) == -1UL) + if (!p || PTR_ERR(p) == -1UL) { + printk("XXX: %d nothing to kill\n", current->pid); goto out; + } if (oom_kill_process(p, gfp_mask, 0, points, limit, mem, NULL, "Memory cgroup out of memory")) -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753128Ab2KYMFa (ORCPT ); Sun, 25 Nov 2012 07:05:30 -0500 Received: from mail-ea0-f174.google.com ([209.85.215.174]:39261 "EHLO mail-ea0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753064Ab2KYMF2 (ORCPT ); Sun, 25 Nov 2012 07:05:28 -0500 Date: Sun, 25 Nov 2012 13:05:24 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki Subject: Re: memory-cgroup bug Message-ID: <20121125120524.GB10623@dhcp22.suse.cz> References: <20121121200207.01068046@pobox.sk> <20121122152441.GA9609@dhcp22.suse.cz> <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121125011047.7477BB5E@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org [Adding Kamezawa into CC] On Sun 25-11-12 01:10:47, azurIt wrote: > >Could you take few snapshots over time? > > > Here it is, now from different server, snapshot was taken every second > for 10 minutes (hope it's enough): > www.watchdog.sk/lkml/memcg-bug-2.tar.gz Hmm, interesting: $ grep . */memory.failcnt | cut -d: -f2 | awk 'BEGIN{min=666666}{if (prev>0) {diff=$1-prev; if (diff>max) max=diff; if (diff] mem_cgroup_handle_oom+0x241/0x3b0 546 [] do_truncate+0x58/0xa0 533 [] 0xffffffffffffffff Tells us that the stacks are pretty much stable. $ grep do_truncate -r [0-9]* | cut -d/ -f2 | sort | uniq -c 546 24495 So 24495 is stuck in do_truncate [] do_truncate+0x58/0xa0 [] do_last+0x250/0xa30 [] path_openat+0xd7/0x440 [] do_filp_open+0x49/0xa0 [] do_sys_open+0x106/0x240 [] sys_open+0x20/0x30 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff I suspect it is waiting for i_mutex. Who is holding that lock? Other tasks are blocked on the mem_cgroup_handle_oom either coming from the page fault path so i_mutex can be exluded or vfs_write (24796) and that one is interesting: [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1146+0x5ab/0x5c0 [] mem_cgroup_cache_charge+0xbe/0xe0 [] add_to_page_cache_locked+0x4c/0x140 [] add_to_page_cache_lru+0x22/0x50 [] grab_cache_page_write_begin+0x8b/0xe0 [] ext3_write_begin+0x88/0x270 [] generic_file_buffered_write+0x116/0x290 [] __generic_file_aio_write+0x27c/0x480 [] generic_file_aio_write+0x76/0xf0 # takes &inode->i_mutex [] do_sync_write+0xea/0x130 [] vfs_write+0xf3/0x1f0 [] sys_write+0x51/0x90 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff This smells like a deadlock. But kind strange one. The rapidly increasing failcnt suggests that somebody still tries to allocate but who when all of them hung in the mem_cgroup_handle_oom. This can be explained though. Memcg OOM killer let's only one process (which is able to lock the hierarchy by mem_cgroup_oom_lock) call mem_cgroup_out_of_memory and kill a process, while others are waiting on the wait queue. Once the killer is done it calls memcg_wakeup_oom which wakes up other tasks waiting on the queue. Those retry the charge, in a hope there is some memory freed in the meantime which hasn't happened so they get into OOM again (and again and again). This all usually works out except in this particular case I would bet my hat that the OOM selected task is pid 24495 which is blocked on the mutex which is held by one of the oom killer task so it cannot finish - thus free a memory. It seems that the current Linus' tree is affected as well. I will have to think about a solution but it sounds really tricky. It is not just ext3 that is affected. I guess we need to tell mem_cgroup_cache_charge that it should never reach OOM from add_to_page_cache_locked. This sounds quite intrusive to me. On the other hand it is really weird that an excessive writer might trigger a memcg OOM killer. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752720Ab2KYMgH (ORCPT ); Sun, 25 Nov 2012 07:36:07 -0500 Received: from gmmr5.centrum.cz ([46.255.225.250]:42604 "EHLO gmmr5.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751923Ab2KYMgG (ORCPT ); Sun, 25 Nov 2012 07:36:06 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_memory=2Dcgroup_bug?= Date: Sun, 25 Nov 2012 13:36:02 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= References: <20121121200207.01068046@pobox.sk>, <20121122152441.GA9609@dhcp22.suse.cz>, <20121122190526.390C7A28@pobox.sk>, <20121122214249.GA20319@dhcp22.suse.cz>, <20121122233434.3D5E35E6@pobox.sk>, <20121123074023.GA24698@dhcp22.suse.cz>, <20121123102137.10D6D653@pobox.sk>, <20121123100438.GF24698@dhcp22.suse.cz>, <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> In-Reply-To: <20121125120524.GB10623@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121125133602.CF488229@pobox.sk> X-Maser: brud Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >So there is a lot of attempts to allocate which fail, every second! Yes, as i said, the cgroup was taking 100% of (allocated) CPU core(s). Not sure if all processes were using CPU but _few_ of them (not only one) for sure. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752843Ab2KYMj5 (ORCPT ); Sun, 25 Nov 2012 07:39:57 -0500 Received: from gmmr2.centrum.cz ([46.255.227.252]:43597 "EHLO gmmr2.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751961Ab2KYMj4 (ORCPT ); Sun, 25 Nov 2012 07:39:56 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_memory=2Dcgroup_bug?= Date: Sun, 25 Nov 2012 13:39:53 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= References: <20121121200207.01068046@pobox.sk>, <20121122152441.GA9609@dhcp22.suse.cz>, <20121122190526.390C7A28@pobox.sk>, <20121122214249.GA20319@dhcp22.suse.cz>, <20121122233434.3D5E35E6@pobox.sk>, <20121123074023.GA24698@dhcp22.suse.cz>, <20121123102137.10D6D653@pobox.sk>, <20121123100438.GF24698@dhcp22.suse.cz>, <20121123155904.490039C5@pobox.sk> <20121125101707.GA10623@dhcp22.suse.cz> In-Reply-To: <20121125101707.GA10623@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121125133953.AD1B2F0A@pobox.sk> X-Maser: oho Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >Inlined at the end of the email. Please note I have compile tested >it. It might produce a lot of output. Thank you very much, i will install it ASAP (probably this night). >dmesg | grep "Out of memory" >doesn't tell anything, right? Only messages for other cgroups but not for the freezed one (before nor after the freeze). azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752981Ab2KYNCN (ORCPT ); Sun, 25 Nov 2012 08:02:13 -0500 Received: from mail-ea0-f174.google.com ([209.85.215.174]:49736 "EHLO mail-ea0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752805Ab2KYNCL (ORCPT ); Sun, 25 Nov 2012 08:02:11 -0500 Date: Sun, 25 Nov 2012 14:02:08 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist Subject: Re: memory-cgroup bug Message-ID: <20121125130208.GC10623@dhcp22.suse.cz> References: <20121122152441.GA9609@dhcp22.suse.cz> <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121123155904.490039C5@pobox.sk> <20121125101707.GA10623@dhcp22.suse.cz> <20121125133953.AD1B2F0A@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121125133953.AD1B2F0A@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun 25-11-12 13:39:53, azurIt wrote: > >Inlined at the end of the email. Please note I have compile tested > >it. It might produce a lot of output. > > > Thank you very much, i will install it ASAP (probably this night). Please don't. If my analysis is correct which I am almost 100% sure it is then it would cause excessive logging. I am sorry I cannot come up with something else in the mean time. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753084Ab2KYNzq (ORCPT ); Sun, 25 Nov 2012 08:55:46 -0500 Received: from mail-ea0-f174.google.com ([209.85.215.174]:48669 "EHLO mail-ea0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752659Ab2KYNzp (ORCPT ); Sun, 25 Nov 2012 08:55:45 -0500 Date: Sun, 25 Nov 2012 14:55:42 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki Subject: Re: memory-cgroup bug Message-ID: <20121125135542.GE10623@dhcp22.suse.cz> References: <20121121200207.01068046@pobox.sk> <20121122152441.GA9609@dhcp22.suse.cz> <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121125120524.GB10623@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun 25-11-12 13:05:24, Michal Hocko wrote: > [Adding Kamezawa into CC] > > On Sun 25-11-12 01:10:47, azurIt wrote: > > >Could you take few snapshots over time? > > > > > > Here it is, now from different server, snapshot was taken every second > > for 10 minutes (hope it's enough): > > www.watchdog.sk/lkml/memcg-bug-2.tar.gz > > Hmm, interesting: > $ grep . */memory.failcnt | cut -d: -f2 | awk 'BEGIN{min=666666}{if (prev>0) {diff=$1-prev; if (diff>max) max=diff; if (diff min:16281 max:224048 avg:18818.943119 > > So there is a lot of attempts to allocate which fail, every second! > Will get to that later. > > The number of tasks in the group is stable (20): > $ for i in *; do ls -d1 $i/[0-9]* | wc -l; done | sort | uniq -c > 546 20 > > And no task has been killed or spawned: > $ for i in *; do ls -d1 $i/[0-9]* | cut -d/ -f2; done | sort | uniq > 24495 > 24762 > 24774 > 24796 > 24798 > 24805 > 24813 > 24827 > 24831 > 24841 > 24842 > 24863 > 24892 > 24924 > 24931 > 25130 > 25131 > 25192 > 25193 > 25243 > > $ for stack in [0-9]*/[0-9]* > do > head -n1 $stack/stack > done | sort | uniq -c > 9841 [] mem_cgroup_handle_oom+0x241/0x3b0 > 546 [] do_truncate+0x58/0xa0 > 533 [] 0xffffffffffffffff > > Tells us that the stacks are pretty much stable. > $ grep do_truncate -r [0-9]* | cut -d/ -f2 | sort | uniq -c > 546 24495 > > So 24495 is stuck in do_truncate > [] do_truncate+0x58/0xa0 > [] do_last+0x250/0xa30 > [] path_openat+0xd7/0x440 > [] do_filp_open+0x49/0xa0 > [] do_sys_open+0x106/0x240 > [] sys_open+0x20/0x30 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff > > I suspect it is waiting for i_mutex. Who is holding that lock? > Other tasks are blocked on the mem_cgroup_handle_oom either coming from > the page fault path so i_mutex can be exluded or vfs_write (24796) and > that one is interesting: > [] mem_cgroup_handle_oom+0x241/0x3b0 > [] T.1146+0x5ab/0x5c0 > [] mem_cgroup_cache_charge+0xbe/0xe0 > [] add_to_page_cache_locked+0x4c/0x140 > [] add_to_page_cache_lru+0x22/0x50 > [] grab_cache_page_write_begin+0x8b/0xe0 > [] ext3_write_begin+0x88/0x270 > [] generic_file_buffered_write+0x116/0x290 > [] __generic_file_aio_write+0x27c/0x480 > [] generic_file_aio_write+0x76/0xf0 # takes &inode->i_mutex > [] do_sync_write+0xea/0x130 > [] vfs_write+0xf3/0x1f0 > [] sys_write+0x51/0x90 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff > > This smells like a deadlock. But kind strange one. The rapidly > increasing failcnt suggests that somebody still tries to allocate but > who when all of them hung in the mem_cgroup_handle_oom. This can be > explained though. > Memcg OOM killer let's only one process (which is able to lock the > hierarchy by mem_cgroup_oom_lock) call mem_cgroup_out_of_memory and kill > a process, while others are waiting on the wait queue. Once the killer > is done it calls memcg_wakeup_oom which wakes up other tasks waiting on > the queue. Those retry the charge, in a hope there is some memory freed > in the meantime which hasn't happened so they get into OOM again (and > again and again). > This all usually works out except in this particular case I would bet > my hat that the OOM selected task is pid 24495 which is blocked on the > mutex which is held by one of the oom killer task so it cannot finish - > thus free a memory. > > It seems that the current Linus' tree is affected as well. > > I will have to think about a solution but it sounds really tricky. It is > not just ext3 that is affected. > > I guess we need to tell mem_cgroup_cache_charge that it should never > reach OOM from add_to_page_cache_locked. This sounds quite intrusive to > me. On the other hand it is really weird that an excessive writer might > trigger a memcg OOM killer. This is hackish but it should help you in this case. Kamezawa, what do you think about that? Should we generalize this and prepare something like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY automatically and use the function whenever we are in a locked context? To be honest I do not like this very much but nothing more sensible (without touching non-memcg paths) comes to my mind. --- diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..da50c83 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -448,7 +448,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(PageSwapBacked(page)); error = mem_cgroup_cache_charge(page, current->mm, - gfp_mask & GFP_RECLAIM_MASK); + (gfp_mask | __GFP_NORETRY) & GFP_RECLAIM_MASK); if (error) goto out; -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753067Ab2KYN1N (ORCPT ); Sun, 25 Nov 2012 08:27:13 -0500 Received: from gmmr1.centrum.cz ([46.255.225.252]:59136 "EHLO gmmr1.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752426Ab2KYN1M (ORCPT ); Sun, 25 Nov 2012 08:27:12 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_memory=2Dcgroup_bug?= Date: Sun, 25 Nov 2012 14:27:09 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= References: <20121122152441.GA9609@dhcp22.suse.cz>, <20121122190526.390C7A28@pobox.sk>, <20121122214249.GA20319@dhcp22.suse.cz>, <20121122233434.3D5E35E6@pobox.sk>, <20121123074023.GA24698@dhcp22.suse.cz>, <20121123102137.10D6D653@pobox.sk>, <20121123100438.GF24698@dhcp22.suse.cz>, <20121123155904.490039C5@pobox.sk>, <20121125101707.GA10623@dhcp22.suse.cz>, <20121125133953.AD1B2F0A@pobox.sk> <20121125130208.GC10623@dhcp22.suse.cz> In-Reply-To: <20121125130208.GC10623@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121125142709.19F4E8C2@pobox.sk> X-Maser: brud Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >> Thank you very much, i will install it ASAP (probably this night). > >Please don't. If my analysis is correct which I am almost 100% sure it >is then it would cause excessive logging. I am sorry I cannot come up >with something else in the mean time. Ok then. I will, meanwhile, try to contact Andrea Righi (author of cgroup-task etc.) and ask him to send here his opinion about relation between freezes and his patches. Maybe it's some kind of a bug in memcg which don't appear in current vanilla code and is triggered by conditions created by, for example, cgroup-task. I noticed that there is always the exact number of freezed processes as the limit set for number of tasks by cgroup-task (i already tried to raise this limit AFTER the cgroup was freezed, didn't change anything). I'm sure it's not the problem with cgroup-task alone, it's 100% related also to memcg (but maybe there must be the combination of both of them). Thank you so far for your time! azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753111Ab2KYNoo (ORCPT ); Sun, 25 Nov 2012 08:44:44 -0500 Received: from mail-ee0-f46.google.com ([74.125.83.46]:61516 "EHLO mail-ee0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752843Ab2KYNom (ORCPT ); Sun, 25 Nov 2012 08:44:42 -0500 Date: Sun, 25 Nov 2012 14:44:40 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist Subject: Re: memory-cgroup bug Message-ID: <20121125134440.GD10623@dhcp22.suse.cz> References: <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121123155904.490039C5@pobox.sk> <20121125101707.GA10623@dhcp22.suse.cz> <20121125133953.AD1B2F0A@pobox.sk> <20121125130208.GC10623@dhcp22.suse.cz> <20121125142709.19F4E8C2@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121125142709.19F4E8C2@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun 25-11-12 14:27:09, azurIt wrote: > >> Thank you very much, i will install it ASAP (probably this night). > > > >Please don't. If my analysis is correct which I am almost 100% sure it > >is then it would cause excessive logging. I am sorry I cannot come up > >with something else in the mean time. > > > Ok then. I will, meanwhile, try to contact Andrea Righi (author of > cgroup-task etc.) and ask him to send here his opinion about relation > between freezes and his patches. As I described in other email. This seems to be a deadlock in memcg oom so I do not think that other patches influence this. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753854Ab2KZAjA (ORCPT ); Sun, 25 Nov 2012 19:39:00 -0500 Received: from gmmr3.centrum.cz ([46.255.225.251]:59858 "EHLO gmmr3.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753686Ab2KZAi6 (ORCPT ); Sun, 25 Nov 2012 19:38:58 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_memory=2Dcgroup_bug?= Date: Mon, 26 Nov 2012 01:38:55 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= References: <20121121200207.01068046@pobox.sk>, <20121122152441.GA9609@dhcp22.suse.cz>, <20121122190526.390C7A28@pobox.sk>, <20121122214249.GA20319@dhcp22.suse.cz>, <20121122233434.3D5E35E6@pobox.sk>, <20121123074023.GA24698@dhcp22.suse.cz>, <20121123102137.10D6D653@pobox.sk>, <20121123100438.GF24698@dhcp22.suse.cz>, <20121125011047.7477BB5E@pobox.sk>, <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> In-Reply-To: <20121125135542.GE10623@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121126013855.AF118F5E@pobox.sk> X-Maser: oho Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >This is hackish but it should help you in this case. Kamezawa, what do >you think about that? Should we generalize this and prepare something >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY >automatically and use the function whenever we are in a locked context? >To be honest I do not like this very much but nothing more sensible >(without touching non-memcg paths) comes to my mind. I installed kernel with this patch, will report back if problem occurs again OR in few weeks if everything will be ok. Thank you! Btw, will this patch be backported to 3.2? azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754385Ab2KZH5L (ORCPT ); Mon, 26 Nov 2012 02:57:11 -0500 Received: from cantor2.suse.de ([195.135.220.15]:55495 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753908Ab2KZH5J (ORCPT ); Mon, 26 Nov 2012 02:57:09 -0500 Date: Mon, 26 Nov 2012 08:57:07 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki Subject: Re: memory-cgroup bug Message-ID: <20121126075656.GA17860@dhcp22.suse.cz> References: <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121126013855.AF118F5E@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 26-11-12 01:38:55, azurIt wrote: > >This is hackish but it should help you in this case. Kamezawa, what do > >you think about that? Should we generalize this and prepare something > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > >automatically and use the function whenever we are in a locked context? > >To be honest I do not like this very much but nothing more sensible > >(without touching non-memcg paths) comes to my mind. > > > I installed kernel with this patch, will report back if problem occurs > again OR in few weeks if everything will be ok. Thank you! Thanks! > Btw, will this patch be backported to 3.2? Once we agree on a proper solution it will be backported to the stable trees. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755264Ab2KZNSl (ORCPT ); Mon, 26 Nov 2012 08:18:41 -0500 Received: from cantor2.suse.de ([195.135.220.15]:43973 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754690Ab2KZNSj (ORCPT ); Mon, 26 Nov 2012 08:18:39 -0500 Date: Mon, 26 Nov 2012 14:18:37 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121126131837.GC17860@dhcp22.suse.cz> References: <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121126013855.AF118F5E@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org [CCing also Johannes - the thread started here: https://lkml.org/lkml/2012/11/21/497] On Mon 26-11-12 01:38:55, azurIt wrote: > >This is hackish but it should help you in this case. Kamezawa, what do > >you think about that? Should we generalize this and prepare something > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > >automatically and use the function whenever we are in a locked context? > >To be honest I do not like this very much but nothing more sensible > >(without touching non-memcg paths) comes to my mind. > > > I installed kernel with this patch, will report back if problem occurs > again OR in few weeks if everything will be ok. Thank you! Now that I am looking at the patch closer it will not work because it depends on other patch which is not merged yet and even that one would help on its own because __GFP_NORETRY doesn't break the charge loop. Sorry I have missed that... The patch bellow should help though. (it is based on top of the current -mm tree but I will send a backport to 3.2 in the reply as well) --- >>From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Mon, 26 Nov 2012 11:47:57 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [] do_truncate+0x58/0xa0 # takes i_mutex [] do_last+0x250/0xa30 [] path_openat+0xd7/0x440 [] do_filp_open+0x49/0xa0 [] do_sys_open+0x106/0x240 [] sys_open+0x20/0x30 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff Process B [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1146+0x5ab/0x5c0 [] mem_cgroup_cache_charge+0xbe/0xe0 [] add_to_page_cache_locked+0x4c/0x140 [] add_to_page_cache_lru+0x22/0x50 [] grab_cache_page_write_begin+0x8b/0xe0 [] ext3_write_begin+0x88/0x270 [] generic_file_buffered_write+0x116/0x290 [] __generic_file_aio_write+0x27c/0x480 [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [] do_sync_write+0xea/0x130 [] vfs_write+0xf3/0x1f0 [] sys_write+0x51/0x90 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which then tells mem_cgroup_charge_common that OOM is not allowed for the charge. No OOM from this path, except for fixing the bug, also make some sense as we really do not want to cause an OOM because of a page cache usage. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. __GFP_NORETRY is abused for this memcg specific flag because it has been used to prevent from OOM already (since not-merged-yet "memcg: reclaim when more than one page needed"). The only difference is that the flag doesn't prevent from reclaim anymore which kind of makes sense because the global memory allocator triggers reclaim as well. The retry without any reclaim on __GFP_NORETRY doesn't make much sense anyway because this is effectively a busy loop with allowed OOM in this path. Reported-by: azurIt Signed-off-by: Michal Hocko --- include/linux/gfp.h | 3 +++ include/linux/memcontrol.h | 12 ++++++++++++ mm/filemap.c | 8 +++++++- mm/memcontrol.c | 5 +---- 4 files changed, 23 insertions(+), 5 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 10e667f..aac9b21 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -152,6 +152,9 @@ struct vm_area_struct; /* 4GB DMA on some platforms */ #define GFP_DMA32 __GFP_DMA32 +/* memcg oom killer is not allowed */ +#define GFP_MEMCG_NO_OOM __GFP_NORETRY + /* Convert GFP flags to their corresponding migrate type */ static inline int allocflags_to_migratetype(gfp_t gfp_flags) { diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 095d2b4..1ad4bc6 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -65,6 +65,12 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM); +} + struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); @@ -215,6 +221,12 @@ static inline int mem_cgroup_cache_charge(struct page *page, return 0; } +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return 0; +} + static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp) { diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..ef14351 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, + /* + * Cannot trigger OOM even if gfp_mask would allow that normally + * because we might be called from a locked context and that + * could lead to deadlocks if the killed process is waiting for + * the same lock. + */ + error = mem_cgroup_cache_charge_no_oom(page, current->mm, gfp_mask & GFP_RECLAIM_MASK); if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 02ee2f7..b4754ba 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2430,9 +2430,6 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (!(gfp_mask & __GFP_WAIT)) return CHARGE_WOULDBLOCK; - if (gfp_mask & __GFP_NORETRY) - return CHARGE_NOMEM; - ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags); if (mem_cgroup_margin(mem_over_limit) >= nr_pages) return CHARGE_RETRY; @@ -3713,7 +3710,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, { struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; - bool oom = true; + bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); int ret; if (PageTransHuge(page)) { -- 1.7.10.4 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754756Ab2KZNVw (ORCPT ); Mon, 26 Nov 2012 08:21:52 -0500 Received: from cantor2.suse.de ([195.135.220.15]:44190 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754250Ab2KZNVu (ORCPT ); Mon, 26 Nov 2012 08:21:50 -0500 Date: Mon, 26 Nov 2012 14:21:49 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121126132149.GD17860@dhcp22.suse.cz> References: <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121126131837.GC17860@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Here we go with the patch for 3.2.34. Could you test with this one, please? --- >>From 0d2d915c16f93918051b7ab8039d30b5a922049c Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Mon, 26 Nov 2012 11:47:57 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [] do_truncate+0x58/0xa0 # takes i_mutex [] do_last+0x250/0xa30 [] path_openat+0xd7/0x440 [] do_filp_open+0x49/0xa0 [] do_sys_open+0x106/0x240 [] sys_open+0x20/0x30 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff Process B [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1146+0x5ab/0x5c0 [] mem_cgroup_cache_charge+0xbe/0xe0 [] add_to_page_cache_locked+0x4c/0x140 [] add_to_page_cache_lru+0x22/0x50 [] grab_cache_page_write_begin+0x8b/0xe0 [] ext3_write_begin+0x88/0x270 [] generic_file_buffered_write+0x116/0x290 [] __generic_file_aio_write+0x27c/0x480 [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [] do_sync_write+0xea/0x130 [] vfs_write+0xf3/0x1f0 [] sys_write+0x51/0x90 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which then tells mem_cgroup_charge_common that OOM is not allowed for the charge. No OOM from this path, except for fixing the bug, also make some sense as we really do not want to cause an OOM because of a page cache usage. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. __GFP_NORETRY is abused for this memcg specific flag because no user accounted allocation use this flag except for THP which have memcg oom disabled already. Reported-by: azurIt Signed-off-by: Michal Hocko --- include/linux/gfp.h | 3 +++ include/linux/memcontrol.h | 13 +++++++++++++ mm/filemap.c | 8 +++++++- mm/memcontrol.c | 2 +- 4 files changed, 24 insertions(+), 2 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 3a76faf..806fb54 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -146,6 +146,9 @@ struct vm_area_struct; /* 4GB DMA on some platforms */ #define GFP_DMA32 __GFP_DMA32 +/* memcg oom killer is not allowed */ +#define GFP_MEMCG_NO_OOM __GFP_NORETRY + /* Convert GFP flags to their corresponding migrate type */ static inline int allocflags_to_migratetype(gfp_t gfp_flags) { diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 81572af..bf0e575 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -63,6 +63,13 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); + +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM); +} + extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru); extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru); extern void mem_cgroup_rotate_reclaimable_page(struct page *page); @@ -178,6 +185,12 @@ static inline int mem_cgroup_cache_charge(struct page *page, return 0; } +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return 0; +} + static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t gfp_mask, struct mem_cgroup **ptr) { diff --git a/mm/filemap.c b/mm/filemap.c index 556858c..ef182a9 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -449,7 +449,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, + /* + * Cannot trigger OOM even if gfp_mask would allow that normally + * because we might be called from a locked context and that + * could lead to deadlocks if the killed process is waiting for + * the same lock. + */ + error = mem_cgroup_cache_charge_no_oom(page, current->mm, gfp_mask & GFP_RECLAIM_MASK); if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c8425b1..1dbbe7f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2703,7 +2703,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; struct page_cgroup *pc; - bool oom = true; + bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); int ret; if (PageTransHuge(page)) { -- 1.7.10.4 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756916Ab2KZRqh (ORCPT ); Mon, 26 Nov 2012 12:46:37 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:35072 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755373Ab2KZRqd (ORCPT ); Mon, 26 Nov 2012 12:46:33 -0500 Date: Mon, 26 Nov 2012 12:46:22 -0500 From: Johannes Weiner To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121126174622.GE2799@cmpxchg.org> References: <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121126131837.GC17860@dhcp22.suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Nov 26, 2012 at 02:18:37PM +0100, Michal Hocko wrote: > [CCing also Johannes - the thread started here: > https://lkml.org/lkml/2012/11/21/497] > > On Mon 26-11-12 01:38:55, azurIt wrote: > > >This is hackish but it should help you in this case. Kamezawa, what do > > >you think about that? Should we generalize this and prepare something > > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > > >automatically and use the function whenever we are in a locked context? > > >To be honest I do not like this very much but nothing more sensible > > >(without touching non-memcg paths) comes to my mind. > > > > > > I installed kernel with this patch, will report back if problem occurs > > again OR in few weeks if everything will be ok. Thank you! > > Now that I am looking at the patch closer it will not work because it > depends on other patch which is not merged yet and even that one would > help on its own because __GFP_NORETRY doesn't break the charge loop. > Sorry I have missed that... > > The patch bellow should help though. (it is based on top of the current > -mm tree but I will send a backport to 3.2 in the reply as well) > --- > >From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001 > From: Michal Hocko > Date: Mon, 26 Nov 2012 11:47:57 +0100 > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > memcg oom killer might deadlock if the process which falls down to > mem_cgroup_handle_oom holds a lock which prevents other task to > terminate because it is blocked on the very same lock. > This can happen when a write system call needs to allocate a page but > the allocation hits the memcg hard limit and there is nothing to reclaim > (e.g. there is no swap or swap limit is hit as well and all cache pages > have been reclaimed already) and the process selected by memcg OOM > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > Process A > [] do_truncate+0x58/0xa0 # takes i_mutex > [] do_last+0x250/0xa30 > [] path_openat+0xd7/0x440 > [] do_filp_open+0x49/0xa0 > [] do_sys_open+0x106/0x240 > [] sys_open+0x20/0x30 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff > > Process B > [] mem_cgroup_handle_oom+0x241/0x3b0 > [] T.1146+0x5ab/0x5c0 > [] mem_cgroup_cache_charge+0xbe/0xe0 > [] add_to_page_cache_locked+0x4c/0x140 > [] add_to_page_cache_lru+0x22/0x50 > [] grab_cache_page_write_begin+0x8b/0xe0 > [] ext3_write_begin+0x88/0x270 > [] generic_file_buffered_write+0x116/0x290 > [] __generic_file_aio_write+0x27c/0x480 > [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > [] do_sync_write+0xea/0x130 > [] vfs_write+0xf3/0x1f0 > [] sys_write+0x51/0x90 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff So process B manages to lock the hierarchy, calls mem_cgroup_out_of_memory() and retries the charge infinitely, waiting for task A to die. All while it holds the i_mutex, preventing task A from dying, right? I think global oom already handles this in a much better way: invoke the OOM killer, sleep for a second, then return to userspace to relinquish all kernel resources and locks. The only reason why we can't simply change from an endless retry loop is because we don't want to return VM_FAULT_OOM and invoke the global OOM killer. But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and just restart the pagefault. Return -ENOMEM to the buffered IO syscall respectively. This way, the memcg OOM killer is invoked as it should but nobody gets stuck anywhere livelocking with the exiting task. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932846Ab2KZSEy (ORCPT ); Mon, 26 Nov 2012 13:04:54 -0500 Received: from mail-ea0-f174.google.com ([209.85.215.174]:60138 "EHLO mail-ea0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932640Ab2KZSEt (ORCPT ); Mon, 26 Nov 2012 13:04:49 -0500 Date: Mon, 26 Nov 2012 19:04:44 +0100 From: Michal Hocko To: Johannes Weiner Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121126180444.GA12602@dhcp22.suse.cz> References: <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126174622.GE2799@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121126174622.GE2799@cmpxchg.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 26-11-12 12:46:22, Johannes Weiner wrote: > On Mon, Nov 26, 2012 at 02:18:37PM +0100, Michal Hocko wrote: > > [CCing also Johannes - the thread started here: > > https://lkml.org/lkml/2012/11/21/497] > > > > On Mon 26-11-12 01:38:55, azurIt wrote: > > > >This is hackish but it should help you in this case. Kamezawa, what do > > > >you think about that? Should we generalize this and prepare something > > > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > > > >automatically and use the function whenever we are in a locked context? > > > >To be honest I do not like this very much but nothing more sensible > > > >(without touching non-memcg paths) comes to my mind. > > > > > > > > > I installed kernel with this patch, will report back if problem occurs > > > again OR in few weeks if everything will be ok. Thank you! > > > > Now that I am looking at the patch closer it will not work because it > > depends on other patch which is not merged yet and even that one would > > help on its own because __GFP_NORETRY doesn't break the charge loop. > > Sorry I have missed that... > > > > The patch bellow should help though. (it is based on top of the current > > -mm tree but I will send a backport to 3.2 in the reply as well) > > --- > > >From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001 > > From: Michal Hocko > > Date: Mon, 26 Nov 2012 11:47:57 +0100 > > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > > > memcg oom killer might deadlock if the process which falls down to > > mem_cgroup_handle_oom holds a lock which prevents other task to > > terminate because it is blocked on the very same lock. > > This can happen when a write system call needs to allocate a page but > > the allocation hits the memcg hard limit and there is nothing to reclaim > > (e.g. there is no swap or swap limit is hit as well and all cache pages > > have been reclaimed already) and the process selected by memcg OOM > > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > > > Process A > > [] do_truncate+0x58/0xa0 # takes i_mutex > > [] do_last+0x250/0xa30 > > [] path_openat+0xd7/0x440 > > [] do_filp_open+0x49/0xa0 > > [] do_sys_open+0x106/0x240 > > [] sys_open+0x20/0x30 > > [] system_call_fastpath+0x18/0x1d > > [] 0xffffffffffffffff > > > > Process B > > [] mem_cgroup_handle_oom+0x241/0x3b0 > > [] T.1146+0x5ab/0x5c0 > > [] mem_cgroup_cache_charge+0xbe/0xe0 > > [] add_to_page_cache_locked+0x4c/0x140 > > [] add_to_page_cache_lru+0x22/0x50 > > [] grab_cache_page_write_begin+0x8b/0xe0 > > [] ext3_write_begin+0x88/0x270 > > [] generic_file_buffered_write+0x116/0x290 > > [] __generic_file_aio_write+0x27c/0x480 > > [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > > [] do_sync_write+0xea/0x130 > > [] vfs_write+0xf3/0x1f0 > > [] sys_write+0x51/0x90 > > [] system_call_fastpath+0x18/0x1d > > [] 0xffffffffffffffff > > So process B manages to lock the hierarchy, calls > mem_cgroup_out_of_memory() and retries the charge infinitely, waiting > for task A to die. All while it holds the i_mutex, preventing task A > from dying, right? Right. > I think global oom already handles this in a much better way: invoke > the OOM killer, sleep for a second, then return to userspace to > relinquish all kernel resources and locks. The only reason why we > can't simply change from an endless retry loop is because we don't > want to return VM_FAULT_OOM and invoke the global OOM killer. Exactly. > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > respectively. This way, the memcg OOM killer is invoked as it should > but nobody gets stuck anywhere livelocking with the exiting task. Hmm, we would still have a problem with oom disabled (aka user space OOM killer), right? All processes but those in mem_cgroup_handle_oom are risky to be killed. Other POV might be, why we should trigger an OOM killer from those paths in the first place. Write or read (or even readahead) are all calls that should rather fail than cause an OOM killer in my opinion. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755207Ab2KZSYd (ORCPT ); Mon, 26 Nov 2012 13:24:33 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:35079 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754042Ab2KZSYb (ORCPT ); Mon, 26 Nov 2012 13:24:31 -0500 Date: Mon, 26 Nov 2012 13:24:21 -0500 From: Johannes Weiner To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121126182421.GB2301@cmpxchg.org> References: <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126174622.GE2799@cmpxchg.org> <20121126180444.GA12602@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121126180444.GA12602@dhcp22.suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote: > On Mon 26-11-12 12:46:22, Johannes Weiner wrote: > > On Mon, Nov 26, 2012 at 02:18:37PM +0100, Michal Hocko wrote: > > > [CCing also Johannes - the thread started here: > > > https://lkml.org/lkml/2012/11/21/497] > > > > > > On Mon 26-11-12 01:38:55, azurIt wrote: > > > > >This is hackish but it should help you in this case. Kamezawa, what do > > > > >you think about that? Should we generalize this and prepare something > > > > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > > > > >automatically and use the function whenever we are in a locked context? > > > > >To be honest I do not like this very much but nothing more sensible > > > > >(without touching non-memcg paths) comes to my mind. > > > > > > > > > > > > I installed kernel with this patch, will report back if problem occurs > > > > again OR in few weeks if everything will be ok. Thank you! > > > > > > Now that I am looking at the patch closer it will not work because it > > > depends on other patch which is not merged yet and even that one would > > > help on its own because __GFP_NORETRY doesn't break the charge loop. > > > Sorry I have missed that... > > > > > > The patch bellow should help though. (it is based on top of the current > > > -mm tree but I will send a backport to 3.2 in the reply as well) > > > --- > > > >From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001 > > > From: Michal Hocko > > > Date: Mon, 26 Nov 2012 11:47:57 +0100 > > > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > > > > > memcg oom killer might deadlock if the process which falls down to > > > mem_cgroup_handle_oom holds a lock which prevents other task to > > > terminate because it is blocked on the very same lock. > > > This can happen when a write system call needs to allocate a page but > > > the allocation hits the memcg hard limit and there is nothing to reclaim > > > (e.g. there is no swap or swap limit is hit as well and all cache pages > > > have been reclaimed already) and the process selected by memcg OOM > > > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > > > > > Process A > > > [] do_truncate+0x58/0xa0 # takes i_mutex > > > [] do_last+0x250/0xa30 > > > [] path_openat+0xd7/0x440 > > > [] do_filp_open+0x49/0xa0 > > > [] do_sys_open+0x106/0x240 > > > [] sys_open+0x20/0x30 > > > [] system_call_fastpath+0x18/0x1d > > > [] 0xffffffffffffffff > > > > > > Process B > > > [] mem_cgroup_handle_oom+0x241/0x3b0 > > > [] T.1146+0x5ab/0x5c0 > > > [] mem_cgroup_cache_charge+0xbe/0xe0 > > > [] add_to_page_cache_locked+0x4c/0x140 > > > [] add_to_page_cache_lru+0x22/0x50 > > > [] grab_cache_page_write_begin+0x8b/0xe0 > > > [] ext3_write_begin+0x88/0x270 > > > [] generic_file_buffered_write+0x116/0x290 > > > [] __generic_file_aio_write+0x27c/0x480 > > > [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > > > [] do_sync_write+0xea/0x130 > > > [] vfs_write+0xf3/0x1f0 > > > [] sys_write+0x51/0x90 > > > [] system_call_fastpath+0x18/0x1d > > > [] 0xffffffffffffffff > > > > So process B manages to lock the hierarchy, calls > > mem_cgroup_out_of_memory() and retries the charge infinitely, waiting > > for task A to die. All while it holds the i_mutex, preventing task A > > from dying, right? > > Right. > > > I think global oom already handles this in a much better way: invoke > > the OOM killer, sleep for a second, then return to userspace to > > relinquish all kernel resources and locks. The only reason why we > > can't simply change from an endless retry loop is because we don't > > want to return VM_FAULT_OOM and invoke the global OOM killer. > > Exactly. > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > > respectively. This way, the memcg OOM killer is invoked as it should > > but nobody gets stuck anywhere livelocking with the exiting task. > > Hmm, we would still have a problem with oom disabled (aka user space OOM > killer), right? All processes but those in mem_cgroup_handle_oom are > risky to be killed. Could we still let everybody get stuck in there when the OOM killer is disabled and let userspace take care of it? > Other POV might be, why we should trigger an OOM killer from those paths > in the first place. Write or read (or even readahead) are all calls that > should rather fail than cause an OOM killer in my opinion. Readahead is arguable, but we kill globally for read() and write() and I think we should do the same for memcg. The OOM killer is there to resolve a problem that comes from overcommitting the machine but the overuse does not have to be from the application that pushes the machine over the edge, that's why we don't just kill the allocating task but actually go look for the best candidate. If you have one memory hog that overuses the resources, attempted memory consumption in a different program should invoke the OOM killer. It does not matter if this is a page fault (would still happen with your patch) or a bufferd read/write (would no longer happen). From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755304Ab2KZTDf (ORCPT ); Mon, 26 Nov 2012 14:03:35 -0500 Received: from mail-wg0-f44.google.com ([74.125.82.44]:60058 "EHLO mail-wg0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755186Ab2KZTDd (ORCPT ); Mon, 26 Nov 2012 14:03:33 -0500 Date: Mon, 26 Nov 2012 20:03:29 +0100 From: Michal Hocko To: Johannes Weiner Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121126190329.GB12602@dhcp22.suse.cz> References: <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126174622.GE2799@cmpxchg.org> <20121126180444.GA12602@dhcp22.suse.cz> <20121126182421.GB2301@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121126182421.GB2301@cmpxchg.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 26-11-12 13:24:21, Johannes Weiner wrote: > On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote: > > On Mon 26-11-12 12:46:22, Johannes Weiner wrote: [...] > > > I think global oom already handles this in a much better way: invoke > > > the OOM killer, sleep for a second, then return to userspace to > > > relinquish all kernel resources and locks. The only reason why we > > > can't simply change from an endless retry loop is because we don't > > > want to return VM_FAULT_OOM and invoke the global OOM killer. > > > > Exactly. > > > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > > > respectively. This way, the memcg OOM killer is invoked as it should > > > but nobody gets stuck anywhere livelocking with the exiting task. > > > > Hmm, we would still have a problem with oom disabled (aka user space OOM > > killer), right? All processes but those in mem_cgroup_handle_oom are > > risky to be killed. > > Could we still let everybody get stuck in there when the OOM killer is > disabled and let userspace take care of it? I am not sure what exactly you mean by "userspace take care of it" but if those processes are stuck and holding the lock then it is usually hard to find that out. Well if somebody is familiar with internal then it is doable but this makes the interface really unusable for regular usage. > > Other POV might be, why we should trigger an OOM killer from those paths > > in the first place. Write or read (or even readahead) are all calls that > > should rather fail than cause an OOM killer in my opinion. > > Readahead is arguable, but we kill globally for read() and write() and > I think we should do the same for memcg. Fair point but the global case is little bit easier than memcg in this case because nobody can hook on OOM killer and provide a userspace implementation for it which is one of the cooler feature of memcg... I am all open to any suggestions but we should somehow fix this (and backport it to stable trees as this is there for quite some time. The current report shows that the problem is not that hard to trigger). > The OOM killer is there to resolve a problem that comes from > overcommitting the machine but the overuse does not have to be from > the application that pushes the machine over the edge, that's why we > don't just kill the allocating task but actually go look for the best > candidate. If you have one memory hog that overuses the resources, > attempted memory consumption in a different program should invoke the > OOM killer. > It does not matter if this is a page fault (would still happen with > your patch) or a bufferd read/write (would no longer happen). true and it is sad that mmap then behaves slightly different than read/write which should I've mentioned in the changelog. As I said I am open to other suggestions. Thanks -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755574Ab2KZT3x (ORCPT ); Mon, 26 Nov 2012 14:29:53 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:35093 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755496Ab2KZT3v (ORCPT ); Mon, 26 Nov 2012 14:29:51 -0500 Date: Mon, 26 Nov 2012 14:29:41 -0500 From: Johannes Weiner To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121126192941.GC2301@cmpxchg.org> References: <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126174622.GE2799@cmpxchg.org> <20121126180444.GA12602@dhcp22.suse.cz> <20121126182421.GB2301@cmpxchg.org> <20121126190329.GB12602@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121126190329.GB12602@dhcp22.suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Nov 26, 2012 at 08:03:29PM +0100, Michal Hocko wrote: > On Mon 26-11-12 13:24:21, Johannes Weiner wrote: > > On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote: > > > On Mon 26-11-12 12:46:22, Johannes Weiner wrote: > [...] > > > > I think global oom already handles this in a much better way: invoke > > > > the OOM killer, sleep for a second, then return to userspace to > > > > relinquish all kernel resources and locks. The only reason why we > > > > can't simply change from an endless retry loop is because we don't > > > > want to return VM_FAULT_OOM and invoke the global OOM killer. > > > > > > Exactly. > > > > > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > > > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > > > > respectively. This way, the memcg OOM killer is invoked as it should > > > > but nobody gets stuck anywhere livelocking with the exiting task. > > > > > > Hmm, we would still have a problem with oom disabled (aka user space OOM > > > killer), right? All processes but those in mem_cgroup_handle_oom are > > > risky to be killed. > > > > Could we still let everybody get stuck in there when the OOM killer is > > disabled and let userspace take care of it? > > I am not sure what exactly you mean by "userspace take care of it" but > if those processes are stuck and holding the lock then it is usually > hard to find that out. Well if somebody is familiar with internal then > it is doable but this makes the interface really unusable for regular > usage. If oom_kill_disable is set, then all processes get stuck all the way down in the charge stack. Whatever resource they pin, you may deadlock on if you try to touch it while handling the problem from userspace. I don't see how this is a new problem...? Or do you mean something else? > > > Other POV might be, why we should trigger an OOM killer from those paths > > > in the first place. Write or read (or even readahead) are all calls that > > > should rather fail than cause an OOM killer in my opinion. > > > > Readahead is arguable, but we kill globally for read() and write() and > > I think we should do the same for memcg. > > Fair point but the global case is little bit easier than memcg in this > case because nobody can hook on OOM killer and provide a userspace > implementation for it which is one of the cooler feature of memcg... > I am all open to any suggestions but we should somehow fix this (and > backport it to stable trees as this is there for quite some time. The > current report shows that the problem is not that hard to trigger). As per above, the userspace OOM handling is risky as hell anyway. What happens when an anonymous fault waits in memcg userspace OOM while holding the mmap_sem, and a writer lines up behind it? Your userspace OOM handler had better not look at any of the /proc files of the stuck task that require the mmap_sem. At the same token, it probably shouldn't touch the same files a memcg task is stuck trying to read/write. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755881Ab2KZUI4 (ORCPT ); Mon, 26 Nov 2012 15:08:56 -0500 Received: from mail-vb0-f46.google.com ([209.85.212.46]:35610 "EHLO mail-vb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755782Ab2KZUIy (ORCPT ); Mon, 26 Nov 2012 15:08:54 -0500 Date: Mon, 26 Nov 2012 21:08:48 +0100 From: Michal Hocko To: Johannes Weiner Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121126200848.GC12602@dhcp22.suse.cz> References: <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126174622.GE2799@cmpxchg.org> <20121126180444.GA12602@dhcp22.suse.cz> <20121126182421.GB2301@cmpxchg.org> <20121126190329.GB12602@dhcp22.suse.cz> <20121126192941.GC2301@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121126192941.GC2301@cmpxchg.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 26-11-12 14:29:41, Johannes Weiner wrote: > On Mon, Nov 26, 2012 at 08:03:29PM +0100, Michal Hocko wrote: > > On Mon 26-11-12 13:24:21, Johannes Weiner wrote: > > > On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote: > > > > On Mon 26-11-12 12:46:22, Johannes Weiner wrote: > > [...] > > > > > I think global oom already handles this in a much better way: invoke > > > > > the OOM killer, sleep for a second, then return to userspace to > > > > > relinquish all kernel resources and locks. The only reason why we > > > > > can't simply change from an endless retry loop is because we don't > > > > > want to return VM_FAULT_OOM and invoke the global OOM killer. > > > > > > > > Exactly. > > > > > > > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > > > > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > > > > > respectively. This way, the memcg OOM killer is invoked as it should > > > > > but nobody gets stuck anywhere livelocking with the exiting task. > > > > > > > > Hmm, we would still have a problem with oom disabled (aka user space OOM > > > > killer), right? All processes but those in mem_cgroup_handle_oom are > > > > risky to be killed. > > > > > > Could we still let everybody get stuck in there when the OOM killer is > > > disabled and let userspace take care of it? > > > > I am not sure what exactly you mean by "userspace take care of it" but > > if those processes are stuck and holding the lock then it is usually > > hard to find that out. Well if somebody is familiar with internal then > > it is doable but this makes the interface really unusable for regular > > usage. > > If oom_kill_disable is set, then all processes get stuck all the way > down in the charge stack. Whatever resource they pin, you may > deadlock on if you try to touch it while handling the problem from > userspace. OK, I guess I am getting what you are trying to say. So what you are suggesting is to just let mem_cgroup_out_of_memory send the signal and move on without retry (or with few charge retries without further OOM killing) and fail the charge with your new FAULT_OOM_HANDLED (resp. something like FAULT_RETRY) error code resp. ENOMEM depending on the caller. OOM disabled case would be "you are on your own" because this has been dangerous anyway. Correct? I do agree that the current endless retry loop is far from being ideal and can see some updates but I am quite nervous about any potential regressions in this area (e.g. too aggressive OOM etc...). I have to think about it some more. Anyway if you have some more specific ideas I would be happy to review patches. [...] Thanks -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756158Ab2KZUT2 (ORCPT ); Mon, 26 Nov 2012 15:19:28 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:35100 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756067Ab2KZUTZ (ORCPT ); Mon, 26 Nov 2012 15:19:25 -0500 Date: Mon, 26 Nov 2012 15:19:18 -0500 From: Johannes Weiner To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121126201918.GD2301@cmpxchg.org> References: <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126174622.GE2799@cmpxchg.org> <20121126180444.GA12602@dhcp22.suse.cz> <20121126182421.GB2301@cmpxchg.org> <20121126190329.GB12602@dhcp22.suse.cz> <20121126192941.GC2301@cmpxchg.org> <20121126200848.GC12602@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121126200848.GC12602@dhcp22.suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Nov 26, 2012 at 09:08:48PM +0100, Michal Hocko wrote: > On Mon 26-11-12 14:29:41, Johannes Weiner wrote: > > On Mon, Nov 26, 2012 at 08:03:29PM +0100, Michal Hocko wrote: > > > On Mon 26-11-12 13:24:21, Johannes Weiner wrote: > > > > On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote: > > > > > On Mon 26-11-12 12:46:22, Johannes Weiner wrote: > > > [...] > > > > > > I think global oom already handles this in a much better way: invoke > > > > > > the OOM killer, sleep for a second, then return to userspace to > > > > > > relinquish all kernel resources and locks. The only reason why we > > > > > > can't simply change from an endless retry loop is because we don't > > > > > > want to return VM_FAULT_OOM and invoke the global OOM killer. > > > > > > > > > > Exactly. > > > > > > > > > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > > > > > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > > > > > > respectively. This way, the memcg OOM killer is invoked as it should > > > > > > but nobody gets stuck anywhere livelocking with the exiting task. > > > > > > > > > > Hmm, we would still have a problem with oom disabled (aka user space OOM > > > > > killer), right? All processes but those in mem_cgroup_handle_oom are > > > > > risky to be killed. > > > > > > > > Could we still let everybody get stuck in there when the OOM killer is > > > > disabled and let userspace take care of it? > > > > > > I am not sure what exactly you mean by "userspace take care of it" but > > > if those processes are stuck and holding the lock then it is usually > > > hard to find that out. Well if somebody is familiar with internal then > > > it is doable but this makes the interface really unusable for regular > > > usage. > > > > If oom_kill_disable is set, then all processes get stuck all the way > > down in the charge stack. Whatever resource they pin, you may > > deadlock on if you try to touch it while handling the problem from > > userspace. > > OK, I guess I am getting what you are trying to say. So what you are > suggesting is to just let mem_cgroup_out_of_memory send the signal and > move on without retry (or with few charge retries without further OOM > killing) and fail the charge with your new FAULT_OOM_HANDLED (resp. > something like FAULT_RETRY) error code resp. ENOMEM depending on the > caller. OOM disabled case would be "you are on your own" because this > has been dangerous anyway. Correct? Yes. > I do agree that the current endless retry loop is far from being ideal > and can see some updates but I am quite nervous about any potential > regressions in this area (e.g. too aggressive OOM etc...). I have to > think about it some more. Agreed on all points. Maybe we can keep a couple of the oom retry iterations or something like that, which is still much more than what global does and I don't think the global OOM killer is overly eager. Testing will show more. > Anyway if you have some more specific ideas I would be happy to review > patches. Okay, I just wanted to check back with you before going down this path. What are we going to do short term, though? Do you want to push the disable-oom-for-pagecache for now or should we put the VM_FAULT_OOM_HANDLED fix in the next version and do stable backports? This issue has been around for a while so frankly I don't think it's urgent enough to rush things. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756379Ab2KZUx7 (ORCPT ); Mon, 26 Nov 2012 15:53:59 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:35105 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755563Ab2KZUx6 (ORCPT ); Mon, 26 Nov 2012 15:53:58 -0500 Date: Mon, 26 Nov 2012 15:53:49 -0500 From: Johannes Weiner To: azurIt Cc: Michal Hocko , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121126205349.GE2301@cmpxchg.org> References: <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126174622.GE2799@cmpxchg.org> <20121126180444.GA12602@dhcp22.suse.cz> <20121126182421.GB2301@cmpxchg.org> <20121126190329.GB12602@dhcp22.suse.cz> <20121126192941.GC2301@cmpxchg.org> <20121126200848.GC12602@dhcp22.suse.cz> <20121126201918.GD2301@cmpxchg.org> <20121126214638.64723F01@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121126214638.64723F01@pobox.sk> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Nov 26, 2012 at 09:46:38PM +0100, azurIt wrote: > >This issue has been around for a while so frankly I don't think it's > >urgent enough to rush things. > > > Well, it's quite urgent at least for us :( i wasn't reported this so > far cos i wasn't sure it's a kernel thing. I will be really happy > and thankfull if fix for this can go to 3.2 in some near > future.. Thank you very much! I understand and of course it's important that we get it fixed as soon as possible. All I meant was that this problem has not exactly been introduced in 3.7 and the fix is non-trivial so we should not be rushing a change like this into 3.7 just days before its release. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756695Ab2KZUqn (ORCPT ); Mon, 26 Nov 2012 15:46:43 -0500 Received: from gmmr7.centrum.cz ([46.255.225.249]:44960 "EHLO gmmr7.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756432Ab2KZUql (ORCPT ); Mon, 26 Nov 2012 15:46:41 -0500 To: =?utf-8?q?Johannes_Weiner?= , =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_=2Dmm=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Mon, 26 Nov 2012 21:46:38 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= References: <20121125120524.GB10623@dhcp22.suse.cz>, <20121125135542.GE10623@dhcp22.suse.cz>, <20121126013855.AF118F5E@pobox.sk>, <20121126131837.GC17860@dhcp22.suse.cz>, <20121126174622.GE2799@cmpxchg.org>, <20121126180444.GA12602@dhcp22.suse.cz>, <20121126182421.GB2301@cmpxchg.org>, <20121126190329.GB12602@dhcp22.suse.cz>, <20121126192941.GC2301@cmpxchg.org>, <20121126200848.GC12602@dhcp22.suse.cz> <20121126201918.GD2301@cmpxchg.org> In-Reply-To: <20121126201918.GD2301@cmpxchg.org> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121126214638.64723F01@pobox.sk> X-Maser: brud Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >This issue has been around for a while so frankly I don't think it's >urgent enough to rush things. Well, it's quite urgent at least for us :( i wasn't reported this so far cos i wasn't sure it's a kernel thing. I will be really happy and thankfull if fix for this can go to 3.2 in some near future.. Thank you very much! azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757166Ab2KZV2a (ORCPT ); Mon, 26 Nov 2012 16:28:30 -0500 Received: from gmmr3.centrum.cz ([46.255.225.251]:58672 "EHLO gmmr3.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755263Ab2KZV23 (ORCPT ); Mon, 26 Nov 2012 16:28:29 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Mon, 26 Nov 2012 22:28:26 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20121122214249.GA20319@dhcp22.suse.cz>, <20121122233434.3D5E35E6@pobox.sk>, <20121123074023.GA24698@dhcp22.suse.cz>, <20121123102137.10D6D653@pobox.sk>, <20121123100438.GF24698@dhcp22.suse.cz>, <20121125011047.7477BB5E@pobox.sk>, <20121125120524.GB10623@dhcp22.suse.cz>, <20121125135542.GE10623@dhcp22.suse.cz>, <20121126013855.AF118F5E@pobox.sk>, <20121126131837.GC17860@dhcp22.suse.cz> <20121126132149.GD17860@dhcp22.suse.cz> In-Reply-To: <20121126132149.GD17860@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121126222826.3843D563@pobox.sk> X-Maser: brud Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >Here we go with the patch for 3.2.34. Could you test with this one, >please? Michal, regarding to your conversation with Johannes Weiner, should i try this patch or not? azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757306Ab2KZWGp (ORCPT ); Mon, 26 Nov 2012 17:06:45 -0500 Received: from mail-wi0-f178.google.com ([209.85.212.178]:63288 "EHLO mail-wi0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755818Ab2KZWGn (ORCPT ); Mon, 26 Nov 2012 17:06:43 -0500 Date: Mon, 26 Nov 2012 23:06:40 +0100 From: Michal Hocko To: Johannes Weiner Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121126220640.GE12602@dhcp22.suse.cz> References: <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126174622.GE2799@cmpxchg.org> <20121126180444.GA12602@dhcp22.suse.cz> <20121126182421.GB2301@cmpxchg.org> <20121126190329.GB12602@dhcp22.suse.cz> <20121126192941.GC2301@cmpxchg.org> <20121126200848.GC12602@dhcp22.suse.cz> <20121126201918.GD2301@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121126201918.GD2301@cmpxchg.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 26-11-12 15:19:18, Johannes Weiner wrote: > On Mon, Nov 26, 2012 at 09:08:48PM +0100, Michal Hocko wrote: [...] > > OK, I guess I am getting what you are trying to say. So what you are > > suggesting is to just let mem_cgroup_out_of_memory send the signal and > > move on without retry (or with few charge retries without further OOM > > killing) and fail the charge with your new FAULT_OOM_HANDLED (resp. > > something like FAULT_RETRY) error code resp. ENOMEM depending on the > > caller. OOM disabled case would be "you are on your own" because this > > has been dangerous anyway. Correct? > > Yes. > > > I do agree that the current endless retry loop is far from being ideal > > and can see some updates but I am quite nervous about any potential > > regressions in this area (e.g. too aggressive OOM etc...). I have to > > think about it some more. > > Agreed on all points. Maybe we can keep a couple of the oom retry > iterations or something like that, which is still much more than what > global does and I don't think the global OOM killer is overly eager. Yes we can offer less blood and more confort > > Testing will show more. > > > Anyway if you have some more specific ideas I would be happy to review > > patches. > > Okay, I just wanted to check back with you before going down this > path. What are we going to do short term, though? Do you want to > push the disable-oom-for-pagecache for now or should we put the > VM_FAULT_OOM_HANDLED fix in the next version and do stable backports? > > This issue has been around for a while so frankly I don't think it's > urgent enough to rush things. Yes, but now we have a real usecase where this hurts AFAIU. Unless we come up with a fix/reasonable workaround I would rather go with something simpler for starter and more sofisticated later. I have to double check other places where we do charging but the last time I've checked we don't hold page locks on already visible pages (we do precharge in __do_fault f.e.), mem_map for reading in the page fault path is also safe (with oom enabled) and I guess that tmpfs is ok as well. Then we have a page cache and that one should be covered by my patch. So we should be covered. But I like your idea long term. Thanks! -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932992Ab2K0AGM (ORCPT ); Mon, 26 Nov 2012 19:06:12 -0500 Received: from fgwmail5.fujitsu.co.jp ([192.51.44.35]:42592 "EHLO fgwmail5.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932209Ab2K0AGL (ORCPT ); Mon, 26 Nov 2012 19:06:11 -0500 X-SecurityPolicyCheck: OK by SHieldMailChecker v1.8.4 Message-ID: <50B403CA.501@jp.fujitsu.com> Date: Tue, 27 Nov 2012 09:05:30 +0900 From: Kamezawa Hiroyuki User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:16.0) Gecko/20121026 Thunderbird/16.0.2 MIME-Version: 1.0 To: Michal Hocko CC: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , Johannes Weiner Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked References: <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> In-Reply-To: <20121126131837.GC17860@dhcp22.suse.cz> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org (2012/11/26 22:18), Michal Hocko wrote: > [CCing also Johannes - the thread started here: > https://lkml.org/lkml/2012/11/21/497] > > On Mon 26-11-12 01:38:55, azurIt wrote: >>> This is hackish but it should help you in this case. Kamezawa, what do >>> you think about that? Should we generalize this and prepare something >>> like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY >>> automatically and use the function whenever we are in a locked context? >>> To be honest I do not like this very much but nothing more sensible >>> (without touching non-memcg paths) comes to my mind. >> >> >> I installed kernel with this patch, will report back if problem occurs >> again OR in few weeks if everything will be ok. Thank you! > > Now that I am looking at the patch closer it will not work because it > depends on other patch which is not merged yet and even that one would > help on its own because __GFP_NORETRY doesn't break the charge loop. > Sorry I have missed that... > > The patch bellow should help though. (it is based on top of the current > -mm tree but I will send a backport to 3.2 in the reply as well) > --- > From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001 > From: Michal Hocko > Date: Mon, 26 Nov 2012 11:47:57 +0100 > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > memcg oom killer might deadlock if the process which falls down to > mem_cgroup_handle_oom holds a lock which prevents other task to > terminate because it is blocked on the very same lock. > This can happen when a write system call needs to allocate a page but > the allocation hits the memcg hard limit and there is nothing to reclaim > (e.g. there is no swap or swap limit is hit as well and all cache pages > have been reclaimed already) and the process selected by memcg OOM > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > Process A > [] do_truncate+0x58/0xa0 # takes i_mutex > [] do_last+0x250/0xa30 > [] path_openat+0xd7/0x440 > [] do_filp_open+0x49/0xa0 > [] do_sys_open+0x106/0x240 > [] sys_open+0x20/0x30 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff > > Process B > [] mem_cgroup_handle_oom+0x241/0x3b0 > [] T.1146+0x5ab/0x5c0 > [] mem_cgroup_cache_charge+0xbe/0xe0 > [] add_to_page_cache_locked+0x4c/0x140 > [] add_to_page_cache_lru+0x22/0x50 > [] grab_cache_page_write_begin+0x8b/0xe0 > [] ext3_write_begin+0x88/0x270 > [] generic_file_buffered_write+0x116/0x290 > [] __generic_file_aio_write+0x27c/0x480 > [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > [] do_sync_write+0xea/0x130 > [] vfs_write+0xf3/0x1f0 > [] sys_write+0x51/0x90 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff > > This is not a hard deadlock though because administrator can still > intervene and increase the limit on the group which helps the writer to > finish the allocation and release the lock. > > This patch heals the problem by forbidding OOM from page cache charges > (namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper > function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which > then tells mem_cgroup_charge_common that OOM is not allowed for the > charge. No OOM from this path, except for fixing the bug, also make some > sense as we really do not want to cause an OOM because of a page cache > usage. > As a possibly visible result add_to_page_cache_lru might fail more often > with ENOMEM but this is to be expected if the limit is set and it is > preferable than OOM killer IMO. > > __GFP_NORETRY is abused for this memcg specific flag because it has been > used to prevent from OOM already (since not-merged-yet "memcg: reclaim > when more than one page needed"). The only difference is that the flag > doesn't prevent from reclaim anymore which kind of makes sense because > the global memory allocator triggers reclaim as well. The retry without > any reclaim on __GFP_NORETRY doesn't make much sense anyway because this > is effectively a busy loop with allowed OOM in this path. > > Reported-by: azurIt > Signed-off-by: Michal Hocko As a short term fix, I think this patch will work enough and seems simple enough. Acked-by: KAMEZAWA Hiroyuki Reading discussion between you and Johannes, to release locks, I understand the memcg need to return "RETRY" for a long term fix. Thinking a little, it will be simple to return "RETRY" to all processes waited on oom kill queue of a memcg and it can be done by a small fixes to memory.c. Thank you. -Kame > --- > include/linux/gfp.h | 3 +++ > include/linux/memcontrol.h | 12 ++++++++++++ > mm/filemap.c | 8 +++++++- > mm/memcontrol.c | 5 +---- > 4 files changed, 23 insertions(+), 5 deletions(-) > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index 10e667f..aac9b21 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -152,6 +152,9 @@ struct vm_area_struct; > /* 4GB DMA on some platforms */ > #define GFP_DMA32 __GFP_DMA32 > > +/* memcg oom killer is not allowed */ > +#define GFP_MEMCG_NO_OOM __GFP_NORETRY > + > /* Convert GFP flags to their corresponding migrate type */ > static inline int allocflags_to_migratetype(gfp_t gfp_flags) > { > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 095d2b4..1ad4bc6 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -65,6 +65,12 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); > extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > gfp_t gfp_mask); > > +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, > + struct mm_struct *mm, gfp_t gfp_mask) > +{ > + return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM); > +} > + > struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); > struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); > > @@ -215,6 +221,12 @@ static inline int mem_cgroup_cache_charge(struct page *page, > return 0; > } > > +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, > + struct mm_struct *mm, gfp_t gfp_mask) > +{ > + return 0; > +} > + > static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp) > { > diff --git a/mm/filemap.c b/mm/filemap.c > index 83efee7..ef14351 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > VM_BUG_ON(!PageLocked(page)); > VM_BUG_ON(PageSwapBacked(page)); > > - error = mem_cgroup_cache_charge(page, current->mm, > + /* > + * Cannot trigger OOM even if gfp_mask would allow that normally > + * because we might be called from a locked context and that > + * could lead to deadlocks if the killed process is waiting for > + * the same lock. > + */ > + error = mem_cgroup_cache_charge_no_oom(page, current->mm, > gfp_mask & GFP_RECLAIM_MASK); > if (error) > goto out; > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 02ee2f7..b4754ba 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2430,9 +2430,6 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > if (!(gfp_mask & __GFP_WAIT)) > return CHARGE_WOULDBLOCK; > > - if (gfp_mask & __GFP_NORETRY) > - return CHARGE_NOMEM; > - > ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags); > if (mem_cgroup_margin(mem_over_limit) >= nr_pages) > return CHARGE_RETRY; > @@ -3713,7 +3710,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, > { > struct mem_cgroup *memcg = NULL; > unsigned int nr_pages = 1; > - bool oom = true; > + bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); > int ret; > > if (PageTransHuge(page)) { > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933242Ab2K0Jy5 (ORCPT ); Tue, 27 Nov 2012 04:54:57 -0500 Received: from cantor2.suse.de ([195.135.220.15]:36524 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758342Ab2K0Jyy (ORCPT ); Tue, 27 Nov 2012 04:54:54 -0500 Date: Tue, 27 Nov 2012 10:54:52 +0100 From: Michal Hocko To: Kamezawa Hiroyuki Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , Johannes Weiner Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121127095452.GD20537@dhcp22.suse.cz> References: <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50B403CA.501@jp.fujitsu.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 27-11-12 09:05:30, KAMEZAWA Hiroyuki wrote: [...] > As a short term fix, I think this patch will work enough and seems simple enough. > Acked-by: KAMEZAWA Hiroyuki Thanks! If Johannes is also ok with this for now I will resubmit the patch to Andrew after I hear back from the reporter. > Reading discussion between you and Johannes, to release locks, I understand > the memcg need to return "RETRY" for a long term fix. Thinking a little, > it will be simple to return "RETRY" to all processes waited on oom kill queue > of a memcg and it can be done by a small fixes to memory.c. I wouldn't call it simple but it is doable. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755659Ab2K0Ts3 (ORCPT ); Tue, 27 Nov 2012 14:48:29 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:35168 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753938Ab2K0Ts1 (ORCPT ); Tue, 27 Nov 2012 14:48:27 -0500 Date: Tue, 27 Nov 2012 14:48:13 -0500 From: Johannes Weiner To: Kamezawa Hiroyuki Cc: Michal Hocko , azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121127194813.GP24381@cmpxchg.org> References: <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50B403CA.501@jp.fujitsu.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Nov 27, 2012 at 09:05:30AM +0900, Kamezawa Hiroyuki wrote: > (2012/11/26 22:18), Michal Hocko wrote: > >[CCing also Johannes - the thread started here: > >https://lkml.org/lkml/2012/11/21/497] > > > >On Mon 26-11-12 01:38:55, azurIt wrote: > >>>This is hackish but it should help you in this case. Kamezawa, what do > >>>you think about that? Should we generalize this and prepare something > >>>like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > >>>automatically and use the function whenever we are in a locked context? > >>>To be honest I do not like this very much but nothing more sensible > >>>(without touching non-memcg paths) comes to my mind. > >> > >> > >>I installed kernel with this patch, will report back if problem occurs > >>again OR in few weeks if everything will be ok. Thank you! > > > >Now that I am looking at the patch closer it will not work because it > >depends on other patch which is not merged yet and even that one would > >help on its own because __GFP_NORETRY doesn't break the charge loop. > >Sorry I have missed that... > > > >The patch bellow should help though. (it is based on top of the current > >-mm tree but I will send a backport to 3.2 in the reply as well) > >--- > > From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001 > >From: Michal Hocko > >Date: Mon, 26 Nov 2012 11:47:57 +0100 > >Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > > >memcg oom killer might deadlock if the process which falls down to > >mem_cgroup_handle_oom holds a lock which prevents other task to > >terminate because it is blocked on the very same lock. > >This can happen when a write system call needs to allocate a page but > >the allocation hits the memcg hard limit and there is nothing to reclaim > >(e.g. there is no swap or swap limit is hit as well and all cache pages > >have been reclaimed already) and the process selected by memcg OOM > >killer is blocked on i_mutex on the same inode (e.g. truncate it). > > > >Process A > >[] do_truncate+0x58/0xa0 # takes i_mutex > >[] do_last+0x250/0xa30 > >[] path_openat+0xd7/0x440 > >[] do_filp_open+0x49/0xa0 > >[] do_sys_open+0x106/0x240 > >[] sys_open+0x20/0x30 > >[] system_call_fastpath+0x18/0x1d > >[] 0xffffffffffffffff > > > >Process B > >[] mem_cgroup_handle_oom+0x241/0x3b0 > >[] T.1146+0x5ab/0x5c0 > >[] mem_cgroup_cache_charge+0xbe/0xe0 > >[] add_to_page_cache_locked+0x4c/0x140 > >[] add_to_page_cache_lru+0x22/0x50 > >[] grab_cache_page_write_begin+0x8b/0xe0 > >[] ext3_write_begin+0x88/0x270 > >[] generic_file_buffered_write+0x116/0x290 > >[] __generic_file_aio_write+0x27c/0x480 > >[] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > >[] do_sync_write+0xea/0x130 > >[] vfs_write+0xf3/0x1f0 > >[] sys_write+0x51/0x90 > >[] system_call_fastpath+0x18/0x1d > >[] 0xffffffffffffffff > > > >This is not a hard deadlock though because administrator can still > >intervene and increase the limit on the group which helps the writer to > >finish the allocation and release the lock. > > > >This patch heals the problem by forbidding OOM from page cache charges > >(namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper > >function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which > >then tells mem_cgroup_charge_common that OOM is not allowed for the > >charge. No OOM from this path, except for fixing the bug, also make some > >sense as we really do not want to cause an OOM because of a page cache > >usage. > >As a possibly visible result add_to_page_cache_lru might fail more often > >with ENOMEM but this is to be expected if the limit is set and it is > >preferable than OOM killer IMO. > > > >__GFP_NORETRY is abused for this memcg specific flag because it has been > >used to prevent from OOM already (since not-merged-yet "memcg: reclaim > >when more than one page needed"). The only difference is that the flag > >doesn't prevent from reclaim anymore which kind of makes sense because > >the global memory allocator triggers reclaim as well. The retry without > >any reclaim on __GFP_NORETRY doesn't make much sense anyway because this > >is effectively a busy loop with allowed OOM in this path. > > > >Reported-by: azurIt > >Signed-off-by: Michal Hocko > > As a short term fix, I think this patch will work enough and seems simple enough. > Acked-by: KAMEZAWA Hiroyuki Yes, let's do this for now. > >diff --git a/include/linux/gfp.h b/include/linux/gfp.h > >index 10e667f..aac9b21 100644 > >--- a/include/linux/gfp.h > >+++ b/include/linux/gfp.h > >@@ -152,6 +152,9 @@ struct vm_area_struct; > > /* 4GB DMA on some platforms */ > > #define GFP_DMA32 __GFP_DMA32 > > > >+/* memcg oom killer is not allowed */ > >+#define GFP_MEMCG_NO_OOM __GFP_NORETRY Could we leave this within memcg, please? An extra flag to mem_cgroup_cache_charge() or the like. GFP flags are about controlling the page allocator, this seems abusive. We have an oom flag down in try_charge, maybe just propagate this up the stack? > >diff --git a/mm/filemap.c b/mm/filemap.c > >index 83efee7..ef14351 100644 > >--- a/mm/filemap.c > >+++ b/mm/filemap.c > >@@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > > VM_BUG_ON(!PageLocked(page)); > > VM_BUG_ON(PageSwapBacked(page)); > > > >- error = mem_cgroup_cache_charge(page, current->mm, > >+ /* > >+ * Cannot trigger OOM even if gfp_mask would allow that normally > >+ * because we might be called from a locked context and that > >+ * could lead to deadlocks if the killed process is waiting for > >+ * the same lock. > >+ */ > >+ error = mem_cgroup_cache_charge_no_oom(page, current->mm, > > gfp_mask & GFP_RECLAIM_MASK); > > if (error) > > goto out; Shmem does not use this function but also charges under the i_mutex in the write path and fallocate at least. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932071Ab2K0Uyl (ORCPT ); Tue, 27 Nov 2012 15:54:41 -0500 Received: from mail-ee0-f46.google.com ([74.125.83.46]:36682 "EHLO mail-ee0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754103Ab2K0Uyk (ORCPT ); Tue, 27 Nov 2012 15:54:40 -0500 Date: Tue, 27 Nov 2012 21:54:36 +0100 From: Michal Hocko To: Johannes Weiner , KAMEZAWA Hiroyuki Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist Subject: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121127205431.GA2433@dhcp22.suse.cz> References: <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> <20121127194813.GP24381@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121127194813.GP24381@cmpxchg.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 27-11-12 14:48:13, Johannes Weiner wrote: [...] > > >diff --git a/include/linux/gfp.h b/include/linux/gfp.h > > >index 10e667f..aac9b21 100644 > > >--- a/include/linux/gfp.h > > >+++ b/include/linux/gfp.h > > >@@ -152,6 +152,9 @@ struct vm_area_struct; > > > /* 4GB DMA on some platforms */ > > > #define GFP_DMA32 __GFP_DMA32 > > > > > >+/* memcg oom killer is not allowed */ > > >+#define GFP_MEMCG_NO_OOM __GFP_NORETRY > > Could we leave this within memcg, please? An extra flag to > mem_cgroup_cache_charge() or the like. GFP flags are about > controlling the page allocator, this seems abusive. We have an oom > flag down in try_charge, maybe just propagate this up the stack? OK, what about the patch bellow? I have dropped Kame's Acked-by because it has been reworked. The patch is the same in principle. > > >diff --git a/mm/filemap.c b/mm/filemap.c > > >index 83efee7..ef14351 100644 > > >--- a/mm/filemap.c > > >+++ b/mm/filemap.c > > >@@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > > > VM_BUG_ON(!PageLocked(page)); > > > VM_BUG_ON(PageSwapBacked(page)); > > > > > >- error = mem_cgroup_cache_charge(page, current->mm, > > >+ /* > > >+ * Cannot trigger OOM even if gfp_mask would allow that normally > > >+ * because we might be called from a locked context and that > > >+ * could lead to deadlocks if the killed process is waiting for > > >+ * the same lock. > > >+ */ > > >+ error = mem_cgroup_cache_charge_no_oom(page, current->mm, > > > gfp_mask & GFP_RECLAIM_MASK); > > > if (error) > > > goto out; > > Shmem does not use this function but also charges under the i_mutex in > the write path and fallocate at least. Right you are --- >>From 60cc8a184490d277eb24fca551b114f1e2234ce0 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Mon, 26 Nov 2012 11:47:57 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [] do_truncate+0x58/0xa0 # takes i_mutex [] do_last+0x250/0xa30 [] path_openat+0xd7/0x440 [] do_filp_open+0x49/0xa0 [] do_sys_open+0x106/0x240 [] sys_open+0x20/0x30 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff Process B [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1146+0x5ab/0x5c0 [] mem_cgroup_cache_charge+0xbe/0xe0 [] add_to_page_cache_locked+0x4c/0x140 [] add_to_page_cache_lru+0x22/0x50 [] grab_cache_page_write_begin+0x8b/0xe0 [] ext3_write_begin+0x88/0x270 [] generic_file_buffered_write+0x116/0x290 [] __generic_file_aio_write+0x27c/0x480 [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [] do_sync_write+0xea/0x130 [] vfs_write+0xf3/0x1f0 [] sys_write+0x51/0x90 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom argument which is pushed down the call chain. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. Changes since v1 - do not abuse gfp_flags and rather use oom parameter directly as per Johannes - handle also shmem write fauls resp. fallocate properly as per Johannes Reported-by: azurIt Signed-off-by: Michal Hocko --- include/linux/memcontrol.h | 5 +++-- mm/filemap.c | 9 +++++++-- mm/memcontrol.c | 9 ++++----- mm/shmem.c | 14 +++++++++++--- 4 files changed, 25 insertions(+), 12 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 095d2b4..8f48d5e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -63,7 +63,7 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page, extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask); + gfp_t gfp_mask, bool oom); struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); @@ -210,7 +210,8 @@ static inline int mem_cgroup_newpage_charge(struct page *page, } static inline int mem_cgroup_cache_charge(struct page *page, - struct mm_struct *mm, gfp_t gfp_mask) + struct mm_struct *mm, gfp_t gfp_mask, + bool oom) { return 0; } diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..ef8fbd5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, - gfp_mask & GFP_RECLAIM_MASK); + /* + * Cannot trigger OOM even if gfp_mask would allow that normally + * because we might be called from a locked context and that + * could lead to deadlocks if the killed process is waiting for + * the same lock. + */ + error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false); if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 02ee2f7..26690d6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3709,11 +3709,10 @@ out: * < 0 if the cgroup is over its limit */ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask, enum charge_type ctype) + gfp_t gfp_mask, enum charge_type ctype, bool oom) { struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; - bool oom = true; int ret; if (PageTransHuge(page)) { @@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page, VM_BUG_ON(page->mapping && !PageAnon(page)); VM_BUG_ON(!mm); return mem_cgroup_charge_common(page, mm, gfp_mask, - MEM_CGROUP_CHARGE_TYPE_ANON); + MEM_CGROUP_CHARGE_TYPE_ANON, true); } /* @@ -3851,7 +3850,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page, } int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask) + gfp_t gfp_mask, bool oom) { struct mem_cgroup *memcg = NULL; enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; @@ -3863,7 +3862,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, return 0; if (!PageSwapCache(page)) - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); else { /* page is swapcache/shmem */ ret = __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, &memcg); diff --git a/mm/shmem.c b/mm/shmem.c index 55054a7..cef63b5 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page) * the shmem_swaplist_mutex which might hold up shmem_writepage(). * Charged back to the user (not to caller) when swap account is used. */ - error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL); + error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true); if (error) goto out; /* No radix_tree_preload: swap entry keeps a place for page in tree */ @@ -1152,8 +1152,16 @@ repeat: goto failed; } + /* + * Cannot trigger OOM even if gfp_mask would allow that + * normally because we might be called from a locked + * context (i_mutex held) if this is a write lock or + * fallocate and that could lead to deadlocks if the + * killed process is waiting for the same lock. + */ error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, + sgp < SGP_WRITE); if (!error) { error = shmem_add_to_page_cache(page, mapping, index, gfp, swp_to_radix_entry(swap)); @@ -1209,7 +1217,7 @@ repeat: SetPageSwapBacked(page); __set_page_locked(page); error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, true); if (error) goto decused; error = radix_tree_preload(gfp & GFP_RECLAIM_MASK); -- 1.7.10.4 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756089Ab2K0U7t (ORCPT ); Tue, 27 Nov 2012 15:59:49 -0500 Received: from mail-ee0-f46.google.com ([74.125.83.46]:59786 "EHLO mail-ee0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752812Ab2K0U7r (ORCPT ); Tue, 27 Nov 2012 15:59:47 -0500 Date: Tue, 27 Nov 2012 21:59:44 +0100 From: Michal Hocko To: Johannes Weiner , KAMEZAWA Hiroyuki Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121127205944.GB2433@dhcp22.suse.cz> References: <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> <20121127194813.GP24381@cmpxchg.org> <20121127205431.GA2433@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121127205431.GA2433@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Sorry, forgot to about one shmem charge: --- >>From 7ae29927d24471c1b1a6ceb021219c592c1ef518 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Tue, 27 Nov 2012 21:53:13 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [] do_truncate+0x58/0xa0 # takes i_mutex [] do_last+0x250/0xa30 [] path_openat+0xd7/0x440 [] do_filp_open+0x49/0xa0 [] do_sys_open+0x106/0x240 [] sys_open+0x20/0x30 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff Process B [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1146+0x5ab/0x5c0 [] mem_cgroup_cache_charge+0xbe/0xe0 [] add_to_page_cache_locked+0x4c/0x140 [] add_to_page_cache_lru+0x22/0x50 [] grab_cache_page_write_begin+0x8b/0xe0 [] ext3_write_begin+0x88/0x270 [] generic_file_buffered_write+0x116/0x290 [] __generic_file_aio_write+0x27c/0x480 [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [] do_sync_write+0xea/0x130 [] vfs_write+0xf3/0x1f0 [] sys_write+0x51/0x90 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom argument which is pushed down the call chain. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. Changes since v1 - do not abuse gfp_flags and rather use oom parameter directly as per Johannes - handle also shmem write fauls resp. fallocate properly as per Johannes Reported-by: azurIt Signed-off-by: Michal Hocko --- include/linux/memcontrol.h | 5 +++-- mm/filemap.c | 9 +++++++-- mm/memcontrol.c | 9 ++++----- mm/shmem.c | 15 ++++++++++++--- 4 files changed, 26 insertions(+), 12 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 095d2b4..8f48d5e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -63,7 +63,7 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page, extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask); + gfp_t gfp_mask, bool oom); struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); @@ -210,7 +210,8 @@ static inline int mem_cgroup_newpage_charge(struct page *page, } static inline int mem_cgroup_cache_charge(struct page *page, - struct mm_struct *mm, gfp_t gfp_mask) + struct mm_struct *mm, gfp_t gfp_mask, + bool oom) { return 0; } diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..ef8fbd5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, - gfp_mask & GFP_RECLAIM_MASK); + /* + * Cannot trigger OOM even if gfp_mask would allow that normally + * because we might be called from a locked context and that + * could lead to deadlocks if the killed process is waiting for + * the same lock. + */ + error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false); if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 02ee2f7..26690d6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3709,11 +3709,10 @@ out: * < 0 if the cgroup is over its limit */ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask, enum charge_type ctype) + gfp_t gfp_mask, enum charge_type ctype, bool oom) { struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; - bool oom = true; int ret; if (PageTransHuge(page)) { @@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page, VM_BUG_ON(page->mapping && !PageAnon(page)); VM_BUG_ON(!mm); return mem_cgroup_charge_common(page, mm, gfp_mask, - MEM_CGROUP_CHARGE_TYPE_ANON); + MEM_CGROUP_CHARGE_TYPE_ANON, true); } /* @@ -3851,7 +3850,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page, } int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask) + gfp_t gfp_mask, bool oom) { struct mem_cgroup *memcg = NULL; enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; @@ -3863,7 +3862,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, return 0; if (!PageSwapCache(page)) - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); else { /* page is swapcache/shmem */ ret = __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, &memcg); diff --git a/mm/shmem.c b/mm/shmem.c index 55054a7..ba59cfa 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page) * the shmem_swaplist_mutex which might hold up shmem_writepage(). * Charged back to the user (not to caller) when swap account is used. */ - error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL); + error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true); if (error) goto out; /* No radix_tree_preload: swap entry keeps a place for page in tree */ @@ -1152,8 +1152,16 @@ repeat: goto failed; } + /* + * Cannot trigger OOM even if gfp_mask would allow that + * normally because we might be called from a locked + * context (i_mutex held) if this is a write lock or + * fallocate and that could lead to deadlocks if the + * killed process is waiting for the same lock. + */ error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, + sgp < SGP_WRITE); if (!error) { error = shmem_add_to_page_cache(page, mapping, index, gfp, swp_to_radix_entry(swap)); @@ -1209,7 +1217,8 @@ repeat: SetPageSwapBacked(page); __set_page_locked(page); error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, + sgp < SGP_WRITE); if (error) goto decused; error = radix_tree_preload(gfp & GFP_RECLAIM_MASK); -- 1.7.10.4 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932499Ab2K1P0s (ORCPT ); Wed, 28 Nov 2012 10:26:48 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:35284 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932408Ab2K1P0q (ORCPT ); Wed, 28 Nov 2012 10:26:46 -0500 Date: Wed, 28 Nov 2012 10:26:31 -0500 From: Johannes Weiner To: Michal Hocko Cc: KAMEZAWA Hiroyuki , azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121128152631.GT24381@cmpxchg.org> References: <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> <20121127194813.GP24381@cmpxchg.org> <20121127205431.GA2433@dhcp22.suse.cz> <20121127205944.GB2433@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121127205944.GB2433@dhcp22.suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Nov 27, 2012 at 09:59:44PM +0100, Michal Hocko wrote: > @@ -3863,7 +3862,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > return 0; > > if (!PageSwapCache(page)) > - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); > + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); > else { /* page is swapcache/shmem */ > ret = __mem_cgroup_try_charge_swapin(mm, page, > gfp_mask, &memcg); I think you need to pass it down the swapcache path too, as that is what happens when the shmem page written to is in swap and has been read into swapcache by the time of charging. > @@ -1152,8 +1152,16 @@ repeat: > goto failed; > } > > + /* > + * Cannot trigger OOM even if gfp_mask would allow that > + * normally because we might be called from a locked > + * context (i_mutex held) if this is a write lock or > + * fallocate and that could lead to deadlocks if the > + * killed process is waiting for the same lock. > + */ Indentation broken? > error = mem_cgroup_cache_charge(page, current->mm, > - gfp & GFP_RECLAIM_MASK); > + gfp & GFP_RECLAIM_MASK, > + sgp < SGP_WRITE); The code tests for read-only paths a bunch of times using sgp != SGP_WRITE && sgp != SGP_FALLOC Would probably be more consistent and more robust to use this here as well? > @@ -1209,7 +1217,8 @@ repeat: > SetPageSwapBacked(page); > __set_page_locked(page); > error = mem_cgroup_cache_charge(page, current->mm, > - gfp & GFP_RECLAIM_MASK); > + gfp & GFP_RECLAIM_MASK, > + sgp < SGP_WRITE); Same. Otherwise, the patch looks good to me, thanks for persisting :) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751690Ab2K1QE6 (ORCPT ); Wed, 28 Nov 2012 11:04:58 -0500 Received: from cantor2.suse.de ([195.135.220.15]:53602 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751036Ab2K1QEu (ORCPT ); Wed, 28 Nov 2012 11:04:50 -0500 Date: Wed, 28 Nov 2012 17:04:47 +0100 From: Michal Hocko To: Johannes Weiner Cc: KAMEZAWA Hiroyuki , azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121128160447.GH12309@dhcp22.suse.cz> References: <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> <20121127194813.GP24381@cmpxchg.org> <20121127205431.GA2433@dhcp22.suse.cz> <20121127205944.GB2433@dhcp22.suse.cz> <20121128152631.GT24381@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121128152631.GT24381@cmpxchg.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 28-11-12 10:26:31, Johannes Weiner wrote: > On Tue, Nov 27, 2012 at 09:59:44PM +0100, Michal Hocko wrote: > > @@ -3863,7 +3862,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > > return 0; > > > > if (!PageSwapCache(page)) > > - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); > > + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); > > else { /* page is swapcache/shmem */ > > ret = __mem_cgroup_try_charge_swapin(mm, page, > > gfp_mask, &memcg); > > I think you need to pass it down the swapcache path too, as that is > what happens when the shmem page written to is in swap and has been > read into swapcache by the time of charging. You are right, of course. I shouldn't send patches late in the evening after staring to a crashdump for a good part of the day. /me ashamed. > > @@ -1152,8 +1152,16 @@ repeat: > > goto failed; > > } > > > > + /* > > + * Cannot trigger OOM even if gfp_mask would allow that > > + * normally because we might be called from a locked > > + * context (i_mutex held) if this is a write lock or > > + * fallocate and that could lead to deadlocks if the > > + * killed process is waiting for the same lock. > > + */ > > Indentation broken? c&p > > error = mem_cgroup_cache_charge(page, current->mm, > > - gfp & GFP_RECLAIM_MASK); > > + gfp & GFP_RECLAIM_MASK, > > + sgp < SGP_WRITE); > > The code tests for read-only paths a bunch of times using > > sgp != SGP_WRITE && sgp != SGP_FALLOC > > Would probably be more consistent and more robust to use this here as > well? Yes my laziness. I was considering that but it was really long so I've chosen the simpler way. But you are right that consistency is probably better here > > @@ -1209,7 +1217,8 @@ repeat: > > SetPageSwapBacked(page); > > __set_page_locked(page); > > error = mem_cgroup_cache_charge(page, current->mm, > > - gfp & GFP_RECLAIM_MASK); > > + gfp & GFP_RECLAIM_MASK, > > + sgp < SGP_WRITE); > > Same. > > Otherwise, the patch looks good to me, thanks for persisting :) Thanks for the throughout review. Here we go with the fixed version. --- >>From 5000bf32c9c02fcd31d18e615300d8e7e7ef94a5 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 28 Nov 2012 16:49:46 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [] do_truncate+0x58/0xa0 # takes i_mutex [] do_last+0x250/0xa30 [] path_openat+0xd7/0x440 [] do_filp_open+0x49/0xa0 [] do_sys_open+0x106/0x240 [] sys_open+0x20/0x30 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff Process B [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1146+0x5ab/0x5c0 [] mem_cgroup_cache_charge+0xbe/0xe0 [] add_to_page_cache_locked+0x4c/0x140 [] add_to_page_cache_lru+0x22/0x50 [] grab_cache_page_write_begin+0x8b/0xe0 [] ext3_write_begin+0x88/0x270 [] generic_file_buffered_write+0x116/0x290 [] __generic_file_aio_write+0x27c/0x480 [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [] do_sync_write+0xea/0x130 [] vfs_write+0xf3/0x1f0 [] sys_write+0x51/0x90 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom argument which is pushed down the call chain. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. Changes since v1 - do not abuse gfp_flags and rather use oom parameter directly as per Johannes - handle also shmem write fauls resp. fallocate properly as per Johannes Reported-by: azurIt Signed-off-by: Michal Hocko --- include/linux/memcontrol.h | 11 +++++++---- mm/filemap.c | 9 +++++++-- mm/memcontrol.c | 25 +++++++++++++------------ mm/memory.c | 2 +- mm/shmem.c | 17 ++++++++++++++--- mm/swapfile.c | 2 +- 6 files changed, 43 insertions(+), 23 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 095d2b4..5abe441 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); /* for swap handling */ extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm, - struct page *page, gfp_t mask, struct mem_cgroup **memcgp); + struct page *page, gfp_t mask, struct mem_cgroup **memcgp, + bool oom); extern void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *memcg); extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask); + gfp_t gfp_mask, bool oom); struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); @@ -210,13 +211,15 @@ static inline int mem_cgroup_newpage_charge(struct page *page, } static inline int mem_cgroup_cache_charge(struct page *page, - struct mm_struct *mm, gfp_t gfp_mask) + struct mm_struct *mm, gfp_t gfp_mask, + bool oom) { return 0; } static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, - struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp) + struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp, + bool oom) { return 0; } diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..ef8fbd5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, - gfp_mask & GFP_RECLAIM_MASK); + /* + * Cannot trigger OOM even if gfp_mask would allow that normally + * because we might be called from a locked context and that + * could lead to deadlocks if the killed process is waiting for + * the same lock. + */ + error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false); if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 02ee2f7..02a6d70 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3709,11 +3709,10 @@ out: * < 0 if the cgroup is over its limit */ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask, enum charge_type ctype) + gfp_t gfp_mask, enum charge_type ctype, bool oom) { struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; - bool oom = true; int ret; if (PageTransHuge(page)) { @@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page, VM_BUG_ON(page->mapping && !PageAnon(page)); VM_BUG_ON(!mm); return mem_cgroup_charge_common(page, mm, gfp_mask, - MEM_CGROUP_CHARGE_TYPE_ANON); + MEM_CGROUP_CHARGE_TYPE_ANON, true); } /* @@ -3754,7 +3753,8 @@ int mem_cgroup_newpage_charge(struct page *page, static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t mask, - struct mem_cgroup **memcgp) + struct mem_cgroup **memcgp, + bool oom) { struct mem_cgroup *memcg; struct page_cgroup *pc; @@ -3776,20 +3776,21 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, if (!memcg) goto charge_cur_mm; *memcgp = memcg; - ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true); + ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, oom); css_put(&memcg->css); if (ret == -EINTR) ret = 0; return ret; charge_cur_mm: - ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, true); + ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, oom); if (ret == -EINTR) ret = 0; return ret; } int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, - gfp_t gfp_mask, struct mem_cgroup **memcgp) + gfp_t gfp_mask, struct mem_cgroup **memcgp, + bool oom) { *memcgp = NULL; if (mem_cgroup_disabled()) @@ -3803,12 +3804,12 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, if (!PageSwapCache(page)) { int ret; - ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, memcgp, true); + ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, memcgp, oom); if (ret == -EINTR) ret = 0; return ret; } - return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp); + return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, oom); } void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg) @@ -3851,7 +3852,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page, } int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask) + gfp_t gfp_mask, bool oom) { struct mem_cgroup *memcg = NULL; enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; @@ -3863,10 +3864,10 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, return 0; if (!PageSwapCache(page)) - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); else { /* page is swapcache/shmem */ ret = __mem_cgroup_try_charge_swapin(mm, page, - gfp_mask, &memcg); + gfp_mask, &memcg, oom); if (!ret) __mem_cgroup_commit_charge_swapin(page, memcg, type); } diff --git a/mm/memory.c b/mm/memory.c index 6891d3b..afad903 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2991,7 +2991,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, } } - if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) { + if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr, true)) { ret = VM_FAULT_OOM; goto out_page; } diff --git a/mm/shmem.c b/mm/shmem.c index 55054a7..3b27db4 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page) * the shmem_swaplist_mutex which might hold up shmem_writepage(). * Charged back to the user (not to caller) when swap account is used. */ - error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL); + error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true); if (error) goto out; /* No radix_tree_preload: swap entry keeps a place for page in tree */ @@ -1152,8 +1152,17 @@ repeat: goto failed; } + /* + * Cannot trigger OOM even if gfp_mask would allow that + * normally because we might be called from a locked + * context (i_mutex held) if this is a write lock or + * fallocate and that could lead to deadlocks if the + * killed process is waiting for the same lock. + */ error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, + sgp != SGP_WRITE && + sgp != SGP_FALLOC); if (!error) { error = shmem_add_to_page_cache(page, mapping, index, gfp, swp_to_radix_entry(swap)); @@ -1209,7 +1218,9 @@ repeat: SetPageSwapBacked(page); __set_page_locked(page); error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, + sgp != SGP_WRITE && + sgp != SGP_FALLOC); if (error) goto decused; error = radix_tree_preload(gfp & GFP_RECLAIM_MASK); diff --git a/mm/swapfile.c b/mm/swapfile.c index 2f8e429..8ec511e 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -828,7 +828,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, int ret = 1; if (mem_cgroup_try_charge_swapin(vma->vm_mm, page, - GFP_KERNEL, &memcg)) { + GFP_KERNEL, &memcg, true)) { ret = -ENOMEM; goto out_nolock; } -- 1.7.10.4 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754257Ab2K1QiA (ORCPT ); Wed, 28 Nov 2012 11:38:00 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:35293 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754117Ab2K1Qh5 (ORCPT ); Wed, 28 Nov 2012 11:37:57 -0500 Date: Wed, 28 Nov 2012 11:37:36 -0500 From: Johannes Weiner To: Michal Hocko Cc: KAMEZAWA Hiroyuki , azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121128163736.GV24381@cmpxchg.org> References: <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> <20121127194813.GP24381@cmpxchg.org> <20121127205431.GA2433@dhcp22.suse.cz> <20121127205944.GB2433@dhcp22.suse.cz> <20121128152631.GT24381@cmpxchg.org> <20121128160447.GH12309@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121128160447.GH12309@dhcp22.suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote: > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 095d2b4..5abe441 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm, > gfp_t gfp_mask); > /* for swap handling */ > extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > - struct page *page, gfp_t mask, struct mem_cgroup **memcgp); > + struct page *page, gfp_t mask, struct mem_cgroup **memcgp, > + bool oom); Ok, now I feel almost bad for asking, but why the public interface, too? You only ever pass "true" in there and this is unlikely to change anytime soon, no? > @@ -3754,7 +3753,8 @@ int mem_cgroup_newpage_charge(struct page *page, > static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, > struct page *page, > gfp_t mask, > - struct mem_cgroup **memcgp) > + struct mem_cgroup **memcgp, > + bool oom) > { > struct mem_cgroup *memcg; > struct page_cgroup *pc; > @@ -3776,20 +3776,21 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, > if (!memcg) > goto charge_cur_mm; > *memcgp = memcg; > - ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true); > + ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, oom); > css_put(&memcg->css); > if (ret == -EINTR) > ret = 0; > return ret; > charge_cur_mm: > - ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, true); > + ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, oom); > if (ret == -EINTR) > ret = 0; > return ret; > } Only this one is needed... > @@ -3851,7 +3852,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page, > } > > int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > - gfp_t gfp_mask) > + gfp_t gfp_mask, bool oom) > { > struct mem_cgroup *memcg = NULL; > enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; > @@ -3863,10 +3864,10 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > return 0; > > if (!PageSwapCache(page)) > - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); > + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); > else { /* page is swapcache/shmem */ > ret = __mem_cgroup_try_charge_swapin(mm, page, > - gfp_mask, &memcg); > + gfp_mask, &memcg, oom); > if (!ret) > __mem_cgroup_commit_charge_swapin(page, memcg, type); > } ...for this site. > diff --git a/mm/memory.c b/mm/memory.c > index 6891d3b..afad903 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2991,7 +2991,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, > } > } > > - if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) { > + if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr, true)) { > ret = VM_FAULT_OOM; > goto out_page; > } Can not happen for shmem, the fault handler uses vma->vm_ops->fault. > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 2f8e429..8ec511e 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -828,7 +828,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, > int ret = 1; > > if (mem_cgroup_try_charge_swapin(vma->vm_mm, page, > - GFP_KERNEL, &memcg)) { > + GFP_KERNEL, &memcg, true)) { > ret = -ENOMEM; > goto out_nolock; > } Can not happen for shmem, uses shmem_unuse() instead. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755564Ab2K1Qqp (ORCPT ); Wed, 28 Nov 2012 11:46:45 -0500 Received: from cantor2.suse.de ([195.135.220.15]:55741 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755099Ab2K1Qqn (ORCPT ); Wed, 28 Nov 2012 11:46:43 -0500 Date: Wed, 28 Nov 2012 17:46:40 +0100 From: Michal Hocko To: Johannes Weiner Cc: KAMEZAWA Hiroyuki , azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121128164640.GB22201@dhcp22.suse.cz> References: <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> <20121127194813.GP24381@cmpxchg.org> <20121127205431.GA2433@dhcp22.suse.cz> <20121127205944.GB2433@dhcp22.suse.cz> <20121128152631.GT24381@cmpxchg.org> <20121128160447.GH12309@dhcp22.suse.cz> <20121128163736.GV24381@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121128163736.GV24381@cmpxchg.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 28-11-12 11:37:36, Johannes Weiner wrote: > On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote: > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > index 095d2b4..5abe441 100644 > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm, > > gfp_t gfp_mask); > > /* for swap handling */ > > extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > > - struct page *page, gfp_t mask, struct mem_cgroup **memcgp); > > + struct page *page, gfp_t mask, struct mem_cgroup **memcgp, > > + bool oom); > > Ok, now I feel almost bad for asking, but why the public interface, > too? Would it work out if I tell it was to double check that your review quality is not decreased after that many revisions? :P Incremental update and the full patch in the reply --- diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 5abe441..8f48d5e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -57,8 +57,7 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); /* for swap handling */ extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm, - struct page *page, gfp_t mask, struct mem_cgroup **memcgp, - bool oom); + struct page *page, gfp_t mask, struct mem_cgroup **memcgp); extern void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *memcg); extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); @@ -218,8 +217,7 @@ static inline int mem_cgroup_cache_charge(struct page *page, } static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, - struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp, - bool oom) + struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp) { return 0; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 02a6d70..3c9b1c5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3789,8 +3789,7 @@ charge_cur_mm: } int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, - gfp_t gfp_mask, struct mem_cgroup **memcgp, - bool oom) + gfp_t gfp_mask, struct mem_cgroup **memcgp) { *memcgp = NULL; if (mem_cgroup_disabled()) @@ -3804,12 +3803,12 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, if (!PageSwapCache(page)) { int ret; - ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, memcgp, oom); + ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, memcgp, true); if (ret == -EINTR) ret = 0; return ret; } - return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, oom); + return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, true); } void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg) diff --git a/mm/memory.c b/mm/memory.c index afad903..6891d3b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2991,7 +2991,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, } } - if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr, true)) { + if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) { ret = VM_FAULT_OOM; goto out_page; } diff --git a/mm/swapfile.c b/mm/swapfile.c index 8ec511e..2f8e429 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -828,7 +828,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, int ret = 1; if (mem_cgroup_try_charge_swapin(vma->vm_mm, page, - GFP_KERNEL, &memcg, true)) { + GFP_KERNEL, &memcg)) { ret = -ENOMEM; goto out_nolock; } -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755047Ab2K1Qs2 (ORCPT ); Wed, 28 Nov 2012 11:48:28 -0500 Received: from cantor2.suse.de ([195.135.220.15]:55830 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753857Ab2K1Qs0 (ORCPT ); Wed, 28 Nov 2012 11:48:26 -0500 Date: Wed, 28 Nov 2012 17:48:24 +0100 From: Michal Hocko To: Johannes Weiner Cc: KAMEZAWA Hiroyuki , azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121128164824.GC22201@dhcp22.suse.cz> References: <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> <20121127194813.GP24381@cmpxchg.org> <20121127205431.GA2433@dhcp22.suse.cz> <20121127205944.GB2433@dhcp22.suse.cz> <20121128152631.GT24381@cmpxchg.org> <20121128160447.GH12309@dhcp22.suse.cz> <20121128163736.GV24381@cmpxchg.org> <20121128164640.GB22201@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121128164640.GB22201@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 28-11-12 17:46:40, Michal Hocko wrote: > On Wed 28-11-12 11:37:36, Johannes Weiner wrote: > > On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote: > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > > index 095d2b4..5abe441 100644 > > > --- a/include/linux/memcontrol.h > > > +++ b/include/linux/memcontrol.h > > > @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm, > > > gfp_t gfp_mask); > > > /* for swap handling */ > > > extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > > > - struct page *page, gfp_t mask, struct mem_cgroup **memcgp); > > > + struct page *page, gfp_t mask, struct mem_cgroup **memcgp, > > > + bool oom); > > > > Ok, now I feel almost bad for asking, but why the public interface, > > too? > > Would it work out if I tell it was to double check that your review > quality is not decreased after that many revisions? :P > > Incremental update and the full patch in the reply --- >>From e21bb704947e9a477ec1df9121575c606dbfcb52 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 28 Nov 2012 17:46:32 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [] do_truncate+0x58/0xa0 # takes i_mutex [] do_last+0x250/0xa30 [] path_openat+0xd7/0x440 [] do_filp_open+0x49/0xa0 [] do_sys_open+0x106/0x240 [] sys_open+0x20/0x30 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff Process B [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1146+0x5ab/0x5c0 [] mem_cgroup_cache_charge+0xbe/0xe0 [] add_to_page_cache_locked+0x4c/0x140 [] add_to_page_cache_lru+0x22/0x50 [] grab_cache_page_write_begin+0x8b/0xe0 [] ext3_write_begin+0x88/0x270 [] generic_file_buffered_write+0x116/0x290 [] __generic_file_aio_write+0x27c/0x480 [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [] do_sync_write+0xea/0x130 [] vfs_write+0xf3/0x1f0 [] sys_write+0x51/0x90 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom argument which is pushed down the call chain. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. Changes since v1 - do not abuse gfp_flags and rather use oom parameter directly as per Johannes - handle also shmem write fauls resp. fallocate properly as per Johannes Reported-by: azurIt Signed-off-by: Michal Hocko --- include/linux/memcontrol.h | 5 +++-- mm/filemap.c | 9 +++++++-- mm/memcontrol.c | 20 ++++++++++---------- mm/shmem.c | 17 ++++++++++++++--- 4 files changed, 34 insertions(+), 17 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 095d2b4..8f48d5e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -63,7 +63,7 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page, extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask); + gfp_t gfp_mask, bool oom); struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); @@ -210,7 +210,8 @@ static inline int mem_cgroup_newpage_charge(struct page *page, } static inline int mem_cgroup_cache_charge(struct page *page, - struct mm_struct *mm, gfp_t gfp_mask) + struct mm_struct *mm, gfp_t gfp_mask, + bool oom) { return 0; } diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..ef8fbd5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, - gfp_mask & GFP_RECLAIM_MASK); + /* + * Cannot trigger OOM even if gfp_mask would allow that normally + * because we might be called from a locked context and that + * could lead to deadlocks if the killed process is waiting for + * the same lock. + */ + error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false); if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 02ee2f7..3c9b1c5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3709,11 +3709,10 @@ out: * < 0 if the cgroup is over its limit */ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask, enum charge_type ctype) + gfp_t gfp_mask, enum charge_type ctype, bool oom) { struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; - bool oom = true; int ret; if (PageTransHuge(page)) { @@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page, VM_BUG_ON(page->mapping && !PageAnon(page)); VM_BUG_ON(!mm); return mem_cgroup_charge_common(page, mm, gfp_mask, - MEM_CGROUP_CHARGE_TYPE_ANON); + MEM_CGROUP_CHARGE_TYPE_ANON, true); } /* @@ -3754,7 +3753,8 @@ int mem_cgroup_newpage_charge(struct page *page, static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t mask, - struct mem_cgroup **memcgp) + struct mem_cgroup **memcgp, + bool oom) { struct mem_cgroup *memcg; struct page_cgroup *pc; @@ -3776,13 +3776,13 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, if (!memcg) goto charge_cur_mm; *memcgp = memcg; - ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true); + ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, oom); css_put(&memcg->css); if (ret == -EINTR) ret = 0; return ret; charge_cur_mm: - ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, true); + ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, oom); if (ret == -EINTR) ret = 0; return ret; @@ -3808,7 +3808,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, ret = 0; return ret; } - return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp); + return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, true); } void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg) @@ -3851,7 +3851,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page, } int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask) + gfp_t gfp_mask, bool oom) { struct mem_cgroup *memcg = NULL; enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; @@ -3863,10 +3863,10 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, return 0; if (!PageSwapCache(page)) - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); else { /* page is swapcache/shmem */ ret = __mem_cgroup_try_charge_swapin(mm, page, - gfp_mask, &memcg); + gfp_mask, &memcg, oom); if (!ret) __mem_cgroup_commit_charge_swapin(page, memcg, type); } diff --git a/mm/shmem.c b/mm/shmem.c index 55054a7..3b27db4 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page) * the shmem_swaplist_mutex which might hold up shmem_writepage(). * Charged back to the user (not to caller) when swap account is used. */ - error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL); + error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true); if (error) goto out; /* No radix_tree_preload: swap entry keeps a place for page in tree */ @@ -1152,8 +1152,17 @@ repeat: goto failed; } + /* + * Cannot trigger OOM even if gfp_mask would allow that + * normally because we might be called from a locked + * context (i_mutex held) if this is a write lock or + * fallocate and that could lead to deadlocks if the + * killed process is waiting for the same lock. + */ error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, + sgp != SGP_WRITE && + sgp != SGP_FALLOC); if (!error) { error = shmem_add_to_page_cache(page, mapping, index, gfp, swp_to_radix_entry(swap)); @@ -1209,7 +1218,9 @@ repeat: SetPageSwapBacked(page); __set_page_locked(page); error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, + sgp != SGP_WRITE && + sgp != SGP_FALLOC); if (error) goto decused; error = radix_tree_preload(gfp & GFP_RECLAIM_MASK); -- 1.7.10.4 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755756Ab2K1Sot (ORCPT ); Wed, 28 Nov 2012 13:44:49 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:35302 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754712Ab2K1Soq (ORCPT ); Wed, 28 Nov 2012 13:44:46 -0500 Date: Wed, 28 Nov 2012 13:44:33 -0500 From: Johannes Weiner To: Michal Hocko Cc: KAMEZAWA Hiroyuki , azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121128184433.GH2301@cmpxchg.org> References: <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> <20121127194813.GP24381@cmpxchg.org> <20121127205431.GA2433@dhcp22.suse.cz> <20121127205944.GB2433@dhcp22.suse.cz> <20121128152631.GT24381@cmpxchg.org> <20121128160447.GH12309@dhcp22.suse.cz> <20121128163736.GV24381@cmpxchg.org> <20121128164640.GB22201@dhcp22.suse.cz> <20121128164824.GC22201@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121128164824.GC22201@dhcp22.suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 28, 2012 at 05:48:24PM +0100, Michal Hocko wrote: > On Wed 28-11-12 17:46:40, Michal Hocko wrote: > > On Wed 28-11-12 11:37:36, Johannes Weiner wrote: > > > On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote: > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > > > index 095d2b4..5abe441 100644 > > > > --- a/include/linux/memcontrol.h > > > > +++ b/include/linux/memcontrol.h > > > > @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm, > > > > gfp_t gfp_mask); > > > > /* for swap handling */ > > > > extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > > > > - struct page *page, gfp_t mask, struct mem_cgroup **memcgp); > > > > + struct page *page, gfp_t mask, struct mem_cgroup **memcgp, > > > > + bool oom); > > > > > > Ok, now I feel almost bad for asking, but why the public interface, > > > too? > > > > Would it work out if I tell it was to double check that your review > > quality is not decreased after that many revisions? :P Deal. > >From e21bb704947e9a477ec1df9121575c606dbfcb52 Mon Sep 17 00:00:00 2001 > From: Michal Hocko > Date: Wed, 28 Nov 2012 17:46:32 +0100 > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > memcg oom killer might deadlock if the process which falls down to > mem_cgroup_handle_oom holds a lock which prevents other task to > terminate because it is blocked on the very same lock. > This can happen when a write system call needs to allocate a page but > the allocation hits the memcg hard limit and there is nothing to reclaim > (e.g. there is no swap or swap limit is hit as well and all cache pages > have been reclaimed already) and the process selected by memcg OOM > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > Process A > [] do_truncate+0x58/0xa0 # takes i_mutex > [] do_last+0x250/0xa30 > [] path_openat+0xd7/0x440 > [] do_filp_open+0x49/0xa0 > [] do_sys_open+0x106/0x240 > [] sys_open+0x20/0x30 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff > > Process B > [] mem_cgroup_handle_oom+0x241/0x3b0 > [] T.1146+0x5ab/0x5c0 > [] mem_cgroup_cache_charge+0xbe/0xe0 > [] add_to_page_cache_locked+0x4c/0x140 > [] add_to_page_cache_lru+0x22/0x50 > [] grab_cache_page_write_begin+0x8b/0xe0 > [] ext3_write_begin+0x88/0x270 > [] generic_file_buffered_write+0x116/0x290 > [] __generic_file_aio_write+0x27c/0x480 > [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > [] do_sync_write+0xea/0x130 > [] vfs_write+0xf3/0x1f0 > [] sys_write+0x51/0x90 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff > > This is not a hard deadlock though because administrator can still > intervene and increase the limit on the group which helps the writer to > finish the allocation and release the lock. > > This patch heals the problem by forbidding OOM from page cache charges > (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom > argument which is pushed down the call chain. > > As a possibly visible result add_to_page_cache_lru might fail more often > with ENOMEM but this is to be expected if the limit is set and it is > preferable than OOM killer IMO. > > Changes since v1 > - do not abuse gfp_flags and rather use oom parameter directly as per > Johannes > - handle also shmem write fauls resp. fallocate properly as per Johannes > > Reported-by: azurIt > Signed-off-by: Michal Hocko Acked-by: Johannes Weiner Thanks, Michal! From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755992Ab2K1UUp (ORCPT ); Wed, 28 Nov 2012 15:20:45 -0500 Received: from mail-qc0-f174.google.com ([209.85.216.174]:51242 "EHLO mail-qc0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755640Ab2K1UUm (ORCPT ); Wed, 28 Nov 2012 15:20:42 -0500 Date: Wed, 28 Nov 2012 12:20:44 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Michal Hocko cc: Johannes Weiner , KAMEZAWA Hiroyuki , azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked In-Reply-To: <20121128164824.GC22201@dhcp22.suse.cz> Message-ID: References: <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <50B403CA.501@jp.fujitsu.com> <20121127194813.GP24381@cmpxchg.org> <20121127205431.GA2433@dhcp22.suse.cz> <20121127205944.GB2433@dhcp22.suse.cz> <20121128152631.GT24381@cmpxchg.org> <20121128160447.GH12309@dhcp22.suse.cz> <20121128163736.GV24381@cmpxchg.org> <20121128164640.GB22201@dhcp22.suse.cz> <20121128164824.GC22201@dhcp22.suse.cz> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 28 Nov 2012, Michal Hocko wrote: > From e21bb704947e9a477ec1df9121575c606dbfcb52 Mon Sep 17 00:00:00 2001 > From: Michal Hocko > Date: Wed, 28 Nov 2012 17:46:32 +0100 > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > memcg oom killer might deadlock if the process which falls down to > mem_cgroup_handle_oom holds a lock which prevents other task to > terminate because it is blocked on the very same lock. > This can happen when a write system call needs to allocate a page but > the allocation hits the memcg hard limit and there is nothing to reclaim > (e.g. there is no swap or swap limit is hit as well and all cache pages > have been reclaimed already) and the process selected by memcg OOM > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > Process A > [] do_truncate+0x58/0xa0 # takes i_mutex > [] do_last+0x250/0xa30 > [] path_openat+0xd7/0x440 > [] do_filp_open+0x49/0xa0 > [] do_sys_open+0x106/0x240 > [] sys_open+0x20/0x30 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff > > Process B > [] mem_cgroup_handle_oom+0x241/0x3b0 > [] T.1146+0x5ab/0x5c0 > [] mem_cgroup_cache_charge+0xbe/0xe0 > [] add_to_page_cache_locked+0x4c/0x140 > [] add_to_page_cache_lru+0x22/0x50 > [] grab_cache_page_write_begin+0x8b/0xe0 > [] ext3_write_begin+0x88/0x270 > [] generic_file_buffered_write+0x116/0x290 > [] __generic_file_aio_write+0x27c/0x480 > [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > [] do_sync_write+0xea/0x130 > [] vfs_write+0xf3/0x1f0 > [] sys_write+0x51/0x90 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff > > This is not a hard deadlock though because administrator can still > intervene and increase the limit on the group which helps the writer to > finish the allocation and release the lock. > > This patch heals the problem by forbidding OOM from page cache charges > (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom > argument which is pushed down the call chain. > > As a possibly visible result add_to_page_cache_lru might fail more often > with ENOMEM but this is to be expected if the limit is set and it is > preferable than OOM killer IMO. > > Changes since v1 > - do not abuse gfp_flags and rather use oom parameter directly as per > Johannes > - handle also shmem write fauls resp. fallocate properly as per Johannes > > Reported-by: azurIt > Signed-off-by: Michal Hocko Sorry, Michal, you've laboured hard on this: but I dislike it so much that I'm here overcoming my dread of entering an OOM-killer discussion, and the resultant deluge of unwelcome CCs for eternity afterwards. I had been relying on Johannes to repeat his "This issue has been around for a while so frankly I don't think it's urgent enough to rush things", but it looks like I have to be the one to repeat it. Your analysis of azurIt's traces may well be correct, and this patch may indeed ameliorate the situation, and it's fine as something for azurIt to try and report on and keep in his tree; but I hope that it does not go upstream and to stable. Why do I dislike it so much? I suppose because it's both too general and too limited at the same time. Too general in that it changes the behaviour on OOM for a large set of memcg charges, all those that go through add_to_page_cache_locked(), when only a subset of those have the i_mutex issue. If you're going to be that general, why not go further? Leave the mem_cgroup_cache_charge() interface as is, make it not-OOM internally, no need for SGP_WRITE,SGP_FALLOC distinctions in mm/shmem.c. No other filesystem gets the benefit of those distinctions: isn't it better to keep it simple? (And I can see a partial truncation case where shmem uses SGP_READ under i_mutex; and the change to shmem_unuse behaviour is a non-issue, since swapoff invites itself to be killed anyway.) Too limited in that i_mutex is just the held resource which azurIt's traces have led you to, but it's a general problem that the OOM-killed task might be waiting for a resource that the OOM-killing task holds. I suspect that if we try hard enough (I admit I have not), we can find an example of such a potential deadlock for almost every memcg charge site. mmap_sem? not as easy to invent a case with that as I thought, since it needs a down_write, and the typical page allocations happen with down_read, and I can't think of a process which does down_write on another's mm. But i_mutex is always good, once you remember the case of write to file from userspace page which got paged out, so the fault path has to read it back in, while i_mutex is still held at the outer level. An unusual case? Well, normally yes, but we're considering out-of-memory conditions, which may converge upon cases like this. Wouldn't it be nice if I could be constructive? But I'm sceptical even of Johannes's faith in what the global OOM killer would do: how does __alloc_pages_slowpath() get out of its "goto restart" loop, excepting the trivial case when the killer is the killed? I wonder why this issue has hit azurIt and no other reporter? No swap plays a part in it, but that's not so unusual. Yours glOOMily, Hugh > --- > include/linux/memcontrol.h | 5 +++-- > mm/filemap.c | 9 +++++++-- > mm/memcontrol.c | 20 ++++++++++---------- > mm/shmem.c | 17 ++++++++++++++--- > 4 files changed, 34 insertions(+), 17 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 095d2b4..8f48d5e 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -63,7 +63,7 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page, > extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); > > extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > - gfp_t gfp_mask); > + gfp_t gfp_mask, bool oom); > > struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); > struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); > @@ -210,7 +210,8 @@ static inline int mem_cgroup_newpage_charge(struct page *page, > } > > static inline int mem_cgroup_cache_charge(struct page *page, > - struct mm_struct *mm, gfp_t gfp_mask) > + struct mm_struct *mm, gfp_t gfp_mask, > + bool oom) > { > return 0; > } > diff --git a/mm/filemap.c b/mm/filemap.c > index 83efee7..ef8fbd5 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > VM_BUG_ON(!PageLocked(page)); > VM_BUG_ON(PageSwapBacked(page)); > > - error = mem_cgroup_cache_charge(page, current->mm, > - gfp_mask & GFP_RECLAIM_MASK); > + /* > + * Cannot trigger OOM even if gfp_mask would allow that normally > + * because we might be called from a locked context and that > + * could lead to deadlocks if the killed process is waiting for > + * the same lock. > + */ > + error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false); > if (error) > goto out; > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 02ee2f7..3c9b1c5 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -3709,11 +3709,10 @@ out: > * < 0 if the cgroup is over its limit > */ > static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, > - gfp_t gfp_mask, enum charge_type ctype) > + gfp_t gfp_mask, enum charge_type ctype, bool oom) > { > struct mem_cgroup *memcg = NULL; > unsigned int nr_pages = 1; > - bool oom = true; > int ret; > > if (PageTransHuge(page)) { > @@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page, > VM_BUG_ON(page->mapping && !PageAnon(page)); > VM_BUG_ON(!mm); > return mem_cgroup_charge_common(page, mm, gfp_mask, > - MEM_CGROUP_CHARGE_TYPE_ANON); > + MEM_CGROUP_CHARGE_TYPE_ANON, true); > } > > /* > @@ -3754,7 +3753,8 @@ int mem_cgroup_newpage_charge(struct page *page, > static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, > struct page *page, > gfp_t mask, > - struct mem_cgroup **memcgp) > + struct mem_cgroup **memcgp, > + bool oom) > { > struct mem_cgroup *memcg; > struct page_cgroup *pc; > @@ -3776,13 +3776,13 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, > if (!memcg) > goto charge_cur_mm; > *memcgp = memcg; > - ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true); > + ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, oom); > css_put(&memcg->css); > if (ret == -EINTR) > ret = 0; > return ret; > charge_cur_mm: > - ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, true); > + ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, oom); > if (ret == -EINTR) > ret = 0; > return ret; > @@ -3808,7 +3808,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, > ret = 0; > return ret; > } > - return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp); > + return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, true); > } > > void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg) > @@ -3851,7 +3851,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page, > } > > int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > - gfp_t gfp_mask) > + gfp_t gfp_mask, bool oom) > { > struct mem_cgroup *memcg = NULL; > enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; > @@ -3863,10 +3863,10 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > return 0; > > if (!PageSwapCache(page)) > - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); > + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); > else { /* page is swapcache/shmem */ > ret = __mem_cgroup_try_charge_swapin(mm, page, > - gfp_mask, &memcg); > + gfp_mask, &memcg, oom); > if (!ret) > __mem_cgroup_commit_charge_swapin(page, memcg, type); > } > diff --git a/mm/shmem.c b/mm/shmem.c > index 55054a7..3b27db4 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page) > * the shmem_swaplist_mutex which might hold up shmem_writepage(). > * Charged back to the user (not to caller) when swap account is used. > */ > - error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL); > + error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true); > if (error) > goto out; > /* No radix_tree_preload: swap entry keeps a place for page in tree */ > @@ -1152,8 +1152,17 @@ repeat: > goto failed; > } > > + /* > + * Cannot trigger OOM even if gfp_mask would allow that > + * normally because we might be called from a locked > + * context (i_mutex held) if this is a write lock or > + * fallocate and that could lead to deadlocks if the > + * killed process is waiting for the same lock. > + */ > error = mem_cgroup_cache_charge(page, current->mm, > - gfp & GFP_RECLAIM_MASK); > + gfp & GFP_RECLAIM_MASK, > + sgp != SGP_WRITE && > + sgp != SGP_FALLOC); > if (!error) { > error = shmem_add_to_page_cache(page, mapping, index, > gfp, swp_to_radix_entry(swap)); > @@ -1209,7 +1218,9 @@ repeat: > SetPageSwapBacked(page); > __set_page_locked(page); > error = mem_cgroup_cache_charge(page, current->mm, > - gfp & GFP_RECLAIM_MASK); > + gfp & GFP_RECLAIM_MASK, > + sgp != SGP_WRITE && > + sgp != SGP_FALLOC); > if (error) > goto decused; > error = radix_tree_preload(gfp & GFP_RECLAIM_MASK); > -- > 1.7.10.4 > > -- > Michal Hocko > SUSE Labs > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754756Ab2K2OFz (ORCPT ); Thu, 29 Nov 2012 09:05:55 -0500 Received: from cantor2.suse.de ([195.135.220.15]:48883 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752364Ab2K2OFv (ORCPT ); Thu, 29 Nov 2012 09:05:51 -0500 Date: Thu, 29 Nov 2012 15:05:49 +0100 From: Michal Hocko To: Hugh Dickins Cc: Johannes Weiner , KAMEZAWA Hiroyuki , azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121129140549.GC27887@dhcp22.suse.cz> References: <50B403CA.501@jp.fujitsu.com> <20121127194813.GP24381@cmpxchg.org> <20121127205431.GA2433@dhcp22.suse.cz> <20121127205944.GB2433@dhcp22.suse.cz> <20121128152631.GT24381@cmpxchg.org> <20121128160447.GH12309@dhcp22.suse.cz> <20121128163736.GV24381@cmpxchg.org> <20121128164640.GB22201@dhcp22.suse.cz> <20121128164824.GC22201@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 28-11-12 12:20:44, Hugh Dickins wrote: [...] > Sorry, Michal, you've laboured hard on this: but I dislike it so much > that I'm here overcoming my dread of entering an OOM-killer discussion, > and the resultant deluge of unwelcome CCs for eternity afterwards. > > I had been relying on Johannes to repeat his "This issue has been > around for a while so frankly I don't think it's urgent enough to > rush things", but it looks like I have to be the one to repeat it. Well, the idea was to use this only as a temporal fix and come up with a better solution without any hurry. > Your analysis of azurIt's traces may well be correct, and this patch > may indeed ameliorate the situation, and it's fine as something for > azurIt to try and report on and keep in his tree; but I hope that > it does not go upstream and to stable. > > Why do I dislike it so much? I suppose because it's both too general > and too limited at the same time. > > Too general in that it changes the behaviour on OOM for a large set > of memcg charges, all those that go through add_to_page_cache_locked(), > when only a subset of those have the i_mutex issue. This is a fair point but the real fix which we were discussing with Johannes would be even more risky for stable. > If you're going to be that general, why not go further? Leave the > mem_cgroup_cache_charge() interface as is, make it not-OOM internally, > no need for SGP_WRITE,SGP_FALLOC distinctions in mm/shmem.c. No other > filesystem gets the benefit of those distinctions: isn't it better to > keep it simple? (And I can see a partial truncation case where shmem > uses SGP_READ under i_mutex; and the change to shmem_unuse behaviour > is a non-issue, since swapoff invites itself to be killed anyway.) > > Too limited in that i_mutex is just the held resource which azurIt's > traces have led you to, but it's a general problem that the OOM-killed > task might be waiting for a resource that the OOM-killing task holds. > > I suspect that if we try hard enough (I admit I have not), we can find > an example of such a potential deadlock for almost every memcg charge > site. mmap_sem? not as easy to invent a case with that as I thought, > since it needs a down_write, and the typical page allocations happen > with down_read, and I can't think of a process which does down_write > on another's mm. > > But i_mutex is always good, once you remember the case of write to > file from userspace page which got paged out, so the fault path has > to read it back in, while i_mutex is still held at the outer level. > An unusual case? Well, normally yes, but we're considering > out-of-memory conditions, which may converge upon cases like this. > > Wouldn't it be nice if I could be constructive? But I'm sceptical > even of Johannes's faith in what the global OOM killer would do: > how does __alloc_pages_slowpath() get out of its "goto restart" > loop, excepting the trivial case when the killer is the killed? I am not sure I am following you here but the Johannes's idea was to break out of the charge after a signal has been sent and the charge still fails and either retry the fault or fail the allocation. I think this should work but I am afraid that this needs some tuning (number of retries f.e.) to prevent from too aggressive OOM or too many failurs. Do we have any other possibilities to solve this issue? Or do you think we should ignore the problem just because nobody complained for such a long time? Dunno, I think we should fix this with something less risky for now and come up with a real fix after it sees sufficient testing. > I wonder why this issue has hit azurIt and no other reporter? > No swap plays a part in it, but that's not so unusual. > > Yours glOOMily, > Hugh [...] -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755538Ab2K3BpS (ORCPT ); Thu, 29 Nov 2012 20:45:18 -0500 Received: from gmmr7.centrum.cz ([46.255.225.249]:38212 "EHLO gmmr7.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751303Ab2K3BpQ (ORCPT ); Thu, 29 Nov 2012 20:45:16 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Fri, 30 Nov 2012 02:45:12 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20121122214249.GA20319@dhcp22.suse.cz>, <20121122233434.3D5E35E6@pobox.sk>, <20121123074023.GA24698@dhcp22.suse.cz>, <20121123102137.10D6D653@pobox.sk>, <20121123100438.GF24698@dhcp22.suse.cz>, <20121125011047.7477BB5E@pobox.sk>, <20121125120524.GB10623@dhcp22.suse.cz>, <20121125135542.GE10623@dhcp22.suse.cz>, <20121126013855.AF118F5E@pobox.sk>, <20121126131837.GC17860@dhcp22.suse.cz> <20121126132149.GD17860@dhcp22.suse.cz> In-Reply-To: <20121126132149.GD17860@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121130024512.EBFBD851@pobox.sk> X-Maser: Georgo Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >Here we go with the patch for 3.2.34. Could you test with this one, >please? I installed kernel with this patch, will report back if problem occurs again OR in few weeks if everything will be ok. Thank you! azurIt From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755780Ab2K3C3X (ORCPT ); Thu, 29 Nov 2012 21:29:23 -0500 Received: from gmmr3.centrum.cz ([46.255.225.251]:57425 "EHLO gmmr3.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755115Ab2K3C3V (ORCPT ); Thu, 29 Nov 2012 21:29:21 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Fri, 30 Nov 2012 03:29:18 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20121122214249.GA20319@dhcp22.suse.cz>, <20121122233434.3D5E35E6@pobox.sk>, <20121123074023.GA24698@dhcp22.suse.cz>, <20121123102137.10D6D653@pobox.sk>, <20121123100438.GF24698@dhcp22.suse.cz>, <20121125011047.7477BB5E@pobox.sk>, <20121125120524.GB10623@dhcp22.suse.cz>, <20121125135542.GE10623@dhcp22.suse.cz>, <20121126013855.AF118F5E@pobox.sk>, <20121126131837.GC17860@dhcp22.suse.cz> <20121126132149.GD17860@dhcp22.suse.cz> In-Reply-To: <20121126132149.GD17860@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121130032918.59B3F780@pobox.sk> X-Maser: brud Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >Here we go with the patch for 3.2.34. Could you test with this one, >please? Michal, unfortunately i had to boot to another kernel because the one with this patch keeps killing my MySQL server :( it was, probably, doing it on OOM in any cgroup - looks like OOM was not choosing processes only from cgroup which is out of memory. Here is the log from syslog: http://www.watchdog.sk/lkml/oom_mysqld Maybe i should mention that MySQL server has it's own cgroup (called 'mysql') but with no limits to any resources. azurIt From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758226Ab2K3MpL (ORCPT ); Fri, 30 Nov 2012 07:45:11 -0500 Received: from cantor2.suse.de ([195.135.220.15]:39399 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756630Ab2K3MpI (ORCPT ); Fri, 30 Nov 2012 07:45:08 -0500 Date: Fri, 30 Nov 2012 13:45:06 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121130124506.GH29317@dhcp22.suse.cz> References: <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126132149.GD17860@dhcp22.suse.cz> <20121130032918.59B3F780@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121130032918.59B3F780@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 30-11-12 03:29:18, azurIt wrote: > >Here we go with the patch for 3.2.34. Could you test with this one, > >please? > > > Michal, unfortunately i had to boot to another kernel because the one > with this patch keeps killing my MySQL server :( it was, probably, > doing it on OOM in any cgroup - looks like OOM was not choosing > processes only from cgroup which is out of memory. Here is the log > from syslog: http://www.watchdog.sk/lkml/oom_mysqld You are seeing also global OOM: Nov 30 02:53:56 server01 kernel: [ 818.233159] Pid: 9247, comm: apache2 Not tainted 3.2.34-grsec #1 Nov 30 02:53:56 server01 kernel: [ 818.233289] Call Trace: Nov 30 02:53:56 server01 kernel: [ 818.233470] [] dump_header+0x7e/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.233600] [] ? find_lock_task_mm+0x2f/0x70 Nov 30 02:53:56 server01 kernel: [ 818.233721] [] oom_kill_process+0x85/0x2a0 Nov 30 02:53:56 server01 kernel: [ 818.233842] [] out_of_memory+0xe5/0x200 Nov 30 02:53:56 server01 kernel: [ 818.233963] [] ? pte_alloc_one+0x3f/0x50 Nov 30 02:53:56 server01 kernel: [ 818.234082] [] pagefault_out_of_memory+0xbd/0x110 Nov 30 02:53:56 server01 kernel: [ 818.234204] [] mm_fault_error+0xb6/0x1a0 Nov 30 02:53:56 server01 kernel: [ 818.235886] [] do_page_fault+0x3ee/0x460 Nov 30 02:53:56 server01 kernel: [ 818.236006] [] ? vma_merge+0x1f7/0x2c0 Nov 30 02:53:56 server01 kernel: [ 818.236124] [] ? do_brk+0x267/0x400 Nov 30 02:53:56 server01 kernel: [ 818.236244] [] ? gr_learn_resource+0x42/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.236367] [] page_fault+0x1f/0x30 [...] Nov 30 02:53:56 server01 kernel: [ 818.356297] Out of memory: Kill process 2188 (mysqld) score 60 or sacrifice child Nov 30 02:53:56 server01 kernel: [ 818.356493] Killed process 2188 (mysqld) total-vm:3330016kB, anon-rss:864176kB, file-rss:8072kB Then you also have memcg oom killer: Nov 30 02:53:56 server01 kernel: [ 818.375717] Task in /1037/uid killed as a result of limit of /1037 Nov 30 02:53:56 server01 kernel: [ 818.375886] memory: usage 102400kB, limit 102400kB, failcnt 736 Nov 30 02:53:56 server01 kernel: [ 818.376008] memory+swap: usage 102400kB, limit 102400kB, failcnt 0 The messages are intermixed and I guess rate limitting jumped in as well, because I cannot associate all the oom messages to a specific OOM event. Anyway your system is under both global and local memory pressure. You didn't see apache going down previously because it was probably the one which was stuck and could be killed. Anyway you need to setup your system more carefully. > Maybe i should mention that MySQL server has it's own cgroup (called > 'mysql') but with no limits to any resources. Where is that group in the hierarchy? > > azurIt > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752354Ab2K3Mxe (ORCPT ); Fri, 30 Nov 2012 07:53:34 -0500 Received: from gmmr7.centrum.cz ([46.255.225.249]:44086 "EHLO gmmr7.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751036Ab2K3Mxd (ORCPT ); Fri, 30 Nov 2012 07:53:33 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Fri, 30 Nov 2012 13:53:30 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20121123074023.GA24698@dhcp22.suse.cz>, <20121123102137.10D6D653@pobox.sk>, <20121123100438.GF24698@dhcp22.suse.cz>, <20121125011047.7477BB5E@pobox.sk>, <20121125120524.GB10623@dhcp22.suse.cz>, <20121125135542.GE10623@dhcp22.suse.cz>, <20121126013855.AF118F5E@pobox.sk>, <20121126131837.GC17860@dhcp22.suse.cz>, <20121126132149.GD17860@dhcp22.suse.cz>, <20121130032918.59B3F780@pobox.sk> <20121130124506.GH29317@dhcp22.suse.cz> In-Reply-To: <20121130124506.GH29317@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121130135330.6D012B71@pobox.sk> X-Maser: oho Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >Anyway your system is under both global and local memory pressure. You >didn't see apache going down previously because it was probably the one >which was stuck and could be killed. >Anyway you need to setup your system more carefully. No, it wasn't, i'm 1000% sure (i was on SSH). Here is the memory usage graph from that system on that time: http://www.watchdog.sk/lkml/memory.png The blank part is rebooting into new kernel. MySQL server was killed several times, then i rebooted into previous kernel and problem was gone (not a single MySQL kill). You can see two MySQL kills there on 03:54 and 03:04:30. > >> Maybe i should mention that MySQL server has it's own cgroup (called >> 'mysql') but with no limits to any resources. > >Where is that group in the hierarchy? In root. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933012Ab2K3Noc (ORCPT ); Fri, 30 Nov 2012 08:44:32 -0500 Received: from gmmr3.centrum.cz ([46.255.225.251]:55084 "EHLO gmmr3.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756623Ab2K3Noa (ORCPT ); Fri, 30 Nov 2012 08:44:30 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Fri, 30 Nov 2012 14:44:27 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20121123074023.GA24698@dhcp22.suse.cz>, <20121123102137.10D6D653@pobox.sk>, <20121123100438.GF24698@dhcp22.suse.cz>, <20121125011047.7477BB5E@pobox.sk>, <20121125120524.GB10623@dhcp22.suse.cz>, <20121125135542.GE10623@dhcp22.suse.cz>, <20121126013855.AF118F5E@pobox.sk>, <20121126131837.GC17860@dhcp22.suse.cz>, <20121126132149.GD17860@dhcp22.suse.cz>, <20121130032918.59B3F780@pobox.sk> <20121130124506.GH29317@dhcp22.suse.cz> In-Reply-To: <20121130124506.GH29317@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121130144427.51A09169@pobox.sk> X-Maser: Georgo Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >Anyway your system is under both global and local memory pressure. You >didn't see apache going down previously because it was probably the one >which was stuck and could be killed. >Anyway you need to setup your system more carefully. There is, also, an evidence that system has enough of memory! :) Just take column 'rss' from process list in OOM message and sum it - you will get 2489911. It's probably in KB so it's about 2.4 GB. System has 14 GB of RAM so this also match data on my graph - 2.4 is about 17% of 14. azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1030461Ab2K3Ooe (ORCPT ); Fri, 30 Nov 2012 09:44:34 -0500 Received: from cantor2.suse.de ([195.135.220.15]:45357 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758383Ab2K3Ood (ORCPT ); Fri, 30 Nov 2012 09:44:33 -0500 Date: Fri, 30 Nov 2012 15:44:31 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121130144431.GI29317@dhcp22.suse.cz> References: <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126132149.GD17860@dhcp22.suse.cz> <20121130032918.59B3F780@pobox.sk> <20121130124506.GH29317@dhcp22.suse.cz> <20121130144427.51A09169@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121130144427.51A09169@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 30-11-12 14:44:27, azurIt wrote: > >Anyway your system is under both global and local memory pressure. You > >didn't see apache going down previously because it was probably the one > >which was stuck and could be killed. > >Anyway you need to setup your system more carefully. > > > There is, also, an evidence that system has enough of memory! :) Just > take column 'rss' from process list in OOM message and sum it - you > will get 2489911. It's probably in KB so it's about 2.4 GB. System has > 14 GB of RAM so this also match data on my graph - 2.4 is about 17% of > 14. Hmm, that corresponds to the ZONE_DMA32 size pretty nicely but that zone is hardly touched: Nov 30 02:53:56 server01 kernel: [ 818.241291] DMA32 free:2523636kB min:2672kB low:3340kB high:4008kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2542248kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no DMA32 zone is usually fills up first 4G unless your HW remaps the rest of the memory above 4G or you have a numa machine and the rest of the memory is at other node. Could you post your memory map printed during the boot? (e820: BIOS-provided physical RAM map: and following lines) There is also ZONE_NORMAL which is also not used much Nov 30 02:53:56 server01 kernel: [ 818.242163] Normal free:6924716kB min:12512kB low:15640kB high:18768kB active_anon:1463128kB inactive_anon:2072kB active_file:1803964kB inactive_file:1072628kB unevictable:3924kB isolated(anon):0kB isolated(file):0kB present:11893760kB mlocked:3924kB dirty:1000kB writeback:776kB mapped:35656kB shmem:3828kB slab_reclaimable:202560kB slab_unreclaimable:50696kB kernel_stack:2944kB pagetables:158616kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no You have mentioned that you are comounting with cpuset. If this happens to be a NUMA machine have you made the access to all nodes available? Also what does /proc/sys/vm/zone_reclaim_mode says? -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1030534Ab2K3PDv (ORCPT ); Fri, 30 Nov 2012 10:03:51 -0500 Received: from cantor2.suse.de ([195.135.220.15]:46102 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1030520Ab2K3PDt (ORCPT ); Fri, 30 Nov 2012 10:03:49 -0500 Date: Fri, 30 Nov 2012 16:03:47 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121130150347.GJ29317@dhcp22.suse.cz> References: <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126132149.GD17860@dhcp22.suse.cz> <20121130032918.59B3F780@pobox.sk> <20121130124506.GH29317@dhcp22.suse.cz> <20121130144427.51A09169@pobox.sk> <20121130144431.GI29317@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121130144431.GI29317@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 30-11-12 15:44:31, Michal Hocko wrote: > On Fri 30-11-12 14:44:27, azurIt wrote: > > >Anyway your system is under both global and local memory pressure. You > > >didn't see apache going down previously because it was probably the one > > >which was stuck and could be killed. > > >Anyway you need to setup your system more carefully. > > > > > > There is, also, an evidence that system has enough of memory! :) Just > > take column 'rss' from process list in OOM message and sum it - you > > will get 2489911. It's probably in KB so it's about 2.4 GB. System has > > 14 GB of RAM so this also match data on my graph - 2.4 is about 17% of > > 14. > > Hmm, that corresponds to the ZONE_DMA32 size pretty nicely but that zone > is hardly touched: > Nov 30 02:53:56 server01 kernel: [ 818.241291] DMA32 free:2523636kB min:2672kB low:3340kB high:4008kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2542248kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no > > DMA32 zone is usually fills up first 4G unless your HW remaps the rest > of the memory above 4G or you have a numa machine and the rest of the > memory is at other node. Could you post your memory map printed during > the boot? (e820: BIOS-provided physical RAM map: and following lines) > > There is also ZONE_NORMAL which is also not used much > Nov 30 02:53:56 server01 kernel: [ 818.242163] Normal free:6924716kB min:12512kB low:15640kB high:18768kB active_anon:1463128kB inactive_anon:2072kB active_file:1803964kB inactive_file:1072628kB unevictable:3924kB isolated(anon):0kB isolated(file):0kB present:11893760kB mlocked:3924kB dirty:1000kB writeback:776kB mapped:35656kB shmem:3828kB slab_reclaimable:202560kB slab_unreclaimable:50696kB kernel_stack:2944kB pagetables:158616kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no > > You have mentioned that you are comounting with cpuset. If this happens > to be a NUMA machine have you made the access to all nodes available? And now that I am looking at the oom message more closely I can see Nov 30 02:53:56 server01 kernel: [ 818.232812] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 Nov 30 02:53:56 server01 kernel: [ 818.233029] apache2 cpuset=uid mems_allowed=0 Nov 30 02:53:56 server01 kernel: [ 818.233159] Pid: 9247, comm: apache2 Not tainted 3.2.34-grsec #1 Nov 30 02:53:56 server01 kernel: [ 818.233289] Call Trace: Nov 30 02:53:56 server01 kernel: [ 818.233470] [] dump_header+0x7e/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.233600] [] ? find_lock_task_mm+0x2f/0x70 Nov 30 02:53:56 server01 kernel: [ 818.233721] [] oom_kill_process+0x85/0x2a0 Nov 30 02:53:56 server01 kernel: [ 818.233842] [] out_of_memory+0xe5/0x200 Nov 30 02:53:56 server01 kernel: [ 818.233963] [] ? pte_alloc_one+0x3f/0x50 Nov 30 02:53:56 server01 kernel: [ 818.234082] [] pagefault_out_of_memory+0xbd/0x110 Nov 30 02:53:56 server01 kernel: [ 818.234204] [] mm_fault_error+0xb6/0x1a0 Nov 30 02:53:56 server01 kernel: [ 818.235886] [] do_page_fault+0x3ee/0x460 Nov 30 02:53:56 server01 kernel: [ 818.236006] [] ? vma_merge+0x1f7/0x2c0 Nov 30 02:53:56 server01 kernel: [ 818.236124] [] ? do_brk+0x267/0x400 Nov 30 02:53:56 server01 kernel: [ 818.236244] [] ? gr_learn_resource+0x42/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.236367] [] page_fault+0x1f/0x30 Which is interesting from 2 perspectives. Only the first node (Node-0) is allowed which would suggest that the cpuset controller is not configured to all nodes. It is still surprising Node 0 wouldn't have any memory (I would expect ZONE_DMA32 would be sitting there). Anyway, the more interesting thing is gfp_mask is GFP_NOWAIT allocation from the page fault? Huh this shouldn't happen - ever. > Also what does /proc/sys/vm/zone_reclaim_mode says? > -- > Michal Hocko > SUSE Labs > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1030611Ab2K3PIQ (ORCPT ); Fri, 30 Nov 2012 10:08:16 -0500 Received: from gmmr3.centrum.cz ([46.255.225.251]:36166 "EHLO gmmr3.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1030598Ab2K3PIO (ORCPT ); Fri, 30 Nov 2012 10:08:14 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Fri, 30 Nov 2012 16:08:11 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20121123100438.GF24698@dhcp22.suse.cz>, <20121125011047.7477BB5E@pobox.sk>, <20121125120524.GB10623@dhcp22.suse.cz>, <20121125135542.GE10623@dhcp22.suse.cz>, <20121126013855.AF118F5E@pobox.sk>, <20121126131837.GC17860@dhcp22.suse.cz>, <20121126132149.GD17860@dhcp22.suse.cz>, <20121130032918.59B3F780@pobox.sk>, <20121130124506.GH29317@dhcp22.suse.cz>, <20121130144427.51A09169@pobox.sk> <20121130144431.GI29317@dhcp22.suse.cz> In-Reply-To: <20121130144431.GI29317@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121130160811.6BB25BDD@pobox.sk> X-Maser: brud Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >DMA32 zone is usually fills up first 4G unless your HW remaps the rest >of the memory above 4G or you have a numa machine and the rest of the >memory is at other node. Could you post your memory map printed during >the boot? (e820: BIOS-provided physical RAM map: and following lines) Here is the full boot log: www.watchdog.sk/lkml/kern.log >You have mentioned that you are comounting with cpuset. If this happens >to be a NUMA machine have you made the access to all nodes available? >Also what does /proc/sys/vm/zone_reclaim_mode says? Don't really know what NUMA means and which nodes are you talking about, sorry :( # cat /proc/sys/vm/zone_reclaim_mode cat: /proc/sys/vm/zone_reclaim_mode: No such file or directory azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1030834Ab2K3PhT (ORCPT ); Fri, 30 Nov 2012 10:37:19 -0500 Received: from cantor2.suse.de ([195.135.220.15]:47384 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1030753Ab2K3PhR (ORCPT ); Fri, 30 Nov 2012 10:37:17 -0500 Date: Fri, 30 Nov 2012 16:37:15 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121130153715.GK29317@dhcp22.suse.cz> References: <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126132149.GD17860@dhcp22.suse.cz> <20121130032918.59B3F780@pobox.sk> <20121130124506.GH29317@dhcp22.suse.cz> <20121130144427.51A09169@pobox.sk> <20121130144431.GI29317@dhcp22.suse.cz> <20121130150347.GJ29317@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121130150347.GJ29317@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 30-11-12 16:03:47, Michal Hocko wrote: [...] > Anyway, the more interesting thing is gfp_mask is GFP_NOWAIT allocation > from the page fault? Huh this shouldn't happen - ever. OK, it starts making sense now. The message came from pagefault_out_of_memory which doesn't have gfp nor the required node information any longer. This suggests that VM_FAULT_OOM has been returned by the fault handler. So this hasn't been triggered by the page fault allocator. I am wondering whether this could be caused by the patch but the effect of that one should be limitted to the write (unlike the later version for -mm tree which hooks into the shmem as well). Will have to think about it some more. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1030876Ab2K3Pjr (ORCPT ); Fri, 30 Nov 2012 10:39:47 -0500 Received: from cantor2.suse.de ([195.135.220.15]:47488 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1030863Ab2K3Pjn (ORCPT ); Fri, 30 Nov 2012 10:39:43 -0500 Date: Fri, 30 Nov 2012 16:39:42 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121130153942.GL29317@dhcp22.suse.cz> References: <20121125120524.GB10623@dhcp22.suse.cz> <20121125135542.GE10623@dhcp22.suse.cz> <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126132149.GD17860@dhcp22.suse.cz> <20121130032918.59B3F780@pobox.sk> <20121130124506.GH29317@dhcp22.suse.cz> <20121130144427.51A09169@pobox.sk> <20121130144431.GI29317@dhcp22.suse.cz> <20121130160811.6BB25BDD@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121130160811.6BB25BDD@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 30-11-12 16:08:11, azurIt wrote: > >DMA32 zone is usually fills up first 4G unless your HW remaps the rest > >of the memory above 4G or you have a numa machine and the rest of the > >memory is at other node. Could you post your memory map printed during > >the boot? (e820: BIOS-provided physical RAM map: and following lines) > > > Here is the full boot log: > www.watchdog.sk/lkml/kern.log The log is not complete. Could you paste the comple dmesg output? Or even better, do you have logs from the previous run? > >You have mentioned that you are comounting with cpuset. If this happens > >to be a NUMA machine have you made the access to all nodes available? > >Also what does /proc/sys/vm/zone_reclaim_mode says? > > > Don't really know what NUMA means and which nodes are you talking > about, sorry :( http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access > # cat /proc/sys/vm/zone_reclaim_mode > cat: /proc/sys/vm/zone_reclaim_mode: No such file or directory OK, so the NUMA is not enabled. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1031018Ab2K3P7l (ORCPT ); Fri, 30 Nov 2012 10:59:41 -0500 Received: from gmmr8.centrum.cz ([46.255.227.254]:39879 "EHLO gmmr8.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1031001Ab2K3P7k (ORCPT ); Fri, 30 Nov 2012 10:59:40 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Fri, 30 Nov 2012 16:59:37 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20121125120524.GB10623@dhcp22.suse.cz>, <20121125135542.GE10623@dhcp22.suse.cz>, <20121126013855.AF118F5E@pobox.sk>, <20121126131837.GC17860@dhcp22.suse.cz>, <20121126132149.GD17860@dhcp22.suse.cz>, <20121130032918.59B3F780@pobox.sk>, <20121130124506.GH29317@dhcp22.suse.cz>, <20121130144427.51A09169@pobox.sk>, <20121130144431.GI29317@dhcp22.suse.cz>, <20121130160811.6BB25BDD@pobox.sk> <20121130153942.GL29317@dhcp22.suse.cz> In-Reply-To: <20121130153942.GL29317@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121130165937.F9564EBE@pobox.sk> X-Maser: Georgo Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >> Here is the full boot log: >> www.watchdog.sk/lkml/kern.log > >The log is not complete. Could you paste the comple dmesg output? Or >even better, do you have logs from the previous run? What is missing there? All kernel messages are logging into /var/log/kern.log (it's the same as dmesg), dmesg itself was already rewrited by other messages. I think it's all what that kernel printed. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933221Ab2K3QT1 (ORCPT ); Fri, 30 Nov 2012 11:19:27 -0500 Received: from cantor2.suse.de ([195.135.220.15]:51057 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932517Ab2K3QTZ (ORCPT ); Fri, 30 Nov 2012 11:19:25 -0500 Date: Fri, 30 Nov 2012 17:19:23 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121130161923.GN29317@dhcp22.suse.cz> References: <20121126013855.AF118F5E@pobox.sk> <20121126131837.GC17860@dhcp22.suse.cz> <20121126132149.GD17860@dhcp22.suse.cz> <20121130032918.59B3F780@pobox.sk> <20121130124506.GH29317@dhcp22.suse.cz> <20121130144427.51A09169@pobox.sk> <20121130144431.GI29317@dhcp22.suse.cz> <20121130160811.6BB25BDD@pobox.sk> <20121130153942.GL29317@dhcp22.suse.cz> <20121130165937.F9564EBE@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121130165937.F9564EBE@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 30-11-12 16:59:37, azurIt wrote: > >> Here is the full boot log: > >> www.watchdog.sk/lkml/kern.log > > > >The log is not complete. Could you paste the comple dmesg output? Or > >even better, do you have logs from the previous run? > > > What is missing there? All kernel messages are logging into > /var/log/kern.log (it's the same as dmesg), dmesg itself was already > rewrited by other messages. I think it's all what that kernel printed. Early boot messages are missing - so exactly the BIOS memory map I was asking for. As the NUMA has been excluded it is probably not that relevant anymore. The important question is why you see VM_FAULT_OOM and whether memcg charging failure can trigger that. I don not see how this could happen right now because __GFP_NORETRY is not used for user pages (except for THP which disable memcg OOM already), file backed page faults (aka __do_fault) use mem_cgroup_newpage_charge which doesn't disable OOM. This is a real head scratcher. Could you also post your complete containers configuration, maybe there is something strange in there (basically grep . -r YOUR_CGROUP_MNT except for tasks files which are of no use right now). -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758612Ab2K3Q1D (ORCPT ); Fri, 30 Nov 2012 11:27:03 -0500 Received: from gmmr3.centrum.cz ([46.255.225.251]:53654 "EHLO gmmr3.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751808Ab2K3Q07 (ORCPT ); Fri, 30 Nov 2012 11:26:59 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Fri, 30 Nov 2012 17:26:51 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20121126013855.AF118F5E@pobox.sk>, <20121126131837.GC17860@dhcp22.suse.cz>, <20121126132149.GD17860@dhcp22.suse.cz>, <20121130032918.59B3F780@pobox.sk>, <20121130124506.GH29317@dhcp22.suse.cz>, <20121130144427.51A09169@pobox.sk>, <20121130144431.GI29317@dhcp22.suse.cz>, <20121130160811.6BB25BDD@pobox.sk>, <20121130153942.GL29317@dhcp22.suse.cz>, <20121130165937.F9564EBE@pobox.sk> <20121130161923.GN29317@dhcp22.suse.cz> In-Reply-To: <20121130161923.GN29317@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121130172651.B6917602@pobox.sk> X-Maser: oho Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >Could you also post your complete containers configuration, maybe there >is something strange in there (basically grep . -r YOUR_CGROUP_MNT >except for tasks files which are of no use right now). Here it is: http://www.watchdog.sk/lkml/cgroups.gz From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758700Ab2K3Qxw (ORCPT ); Fri, 30 Nov 2012 11:53:52 -0500 Received: from cantor2.suse.de ([195.135.220.15]:34700 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758648Ab2K3Qxu (ORCPT ); Fri, 30 Nov 2012 11:53:50 -0500 Date: Fri, 30 Nov 2012 17:53:47 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121130165347.GO29317@dhcp22.suse.cz> References: <20121126132149.GD17860@dhcp22.suse.cz> <20121130032918.59B3F780@pobox.sk> <20121130124506.GH29317@dhcp22.suse.cz> <20121130144427.51A09169@pobox.sk> <20121130144431.GI29317@dhcp22.suse.cz> <20121130160811.6BB25BDD@pobox.sk> <20121130153942.GL29317@dhcp22.suse.cz> <20121130165937.F9564EBE@pobox.sk> <20121130161923.GN29317@dhcp22.suse.cz> <20121130172651.B6917602@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121130172651.B6917602@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 30-11-12 17:26:51, azurIt wrote: > >Could you also post your complete containers configuration, maybe there > >is something strange in there (basically grep . -r YOUR_CGROUP_MNT > >except for tasks files which are of no use right now). > > > Here it is: > http://www.watchdog.sk/lkml/cgroups.gz The only strange thing I noticed is that some groups have 0 limit. Is this intentional? grep memory.limit_in_bytes cgroups | grep -v uid | sed 's@.*/@@' | sort | uniq -c 3 memory.limit_in_bytes:0 254 memory.limit_in_bytes:104857600 107 memory.limit_in_bytes:157286400 68 memory.limit_in_bytes:209715200 10 memory.limit_in_bytes:262144000 28 memory.limit_in_bytes:314572800 1 memory.limit_in_bytes:346030080 1 memory.limit_in_bytes:524288000 2 memory.limit_in_bytes:9223372036854775807 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1031593Ab2K3UnS (ORCPT ); Fri, 30 Nov 2012 15:43:18 -0500 Received: from gmmr1.centrum.cz ([46.255.225.252]:36995 "EHLO gmmr1.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1030302Ab2K3UnO (ORCPT ); Fri, 30 Nov 2012 15:43:14 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Fri, 30 Nov 2012 21:43:05 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20121126132149.GD17860@dhcp22.suse.cz>, <20121130032918.59B3F780@pobox.sk>, <20121130124506.GH29317@dhcp22.suse.cz>, <20121130144427.51A09169@pobox.sk>, <20121130144431.GI29317@dhcp22.suse.cz>, <20121130160811.6BB25BDD@pobox.sk>, <20121130153942.GL29317@dhcp22.suse.cz>, <20121130165937.F9564EBE@pobox.sk>, <20121130161923.GN29317@dhcp22.suse.cz>, <20121130172651.B6917602@pobox.sk> <20121130165347.GO29317@dhcp22.suse.cz> In-Reply-To: <20121130165347.GO29317@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121130214305.6741FF64@pobox.sk> X-Maser: brud Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >The only strange thing I noticed is that some groups have 0 limit. Is >this intentional? >grep memory.limit_in_bytes cgroups | grep -v uid | sed 's@.*/@@' | sort | uniq -c > 3 memory.limit_in_bytes:0 These are users who are not allowed to run anything. azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755013Ab2LCPQH (ORCPT ); Mon, 3 Dec 2012 10:16:07 -0500 Received: from cantor2.suse.de ([195.135.220.15]:35871 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752084Ab2LCPQD (ORCPT ); Mon, 3 Dec 2012 10:16:03 -0500 Date: Mon, 3 Dec 2012 16:16:01 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121203151601.GA17093@dhcp22.suse.cz> References: <20121126131837.GC17860@dhcp22.suse.cz> <20121126132149.GD17860@dhcp22.suse.cz> <20121130032918.59B3F780@pobox.sk> <20121130124506.GH29317@dhcp22.suse.cz> <20121130144427.51A09169@pobox.sk> <20121130144431.GI29317@dhcp22.suse.cz> <20121130160811.6BB25BDD@pobox.sk> <20121130153942.GL29317@dhcp22.suse.cz> <20121130165937.F9564EBE@pobox.sk> <20121130161923.GN29317@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121130161923.GN29317@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 30-11-12 17:19:23, Michal Hocko wrote: [...] > The important question is why you see VM_FAULT_OOM and whether memcg > charging failure can trigger that. I don not see how this could happen > right now because __GFP_NORETRY is not used for user pages (except for > THP which disable memcg OOM already), file backed page faults (aka > __do_fault) use mem_cgroup_newpage_charge which doesn't disable OOM. > This is a real head scratcher. The following should print the traces when we hand over ENOMEM to the caller. It should catch all charge paths (migration is not covered but that one is not important here). If we don't see any traces from here and there is still global OOM striking then there must be something else to trigger this. Could you test this with the patch which aims at fixing your deadlock, please? I realise that this is a production environment but I do not see anything relevant in the code. --- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c8425b1..9e5b56b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2397,6 +2397,7 @@ done: return 0; nomem: *ptr = NULL; + __WARN(); return -ENOMEM; bypass: *ptr = NULL; -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753018Ab2LEBgt (ORCPT ); Tue, 4 Dec 2012 20:36:49 -0500 Received: from gmmr4.centrum.cz ([46.255.227.253]:36842 "EHLO gmmr4.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752000Ab2LEBgr (ORCPT ); Tue, 4 Dec 2012 20:36:47 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Wed, 05 Dec 2012 02:36:44 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20121126131837.GC17860@dhcp22.suse.cz>, <20121126132149.GD17860@dhcp22.suse.cz>, <20121130032918.59B3F780@pobox.sk>, <20121130124506.GH29317@dhcp22.suse.cz>, <20121130144427.51A09169@pobox.sk>, <20121130144431.GI29317@dhcp22.suse.cz>, <20121130160811.6BB25BDD@pobox.sk>, <20121130153942.GL29317@dhcp22.suse.cz>, <20121130165937.F9564EBE@pobox.sk>, <20121130161923.GN29317@dhcp22.suse.cz> <20121203151601.GA17093@dhcp22.suse.cz> In-Reply-To: <20121203151601.GA17093@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121205023644.18C3006B@pobox.sk> X-Maser: Georgo Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >The following should print the traces when we hand over ENOMEM to the >caller. It should catch all charge paths (migration is not covered but >that one is not important here). If we don't see any traces from here >and there is still global OOM striking then there must be something else >to trigger this. >Could you test this with the patch which aims at fixing your deadlock, >please? I realise that this is a production environment but I do not see >anything relevant in the code. Michal, i think/hope this is what you wanted: http://www.watchdog.sk/lkml/oom_mysqld2 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753007Ab2LEOR0 (ORCPT ); Wed, 5 Dec 2012 09:17:26 -0500 Received: from cantor2.suse.de ([195.135.220.15]:43310 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751433Ab2LEORZ (ORCPT ); Wed, 5 Dec 2012 09:17:25 -0500 Date: Wed, 5 Dec 2012 15:17:22 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121205141722.GA9714@dhcp22.suse.cz> References: <20121130032918.59B3F780@pobox.sk> <20121130124506.GH29317@dhcp22.suse.cz> <20121130144427.51A09169@pobox.sk> <20121130144431.GI29317@dhcp22.suse.cz> <20121130160811.6BB25BDD@pobox.sk> <20121130153942.GL29317@dhcp22.suse.cz> <20121130165937.F9564EBE@pobox.sk> <20121130161923.GN29317@dhcp22.suse.cz> <20121203151601.GA17093@dhcp22.suse.cz> <20121205023644.18C3006B@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121205023644.18C3006B@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 05-12-12 02:36:44, azurIt wrote: > >The following should print the traces when we hand over ENOMEM to the > >caller. It should catch all charge paths (migration is not covered but > >that one is not important here). If we don't see any traces from here > >and there is still global OOM striking then there must be something else > >to trigger this. > >Could you test this with the patch which aims at fixing your deadlock, > >please? I realise that this is a production environment but I do not see > >anything relevant in the code. > > > Michal, > > i think/hope this is what you wanted: > http://www.watchdog.sk/lkml/oom_mysqld2 Dec 5 02:20:48 server01 kernel: [ 380.995947] WARNING: at mm/memcontrol.c:2400 T.1146+0x2c1/0x5d0() Dec 5 02:20:48 server01 kernel: [ 380.995950] Hardware name: S5000VSA Dec 5 02:20:48 server01 kernel: [ 380.995952] Pid: 5351, comm: apache2 Not tainted 3.2.34-grsec #1 Dec 5 02:20:48 server01 kernel: [ 380.995954] Call Trace: Dec 5 02:20:48 server01 kernel: [ 380.995960] [] warn_slowpath_common+0x7a/0xb0 Dec 5 02:20:48 server01 kernel: [ 380.995963] [] warn_slowpath_null+0x1a/0x20 Dec 5 02:20:48 server01 kernel: [ 380.995965] [] T.1146+0x2c1/0x5d0 Dec 5 02:20:48 server01 kernel: [ 380.995967] [] mem_cgroup_charge_common+0x53/0x90 Dec 5 02:20:48 server01 kernel: [ 380.995970] [] mem_cgroup_newpage_charge+0x45/0x50 Dec 5 02:20:48 server01 kernel: [ 380.995974] [] handle_pte_fault+0x609/0x940 Dec 5 02:20:48 server01 kernel: [ 380.995978] [] ? pte_alloc_one+0x3f/0x50 Dec 5 02:20:48 server01 kernel: [ 380.995981] [] handle_mm_fault+0x138/0x260 Dec 5 02:20:48 server01 kernel: [ 380.995983] [] do_page_fault+0x13d/0x460 Dec 5 02:20:48 server01 kernel: [ 380.995986] [] ? do_mmap_pgoff+0x3dc/0x430 Dec 5 02:20:48 server01 kernel: [ 380.995988] [] ? remove_vma+0x5d/0x80 Dec 5 02:20:48 server01 kernel: [ 380.995992] [] page_fault+0x1f/0x30 Dec 5 02:20:48 server01 kernel: [ 380.995994] ---[ end trace 25bbb3e634c25b7f ]--- Dec 5 02:20:48 server01 kernel: [ 380.996373] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 Dec 5 02:20:48 server01 kernel: [ 380.996377] apache2 cpuset=uid mems_allowed=0 Dec 5 02:20:48 server01 kernel: [ 380.996379] Pid: 5351, comm: apache2 Tainted: G W 3.2.34-grsec #1 Dec 5 02:20:48 server01 kernel: [ 380.996380] Call Trace: Dec 5 02:20:48 server01 kernel: [ 380.996384] [] dump_header+0x7e/0x1e0 Dec 5 02:20:48 server01 kernel: [ 380.996387] [] ? find_lock_task_mm+0x2f/0x70 Dec 5 02:20:48 server01 kernel: [ 380.996389] [] oom_kill_process+0x85/0x2a0 Dec 5 02:20:48 server01 kernel: [ 380.996392] [] out_of_memory+0xe5/0x200 Dec 5 02:20:48 server01 kernel: [ 380.996394] [] ? pte_alloc_one+0x3f/0x50 Dec 5 02:20:48 server01 kernel: [ 380.996397] [] pagefault_out_of_memory+0xbd/0x110 Dec 5 02:20:48 server01 kernel: [ 380.996399] [] mm_fault_error+0xb6/0x1a0 Dec 5 02:20:48 server01 kernel: [ 380.996401] [] do_page_fault+0x3ee/0x460 Dec 5 02:20:48 server01 kernel: [ 380.996403] [] ? do_mmap_pgoff+0x3dc/0x430 Dec 5 02:20:48 server01 kernel: [ 380.996405] [] ? remove_vma+0x5d/0x80 Dec 5 02:20:48 server01 kernel: [ 380.996408] [] page_fault+0x1f/0x30 OK, so the ENOMEM seems to be leaking from mem_cgroup_newpage_charge. This can only happen if this was an atomic allocation request (!__GFP_WAIT) or if oom is not allowed which is the case only for transparent huge page allocation. The first case can be excluded (in the clean 3.2 stable kernel) because all callers of mem_cgroup_newpage_charge use GFP_KERNEL. The later one should be OK because the page fault should fallback to a regular page if THP allocation/charge fails. [/me goes to double check] Hmm do_huge_pmd_wp_page seems to charge a huge page and fails with VM_FAULT_OOM without any fallback. We should do_huge_pmd_wp_page_fallback instead. This has been fixed in 3.5-rc1 by 1f1d06c3 (thp, memcg: split hugepage for memcg oom on cow) but it hasn't been backported to 3.2. The patch applies to 3.2 without any further modifications. I didn't have time to test it but if it helps you we should push this to the stable tree. --- >>From 765f5e0121c4410faa19c088e9ada75976bde178 Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Tue, 29 May 2012 15:06:23 -0700 Subject: [PATCH] thp, memcg: split hugepage for memcg oom on cow On COW, a new hugepage is allocated and charged to the memcg. If the system is oom or the charge to the memcg fails, however, the fault handler will return VM_FAULT_OOM which results in an oom kill. Instead, it's possible to fallback to splitting the hugepage so that the COW results only in an order-0 page being allocated and charged to the memcg which has a higher liklihood to succeed. This is expensive because the hugepage must be split in the page fault handler, but it is much better than unnecessarily oom killing a process. Signed-off-by: David Rientjes Cc: Andrea Arcangeli Cc: Johannes Weiner Acked-by: KAMEZAWA Hiroyuki Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds (cherry picked from commit 1f1d06c34f7675026326cd9f39ff91e4555cf355) --- mm/huge_memory.c | 3 +++ mm/memory.c | 18 +++++++++++++++--- 2 files changed, 18 insertions(+), 3 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8f005e9..470cbb4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -921,6 +921,8 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, count_vm_event(THP_FAULT_FALLBACK); ret = do_huge_pmd_wp_page_fallback(mm, vma, address, pmd, orig_pmd, page, haddr); + if (ret & VM_FAULT_OOM) + split_huge_page(page); put_page(page); goto out; } @@ -928,6 +930,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) { put_page(new_page); + split_huge_page(page); put_page(page); ret |= VM_FAULT_OOM; goto out; diff --git a/mm/memory.c b/mm/memory.c index 70f5daf..15e686a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3469,6 +3469,7 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (unlikely(is_vm_hugetlb_page(vma))) return hugetlb_fault(mm, vma, address, flags); +retry: pgd = pgd_offset(mm, address); pud = pud_alloc(mm, pgd, address); if (!pud) @@ -3482,13 +3483,24 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, pmd, flags); } else { pmd_t orig_pmd = *pmd; + int ret; + barrier(); if (pmd_trans_huge(orig_pmd)) { if (flags & FAULT_FLAG_WRITE && !pmd_write(orig_pmd) && - !pmd_trans_splitting(orig_pmd)) - return do_huge_pmd_wp_page(mm, vma, address, - pmd, orig_pmd); + !pmd_trans_splitting(orig_pmd)) { + ret = do_huge_pmd_wp_page(mm, vma, address, pmd, + orig_pmd); + /* + * If COW results in an oom, the huge pmd will + * have been split, so retry the fault on the + * pte for a smaller charge. + */ + if (unlikely(ret & VM_FAULT_OOM)) + goto retry; + return ret; + } return 0; } } -- 1.7.10.4 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753611Ab2LFA33 (ORCPT ); Wed, 5 Dec 2012 19:29:29 -0500 Received: from gmmr3.centrum.cz ([46.255.225.251]:57562 "EHLO gmmr3.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752183Ab2LFA31 (ORCPT ); Wed, 5 Dec 2012 19:29:27 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Thu, 06 Dec 2012 01:29:24 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20121130032918.59B3F780@pobox.sk>, <20121130124506.GH29317@dhcp22.suse.cz>, <20121130144427.51A09169@pobox.sk>, <20121130144431.GI29317@dhcp22.suse.cz>, <20121130160811.6BB25BDD@pobox.sk>, <20121130153942.GL29317@dhcp22.suse.cz>, <20121130165937.F9564EBE@pobox.sk>, <20121130161923.GN29317@dhcp22.suse.cz>, <20121203151601.GA17093@dhcp22.suse.cz>, <20121205023644.18C3006B@pobox.sk> <20121205141722.GA9714@dhcp22.suse.cz> In-Reply-To: <20121205141722.GA9714@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121206012924.FE077FD7@pobox.sk> X-Maser: oho Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >OK, so the ENOMEM seems to be leaking from mem_cgroup_newpage_charge. >This can only happen if this was an atomic allocation request >(!__GFP_WAIT) or if oom is not allowed which is the case only for >transparent huge page allocation. >The first case can be excluded (in the clean 3.2 stable kernel) because >all callers of mem_cgroup_newpage_charge use GFP_KERNEL. The later one >should be OK because the page fault should fallback to a regular page if >THP allocation/charge fails. >[/me goes to double check] >Hmm do_huge_pmd_wp_page seems to charge a huge page and fails with >VM_FAULT_OOM without any fallback. We should do_huge_pmd_wp_page_fallback >instead. This has been fixed in 3.5-rc1 by 1f1d06c3 (thp, memcg: split >hugepage for memcg oom on cow) but it hasn't been backported to 3.2. The >patch applies to 3.2 without any further modifications. I didn't have >time to test it but if it helps you we should push this to the stable >tree. This, unfortunately, didn't fix the problem :( http://www.watchdog.sk/lkml/oom_mysqld3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965008Ab2LFJy2 (ORCPT ); Thu, 6 Dec 2012 04:54:28 -0500 Received: from cantor2.suse.de ([195.135.220.15]:49889 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932322Ab2LFJyZ (ORCPT ); Thu, 6 Dec 2012 04:54:25 -0500 Date: Thu, 6 Dec 2012 10:54:23 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121206095423.GB10931@dhcp22.suse.cz> References: <20121130144427.51A09169@pobox.sk> <20121130144431.GI29317@dhcp22.suse.cz> <20121130160811.6BB25BDD@pobox.sk> <20121130153942.GL29317@dhcp22.suse.cz> <20121130165937.F9564EBE@pobox.sk> <20121130161923.GN29317@dhcp22.suse.cz> <20121203151601.GA17093@dhcp22.suse.cz> <20121205023644.18C3006B@pobox.sk> <20121205141722.GA9714@dhcp22.suse.cz> <20121206012924.FE077FD7@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121206012924.FE077FD7@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 06-12-12 01:29:24, azurIt wrote: > >OK, so the ENOMEM seems to be leaking from mem_cgroup_newpage_charge. > >This can only happen if this was an atomic allocation request > >(!__GFP_WAIT) or if oom is not allowed which is the case only for > >transparent huge page allocation. > >The first case can be excluded (in the clean 3.2 stable kernel) because > >all callers of mem_cgroup_newpage_charge use GFP_KERNEL. The later one > >should be OK because the page fault should fallback to a regular page if > >THP allocation/charge fails. > >[/me goes to double check] > >Hmm do_huge_pmd_wp_page seems to charge a huge page and fails with > >VM_FAULT_OOM without any fallback. We should do_huge_pmd_wp_page_fallback > >instead. This has been fixed in 3.5-rc1 by 1f1d06c3 (thp, memcg: split > >hugepage for memcg oom on cow) but it hasn't been backported to 3.2. The > >patch applies to 3.2 without any further modifications. I didn't have > >time to test it but if it helps you we should push this to the stable > >tree. > > > This, unfortunately, didn't fix the problem :( > http://www.watchdog.sk/lkml/oom_mysqld3 Dohh. The very same stack mem_cgroup_newpage_charge called from the page fault. The heavy inlining is not particularly helping here... So there must be some other THP charge leaking out. [/me is diving into the code again] * do_huge_pmd_anonymous_page falls back to handle_pte_fault * do_huge_pmd_wp_page_fallback falls back to simple pages so it doesn't charge the huge page * do_huge_pmd_wp_page splits the huge page and retries with fallback to handle_pte_fault * collapse_huge_page is not called in the page fault path * do_wp_page, do_anonymous_page and __do_fault operate on a single page so the memcg charging cannot return ENOMEM There are no other callers AFAICS so I am getting clueless. Maybe more debugging will tell us something (the inlining has been reduced for thp paths which can reduce performance in thp page fault heavy workloads but this will give us better traces - I hope). Anyway do you see the same problem if transparent huge pages are disabled? echo never > /sys/kernel/mm/transparent_hugepage/enabled) --- >>From 93a30140b50d8474a047b91c698f4880149635db Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Thu, 6 Dec 2012 10:40:17 +0100 Subject: [PATCH] more debugging --- mm/huge_memory.c | 6 +++--- mm/memcontrol.c | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 470cbb4..01a11f1 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -671,7 +671,7 @@ static inline struct page *alloc_hugepage(int defrag) } #endif -int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, +noinline int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, unsigned int flags) { @@ -790,7 +790,7 @@ pgtable_t get_pmd_huge_pte(struct mm_struct *mm) return pgtable; } -static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, +static noinline int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pmd_t orig_pmd, @@ -883,7 +883,7 @@ out_free_pages: goto out; } -int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, +noinline int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pmd_t orig_pmd) { int ret = 0; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9e5b56b..1986c65 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2397,7 +2397,7 @@ done: return 0; nomem: *ptr = NULL; - __WARN(); + __WARN_printf("gfp_mask:%u nr_pages:%u oom:%d ret:%d\n", gfp_mask, nr_pages, oom, ret); return -ENOMEM; bypass: *ptr = NULL; -- 1.7.10.4 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755204Ab2LFRHF (ORCPT ); Thu, 6 Dec 2012 12:07:05 -0500 Received: from cantor2.suse.de ([195.135.220.15]:39600 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751146Ab2LFRHA (ORCPT ); Thu, 6 Dec 2012 12:07:00 -0500 Date: Thu, 6 Dec 2012 18:06:58 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121206170658.GD10931@dhcp22.suse.cz> References: <20121130160811.6BB25BDD@pobox.sk> <20121130153942.GL29317@dhcp22.suse.cz> <20121130165937.F9564EBE@pobox.sk> <20121130161923.GN29317@dhcp22.suse.cz> <20121203151601.GA17093@dhcp22.suse.cz> <20121205023644.18C3006B@pobox.sk> <20121205141722.GA9714@dhcp22.suse.cz> <20121206012924.FE077FD7@pobox.sk> <20121206095423.GB10931@dhcp22.suse.cz> <20121206111249.58F013EA@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121206111249.58F013EA@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 06-12-12 11:12:49, azurIt wrote: > >Dohh. The very same stack mem_cgroup_newpage_charge called from the page > >fault. The heavy inlining is not particularly helping here... So there > >must be some other THP charge leaking out. > >[/me is diving into the code again] > > > >* do_huge_pmd_anonymous_page falls back to handle_pte_fault > >* do_huge_pmd_wp_page_fallback falls back to simple pages so it doesn't > > charge the huge page > >* do_huge_pmd_wp_page splits the huge page and retries with fallback to > > handle_pte_fault > >* collapse_huge_page is not called in the page fault path > >* do_wp_page, do_anonymous_page and __do_fault operate on a single page > > so the memcg charging cannot return ENOMEM > > > >There are no other callers AFAICS so I am getting clueless. Maybe more > >debugging will tell us something (the inlining has been reduced for thp > >paths which can reduce performance in thp page fault heavy workloads but > >this will give us better traces - I hope). > > > Should i apply all patches togather? (fix for this bug, more log > messages, backported fix from 3.5 and this new one) Yes please -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751658Ab2LJBUn (ORCPT ); Sun, 9 Dec 2012 20:20:43 -0500 Received: from gmmr4.centrum.cz ([46.255.227.253]:37883 "EHLO gmmr4.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751430Ab2LJBUm (ORCPT ); Sun, 9 Dec 2012 20:20:42 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Mon, 10 Dec 2012 02:20:38 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20121130144427.51A09169@pobox.sk>, <20121130144431.GI29317@dhcp22.suse.cz>, <20121130160811.6BB25BDD@pobox.sk>, <20121130153942.GL29317@dhcp22.suse.cz>, <20121130165937.F9564EBE@pobox.sk>, <20121130161923.GN29317@dhcp22.suse.cz>, <20121203151601.GA17093@dhcp22.suse.cz>, <20121205023644.18C3006B@pobox.sk>, <20121205141722.GA9714@dhcp22.suse.cz>, <20121206012924.FE077FD7@pobox.sk> <20121206095423.GB10931@dhcp22.suse.cz> In-Reply-To: <20121206095423.GB10931@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121210022038.E6570D37@pobox.sk> X-Maser: brud Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >There are no other callers AFAICS so I am getting clueless. Maybe more >debugging will tell us something (the inlining has been reduced for thp >paths which can reduce performance in thp page fault heavy workloads but >this will give us better traces - I hope). Michal, this was printing so many debug messages to console that the whole server hangs and i had to hard reset it after several minutes :( Sorry but i cannot test such a things in production. There's no problem with one soft reset which takes 4 minutes but this hard reset creates about 20 minutes outage (mainly cos of disk quotas checking). Last logged message: Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds. Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753845Ab2LJJnm (ORCPT ); Mon, 10 Dec 2012 04:43:42 -0500 Received: from cantor2.suse.de ([195.135.220.15]:43150 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752281Ab2LJJnk (ORCPT ); Mon, 10 Dec 2012 04:43:40 -0500 Date: Mon, 10 Dec 2012 10:43:38 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121210094318.GA6777@dhcp22.suse.cz> References: <20121130160811.6BB25BDD@pobox.sk> <20121130153942.GL29317@dhcp22.suse.cz> <20121130165937.F9564EBE@pobox.sk> <20121130161923.GN29317@dhcp22.suse.cz> <20121203151601.GA17093@dhcp22.suse.cz> <20121205023644.18C3006B@pobox.sk> <20121205141722.GA9714@dhcp22.suse.cz> <20121206012924.FE077FD7@pobox.sk> <20121206095423.GB10931@dhcp22.suse.cz> <20121210022038.E6570D37@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121210022038.E6570D37@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 10-12-12 02:20:38, azurIt wrote: [...] > Michal, Hi, > this was printing so many debug messages to console that the whole > server hangs Hmm, this is _really_ surprising. The latest patch didn't add any new logging actually. It just enahanced messages which were already printed out previously + changed few functions to be not inlined so they show up in the traces. So the only explanation is that the workload has changed or the patches got misapplied. > and i had to hard reset it after several minutes :( Sorry > but i cannot test such a things in production. There's no problem with > one soft reset which takes 4 minutes but this hard reset creates about > 20 minutes outage (mainly cos of disk quotas checking). Understood. > Last logged message: > > Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds. Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0 This explains why you have seen your machine hung. I am not familiar with grsec but stalling each fork 30s sounds really bad. Anyway this will not help me much. Do you happen to still have any of those logged traces from the last run? Apart from that. If my current understanding is correct then this is related to transparent huge pages (and leaking charge to the page fault handler). Do you see the same problem if you disable THP before you start your workload? (echo never > /sys/kernel/mm/transparent_hugepage/enabled) -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754067Ab2LJKSW (ORCPT ); Mon, 10 Dec 2012 05:18:22 -0500 Received: from gmmr2.centrum.cz ([46.255.227.252]:56012 "EHLO gmmr2.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753677Ab2LJKSV (ORCPT ); Mon, 10 Dec 2012 05:18:21 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Mon, 10 Dec 2012 11:18:17 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20121130160811.6BB25BDD@pobox.sk>, <20121130153942.GL29317@dhcp22.suse.cz>, <20121130165937.F9564EBE@pobox.sk>, <20121130161923.GN29317@dhcp22.suse.cz>, <20121203151601.GA17093@dhcp22.suse.cz>, <20121205023644.18C3006B@pobox.sk>, <20121205141722.GA9714@dhcp22.suse.cz>, <20121206012924.FE077FD7@pobox.sk>, <20121206095423.GB10931@dhcp22.suse.cz>, <20121210022038.E6570D37@pobox.sk> <20121210094318.GA6777@dhcp22.suse.cz> In-Reply-To: <20121210094318.GA6777@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121210111817.F697F53E@pobox.sk> X-Maser: oho Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >Hmm, this is _really_ surprising. The latest patch didn't add any new >logging actually. It just enahanced messages which were already printed >out previously + changed few functions to be not inlined so they show up >in the traces. So the only explanation is that the workload has changed >or the patches got misapplied. This time i installed 3.2.35, maybe some changes between .34 and .35 did this? Should i try .34? >> Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds. Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0 > >This explains why you have seen your machine hung. I am not familiar >with grsec but stalling each fork 30s sounds really bad. Btw, i never ever saw such a message from grsecurity yet. Will write to grsec mailing list about explanation. >Anyway this will not help me much. Do you happen to still have any of >those logged traces from the last run? Unfortunately not, it didn't log anything and tons of messages were printed only to console (i was logged via IP-KVM). It looked that printing is infinite, i rebooted it after few minutes. >Apart from that. If my current understanding is correct then this is >related to transparent huge pages (and leaking charge to the page fault >handler). Do you see the same problem if you disable THP before you >start your workload? (echo never > /sys/kernel/mm/transparent_hugepage/enabled) # cat /sys/kernel/mm/transparent_hugepage/enabled cat: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory # ls -la /sys/kernel/mm total 0 drwx------ 3 root root 0 Dec 10 11:11 . drwx------ 5 root root 0 Dec 10 02:06 .. drwx------ 2 root root 0 Dec 10 11:11 cleancache From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754182Ab2LJPwJ (ORCPT ); Mon, 10 Dec 2012 10:52:09 -0500 Received: from cantor2.suse.de ([195.135.220.15]:60169 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752179Ab2LJPwH (ORCPT ); Mon, 10 Dec 2012 10:52:07 -0500 Date: Mon, 10 Dec 2012 16:52:05 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121210155205.GB6777@dhcp22.suse.cz> References: <20121130165937.F9564EBE@pobox.sk> <20121130161923.GN29317@dhcp22.suse.cz> <20121203151601.GA17093@dhcp22.suse.cz> <20121205023644.18C3006B@pobox.sk> <20121205141722.GA9714@dhcp22.suse.cz> <20121206012924.FE077FD7@pobox.sk> <20121206095423.GB10931@dhcp22.suse.cz> <20121210022038.E6570D37@pobox.sk> <20121210094318.GA6777@dhcp22.suse.cz> <20121210111817.F697F53E@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121210111817.F697F53E@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 10-12-12 11:18:17, azurIt wrote: > >Hmm, this is _really_ surprising. The latest patch didn't add any new > >logging actually. It just enahanced messages which were already printed > >out previously + changed few functions to be not inlined so they show up > >in the traces. So the only explanation is that the workload has changed > >or the patches got misapplied. > > > This time i installed 3.2.35, maybe some changes between .34 and .35 > did this? Should i try .34? I would try to limit changes to minimum. So the original kernel you were using + the first patch to prevent OOM from the write path + 2 debugging patches. > >> Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds. Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0 > > > >This explains why you have seen your machine hung. I am not familiar > >with grsec but stalling each fork 30s sounds really bad. > > > Btw, i never ever saw such a message from grsecurity yet. Will write to grsec mailing list about explanation. > > > >Anyway this will not help me much. Do you happen to still have any of > >those logged traces from the last run? > > > Unfortunately not, it didn't log anything and tons of messages were > printed only to console (i was logged via IP-KVM). It looked that > printing is infinite, i rebooted it after few minutes. But was it at least related to the debugging from the patch or it was rather a totally unrelated thing? > >Apart from that. If my current understanding is correct then this is > >related to transparent huge pages (and leaking charge to the page fault > >handler). Do you see the same problem if you disable THP before you > >start your workload? (echo never > /sys/kernel/mm/transparent_hugepage/enabled) > > # cat /sys/kernel/mm/transparent_hugepage/enabled > cat: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory Weee. Then it cannot be related to THP at all. Which makes this even bigger mystery. We really need to find out who is leaking that charge. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751791Ab2LJRS7 (ORCPT ); Mon, 10 Dec 2012 12:18:59 -0500 Received: from gmmr8.centrum.cz ([46.255.227.254]:53481 "EHLO gmmr8.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750818Ab2LJRS6 (ORCPT ); Mon, 10 Dec 2012 12:18:58 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Mon, 10 Dec 2012 18:18:54 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20121130165937.F9564EBE@pobox.sk>, <20121130161923.GN29317@dhcp22.suse.cz>, <20121203151601.GA17093@dhcp22.suse.cz>, <20121205023644.18C3006B@pobox.sk>, <20121205141722.GA9714@dhcp22.suse.cz>, <20121206012924.FE077FD7@pobox.sk>, <20121206095423.GB10931@dhcp22.suse.cz>, <20121210022038.E6570D37@pobox.sk>, <20121210094318.GA6777@dhcp22.suse.cz>, <20121210111817.F697F53E@pobox.sk> <20121210155205.GB6777@dhcp22.suse.cz> In-Reply-To: <20121210155205.GB6777@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121210181854.5BE82C77@pobox.sk> X-Maser: Georgo Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >I would try to limit changes to minimum. So the original kernel you were >using + the first patch to prevent OOM from the write path + 2 debugging >patches. ok. >But was it at least related to the debugging from the patch or it was >rather a totally unrelated thing? I wasn't reading it much but i think it looks like a traces i was sending you before. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752199Ab2LQBee (ORCPT ); Sun, 16 Dec 2012 20:34:34 -0500 Received: from gmmr3.centrum.cz ([46.255.225.251]:44425 "EHLO gmmr3.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751397Ab2LQBec (ORCPT ); Sun, 16 Dec 2012 20:34:32 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Mon, 17 Dec 2012 02:34:30 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20121130165937.F9564EBE@pobox.sk>, <20121130161923.GN29317@dhcp22.suse.cz>, <20121203151601.GA17093@dhcp22.suse.cz>, <20121205023644.18C3006B@pobox.sk>, <20121205141722.GA9714@dhcp22.suse.cz>, <20121206012924.FE077FD7@pobox.sk>, <20121206095423.GB10931@dhcp22.suse.cz>, <20121210022038.E6570D37@pobox.sk>, <20121210094318.GA6777@dhcp22.suse.cz>, <20121210111817.F697F53E@pobox.sk> <20121210155205.GB6777@dhcp22.suse.cz> In-Reply-To: <20121210155205.GB6777@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121217023430.5A390FD7@pobox.sk> X-Maser: oho Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >I would try to limit changes to minimum. So the original kernel you were >using + the first patch to prevent OOM from the write path + 2 debugging >patches. It didn't take off the whole system this time (but i was prepared to record a video of console ;) ), here it is: http://www.watchdog.sk/lkml/oom_mysqld4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754535Ab2LQQgd (ORCPT ); Mon, 17 Dec 2012 11:36:33 -0500 Received: from cantor2.suse.de ([195.135.220.15]:36911 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753746Ab2LQQcG (ORCPT ); Mon, 17 Dec 2012 11:32:06 -0500 Date: Mon, 17 Dec 2012 17:32:03 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121217163203.GD25432@dhcp22.suse.cz> References: <20121203151601.GA17093@dhcp22.suse.cz> <20121205023644.18C3006B@pobox.sk> <20121205141722.GA9714@dhcp22.suse.cz> <20121206012924.FE077FD7@pobox.sk> <20121206095423.GB10931@dhcp22.suse.cz> <20121210022038.E6570D37@pobox.sk> <20121210094318.GA6777@dhcp22.suse.cz> <20121210111817.F697F53E@pobox.sk> <20121210155205.GB6777@dhcp22.suse.cz> <20121217023430.5A390FD7@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121217023430.5A390FD7@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 17-12-12 02:34:30, azurIt wrote: > >I would try to limit changes to minimum. So the original kernel you were > >using + the first patch to prevent OOM from the write path + 2 debugging > >patches. > > > It didn't take off the whole system this time (but i was > prepared to record a video of console ;) ), here it is: > http://www.watchdog.sk/lkml/oom_mysqld4 [...] [ 1248.059429] ------------[ cut here ]------------ [ 1248.059586] WARNING: at mm/memcontrol.c:2400 T.1146+0x2d9/0x610() [ 1248.059723] Hardware name: S5000VSA [ 1248.059855] gfp_mask:208 nr_pages:1 oom:0 ret:2 This is GFP_KERNEL allocation which is expected. It is also a simple page which is not that expected because we shouldn't return ENOMEM on those unless this was GFP_ATOMIC allocation (which it wasn't) or the caller told us to not trigger OOM which is the case only for THP pages (see mem_cgroup_charge_common). So the big question is how have we ended up with oom=false here... [Ohh, I am really an idiot. I screwed the first patch] - bool oom = true; + bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); Which obviously doesn't work. It should read !(gfp_mask &GFP_MEMCG_NO_OOM). No idea how I could have missed that. I am really sorry about that. --- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c04676d..1f35a74 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2704,7 +2704,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; struct page_cgroup *pc; - bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); + bool oom = !(gfp_mask & GFP_MEMCG_NO_OOM); int ret; if (PageTransHuge(page)) { -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753801Ab2LQSXL (ORCPT ); Mon, 17 Dec 2012 13:23:11 -0500 Received: from gmmr1.centrum.cz ([46.255.225.252]:35944 "EHLO gmmr1.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753747Ab2LQSXF (ORCPT ); Mon, 17 Dec 2012 13:23:05 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Mon, 17 Dec 2012 19:23:01 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20121203151601.GA17093@dhcp22.suse.cz>, <20121205023644.18C3006B@pobox.sk>, <20121205141722.GA9714@dhcp22.suse.cz>, <20121206012924.FE077FD7@pobox.sk>, <20121206095423.GB10931@dhcp22.suse.cz>, <20121210022038.E6570D37@pobox.sk>, <20121210094318.GA6777@dhcp22.suse.cz>, <20121210111817.F697F53E@pobox.sk>, <20121210155205.GB6777@dhcp22.suse.cz>, <20121217023430.5A390FD7@pobox.sk> <20121217163203.GD25432@dhcp22.suse.cz> In-Reply-To: <20121217163203.GD25432@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121217192301.829A7020@pobox.sk> X-Maser: brud Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >[Ohh, I am really an idiot. I screwed the first patch] >- bool oom = true; >+ bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); > >Which obviously doesn't work. It should read !(gfp_mask &GFP_MEMCG_NO_OOM). > No idea how I could have missed that. I am really sorry about that. :D no problem :) so, now it should really work as expected and completely fix my original problem? is it safe to apply it on 3.2.35? Thank you very much! azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753631Ab2LQTz0 (ORCPT ); Mon, 17 Dec 2012 14:55:26 -0500 Received: from mail-ee0-f46.google.com ([74.125.83.46]:54643 "EHLO mail-ee0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752276Ab2LQTzY (ORCPT ); Mon, 17 Dec 2012 14:55:24 -0500 Date: Mon, 17 Dec 2012 20:55:10 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121217195510.GA16375@dhcp22.suse.cz> References: <20121205141722.GA9714@dhcp22.suse.cz> <20121206012924.FE077FD7@pobox.sk> <20121206095423.GB10931@dhcp22.suse.cz> <20121210022038.E6570D37@pobox.sk> <20121210094318.GA6777@dhcp22.suse.cz> <20121210111817.F697F53E@pobox.sk> <20121210155205.GB6777@dhcp22.suse.cz> <20121217023430.5A390FD7@pobox.sk> <20121217163203.GD25432@dhcp22.suse.cz> <20121217192301.829A7020@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121217192301.829A7020@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 17-12-12 19:23:01, azurIt wrote: > >[Ohh, I am really an idiot. I screwed the first patch] > >- bool oom = true; > >+ bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); > > > >Which obviously doesn't work. It should read !(gfp_mask &GFP_MEMCG_NO_OOM). > > No idea how I could have missed that. I am really sorry about that. > > > :D no problem :) so, now it should really work as expected and > completely fix my original problem? It should mitigate the problem. The real fix shouldn't be that specific (as per discussion in other thread). The chance this will get upstream is not big and that means that it will not get to the stable tree either. > is it safe to apply it on 3.2.35? I didn't check what are the differences but I do not think there is anything to conflict with it. > Thank you very much! HTH -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755085Ab2LROW1 (ORCPT ); Tue, 18 Dec 2012 09:22:27 -0500 Received: from gmmr4.centrum.cz ([46.255.227.253]:41597 "EHLO gmmr4.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754869Ab2LROW0 (ORCPT ); Tue, 18 Dec 2012 09:22:26 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Tue, 18 Dec 2012 15:22:23 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20121205141722.GA9714@dhcp22.suse.cz>, <20121206012924.FE077FD7@pobox.sk>, <20121206095423.GB10931@dhcp22.suse.cz>, <20121210022038.E6570D37@pobox.sk>, <20121210094318.GA6777@dhcp22.suse.cz>, <20121210111817.F697F53E@pobox.sk>, <20121210155205.GB6777@dhcp22.suse.cz>, <20121217023430.5A390FD7@pobox.sk>, <20121217163203.GD25432@dhcp22.suse.cz>, <20121217192301.829A7020@pobox.sk> <20121217195510.GA16375@dhcp22.suse.cz> In-Reply-To: <20121217195510.GA16375@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121218152223.6912832C@pobox.sk> X-Maser: oho Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >It should mitigate the problem. The real fix shouldn't be that specific >(as per discussion in other thread). The chance this will get upstream >is not big and that means that it will not get to the stable tree >either. OOM is no longer killing processes outside target cgroups, so everything looks fine so far. Will report back when i will have more info. Thnks! azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932178Ab2LRPUJ (ORCPT ); Tue, 18 Dec 2012 10:20:09 -0500 Received: from cantor2.suse.de ([195.135.220.15]:49215 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932067Ab2LRPUH (ORCPT ); Tue, 18 Dec 2012 10:20:07 -0500 Date: Tue, 18 Dec 2012 16:20:04 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121218152004.GA25208@dhcp22.suse.cz> References: <20121206095423.GB10931@dhcp22.suse.cz> <20121210022038.E6570D37@pobox.sk> <20121210094318.GA6777@dhcp22.suse.cz> <20121210111817.F697F53E@pobox.sk> <20121210155205.GB6777@dhcp22.suse.cz> <20121217023430.5A390FD7@pobox.sk> <20121217163203.GD25432@dhcp22.suse.cz> <20121217192301.829A7020@pobox.sk> <20121217195510.GA16375@dhcp22.suse.cz> <20121218152223.6912832C@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121218152223.6912832C@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 18-12-12 15:22:23, azurIt wrote: > >It should mitigate the problem. The real fix shouldn't be that specific > >(as per discussion in other thread). The chance this will get upstream > >is not big and that means that it will not get to the stable tree > >either. > > > OOM is no longer killing processes outside target cgroups, so > everything looks fine so far. Will report back when i will have more > info. Thnks! OK, good to hear and fingers crossed. I will try to get back to the original problem and a better solution sometimes early next year when all the things settle a bit. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752980Ab2LXNZb (ORCPT ); Mon, 24 Dec 2012 08:25:31 -0500 Received: from gmmr5.centrum.cz ([46.255.225.250]:37807 "EHLO gmmr5.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752473Ab2LXNZ2 (ORCPT ); Mon, 24 Dec 2012 08:25:28 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Mon, 24 Dec 2012 14:25:26 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20121206095423.GB10931@dhcp22.suse.cz>, <20121210022038.E6570D37@pobox.sk>, <20121210094318.GA6777@dhcp22.suse.cz>, <20121210111817.F697F53E@pobox.sk>, <20121210155205.GB6777@dhcp22.suse.cz>, <20121217023430.5A390FD7@pobox.sk>, <20121217163203.GD25432@dhcp22.suse.cz>, <20121217192301.829A7020@pobox.sk>, <20121217195510.GA16375@dhcp22.suse.cz>, <20121218152223.6912832C@pobox.sk> <20121218152004.GA25208@dhcp22.suse.cz> In-Reply-To: <20121218152004.GA25208@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121224142526.020165D3@pobox.sk> X-Maser: brud Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >OK, good to hear and fingers crossed. I will try to get back to the >original problem and a better solution sometimes early next year when >all the things settle a bit. Michal, problem, unfortunately, happened again :( twice. When it happened first time (two days ago) i don't want to believe it so i recompiled the kernel and boot it again to be sure i really used your patch. Today it happened again, here is report: http://watchdog.sk/lkml/memcg-bug-3.tar.gz Here is patch which i used (kernel 3.2.35, i didn't use any other from your patches): http://watchdog.sk/lkml/5-memcg-fix.patch azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753022Ab2LXNiy (ORCPT ); Mon, 24 Dec 2012 08:38:54 -0500 Received: from gmmr4.centrum.cz ([46.255.227.253]:48528 "EHLO gmmr4.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752321Ab2LXNix (ORCPT ); Mon, 24 Dec 2012 08:38:53 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Mon, 24 Dec 2012 14:38:50 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20121206095423.GB10931@dhcp22.suse.cz>, <20121210022038.E6570D37@pobox.sk>, <20121210094318.GA6777@dhcp22.suse.cz>, <20121210111817.F697F53E@pobox.sk>, <20121210155205.GB6777@dhcp22.suse.cz>, <20121217023430.5A390FD7@pobox.sk>, <20121217163203.GD25432@dhcp22.suse.cz>, <20121217192301.829A7020@pobox.sk>, <20121217195510.GA16375@dhcp22.suse.cz>, <20121218152223.6912832C@pobox.sk> <20121218152004.GA25208@dhcp22.suse.cz> In-Reply-To: <20121218152004.GA25208@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121224143850.B611B3C3@pobox.sk> X-Maser: oho Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >OK, good to hear and fingers crossed. I will try to get back to the >original problem and a better solution sometimes early next year when >all the things settle a bit. Btw, i noticed one more thing when problem is happening (=when any cgroup is stucked), i fogot to mention it before, sorry :( . It's related to HDDs, something is slowing them down in a strange way. All services are working normally and i really cannot notice any slowness, the only thing which i noticed is affeceted is our backup software ( www.Bacula.org ). When problem occurs at night, so it's happening when backup is running, backup is extremely slow and usually don't finish until i kill processes inside affected cgroup (=until i resolve the problem). Backup software is NOT doing big HDD bandwidth BUT it's doing quite huge number of disk operations (it needs to stat every file and directory). I believe that only speed of disk operations are affected and are very slow. Merry christmas! From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753887Ab2L1QfZ (ORCPT ); Fri, 28 Dec 2012 11:35:25 -0500 Received: from cantor2.suse.de ([195.135.220.15]:37547 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753686Ab2L1QfX (ORCPT ); Fri, 28 Dec 2012 11:35:23 -0500 Date: Fri, 28 Dec 2012 17:35:21 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121228163521.GB1455@dhcp22.suse.cz> References: <20121210094318.GA6777@dhcp22.suse.cz> <20121210111817.F697F53E@pobox.sk> <20121210155205.GB6777@dhcp22.suse.cz> <20121217023430.5A390FD7@pobox.sk> <20121217163203.GD25432@dhcp22.suse.cz> <20121217192301.829A7020@pobox.sk> <20121217195510.GA16375@dhcp22.suse.cz> <20121218152223.6912832C@pobox.sk> <20121218152004.GA25208@dhcp22.suse.cz> <20121224143850.B611B3C3@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121224143850.B611B3C3@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 24-12-12 14:38:50, azurIt wrote: > >OK, good to hear and fingers crossed. I will try to get back to the > >original problem and a better solution sometimes early next year when > >all the things settle a bit. > > > Btw, i noticed one more thing when problem is happening (=when any > cgroup is stucked), i fogot to mention it before, sorry :( . It's > related to HDDs, something is slowing them down in a strange way. All > services are working normally and i really cannot notice any slowness, > the only thing which i noticed is affeceted is our backup software ( > www.Bacula.org ). When problem occurs at night, so it's happening when > backup is running, backup is extremely slow and usually don't finish > until i kill processes inside affected cgroup (=until i resolve the > problem). Backup software is NOT doing big HDD bandwidth BUT it's > doing quite huge number of disk operations (it needs to stat every > file and directory). I believe that only speed of disk operations are > affected and are very slow. I would bet that this is caused by the blocked proceses in memcg oom handler which hold i_mutex and the backup process wants to access the same inode with an operation which requires the lock. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753855Ab2L1QWP (ORCPT ); Fri, 28 Dec 2012 11:22:15 -0500 Received: from cantor2.suse.de ([195.135.220.15]:37299 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753723Ab2L1QWM (ORCPT ); Fri, 28 Dec 2012 11:22:12 -0500 Date: Fri, 28 Dec 2012 17:22:09 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20121228162209.GA1455@dhcp22.suse.cz> References: <20121210094318.GA6777@dhcp22.suse.cz> <20121210111817.F697F53E@pobox.sk> <20121210155205.GB6777@dhcp22.suse.cz> <20121217023430.5A390FD7@pobox.sk> <20121217163203.GD25432@dhcp22.suse.cz> <20121217192301.829A7020@pobox.sk> <20121217195510.GA16375@dhcp22.suse.cz> <20121218152223.6912832C@pobox.sk> <20121218152004.GA25208@dhcp22.suse.cz> <20121224142526.020165D3@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121224142526.020165D3@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 24-12-12 14:25:26, azurIt wrote: > >OK, good to hear and fingers crossed. I will try to get back to the > >original problem and a better solution sometimes early next year when > >all the things settle a bit. > > > Michal, problem, unfortunately, happened again :( twice. When it > happened first time (two days ago) i don't want to believe it so i > recompiled the kernel and boot it again to be sure i really used your > patch. Today it happened again, here is report: > http://watchdog.sk/lkml/memcg-bug-3.tar.gz Hmm, 1356352982/1507/stack says [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1147+0x5ab/0x5c0 [] mem_cgroup_cache_charge+0xbe/0xe0 [] add_to_page_cache_locked+0x4f/0x140 [] add_to_page_cache_lru+0x22/0x50 [] find_or_create_page+0x73/0xb0 [] __getblk+0xea/0x2c0 [] ext3_getblk+0xeb/0x240 [] ext3_bread+0x19/0x90 [] ext3_dx_find_entry+0x83/0x1e0 [] ext3_find_entry+0x2e4/0x480 [] ext3_lookup+0x4d/0x120 [] d_alloc_and_lookup+0x45/0x90 [] do_lookup+0x278/0x390 [] path_lookupat+0xae/0x7e0 [] do_path_lookup+0x35/0xe0 [] user_path_at_empty+0x59/0xb0 [] user_path_at+0x11/0x20 [] vfs_fstatat+0x47/0x80 [] vfs_lstat+0x1e/0x20 [] sys_newlstat+0x24/0x50 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff which suggests that the patch is incomplete and that I am blind :/ mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following follow-up patch on top of the one you already have (which should catch all the remaining cases). Sorry about that... --- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 89997ac..559a54d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2779,6 +2779,7 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg, int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask) { + bool oom = !(gfp_mask & GFP_MEMCG_NO_OOM); struct mem_cgroup *memcg = NULL; int ret; @@ -2791,7 +2792,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, mm = &init_mm; if (page_is_file_cache(page)) { - ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, true); + ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, oom); if (ret || !memcg) return ret; @@ -2827,6 +2828,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t mask, struct mem_cgroup **ptr) { + bool oom = !(mask & GFP_MEMCG_NO_OOM); struct mem_cgroup *memcg; int ret; @@ -2849,13 +2851,13 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, if (!memcg) goto charge_cur_mm; *ptr = memcg; - ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, true); + ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, oom); css_put(&memcg->css); return ret; charge_cur_mm: if (unlikely(!mm)) mm = &init_mm; - return __mem_cgroup_try_charge(mm, mask, 1, ptr, true); + return __mem_cgroup_try_charge(mm, mask, 1, ptr, oom); } static void -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753869Ab2L3BJx (ORCPT ); Sat, 29 Dec 2012 20:09:53 -0500 Received: from gmmr4.centrum.cz ([46.255.227.253]:48930 "EHLO gmmr4.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753614Ab2L3BJu (ORCPT ); Sat, 29 Dec 2012 20:09:50 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Sun, 30 Dec 2012 02:09:47 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20121210094318.GA6777@dhcp22.suse.cz>, <20121210111817.F697F53E@pobox.sk>, <20121210155205.GB6777@dhcp22.suse.cz>, <20121217023430.5A390FD7@pobox.sk>, <20121217163203.GD25432@dhcp22.suse.cz>, <20121217192301.829A7020@pobox.sk>, <20121217195510.GA16375@dhcp22.suse.cz>, <20121218152223.6912832C@pobox.sk>, <20121218152004.GA25208@dhcp22.suse.cz>, <20121224142526.020165D3@pobox.sk> <20121228162209.GA1455@dhcp22.suse.cz> In-Reply-To: <20121228162209.GA1455@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20121230020947.AA002F34@pobox.sk> X-Maser: oho Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >which suggests that the patch is incomplete and that I am blind :/ >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following >follow-up patch on top of the one you already have (which should catch >all the remaining cases). >Sorry about that... This was, again, killing my MySQL server (search for "(mysqld)"): http://www.watchdog.sk/lkml/oom_mysqld5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932155Ab3AYPHa (ORCPT ); Fri, 25 Jan 2013 10:07:30 -0500 Received: from gmmr3.centrum.cz ([46.255.225.251]:32923 "EHLO gmmr3.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755906Ab3AYPH1 (ORCPT ); Fri, 25 Jan 2013 10:07:27 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Fri, 25 Jan 2013 16:07:23 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20121210155205.GB6777@dhcp22.suse.cz>, <20121217023430.5A390FD7@pobox.sk>, <20121217163203.GD25432@dhcp22.suse.cz>, <20121217192301.829A7020@pobox.sk>, <20121217195510.GA16375@dhcp22.suse.cz>, <20121218152223.6912832C@pobox.sk>, <20121218152004.GA25208@dhcp22.suse.cz>, <20121224142526.020165D3@pobox.sk>, <20121228162209.GA1455@dhcp22.suse.cz>, <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> In-Reply-To: <20121230110815.GA12940@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20130125160723.FAE73567@pobox.sk> X-Maser: oho Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Any news? Thnx! azur ______________________________________________________________ > Od: "Michal Hocko" > Komu: azurIt > Dátum: 30.12.2012 12:08 > Predmet: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked > > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , "Johannes Weiner" >On Sun 30-12-12 02:09:47, azurIt wrote: >> >which suggests that the patch is incomplete and that I am blind :/ >> >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache >> >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following >> >follow-up patch on top of the one you already have (which should catch >> >all the remaining cases). >> >Sorry about that... >> >> >> This was, again, killing my MySQL server (search for "(mysqld)"): >> http://www.watchdog.sk/lkml/oom_mysqld5 > >grep "Kill process" oom_mysqld5 >Dec 30 01:53:34 server01 kernel: [ 367.061801] Memory cgroup out of memory: Kill process 5512 (apache2) score 716 or sacrifice child >Dec 30 01:53:35 server01 kernel: [ 367.338024] Memory cgroup out of memory: Kill process 5517 (apache2) score 718 or sacrifice child >Dec 30 01:53:35 server01 kernel: [ 367.747888] Memory cgroup out of memory: Kill process 5513 (apache2) score 721 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 368.159860] Memory cgroup out of memory: Kill process 5516 (apache2) score 726 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 368.665606] Memory cgroup out of memory: Kill process 5520 (apache2) score 733 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 368.765652] Out of memory: Kill process 1778 (mysqld) score 39 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 369.101753] Memory cgroup out of memory: Kill process 5519 (apache2) score 754 or sacrifice child >Dec 30 01:53:37 server01 kernel: [ 369.464262] Memory cgroup out of memory: Kill process 5583 (apache2) score 762 or sacrifice child >Dec 30 01:53:37 server01 kernel: [ 369.465017] Out of memory: Kill process 5506 (apache2) score 18 or sacrifice child >Dec 30 01:53:37 server01 kernel: [ 369.574932] Memory cgroup out of memory: Kill process 5523 (apache2) score 759 or sacrifice child > >So your mysqld has been killed by the global OOM not memcg. But why when >you seem to be perfectly fine regarding memory? I guess the following >backtrace is relevant: >Dec 30 01:53:36 server01 kernel: [ 368.569720] DMA: 0*4kB 1*8kB 0*16kB 1*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15912kB >Dec 30 01:53:36 server01 kernel: [ 368.570447] DMA32: 9*4kB 10*8kB 8*16kB 6*32kB 5*64kB 6*128kB 4*256kB 2*512kB 3*1024kB 3*2048kB 613*4096kB = 2523636kB >Dec 30 01:53:36 server01 kernel: [ 368.571175] Normal: 5*4kB 2060*8kB 4122*16kB 2550*32kB 2667*64kB 722*128kB 197*256kB 68*512kB 15*1024kB 4*2048kB 1855*4096kB = 8134036kB >Dec 30 01:53:36 server01 kernel: [ 368.571906] 308964 total pagecache pages >Dec 30 01:53:36 server01 kernel: [ 368.572023] 0 pages in swap cache >Dec 30 01:53:36 server01 kernel: [ 368.572140] Swap cache stats: add 0, delete 0, find 0/0 >Dec 30 01:53:36 server01 kernel: [ 368.572260] Free swap = 0kB >Dec 30 01:53:36 server01 kernel: [ 368.572375] Total swap = 0kB >Dec 30 01:53:36 server01 kernel: [ 368.597836] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 >Dec 30 01:53:36 server01 kernel: [ 368.598034] apache2 cpuset=uid mems_allowed=0 >Dec 30 01:53:36 server01 kernel: [ 368.598152] Pid: 5385, comm: apache2 Not tainted 3.2.35-grsec #1 >Dec 30 01:53:36 server01 kernel: [ 368.598273] Call Trace: >Dec 30 01:53:36 server01 kernel: [ 368.598396] [] dump_header+0x7e/0x1e0 >Dec 30 01:53:36 server01 kernel: [ 368.598516] [] ? find_lock_task_mm+0x2f/0x70 >Dec 30 01:53:36 server01 kernel: [ 368.598638] [] oom_kill_process+0x85/0x2a0 >Dec 30 01:53:36 server01 kernel: [ 368.598759] [] out_of_memory+0xe5/0x200 >Dec 30 01:53:36 server01 kernel: [ 368.598880] [] pagefault_out_of_memory+0xbd/0x110 >Dec 30 01:53:36 server01 kernel: [ 368.599006] [] mm_fault_error+0xb6/0x1a0 >Dec 30 01:53:36 server01 kernel: [ 368.599127] [] do_page_fault+0x3ee/0x460 >Dec 30 01:53:36 server01 kernel: [ 368.599250] [] ? mntput+0x1f/0x30 >Dec 30 01:53:36 server01 kernel: [ 368.599371] [] ? fput+0x156/0x200 >Dec 30 01:53:36 server01 kernel: [ 368.599496] [] page_fault+0x1f/0x30 > >This would suggest that an unexpected ENOMEM leaked during page fault >path. I do not see which one could that be because you said THP >(CONFIG_TRANSPARENT_HUGEPAGE) are disabled (and the other patch I have >mentioned in the thread should fix that issue - btw. the patch is >already scheduled for stable tree). > __do_fault, do_anonymous_page and do_wp_page call >mem_cgroup_newpage_charge with GFP_KERNEL which means that >we do memcg OOM and never return ENOMEM. do_swap_page calls >mem_cgroup_try_charge_swapin with GFP_KERNEL as well. > >I might have missed something but I will not get to look closer before >2nd January. >-- >Michal Hocko >SUSE Labs > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757389Ab3AYQbm (ORCPT ); Fri, 25 Jan 2013 11:31:42 -0500 Received: from cantor2.suse.de ([195.135.220.15]:41874 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756981Ab3AYQbj (ORCPT ); Fri, 25 Jan 2013 11:31:39 -0500 Date: Fri, 25 Jan 2013 17:31:30 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20130125163130.GF4721@dhcp22.suse.cz> References: <20121217163203.GD25432@dhcp22.suse.cz> <20121217192301.829A7020@pobox.sk> <20121217195510.GA16375@dhcp22.suse.cz> <20121218152223.6912832C@pobox.sk> <20121218152004.GA25208@dhcp22.suse.cz> <20121224142526.020165D3@pobox.sk> <20121228162209.GA1455@dhcp22.suse.cz> <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20130125160723.FAE73567@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 25-01-13 16:07:23, azurIt wrote: > Any news? Thnx! Sorry, but I didn't get to this one yet. > > azur > > > > ______________________________________________________________ > > Od: "Michal Hocko" > > Komu: azurIt > > Dátum: 30.12.2012 12:08 > > Predmet: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked > > > > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , "Johannes Weiner" > >On Sun 30-12-12 02:09:47, azurIt wrote: > >> >which suggests that the patch is incomplete and that I am blind :/ > >> >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache > >> >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following > >> >follow-up patch on top of the one you already have (which should catch > >> >all the remaining cases). > >> >Sorry about that... > >> > >> > >> This was, again, killing my MySQL server (search for "(mysqld)"): > >> http://www.watchdog.sk/lkml/oom_mysqld5 > > > >grep "Kill process" oom_mysqld5 > >Dec 30 01:53:34 server01 kernel: [ 367.061801] Memory cgroup out of memory: Kill process 5512 (apache2) score 716 or sacrifice child > >Dec 30 01:53:35 server01 kernel: [ 367.338024] Memory cgroup out of memory: Kill process 5517 (apache2) score 718 or sacrifice child > >Dec 30 01:53:35 server01 kernel: [ 367.747888] Memory cgroup out of memory: Kill process 5513 (apache2) score 721 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 368.159860] Memory cgroup out of memory: Kill process 5516 (apache2) score 726 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 368.665606] Memory cgroup out of memory: Kill process 5520 (apache2) score 733 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 368.765652] Out of memory: Kill process 1778 (mysqld) score 39 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 369.101753] Memory cgroup out of memory: Kill process 5519 (apache2) score 754 or sacrifice child > >Dec 30 01:53:37 server01 kernel: [ 369.464262] Memory cgroup out of memory: Kill process 5583 (apache2) score 762 or sacrifice child > >Dec 30 01:53:37 server01 kernel: [ 369.465017] Out of memory: Kill process 5506 (apache2) score 18 or sacrifice child > >Dec 30 01:53:37 server01 kernel: [ 369.574932] Memory cgroup out of memory: Kill process 5523 (apache2) score 759 or sacrifice child > > > >So your mysqld has been killed by the global OOM not memcg. But why when > >you seem to be perfectly fine regarding memory? I guess the following > >backtrace is relevant: > >Dec 30 01:53:36 server01 kernel: [ 368.569720] DMA: 0*4kB 1*8kB 0*16kB 1*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15912kB > >Dec 30 01:53:36 server01 kernel: [ 368.570447] DMA32: 9*4kB 10*8kB 8*16kB 6*32kB 5*64kB 6*128kB 4*256kB 2*512kB 3*1024kB 3*2048kB 613*4096kB = 2523636kB > >Dec 30 01:53:36 server01 kernel: [ 368.571175] Normal: 5*4kB 2060*8kB 4122*16kB 2550*32kB 2667*64kB 722*128kB 197*256kB 68*512kB 15*1024kB 4*2048kB 1855*4096kB = 8134036kB > >Dec 30 01:53:36 server01 kernel: [ 368.571906] 308964 total pagecache pages > >Dec 30 01:53:36 server01 kernel: [ 368.572023] 0 pages in swap cache > >Dec 30 01:53:36 server01 kernel: [ 368.572140] Swap cache stats: add 0, delete 0, find 0/0 > >Dec 30 01:53:36 server01 kernel: [ 368.572260] Free swap = 0kB > >Dec 30 01:53:36 server01 kernel: [ 368.572375] Total swap = 0kB > >Dec 30 01:53:36 server01 kernel: [ 368.597836] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > >Dec 30 01:53:36 server01 kernel: [ 368.598034] apache2 cpuset=uid mems_allowed=0 > >Dec 30 01:53:36 server01 kernel: [ 368.598152] Pid: 5385, comm: apache2 Not tainted 3.2.35-grsec #1 > >Dec 30 01:53:36 server01 kernel: [ 368.598273] Call Trace: > >Dec 30 01:53:36 server01 kernel: [ 368.598396] [] dump_header+0x7e/0x1e0 > >Dec 30 01:53:36 server01 kernel: [ 368.598516] [] ? find_lock_task_mm+0x2f/0x70 > >Dec 30 01:53:36 server01 kernel: [ 368.598638] [] oom_kill_process+0x85/0x2a0 > >Dec 30 01:53:36 server01 kernel: [ 368.598759] [] out_of_memory+0xe5/0x200 > >Dec 30 01:53:36 server01 kernel: [ 368.598880] [] pagefault_out_of_memory+0xbd/0x110 > >Dec 30 01:53:36 server01 kernel: [ 368.599006] [] mm_fault_error+0xb6/0x1a0 > >Dec 30 01:53:36 server01 kernel: [ 368.599127] [] do_page_fault+0x3ee/0x460 > >Dec 30 01:53:36 server01 kernel: [ 368.599250] [] ? mntput+0x1f/0x30 > >Dec 30 01:53:36 server01 kernel: [ 368.599371] [] ? fput+0x156/0x200 > >Dec 30 01:53:36 server01 kernel: [ 368.599496] [] page_fault+0x1f/0x30 > > > >This would suggest that an unexpected ENOMEM leaked during page fault > >path. I do not see which one could that be because you said THP > >(CONFIG_TRANSPARENT_HUGEPAGE) are disabled (and the other patch I have > >mentioned in the thread should fix that issue - btw. the patch is > >already scheduled for stable tree). > > __do_fault, do_anonymous_page and do_wp_page call > >mem_cgroup_newpage_charge with GFP_KERNEL which means that > >we do memcg OOM and never return ENOMEM. do_swap_page calls > >mem_cgroup_try_charge_swapin with GFP_KERNEL as well. > > > >I might have missed something but I will not get to look closer before > >2nd January. > >-- > >Michal Hocko > >SUSE Labs > > > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755498Ab3BENtv (ORCPT ); Tue, 5 Feb 2013 08:49:51 -0500 Received: from cantor2.suse.de ([195.135.220.15]:50721 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754654Ab3BENtq (ORCPT ); Tue, 5 Feb 2013 08:49:46 -0500 Date: Tue, 5 Feb 2013 14:49:42 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20130205134937.GA22804@dhcp22.suse.cz> References: <20121217192301.829A7020@pobox.sk> <20121217195510.GA16375@dhcp22.suse.cz> <20121218152223.6912832C@pobox.sk> <20121218152004.GA25208@dhcp22.suse.cz> <20121224142526.020165D3@pobox.sk> <20121228162209.GA1455@dhcp22.suse.cz> <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130125163130.GF4721@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 25-01-13 17:31:30, Michal Hocko wrote: > On Fri 25-01-13 16:07:23, azurIt wrote: > > Any news? Thnx! > > Sorry, but I didn't get to this one yet. Sorry, to get back to this that late but I was busy as hell since the beginning of the year. Has the issue repeated since then? You said you didn't apply other than the above mentioned patch. Could you apply also debugging part of the patches I have sent? In case you don't have it handy then it should be this one: --- >>From 1623420d964e7e8bc88e2a6239563052df891bf7 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Mon, 3 Dec 2012 16:16:01 +0100 Subject: [PATCH] more debugging --- mm/huge_memory.c | 6 +++--- mm/memcontrol.c | 1 + 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 470cbb4..01a11f1 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -671,7 +671,7 @@ static inline struct page *alloc_hugepage(int defrag) } #endif -int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, +noinline int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, unsigned int flags) { @@ -790,7 +790,7 @@ pgtable_t get_pmd_huge_pte(struct mm_struct *mm) return pgtable; } -static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, +static noinline int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pmd_t orig_pmd, @@ -883,7 +883,7 @@ out_free_pages: goto out; } -int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, +noinline int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pmd_t orig_pmd) { int ret = 0; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c8425b1..1986c65 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2397,6 +2397,7 @@ done: return 0; nomem: *ptr = NULL; + __WARN_printf("gfp_mask:%u nr_pages:%u oom:%d ret:%d\n", gfp_mask, nr_pages, oom, ret); return -ENOMEM; bypass: *ptr = NULL; -- 1.7.10.4 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755904Ab3BEPAX (ORCPT ); Tue, 5 Feb 2013 10:00:23 -0500 Received: from gmmr2.centrum.cz ([46.255.227.252]:38901 "EHLO gmmr2.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755077Ab3BEPAV (ORCPT ); Tue, 5 Feb 2013 10:00:21 -0500 X-Greylist: delayed 631 seconds by postgrey-1.27 at vger.kernel.org; Tue, 05 Feb 2013 10:00:20 EST To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Tue, 05 Feb 2013 15:49:47 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20121217192301.829A7020@pobox.sk>, <20121217195510.GA16375@dhcp22.suse.cz>, <20121218152223.6912832C@pobox.sk>, <20121218152004.GA25208@dhcp22.suse.cz>, <20121224142526.020165D3@pobox.sk>, <20121228162209.GA1455@dhcp22.suse.cz>, <20121230020947.AA002F34@pobox.sk>, <20121230110815.GA12940@dhcp22.suse.cz>, <20130125160723.FAE73567@pobox.sk>, <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> In-Reply-To: <20130205134937.GA22804@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20130205154947.CD6411E2@pobox.sk> X-Maser: oho Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >Sorry, to get back to this that late but I was busy as hell since the >beginning of the year. Thank you for your time! >Has the issue repeated since then? Yes, it's happening all the time but meanwhile i wrote a script which is monitoring the problem and killing freezed processes when it occurs. But i don't like it much, it's not a solution for me :( i also noticed, that problem is always affecting the whole server but not so much as freezed cgroup. Depends on number of freezed processes, sometimes it has almost no imapct on the rest of the server, sometimes the whole server is lagging much. I have another old problem which is maybe also related to this. I wasn't connecting it with this before but now i'm not sure. Two of our servers, which are affected by this cgroup problem, are also randomly freezing completely (few times per month). These are the symptoms: - servers are answering to ping - it is possible to connect via SSH but connection is freezed after sending the password - it is possible to login via console but it is freezed after typeing the login These symptoms are very similar to HDD problems or HDD overload (but there is no overload for sure). The only way to fix it is, probably, hard rebooting the server (didn't find any other way). What do you think? Can this be related? Maybe HDDs are locked in the similar way the cgroups are - we already found out that cgroup freezeing is related also to HDD activity. Maybe there is a little chance that the whole HDD subsystem ends in deadlock? >You said you didn't apply other than the above mentioned patch. Could >you apply also debugging part of the patches I have sent? >In case you don't have it handy then it should be this one: Just to be sure - am i supposed to apply this two patches? http://watchdog.sk/lkml/patches/ azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756233Ab3BEQJl (ORCPT ); Tue, 5 Feb 2013 11:09:41 -0500 Received: from cantor2.suse.de ([195.135.220.15]:56875 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755247Ab3BEQJh (ORCPT ); Tue, 5 Feb 2013 11:09:37 -0500 Date: Tue, 5 Feb 2013 17:09:34 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20130205160934.GB22804@dhcp22.suse.cz> References: <20121218152223.6912832C@pobox.sk> <20121218152004.GA25208@dhcp22.suse.cz> <20121224142526.020165D3@pobox.sk> <20121228162209.GA1455@dhcp22.suse.cz> <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130205154947.CD6411E2@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 05-02-13 15:49:47, azurIt wrote: [...] > Just to be sure - am i supposed to apply this two patches? > http://watchdog.sk/lkml/patches/ 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I mentioned in a follow up email. Here is the full patch: --- >>From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Mon, 26 Nov 2012 11:47:57 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [] do_truncate+0x58/0xa0 # takes i_mutex [] do_last+0x250/0xa30 [] path_openat+0xd7/0x440 [] do_filp_open+0x49/0xa0 [] do_sys_open+0x106/0x240 [] sys_open+0x20/0x30 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff Process B [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1146+0x5ab/0x5c0 [] mem_cgroup_cache_charge+0xbe/0xe0 [] add_to_page_cache_locked+0x4c/0x140 [] add_to_page_cache_lru+0x22/0x50 [] grab_cache_page_write_begin+0x8b/0xe0 [] ext3_write_begin+0x88/0x270 [] generic_file_buffered_write+0x116/0x290 [] __generic_file_aio_write+0x27c/0x480 [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [] do_sync_write+0xea/0x130 [] vfs_write+0xf3/0x1f0 [] sys_write+0x51/0x90 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which then tells mem_cgroup_charge_common that OOM is not allowed for the charge. No OOM from this path, except for fixing the bug, also make some sense as we really do not want to cause an OOM because of a page cache usage. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. __GFP_NORETRY is abused for this memcg specific flag because no user accounted allocation use this flag except for THP which have memcg oom disabled already. Reported-by: azurIt Signed-off-by: Michal Hocko --- include/linux/gfp.h | 3 +++ include/linux/memcontrol.h | 13 +++++++++++++ mm/filemap.c | 8 +++++++- mm/memcontrol.c | 10 ++++++---- 4 files changed, 29 insertions(+), 5 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 3a76faf..806fb54 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -146,6 +146,9 @@ struct vm_area_struct; /* 4GB DMA on some platforms */ #define GFP_DMA32 __GFP_DMA32 +/* memcg oom killer is not allowed */ +#define GFP_MEMCG_NO_OOM __GFP_NORETRY + /* Convert GFP flags to their corresponding migrate type */ static inline int allocflags_to_migratetype(gfp_t gfp_flags) { diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 81572af..bf0e575 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -63,6 +63,13 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); + +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM); +} + extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru); extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru); extern void mem_cgroup_rotate_reclaimable_page(struct page *page); @@ -178,6 +185,12 @@ static inline int mem_cgroup_cache_charge(struct page *page, return 0; } +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return 0; +} + static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t gfp_mask, struct mem_cgroup **ptr) { diff --git a/mm/filemap.c b/mm/filemap.c index 556858c..ef182a9 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -449,7 +449,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, + /* + * Cannot trigger OOM even if gfp_mask would allow that normally + * because we might be called from a locked context and that + * could lead to deadlocks if the killed process is waiting for + * the same lock. + */ + error = mem_cgroup_cache_charge_no_oom(page, current->mm, gfp_mask & GFP_RECLAIM_MASK); if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1986c65..a68aa08 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2704,7 +2704,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; struct page_cgroup *pc; - bool oom = true; + bool oom = !(gfp_mask & GFP_MEMCG_NO_OOM); int ret; if (PageTransHuge(page)) { @@ -2771,6 +2771,7 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg, int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask) { + bool oom = !(gfp_mask & GFP_MEMCG_NO_OOM); struct mem_cgroup *memcg = NULL; int ret; @@ -2783,7 +2784,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, mm = &init_mm; if (page_is_file_cache(page)) { - ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, true); + ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, oom); if (ret || !memcg) return ret; @@ -2819,6 +2820,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t mask, struct mem_cgroup **ptr) { + bool oom = !(mask & GFP_MEMCG_NO_OOM); struct mem_cgroup *memcg; int ret; @@ -2841,13 +2843,13 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, if (!memcg) goto charge_cur_mm; *ptr = memcg; - ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, true); + ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, oom); css_put(&memcg->css); return ret; charge_cur_mm: if (unlikely(!mm)) mm = &init_mm; - return __mem_cgroup_try_charge(mm, mask, 1, ptr, true); + return __mem_cgroup_try_charge(mm, mask, 1, ptr, oom); } static void -- 1.7.10.4 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754833Ab3BEQbK (ORCPT ); Tue, 5 Feb 2013 11:31:10 -0500 Received: from cantor2.suse.de ([195.135.220.15]:57709 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753986Ab3BEQbH (ORCPT ); Tue, 5 Feb 2013 11:31:07 -0500 Date: Tue, 5 Feb 2013 17:31:06 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20130205163106.GC22804@dhcp22.suse.cz> References: <20121218152223.6912832C@pobox.sk> <20121218152004.GA25208@dhcp22.suse.cz> <20121224142526.020165D3@pobox.sk> <20121228162209.GA1455@dhcp22.suse.cz> <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130205154947.CD6411E2@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 05-02-13 15:49:47, azurIt wrote: [...] > I have another old problem which is maybe also related to this. I > wasn't connecting it with this before but now i'm not sure. Two of our > servers, which are affected by this cgroup problem, are also randomly > freezing completely (few times per month). These are the symptoms: > - servers are answering to ping > - it is possible to connect via SSH but connection is freezed after > sending the password > - it is possible to login via console but it is freezed after typeing > the login > These symptoms are very similar to HDD problems or HDD overload (but > there is no overload for sure). The only way to fix it is, probably, > hard rebooting the server (didn't find any other way). What do you > think? Can this be related? This is hard to tell without further information. > Maybe HDDs are locked in the similar way the cgroups are - we already > found out that cgroup freezeing is related also to HDD activity. Maybe > there is a little chance that the whole HDD subsystem ends in > deadlock? "HDD subsystem" whatever that means cannot be blocked by memcg being stuck. Certain access to soem files might be an issue because those could have locks held but I do not see other relations. I would start by checking the HW, trying to focus on reducing elements that could contribute - aka try to nail down to the minimum set which reproduces the issue. I cannot help you much with that I am afraid. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755576Ab3BEQqm (ORCPT ); Tue, 5 Feb 2013 11:46:42 -0500 Received: from gmmr2.centrum.cz ([46.255.227.252]:51154 "EHLO gmmr2.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754083Ab3BEQqk (ORCPT ); Tue, 5 Feb 2013 11:46:40 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Tue, 05 Feb 2013 17:46:37 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20121218152223.6912832C@pobox.sk>, <20121218152004.GA25208@dhcp22.suse.cz>, <20121224142526.020165D3@pobox.sk>, <20121228162209.GA1455@dhcp22.suse.cz>, <20121230020947.AA002F34@pobox.sk>, <20121230110815.GA12940@dhcp22.suse.cz>, <20130125160723.FAE73567@pobox.sk>, <20130125163130.GF4721@dhcp22.suse.cz>, <20130205134937.GA22804@dhcp22.suse.cz>, <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> In-Reply-To: <20130205160934.GB22804@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20130205174637.C7A8CE45@pobox.sk> X-Maser: oho Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >mentioned in a follow up email. ou, it wasn't complete? i used it in my last test.. sorry, i'm litte confused by all those patches. will try it this night and report back. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755912Ab3BEQsv (ORCPT ); Tue, 5 Feb 2013 11:48:51 -0500 Received: from mail-gh0-f202.google.com ([209.85.160.202]:53113 "EHLO mail-gh0-f202.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754626Ab3BEQs0 (ORCPT ); Tue, 5 Feb 2013 11:48:26 -0500 From: Greg Thelen To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked References: <20121218152223.6912832C@pobox.sk> <20121218152004.GA25208@dhcp22.suse.cz> <20121224142526.020165D3@pobox.sk> <20121228162209.GA1455@dhcp22.suse.cz> <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> Date: Tue, 05 Feb 2013 08:48:23 -0800 In-Reply-To: <20130205160934.GB22804@dhcp22.suse.cz> (Michal Hocko's message of "Tue, 5 Feb 2013 17:09:34 +0100") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 05 2013, Michal Hocko wrote: > On Tue 05-02-13 15:49:47, azurIt wrote: > [...] >> Just to be sure - am i supposed to apply this two patches? >> http://watchdog.sk/lkml/patches/ > > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > mentioned in a follow up email. Here is the full patch: > --- > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 > From: Michal Hocko > Date: Mon, 26 Nov 2012 11:47:57 +0100 > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > memcg oom killer might deadlock if the process which falls down to > mem_cgroup_handle_oom holds a lock which prevents other task to > terminate because it is blocked on the very same lock. > This can happen when a write system call needs to allocate a page but > the allocation hits the memcg hard limit and there is nothing to reclaim > (e.g. there is no swap or swap limit is hit as well and all cache pages > have been reclaimed already) and the process selected by memcg OOM > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > Process A > [] do_truncate+0x58/0xa0 # takes i_mutex > [] do_last+0x250/0xa30 > [] path_openat+0xd7/0x440 > [] do_filp_open+0x49/0xa0 > [] do_sys_open+0x106/0x240 > [] sys_open+0x20/0x30 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff > > Process B > [] mem_cgroup_handle_oom+0x241/0x3b0 > [] T.1146+0x5ab/0x5c0 > [] mem_cgroup_cache_charge+0xbe/0xe0 > [] add_to_page_cache_locked+0x4c/0x140 > [] add_to_page_cache_lru+0x22/0x50 > [] grab_cache_page_write_begin+0x8b/0xe0 > [] ext3_write_begin+0x88/0x270 > [] generic_file_buffered_write+0x116/0x290 > [] __generic_file_aio_write+0x27c/0x480 > [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > [] do_sync_write+0xea/0x130 > [] vfs_write+0xf3/0x1f0 > [] sys_write+0x51/0x90 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff It looks like grab_cache_page_write_begin() passes __GFP_FS into __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me think that this deadlock is also possible in the page allocator even before getting to add_to_page_cache_lru. no? Can callers holding fs resources (e.g. i_mutex) pass __GFP_FS into the page allocator? If __GFP_FS was avoided, then I think memcg user page charging would need a !__GFP_FS check to avoid invoking oom killer, but at least then we'd avoid both deadlocks and cover both page allocation and memcg page charging in similar fashion. Example from memcg_charge_kmem: may_oom = (gfp & __GFP_FS) && !(gfp & __GFP_NORETRY); From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755304Ab3BERq7 (ORCPT ); Tue, 5 Feb 2013 12:46:59 -0500 Received: from mail-wi0-f180.google.com ([209.85.212.180]:46422 "EHLO mail-wi0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753782Ab3BERq6 (ORCPT ); Tue, 5 Feb 2013 12:46:58 -0500 Date: Tue, 5 Feb 2013 18:46:51 +0100 From: Michal Hocko To: Greg Thelen Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20130205174651.GA3959@dhcp22.suse.cz> References: <20121224142526.020165D3@pobox.sk> <20121228162209.GA1455@dhcp22.suse.cz> <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 05-02-13 08:48:23, Greg Thelen wrote: > On Tue, Feb 05 2013, Michal Hocko wrote: > > > On Tue 05-02-13 15:49:47, azurIt wrote: > > [...] > >> Just to be sure - am i supposed to apply this two patches? > >> http://watchdog.sk/lkml/patches/ > > > > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > > mentioned in a follow up email. Here is the full patch: > > --- > > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 > > From: Michal Hocko > > Date: Mon, 26 Nov 2012 11:47:57 +0100 > > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > > > memcg oom killer might deadlock if the process which falls down to > > mem_cgroup_handle_oom holds a lock which prevents other task to > > terminate because it is blocked on the very same lock. > > This can happen when a write system call needs to allocate a page but > > the allocation hits the memcg hard limit and there is nothing to reclaim > > (e.g. there is no swap or swap limit is hit as well and all cache pages > > have been reclaimed already) and the process selected by memcg OOM > > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > > > Process A > > [] do_truncate+0x58/0xa0 # takes i_mutex > > [] do_last+0x250/0xa30 > > [] path_openat+0xd7/0x440 > > [] do_filp_open+0x49/0xa0 > > [] do_sys_open+0x106/0x240 > > [] sys_open+0x20/0x30 > > [] system_call_fastpath+0x18/0x1d > > [] 0xffffffffffffffff > > > > Process B > > [] mem_cgroup_handle_oom+0x241/0x3b0 > > [] T.1146+0x5ab/0x5c0 > > [] mem_cgroup_cache_charge+0xbe/0xe0 > > [] add_to_page_cache_locked+0x4c/0x140 > > [] add_to_page_cache_lru+0x22/0x50 > > [] grab_cache_page_write_begin+0x8b/0xe0 > > [] ext3_write_begin+0x88/0x270 > > [] generic_file_buffered_write+0x116/0x290 > > [] __generic_file_aio_write+0x27c/0x480 > > [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > > [] do_sync_write+0xea/0x130 > > [] vfs_write+0xf3/0x1f0 > > [] sys_write+0x51/0x90 > > [] system_call_fastpath+0x18/0x1d > > [] 0xffffffffffffffff > > It looks like grab_cache_page_write_begin() passes __GFP_FS into > __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me > think that this deadlock is also possible in the page allocator even > before getting to add_to_page_cache_lru. no? I am not that familiar with VFS but i_mutex is a high level lock AFAIR and it shouldn't be called from the pageout path so __page_cache_alloc should be safe. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756583Ab3BESQg (ORCPT ); Tue, 5 Feb 2013 13:16:36 -0500 Received: from mail-gg0-f202.google.com ([209.85.161.202]:46770 "EHLO mail-gg0-f202.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756354Ab3BESQb (ORCPT ); Tue, 5 Feb 2013 13:16:31 -0500 From: Greg Thelen To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked References: <20121224142526.020165D3@pobox.sk> <20121228162209.GA1455@dhcp22.suse.cz> <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130205174651.GA3959@dhcp22.suse.cz> Date: Tue, 05 Feb 2013 10:09:57 -0800 In-Reply-To: <20130205174651.GA3959@dhcp22.suse.cz> (Michal Hocko's message of "Tue, 5 Feb 2013 18:46:51 +0100") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 05 2013, Michal Hocko wrote: > On Tue 05-02-13 08:48:23, Greg Thelen wrote: >> On Tue, Feb 05 2013, Michal Hocko wrote: >> >> > On Tue 05-02-13 15:49:47, azurIt wrote: >> > [...] >> >> Just to be sure - am i supposed to apply this two patches? >> >> http://watchdog.sk/lkml/patches/ >> > >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >> > mentioned in a follow up email. Here is the full patch: >> > --- >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 >> > From: Michal Hocko >> > Date: Mon, 26 Nov 2012 11:47:57 +0100 >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked >> > >> > memcg oom killer might deadlock if the process which falls down to >> > mem_cgroup_handle_oom holds a lock which prevents other task to >> > terminate because it is blocked on the very same lock. >> > This can happen when a write system call needs to allocate a page but >> > the allocation hits the memcg hard limit and there is nothing to reclaim >> > (e.g. there is no swap or swap limit is hit as well and all cache pages >> > have been reclaimed already) and the process selected by memcg OOM >> > killer is blocked on i_mutex on the same inode (e.g. truncate it). >> > >> > Process A >> > [] do_truncate+0x58/0xa0 # takes i_mutex >> > [] do_last+0x250/0xa30 >> > [] path_openat+0xd7/0x440 >> > [] do_filp_open+0x49/0xa0 >> > [] do_sys_open+0x106/0x240 >> > [] sys_open+0x20/0x30 >> > [] system_call_fastpath+0x18/0x1d >> > [] 0xffffffffffffffff >> > >> > Process B >> > [] mem_cgroup_handle_oom+0x241/0x3b0 >> > [] T.1146+0x5ab/0x5c0 >> > [] mem_cgroup_cache_charge+0xbe/0xe0 >> > [] add_to_page_cache_locked+0x4c/0x140 >> > [] add_to_page_cache_lru+0x22/0x50 >> > [] grab_cache_page_write_begin+0x8b/0xe0 >> > [] ext3_write_begin+0x88/0x270 >> > [] generic_file_buffered_write+0x116/0x290 >> > [] __generic_file_aio_write+0x27c/0x480 >> > [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex >> > [] do_sync_write+0xea/0x130 >> > [] vfs_write+0xf3/0x1f0 >> > [] sys_write+0x51/0x90 >> > [] system_call_fastpath+0x18/0x1d >> > [] 0xffffffffffffffff >> >> It looks like grab_cache_page_write_begin() passes __GFP_FS into >> __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me >> think that this deadlock is also possible in the page allocator even >> before getting to add_to_page_cache_lru. no? > > I am not that familiar with VFS but i_mutex is a high level lock AFAIR > and it shouldn't be called from the pageout path so __page_cache_alloc > should be safe. I wasn't clear, sorry. My concern is not that pageout() grabs i_mutex. My concern is that __page_cache_alloc() will invoke the oom killer and select a victim which wants i_mutex. This victim will deadlock because the oom killer caller already holds i_mutex. The wild accusation I am making is that anyone who invokes the oom killer and waits on the victim to die is essentially grabbing all of the locks that any of the oom killer victims may grab (e.g. i_mutex). To avoid deadlock the oom killer can only be called is while holding no locks that the oom victim demands. I think some locks are grabbed in a way that allows the lock request to fail if the task has a fatal signal pending, so they are safe. But any locks acquisitions that cannot fail (e.g. mutex_lock) will deadlock with the oom killing process. So the oom killing process cannot hold any such locks which the victim will attempt to grab. Hopefully I'm missing something. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756501Ab3BETAC (ORCPT ); Tue, 5 Feb 2013 14:00:02 -0500 Received: from mail-wi0-f169.google.com ([209.85.212.169]:63005 "EHLO mail-wi0-f169.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754933Ab3BES77 (ORCPT ); Tue, 5 Feb 2013 13:59:59 -0500 Date: Tue, 5 Feb 2013 19:59:53 +0100 From: Michal Hocko To: Greg Thelen Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20130205185953.GB3959@dhcp22.suse.cz> References: <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130205174651.GA3959@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 05-02-13 10:09:57, Greg Thelen wrote: > On Tue, Feb 05 2013, Michal Hocko wrote: > > > On Tue 05-02-13 08:48:23, Greg Thelen wrote: > >> On Tue, Feb 05 2013, Michal Hocko wrote: > >> > >> > On Tue 05-02-13 15:49:47, azurIt wrote: > >> > [...] > >> >> Just to be sure - am i supposed to apply this two patches? > >> >> http://watchdog.sk/lkml/patches/ > >> > > >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > >> > mentioned in a follow up email. Here is the full patch: > >> > --- > >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 > >> > From: Michal Hocko > >> > Date: Mon, 26 Nov 2012 11:47:57 +0100 > >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > >> > > >> > memcg oom killer might deadlock if the process which falls down to > >> > mem_cgroup_handle_oom holds a lock which prevents other task to > >> > terminate because it is blocked on the very same lock. > >> > This can happen when a write system call needs to allocate a page but > >> > the allocation hits the memcg hard limit and there is nothing to reclaim > >> > (e.g. there is no swap or swap limit is hit as well and all cache pages > >> > have been reclaimed already) and the process selected by memcg OOM > >> > killer is blocked on i_mutex on the same inode (e.g. truncate it). > >> > > >> > Process A > >> > [] do_truncate+0x58/0xa0 # takes i_mutex > >> > [] do_last+0x250/0xa30 > >> > [] path_openat+0xd7/0x440 > >> > [] do_filp_open+0x49/0xa0 > >> > [] do_sys_open+0x106/0x240 > >> > [] sys_open+0x20/0x30 > >> > [] system_call_fastpath+0x18/0x1d > >> > [] 0xffffffffffffffff > >> > > >> > Process B > >> > [] mem_cgroup_handle_oom+0x241/0x3b0 > >> > [] T.1146+0x5ab/0x5c0 > >> > [] mem_cgroup_cache_charge+0xbe/0xe0 > >> > [] add_to_page_cache_locked+0x4c/0x140 > >> > [] add_to_page_cache_lru+0x22/0x50 > >> > [] grab_cache_page_write_begin+0x8b/0xe0 > >> > [] ext3_write_begin+0x88/0x270 > >> > [] generic_file_buffered_write+0x116/0x290 > >> > [] __generic_file_aio_write+0x27c/0x480 > >> > [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > >> > [] do_sync_write+0xea/0x130 > >> > [] vfs_write+0xf3/0x1f0 > >> > [] sys_write+0x51/0x90 > >> > [] system_call_fastpath+0x18/0x1d > >> > [] 0xffffffffffffffff > >> > >> It looks like grab_cache_page_write_begin() passes __GFP_FS into > >> __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me > >> think that this deadlock is also possible in the page allocator even > >> before getting to add_to_page_cache_lru. no? > > > > I am not that familiar with VFS but i_mutex is a high level lock AFAIR > > and it shouldn't be called from the pageout path so __page_cache_alloc > > should be safe. > > I wasn't clear, sorry. My concern is not that pageout() grabs i_mutex. > My concern is that __page_cache_alloc() will invoke the oom killer and > select a victim which wants i_mutex. This victim will deadlock because > the oom killer caller already holds i_mutex. That would be true for the memcg oom because that one is blocking but the global oom just puts the allocator into sleep for a while and then the allocator should back off eventually (unless this is NOFAIL allocation). I would need to look closer whether this is really the case - I haven't seen that allocator code path for a while... > The wild accusation I am making is that anyone who invokes the oom > killer and waits on the victim to die is essentially grabbing all of > the locks that any of the oom killer victims may grab (e.g. i_mutex). True. > To avoid deadlock the oom killer can only be called is while holding > no locks that the oom victim demands. I think some locks are grabbed > in a way that allows the lock request to fail if the task has a fatal > signal pending, so they are safe. But any locks acquisitions that > cannot fail (e.g. mutex_lock) will deadlock with the oom killing > process. So the oom killing process cannot hold any such locks which > the victim will attempt to grab. Hopefully I'm missing something. Agreed. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756863Ab3BFBRc (ORCPT ); Tue, 5 Feb 2013 20:17:32 -0500 Received: from gmmr4.centrum.cz ([46.255.227.253]:43281 "EHLO gmmr4.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756133Ab3BFBRa (ORCPT ); Tue, 5 Feb 2013 20:17:30 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_from_add=5Fto=5Fpage=5Fcache=5Flocked?= Date: Wed, 06 Feb 2013 02:17:21 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20121218152223.6912832C@pobox.sk>, <20121218152004.GA25208@dhcp22.suse.cz>, <20121224142526.020165D3@pobox.sk>, <20121228162209.GA1455@dhcp22.suse.cz>, <20121230020947.AA002F34@pobox.sk>, <20121230110815.GA12940@dhcp22.suse.cz>, <20130125160723.FAE73567@pobox.sk>, <20130125163130.GF4721@dhcp22.suse.cz>, <20130205134937.GA22804@dhcp22.suse.cz>, <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> In-Reply-To: <20130205160934.GB22804@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20130206021721.1AE9E3C7@pobox.sk> X-Maser: oho Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >mentioned in a follow up email. Here is the full patch: Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: http://www.watchdog.sk/lkml/oom_mysqld6 azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754241Ab3BFOB0 (ORCPT ); Wed, 6 Feb 2013 09:01:26 -0500 Received: from cantor2.suse.de ([195.135.220.15]:40923 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751010Ab3BFOBW (ORCPT ); Wed, 6 Feb 2013 09:01:22 -0500 Date: Wed, 6 Feb 2013 15:01:19 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20130206140119.GD10254@dhcp22.suse.cz> References: <20121224142526.020165D3@pobox.sk> <20121228162209.GA1455@dhcp22.suse.cz> <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130206021721.1AE9E3C7@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 06-02-13 02:17:21, azurIt wrote: > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > >mentioned in a follow up email. Here is the full patch: > > > Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: > http://www.watchdog.sk/lkml/oom_mysqld6 [...] WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() Hardware name: S5000VSA gfp_mask:4304 nr_pages:1 oom:0 ret:2 Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 Call Trace: [] warn_slowpath_common+0x7a/0xb0 [] warn_slowpath_fmt+0x46/0x50 [] ? mem_cgroup_margin+0x73/0xa0 [] T.1149+0x2d9/0x610 [] ? blk_finish_plug+0x18/0x50 [] mem_cgroup_cache_charge+0xc4/0xf0 [] add_to_page_cache_locked+0x4f/0x140 [] add_to_page_cache_lru+0x22/0x50 [] filemap_fault+0x252/0x4f0 [] __do_fault+0x78/0x5a0 [] handle_pte_fault+0x84/0x940 [] ? vma_prio_tree_insert+0x30/0x50 [] ? vma_link+0x88/0xe0 [] handle_mm_fault+0x138/0x260 [] do_page_fault+0x13d/0x460 [] ? do_mmap_pgoff+0x3dc/0x430 [] page_fault+0x1f/0x30 ---[ end trace 8817670349022007 ]--- apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 apache2 cpuset=uid mems_allowed=0 Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 Call Trace: [] dump_header+0x7e/0x1e0 [] ? find_lock_task_mm+0x2f/0x70 [] oom_kill_process+0x85/0x2a0 [] out_of_memory+0xe5/0x200 [] pagefault_out_of_memory+0xbd/0x110 [] mm_fault_error+0xb6/0x1a0 [] do_page_fault+0x3ee/0x460 [] ? do_mmap_pgoff+0x3dc/0x430 [] page_fault+0x1f/0x30 The first trace comes from the debugging WARN and it clearly points to a file fault path. __do_fault pre-charges a page in case we need to do CoW (copy-on-write) for the returned page. This one falls back to memcg OOM and never returns ENOMEM as I have mentioned earlier. However, the fs fault handler (filemap_fault here) can fallback to page_cache_read if the readahead (do_sync_mmap_readahead) fails to get page to the page cache. And we can see this happening in the first trace. page_cache_read then calls add_to_page_cache_lru and eventually gets to add_to_page_cache_locked which calls mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should happen. This ENOMEM gets to the fault handler and kaboom. So the fix is really much more complex than I thought. Although add_to_page_cache_locked sounded like a good place it turned out to be not in fact. We need something more clever appaerently. One way would be not misusing __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32 bits for those flags in gfp_t so there should be some room there. Or we could do this per task flag, same we do for NO_IO in the current -mm tree. The later one seems easier wrt. gfp_mask passing horror - e.g. __generic_file_aio_write doesn't pass flags and it can be called from unlocked contexts as well. I have to think about it some more. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755053Ab3BFOW0 (ORCPT ); Wed, 6 Feb 2013 09:22:26 -0500 Received: from cantor2.suse.de ([195.135.220.15]:41929 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751294Ab3BFOWW (ORCPT ); Wed, 6 Feb 2013 09:22:22 -0500 Date: Wed, 6 Feb 2013 15:22:19 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20130206142219.GF10254@dhcp22.suse.cz> References: <20121228162209.GA1455@dhcp22.suse.cz> <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> <20130206140119.GD10254@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130206140119.GD10254@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 06-02-13 15:01:19, Michal Hocko wrote: > On Wed 06-02-13 02:17:21, azurIt wrote: > > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > > >mentioned in a follow up email. Here is the full patch: > > > > > > Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: > > http://www.watchdog.sk/lkml/oom_mysqld6 > > [...] > WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() > Hardware name: S5000VSA > gfp_mask:4304 nr_pages:1 oom:0 ret:2 > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > Call Trace: > [] warn_slowpath_common+0x7a/0xb0 > [] warn_slowpath_fmt+0x46/0x50 > [] ? mem_cgroup_margin+0x73/0xa0 > [] T.1149+0x2d9/0x610 > [] ? blk_finish_plug+0x18/0x50 > [] mem_cgroup_cache_charge+0xc4/0xf0 > [] add_to_page_cache_locked+0x4f/0x140 > [] add_to_page_cache_lru+0x22/0x50 > [] filemap_fault+0x252/0x4f0 > [] __do_fault+0x78/0x5a0 > [] handle_pte_fault+0x84/0x940 > [] ? vma_prio_tree_insert+0x30/0x50 > [] ? vma_link+0x88/0xe0 > [] handle_mm_fault+0x138/0x260 > [] do_page_fault+0x13d/0x460 > [] ? do_mmap_pgoff+0x3dc/0x430 > [] page_fault+0x1f/0x30 > ---[ end trace 8817670349022007 ]--- > apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > apache2 cpuset=uid mems_allowed=0 > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > Call Trace: > [] dump_header+0x7e/0x1e0 > [] ? find_lock_task_mm+0x2f/0x70 > [] oom_kill_process+0x85/0x2a0 > [] out_of_memory+0xe5/0x200 > [] pagefault_out_of_memory+0xbd/0x110 > [] mm_fault_error+0xb6/0x1a0 > [] do_page_fault+0x3ee/0x460 > [] ? do_mmap_pgoff+0x3dc/0x430 > [] page_fault+0x1f/0x30 > > The first trace comes from the debugging WARN and it clearly points to > a file fault path. __do_fault pre-charges a page in case we need to > do CoW (copy-on-write) for the returned page. This one falls back to > memcg OOM and never returns ENOMEM as I have mentioned earlier. > However, the fs fault handler (filemap_fault here) can fallback to > page_cache_read if the readahead (do_sync_mmap_readahead) fails > to get page to the page cache. And we can see this happening in > the first trace. page_cache_read then calls add_to_page_cache_lru > and eventually gets to add_to_page_cache_locked which calls > mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should > happen. This ENOMEM gets to the fault handler and kaboom. > > So the fix is really much more complex than I thought. Although > add_to_page_cache_locked sounded like a good place it turned out to be > not in fact. > > We need something more clever appaerently. One way would be not misusing > __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32 > bits for those flags in gfp_t so there should be some room there. > Or we could do this per task flag, same we do for NO_IO in the current > -mm tree. > The later one seems easier wrt. gfp_mask passing horror - e.g. > __generic_file_aio_write doesn't pass flags and it can be called from > unlocked contexts as well. Ouch, PF_ flags space seem to be drained already because task_struct::flags is just unsigned int so there is just one bit left. I am not sure this is the best use for it. This will be a real pain! -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757518Ab3BFQBA (ORCPT ); Wed, 6 Feb 2013 11:01:00 -0500 Received: from cantor2.suse.de ([195.135.220.15]:48049 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755083Ab3BFQAy (ORCPT ); Wed, 6 Feb 2013 11:00:54 -0500 Date: Wed, 6 Feb 2013 17:00:51 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Message-ID: <20130206160051.GG10254@dhcp22.suse.cz> References: <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> <20130206140119.GD10254@dhcp22.suse.cz> <20130206142219.GF10254@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130206142219.GF10254@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 06-02-13 15:22:19, Michal Hocko wrote: > On Wed 06-02-13 15:01:19, Michal Hocko wrote: > > On Wed 06-02-13 02:17:21, azurIt wrote: > > > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > > > >mentioned in a follow up email. Here is the full patch: > > > > > > > > > Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: > > > http://www.watchdog.sk/lkml/oom_mysqld6 > > > > [...] > > WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() > > Hardware name: S5000VSA > > gfp_mask:4304 nr_pages:1 oom:0 ret:2 > > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > > Call Trace: > > [] warn_slowpath_common+0x7a/0xb0 > > [] warn_slowpath_fmt+0x46/0x50 > > [] ? mem_cgroup_margin+0x73/0xa0 > > [] T.1149+0x2d9/0x610 > > [] ? blk_finish_plug+0x18/0x50 > > [] mem_cgroup_cache_charge+0xc4/0xf0 > > [] add_to_page_cache_locked+0x4f/0x140 > > [] add_to_page_cache_lru+0x22/0x50 > > [] filemap_fault+0x252/0x4f0 > > [] __do_fault+0x78/0x5a0 > > [] handle_pte_fault+0x84/0x940 > > [] ? vma_prio_tree_insert+0x30/0x50 > > [] ? vma_link+0x88/0xe0 > > [] handle_mm_fault+0x138/0x260 > > [] do_page_fault+0x13d/0x460 > > [] ? do_mmap_pgoff+0x3dc/0x430 > > [] page_fault+0x1f/0x30 > > ---[ end trace 8817670349022007 ]--- > > apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > > apache2 cpuset=uid mems_allowed=0 > > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > > Call Trace: > > [] dump_header+0x7e/0x1e0 > > [] ? find_lock_task_mm+0x2f/0x70 > > [] oom_kill_process+0x85/0x2a0 > > [] out_of_memory+0xe5/0x200 > > [] pagefault_out_of_memory+0xbd/0x110 > > [] mm_fault_error+0xb6/0x1a0 > > [] do_page_fault+0x3ee/0x460 > > [] ? do_mmap_pgoff+0x3dc/0x430 > > [] page_fault+0x1f/0x30 > > > > The first trace comes from the debugging WARN and it clearly points to > > a file fault path. __do_fault pre-charges a page in case we need to > > do CoW (copy-on-write) for the returned page. This one falls back to > > memcg OOM and never returns ENOMEM as I have mentioned earlier. > > However, the fs fault handler (filemap_fault here) can fallback to > > page_cache_read if the readahead (do_sync_mmap_readahead) fails > > to get page to the page cache. And we can see this happening in > > the first trace. page_cache_read then calls add_to_page_cache_lru > > and eventually gets to add_to_page_cache_locked which calls > > mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should > > happen. This ENOMEM gets to the fault handler and kaboom. > > > > So the fix is really much more complex than I thought. Although > > add_to_page_cache_locked sounded like a good place it turned out to be > > not in fact. > > > > We need something more clever appaerently. One way would be not misusing > > __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32 > > bits for those flags in gfp_t so there should be some room there. > > Or we could do this per task flag, same we do for NO_IO in the current > > -mm tree. > > The later one seems easier wrt. gfp_mask passing horror - e.g. > > __generic_file_aio_write doesn't pass flags and it can be called from > > unlocked contexts as well. > > Ouch, PF_ flags space seem to be drained already because > task_struct::flags is just unsigned int so there is just one bit left. I > am not sure this is the best use for it. This will be a real pain! OK, so this something that should help you without any risk of false OOMs. I do not believe that something like that would be accepted upstream because it is really heavy. We will need to come up with something more clever for upstream. I have also added a warning which will trigger when the charge fails. If you see too many of those messages then there is something bad going on and the lack of OOM causes userspace to loop without getting any progress. So there you go - your personal patch ;) You can drop all other patches. Please note I have just compile tested it. But it should be pretty trivial to check it is correct --- >>From 6f155187f77c971b45caf05dbc80ca9c20bc278c Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 6 Feb 2013 16:45:07 +0100 Subject: [PATCH 1/2] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [] do_truncate+0x58/0xa0 # takes i_mutex [] do_last+0x250/0xa30 [] path_openat+0xd7/0x440 [] do_filp_open+0x49/0xa0 [] do_sys_open+0x106/0x240 [] sys_open+0x20/0x30 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff Process B [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1146+0x5ab/0x5c0 [] mem_cgroup_cache_charge+0xbe/0xe0 [] add_to_page_cache_locked+0x4c/0x140 [] add_to_page_cache_lru+0x22/0x50 [] grab_cache_page_write_begin+0x8b/0xe0 [] ext3_write_begin+0x88/0x270 [] generic_file_buffered_write+0x116/0x290 [] __generic_file_aio_write+0x27c/0x480 [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [] do_sync_write+0xea/0x130 [] vfs_write+0xf3/0x1f0 [] sys_write+0x51/0x90 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from dangerous context. Memcg charging code has no way to find out whether it is called from a locked context we have to help it via process flags. PF_OOM_ORIGIN flag removed recently will be reused for PF_NO_MEMCG_OOM which signals that the memcg OOM killer could lead to a deadlock. Only locked callers of __generic_file_aio_write are currently marked. I am pretty sure there are more places (I didn't check shmem and hugetlb uses fancy instantion mutex during page fault and filesystems might use some locks during the write) but I've ignored those as this will probably be just a user specific patch without any way to get upstream in the current form. Reported-by: azurIt Signed-off-by: Michal Hocko --- drivers/staging/pohmelfs/inode.c | 2 ++ include/linux/sched.h | 1 + mm/filemap.c | 2 ++ mm/memcontrol.c | 18 ++++++++++++++---- 4 files changed, 19 insertions(+), 4 deletions(-) diff --git a/drivers/staging/pohmelfs/inode.c b/drivers/staging/pohmelfs/inode.c index 7a19555..523de82e 100644 --- a/drivers/staging/pohmelfs/inode.c +++ b/drivers/staging/pohmelfs/inode.c @@ -921,7 +921,9 @@ ssize_t pohmelfs_write(struct file *file, const char __user *buf, if (ret) goto err_out_unlock; + current->flags |= PF_NO_MEMCG_OOM; ret = __generic_file_aio_write(&kiocb, &iov, 1, &kiocb.ki_pos); + current->flags &= ~PF_NO_MEMCG_OOM; *ppos = kiocb.ki_pos; mutex_unlock(&inode->i_mutex); diff --git a/include/linux/sched.h b/include/linux/sched.h index 1e86bb4..f275c8f 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1781,6 +1781,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t * #define PF_FROZEN 0x00010000 /* frozen for system suspend */ #define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */ #define PF_KSWAPD 0x00040000 /* I am kswapd */ +#define PF_NO_MEMCG_OOM 0x00080000 /* Memcg OOM could lead to a deadlock */ #define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ #define PF_RANDOMIZE 0x00400000 /* randomize virtual address space */ diff --git a/mm/filemap.c b/mm/filemap.c index 556858c..58a316b 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2617,7 +2617,9 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov, mutex_lock(&inode->i_mutex); blk_start_plug(&plug); + current->flags |= PF_NO_MEMCG_OOM; ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos); + current->flags &= ~PF_NO_MEMCG_OOM; mutex_unlock(&inode->i_mutex); if (ret > 0 || ret == -EIOCBQUEUED) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c8425b1..128b615 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2397,6 +2397,14 @@ done: return 0; nomem: *ptr = NULL; + if (printk_ratelimit()) + printk(KERN_WARNING"%s: task:%s pid:%d got ENOMEM without OOM for memcg:%p." + " If this message shows up very often for the" + " same task then there is a risk that the" + " process is not able to make any progress" + " because of the current limit. Try to enlarge" + " the hard limit.\n", __FUNCTION__, + current->comm, current->pid, memcg); return -ENOMEM; bypass: *ptr = NULL; @@ -2703,7 +2711,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; struct page_cgroup *pc; - bool oom = true; + bool oom = !(current->flags & PF_NO_MEMCG_OOM); int ret; if (PageTransHuge(page)) { @@ -2770,6 +2778,7 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg, int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask) { + bool oom = !(current->flags & PF_NO_MEMCG_OOM); struct mem_cgroup *memcg = NULL; int ret; @@ -2782,7 +2791,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, mm = &init_mm; if (page_is_file_cache(page)) { - ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, true); + ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, oom); if (ret || !memcg) return ret; @@ -2818,6 +2827,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t mask, struct mem_cgroup **ptr) { + bool oom = !(current->flags & PF_NO_MEMCG_OOM); struct mem_cgroup *memcg; int ret; @@ -2840,13 +2850,13 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, if (!memcg) goto charge_cur_mm; *ptr = memcg; - ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, true); + ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, oom); css_put(&memcg->css); return ret; charge_cur_mm: if (unlikely(!mm)) mm = &init_mm; - return __mem_cgroup_try_charge(mm, mask, 1, ptr, true); + return __mem_cgroup_try_charge(mm, mask, 1, ptr, oom); } static void -- 1.7.10.4 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758143Ab3BGLCK (ORCPT ); Thu, 7 Feb 2013 06:02:10 -0500 Received: from fgwmail5.fujitsu.co.jp ([192.51.44.35]:60312 "EHLO fgwmail5.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755653Ab3BGLCH (ORCPT ); Thu, 7 Feb 2013 06:02:07 -0500 X-SecurityPolicyCheck: OK by SHieldMailChecker v1.8.4 Message-ID: <51138999.3090006@jp.fujitsu.com> Date: Thu, 07 Feb 2013 20:01:45 +0900 From: Kamezawa Hiroyuki User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/20130107 Thunderbird/17.0.2 MIME-Version: 1.0 To: Michal Hocko CC: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked References: <20121224142526.020165D3@pobox.sk> <20121228162209.GA1455@dhcp22.suse.cz> <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> <20130206140119.GD10254@dhcp22.suse.cz> In-Reply-To: <20130206140119.GD10254@dhcp22.suse.cz> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org (2013/02/06 23:01), Michal Hocko wrote: > On Wed 06-02-13 02:17:21, azurIt wrote: >>> 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >>> mentioned in a follow up email. Here is the full patch: >> >> >> Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: >> http://www.watchdog.sk/lkml/oom_mysqld6 > > [...] > WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() > Hardware name: S5000VSA > gfp_mask:4304 nr_pages:1 oom:0 ret:2 > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > Call Trace: > [] warn_slowpath_common+0x7a/0xb0 > [] warn_slowpath_fmt+0x46/0x50 > [] ? mem_cgroup_margin+0x73/0xa0 > [] T.1149+0x2d9/0x610 > [] ? blk_finish_plug+0x18/0x50 > [] mem_cgroup_cache_charge+0xc4/0xf0 > [] add_to_page_cache_locked+0x4f/0x140 > [] add_to_page_cache_lru+0x22/0x50 > [] filemap_fault+0x252/0x4f0 > [] __do_fault+0x78/0x5a0 > [] handle_pte_fault+0x84/0x940 > [] ? vma_prio_tree_insert+0x30/0x50 > [] ? vma_link+0x88/0xe0 > [] handle_mm_fault+0x138/0x260 > [] do_page_fault+0x13d/0x460 > [] ? do_mmap_pgoff+0x3dc/0x430 > [] page_fault+0x1f/0x30 > ---[ end trace 8817670349022007 ]--- > apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > apache2 cpuset=uid mems_allowed=0 > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > Call Trace: > [] dump_header+0x7e/0x1e0 > [] ? find_lock_task_mm+0x2f/0x70 > [] oom_kill_process+0x85/0x2a0 > [] out_of_memory+0xe5/0x200 > [] pagefault_out_of_memory+0xbd/0x110 > [] mm_fault_error+0xb6/0x1a0 > [] do_page_fault+0x3ee/0x460 > [] ? do_mmap_pgoff+0x3dc/0x430 > [] page_fault+0x1f/0x30 > > The first trace comes from the debugging WARN and it clearly points to > a file fault path. __do_fault pre-charges a page in case we need to > do CoW (copy-on-write) for the returned page. This one falls back to > memcg OOM and never returns ENOMEM as I have mentioned earlier. > However, the fs fault handler (filemap_fault here) can fallback to > page_cache_read if the readahead (do_sync_mmap_readahead) fails > to get page to the page cache. And we can see this happening in > the first trace. page_cache_read then calls add_to_page_cache_lru > and eventually gets to add_to_page_cache_locked which calls > mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should > happen. This ENOMEM gets to the fault handler and kaboom. > Hmm. do we need to increase the "limit" virtually at memcg oom until the oom-killed process dies ? It may be doable by increasing stock->cache of each cpu....I think kernel can offer extra virtual charge up to oom-killed process's memory usage..... Thanks, -Kame From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758384Ab3BGMbu (ORCPT ); Thu, 7 Feb 2013 07:31:50 -0500 Received: from cantor2.suse.de ([195.135.220.15]:55932 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755706Ab3BGMbt (ORCPT ); Thu, 7 Feb 2013 07:31:49 -0500 Date: Thu, 7 Feb 2013 13:31:40 +0100 From: Michal Hocko To: Kamezawa Hiroyuki Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20130207123140.GA15820@dhcp22.suse.cz> References: <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> <20130206140119.GD10254@dhcp22.suse.cz> <51138999.3090006@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51138999.3090006@jp.fujitsu.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 07-02-13 20:01:45, KAMEZAWA Hiroyuki wrote: > (2013/02/06 23:01), Michal Hocko wrote: > >On Wed 06-02-13 02:17:21, azurIt wrote: > >>>5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > >>>mentioned in a follow up email. Here is the full patch: > >> > >> > >>Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: > >>http://www.watchdog.sk/lkml/oom_mysqld6 > > > >[...] > >WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() > >Hardware name: S5000VSA > >gfp_mask:4304 nr_pages:1 oom:0 ret:2 > >Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > >Call Trace: > > [] warn_slowpath_common+0x7a/0xb0 > > [] warn_slowpath_fmt+0x46/0x50 > > [] ? mem_cgroup_margin+0x73/0xa0 > > [] T.1149+0x2d9/0x610 > > [] ? blk_finish_plug+0x18/0x50 > > [] mem_cgroup_cache_charge+0xc4/0xf0 > > [] add_to_page_cache_locked+0x4f/0x140 > > [] add_to_page_cache_lru+0x22/0x50 > > [] filemap_fault+0x252/0x4f0 > > [] __do_fault+0x78/0x5a0 > > [] handle_pte_fault+0x84/0x940 > > [] ? vma_prio_tree_insert+0x30/0x50 > > [] ? vma_link+0x88/0xe0 > > [] handle_mm_fault+0x138/0x260 > > [] do_page_fault+0x13d/0x460 > > [] ? do_mmap_pgoff+0x3dc/0x430 > > [] page_fault+0x1f/0x30 > >---[ end trace 8817670349022007 ]--- > >apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > >apache2 cpuset=uid mems_allowed=0 > >Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > >Call Trace: > > [] dump_header+0x7e/0x1e0 > > [] ? find_lock_task_mm+0x2f/0x70 > > [] oom_kill_process+0x85/0x2a0 > > [] out_of_memory+0xe5/0x200 > > [] pagefault_out_of_memory+0xbd/0x110 > > [] mm_fault_error+0xb6/0x1a0 > > [] do_page_fault+0x3ee/0x460 > > [] ? do_mmap_pgoff+0x3dc/0x430 > > [] page_fault+0x1f/0x30 > > > >The first trace comes from the debugging WARN and it clearly points to > >a file fault path. __do_fault pre-charges a page in case we need to > >do CoW (copy-on-write) for the returned page. This one falls back to > >memcg OOM and never returns ENOMEM as I have mentioned earlier. > >However, the fs fault handler (filemap_fault here) can fallback to > >page_cache_read if the readahead (do_sync_mmap_readahead) fails > >to get page to the page cache. And we can see this happening in > >the first trace. page_cache_read then calls add_to_page_cache_lru > >and eventually gets to add_to_page_cache_locked which calls > >mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should > >happen. This ENOMEM gets to the fault handler and kaboom. > > > > Hmm. do we need to increase the "limit" virtually at memcg oom until > the oom-killed process dies ? It may be doable by increasing stock->cache > of each cpu....I think kernel can offer extra virtual charge up to > oom-killed process's memory usage..... If we can guarantee that the overflow charges do not exceed the memory usage of the killed process then this would work. The question is, how do we find out how much we can overflow. immigrate_on_move will play some role as well as the amount of the shared memory. I am afraid this would get too complex. Nevertheless the idea is nice. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757301Ab3BHBkx (ORCPT ); Thu, 7 Feb 2013 20:40:53 -0500 Received: from fgwmail6.fujitsu.co.jp ([192.51.44.36]:55107 "EHLO fgwmail6.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753294Ab3BHBkw (ORCPT ); Thu, 7 Feb 2013 20:40:52 -0500 X-SecurityPolicyCheck: OK by SHieldMailChecker v1.8.4 Message-ID: <5114577D.70608@jp.fujitsu.com> Date: Fri, 08 Feb 2013 10:40:13 +0900 From: Kamezawa Hiroyuki User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/20130107 Thunderbird/17.0.2 MIME-Version: 1.0 To: Michal Hocko CC: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked References: <20121224142526.020165D3@pobox.sk> <20121228162209.GA1455@dhcp22.suse.cz> <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> <20130206140119.GD10254@dhcp22.suse.cz> <51138999.3090006@jp.fujitsu.com> In-Reply-To: <51138999.3090006@jp.fujitsu.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org (2013/02/07 20:01), Kamezawa Hiroyuki wrote: > (2013/02/06 23:01), Michal Hocko wrote: >> On Wed 06-02-13 02:17:21, azurIt wrote: >>>> 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >>>> mentioned in a follow up email. Here is the full patch: >>> >>> >>> Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: >>> http://www.watchdog.sk/lkml/oom_mysqld6 >> >> [...] >> WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() >> Hardware name: S5000VSA >> gfp_mask:4304 nr_pages:1 oom:0 ret:2 >> Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >> Call Trace: >> [] warn_slowpath_common+0x7a/0xb0 >> [] warn_slowpath_fmt+0x46/0x50 >> [] ? mem_cgroup_margin+0x73/0xa0 >> [] T.1149+0x2d9/0x610 >> [] ? blk_finish_plug+0x18/0x50 >> [] mem_cgroup_cache_charge+0xc4/0xf0 >> [] add_to_page_cache_locked+0x4f/0x140 >> [] add_to_page_cache_lru+0x22/0x50 >> [] filemap_fault+0x252/0x4f0 >> [] __do_fault+0x78/0x5a0 >> [] handle_pte_fault+0x84/0x940 >> [] ? vma_prio_tree_insert+0x30/0x50 >> [] ? vma_link+0x88/0xe0 >> [] handle_mm_fault+0x138/0x260 >> [] do_page_fault+0x13d/0x460 >> [] ? do_mmap_pgoff+0x3dc/0x430 >> [] page_fault+0x1f/0x30 >> ---[ end trace 8817670349022007 ]--- >> apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 >> apache2 cpuset=uid mems_allowed=0 >> Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >> Call Trace: >> [] dump_header+0x7e/0x1e0 >> [] ? find_lock_task_mm+0x2f/0x70 >> [] oom_kill_process+0x85/0x2a0 >> [] out_of_memory+0xe5/0x200 >> [] pagefault_out_of_memory+0xbd/0x110 >> [] mm_fault_error+0xb6/0x1a0 >> [] do_page_fault+0x3ee/0x460 >> [] ? do_mmap_pgoff+0x3dc/0x430 >> [] page_fault+0x1f/0x30 >> >> The first trace comes from the debugging WARN and it clearly points to >> a file fault path. __do_fault pre-charges a page in case we need to >> do CoW (copy-on-write) for the returned page. This one falls back to >> memcg OOM and never returns ENOMEM as I have mentioned earlier. >> However, the fs fault handler (filemap_fault here) can fallback to >> page_cache_read if the readahead (do_sync_mmap_readahead) fails >> to get page to the page cache. And we can see this happening in >> the first trace. page_cache_read then calls add_to_page_cache_lru >> and eventually gets to add_to_page_cache_locked which calls >> mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should >> happen. This ENOMEM gets to the fault handler and kaboom. >> > > Hmm. do we need to increase the "limit" virtually at memcg oom until > the oom-killed process dies ? Here is my naive idea... == From 1a46318cf89e7df94bd4844f29105b61dacf335b Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Fri, 8 Feb 2013 10:43:52 +0900 Subject: [PATCH] [Don't Apply][PATCH] memcg relax resource at OOM situation. When an OOM happens, a task is killed and resources will be freed. A problem here is that a task, which is oom-killed, may wait for some other resource in which memory resource is required. Some thread waits for free memory may holds some mutex and oom-killed process wait for the mutex. To avoid this, relaxing charged memory by giving virtual resource can be a help. The system can get back it at uncharge(). This is a sample native implementation. Signed-off-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 73 insertions(+), 6 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 25ac5f4..4dea49a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -301,6 +301,9 @@ struct mem_cgroup { /* set when res.limit == memsw.limit */ bool memsw_is_minimum; + /* extra resource at emergency situation */ + unsigned long loan; + spinlock_t loan_lock; /* protect arrays of thresholds */ struct mutex thresholds_lock; @@ -2034,6 +2037,61 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, mem_cgroup_iter_break(root_memcg, victim); return total; } +/* + * When a memcg is in OOM situation, this lack of resource may cause deadlock + * because of complicated lock dependency(i_mutex...). To avoid that, we + * need extra resource or avoid charging. + * + * A memcg can request resource in an emergency state. We call it as loan. + * A memcg will return a loan when it does uncharge resource. We disallow + * double-loan and moving task to other groups until the loan is fully + * returned. + * + * Note: the problem here is that we cannot know what amount resouce should + * be necessary to exiting an emergency state..... + */ +#define LOAN_MAX (2 * 1024 * 1024) + +static void mem_cgroup_make_loan(struct mem_cgroup *memcg) +{ + u64 usage; + unsigned long amount; + + amount = LOAN_MAX; + + usage = res_counter_read_u64(&memcg->res, RES_USAGE); + if (amount > usage /2 ) + amount = usage / 2; + spin_lock(&memcg->loan_lock); + if (memcg->loan) { + spin_unlock(&memcg->loan_lock); + return; + } + memcg->loan = amount; + res_counter_uncharge(&memcg->res, amount); + if (do_swap_account) + res_counter_uncharge(&memcg->memsw, amount); + spin_unlock(&memcg->loan_lock); +} + +/* return amount of free resource which can be uncharged */ +static unsigned long +mem_cgroup_may_return_loan(struct mem_cgroup *memcg, unsigned long val) +{ + unsigned long tmp; + /* we don't care small race here */ + if (unlikely(!memcg->loan)) + return val; + spin_lock(&memcg->loan_lock); + if (memcg->loan) { + tmp = min(memcg->loan, val); + memcg->loan -= tmp; + val -= tmp; + } + spin_unlock(&memcg->loan_lock); + return val; +} + /* * Check OOM-Killer is already running under our hierarchy. @@ -2182,6 +2240,7 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, if (need_to_kill) { finish_wait(&memcg_oom_waitq, &owait.wait); mem_cgroup_out_of_memory(memcg, mask, order); + mem_cgroup_make_loan(memcg); } else { schedule(); finish_wait(&memcg_oom_waitq, &owait.wait); @@ -2748,6 +2807,8 @@ static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg, if (!mem_cgroup_is_root(memcg)) { unsigned long bytes = nr_pages * PAGE_SIZE; + bytes = mem_cgroup_may_return_loan(memcg, bytes); + res_counter_uncharge(&memcg->res, bytes); if (do_swap_account) res_counter_uncharge(&memcg->memsw, bytes); @@ -3989,6 +4050,7 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg, { struct memcg_batch_info *batch = NULL; bool uncharge_memsw = true; + unsigned long val; /* If swapout, usage of swap doesn't decrease */ if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) @@ -4029,9 +4091,11 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg, batch->memsw_nr_pages++; return; direct_uncharge: - res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE); + val = nr_pages * PAGE_SIZE; + val = mem_cgroup_may_return_loan(memcg, val); + res_counter_uncharge(&memcg->res, val); if (uncharge_memsw) - res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE); + res_counter_uncharge(&memcg->memsw, val); if (unlikely(batch->memcg != memcg)) memcg_oom_recover(memcg); } @@ -4182,6 +4246,7 @@ void mem_cgroup_uncharge_start(void) void mem_cgroup_uncharge_end(void) { struct memcg_batch_info *batch = ¤t->memcg_batch; + unsigned long val; if (!batch->do_batch) return; @@ -4192,16 +4257,16 @@ void mem_cgroup_uncharge_end(void) if (!batch->memcg) return; + val = batch->nr_pages * PAGE_SIZE; + val = mem_cgroup_may_return_loan(batch->memcg, val); /* * This "batch->memcg" is valid without any css_get/put etc... * bacause we hide charges behind us. */ if (batch->nr_pages) - res_counter_uncharge(&batch->memcg->res, - batch->nr_pages * PAGE_SIZE); + res_counter_uncharge(&batch->memcg->res, val); if (batch->memsw_nr_pages) - res_counter_uncharge(&batch->memcg->memsw, - batch->memsw_nr_pages * PAGE_SIZE); + res_counter_uncharge(&batch->memcg->memsw, val); memcg_oom_recover(batch->memcg); /* forget this pointer (for sanity check) */ batch->memcg = NULL; @@ -6291,6 +6356,8 @@ mem_cgroup_css_alloc(struct cgroup *cont) memcg->move_charge_at_immigrate = 0; mutex_init(&memcg->thresholds_lock); spin_lock_init(&memcg->move_lock); + memcg->loan = 0; + spin_lock_init(&memcg->loan_lock); return &memcg->css; -- 1.7.10.2 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759938Ab3BHEeZ (ORCPT ); Thu, 7 Feb 2013 23:34:25 -0500 Received: from mail-wg0-f74.google.com ([74.125.82.74]:49527 "EHLO mail-wg0-f74.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754548Ab3BHEeW (ORCPT ); Thu, 7 Feb 2013 23:34:22 -0500 From: Greg Thelen To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked References: <20121230020947.AA002F34@pobox.sk> <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130205174651.GA3959@dhcp22.suse.cz> <20130205185953.GB3959@dhcp22.suse.cz> Date: Thu, 07 Feb 2013 20:27:00 -0800 In-Reply-To: <20130205185953.GB3959@dhcp22.suse.cz> (Michal Hocko's message of "Tue, 5 Feb 2013 19:59:53 +0100") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 05 2013, Michal Hocko wrote: > On Tue 05-02-13 10:09:57, Greg Thelen wrote: >> On Tue, Feb 05 2013, Michal Hocko wrote: >> >> > On Tue 05-02-13 08:48:23, Greg Thelen wrote: >> >> On Tue, Feb 05 2013, Michal Hocko wrote: >> >> >> >> > On Tue 05-02-13 15:49:47, azurIt wrote: >> >> > [...] >> >> >> Just to be sure - am i supposed to apply this two patches? >> >> >> http://watchdog.sk/lkml/patches/ >> >> > >> >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >> >> > mentioned in a follow up email. Here is the full patch: >> >> > --- >> >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 >> >> > From: Michal Hocko >> >> > Date: Mon, 26 Nov 2012 11:47:57 +0100 >> >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked >> >> > >> >> > memcg oom killer might deadlock if the process which falls down to >> >> > mem_cgroup_handle_oom holds a lock which prevents other task to >> >> > terminate because it is blocked on the very same lock. >> >> > This can happen when a write system call needs to allocate a page but >> >> > the allocation hits the memcg hard limit and there is nothing to reclaim >> >> > (e.g. there is no swap or swap limit is hit as well and all cache pages >> >> > have been reclaimed already) and the process selected by memcg OOM >> >> > killer is blocked on i_mutex on the same inode (e.g. truncate it). >> >> > >> >> > Process A >> >> > [] do_truncate+0x58/0xa0 # takes i_mutex >> >> > [] do_last+0x250/0xa30 >> >> > [] path_openat+0xd7/0x440 >> >> > [] do_filp_open+0x49/0xa0 >> >> > [] do_sys_open+0x106/0x240 >> >> > [] sys_open+0x20/0x30 >> >> > [] system_call_fastpath+0x18/0x1d >> >> > [] 0xffffffffffffffff >> >> > >> >> > Process B >> >> > [] mem_cgroup_handle_oom+0x241/0x3b0 >> >> > [] T.1146+0x5ab/0x5c0 >> >> > [] mem_cgroup_cache_charge+0xbe/0xe0 >> >> > [] add_to_page_cache_locked+0x4c/0x140 >> >> > [] add_to_page_cache_lru+0x22/0x50 >> >> > [] grab_cache_page_write_begin+0x8b/0xe0 >> >> > [] ext3_write_begin+0x88/0x270 >> >> > [] generic_file_buffered_write+0x116/0x290 >> >> > [] __generic_file_aio_write+0x27c/0x480 >> >> > [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex >> >> > [] do_sync_write+0xea/0x130 >> >> > [] vfs_write+0xf3/0x1f0 >> >> > [] sys_write+0x51/0x90 >> >> > [] system_call_fastpath+0x18/0x1d >> >> > [] 0xffffffffffffffff >> >> >> >> It looks like grab_cache_page_write_begin() passes __GFP_FS into >> >> __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me >> >> think that this deadlock is also possible in the page allocator even >> >> before getting to add_to_page_cache_lru. no? >> > >> > I am not that familiar with VFS but i_mutex is a high level lock AFAIR >> > and it shouldn't be called from the pageout path so __page_cache_alloc >> > should be safe. >> >> I wasn't clear, sorry. My concern is not that pageout() grabs i_mutex. >> My concern is that __page_cache_alloc() will invoke the oom killer and >> select a victim which wants i_mutex. This victim will deadlock because >> the oom killer caller already holds i_mutex. > > That would be true for the memcg oom because that one is blocking but > the global oom just puts the allocator into sleep for a while and then > the allocator should back off eventually (unless this is NOFAIL > allocation). I would need to look closer whether this is really the case > - I haven't seen that allocator code path for a while... I think the page allocator can loop forever waiting for an oom victim to terminate even without NOFAIL. Especially if the oom victim wants a resource exclusively held by the allocating thread (e.g. i_mutex). It looks like the same deadlock you describe is also possible (though more rare) without memcg. If the looping thread is an eligible oom victim (i.e. not oom disabled, not an kernel thread, etc) then the page allocator can return NULL in so long as NOFAIL is not used. So any allocator which is able to call the oom killer and is not oom disabled (kernel thread, etc) is already exposed to the possibility of page allocator failure. So if the page allocator could detect the deadlock, then it could safely return NULL. Maybe after looping N times without forward progress the page allocator should consider failing unless NOFAIL is given. Switching back to the memcg oom situation, can we similarly return NULL if memcg oom kill has been tried a reasonable number of times. Simply failing the memcg charge with ENOMEM seems easier to support than exceeding limit (Kame's loan patch). From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751509Ab3BHFDP (ORCPT ); Fri, 8 Feb 2013 00:03:15 -0500 Received: from gmmr2.centrum.cz ([46.255.227.252]:60192 "EHLO gmmr2.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750926Ab3BHFDN (ORCPT ); Fri, 8 Feb 2013 00:03:13 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_if_PF=5FNO=5FMEMCG=5FOOM_is_set?= Date: Fri, 08 Feb 2013 06:03:04 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20121230020947.AA002F34@pobox.sk>, <20121230110815.GA12940@dhcp22.suse.cz>, <20130125160723.FAE73567@pobox.sk>, <20130125163130.GF4721@dhcp22.suse.cz>, <20130205134937.GA22804@dhcp22.suse.cz>, <20130205154947.CD6411E2@pobox.sk>, <20130205160934.GB22804@dhcp22.suse.cz>, <20130206021721.1AE9E3C7@pobox.sk>, <20130206140119.GD10254@dhcp22.suse.cz>, <20130206142219.GF10254@dhcp22.suse.cz> <20130206160051.GG10254@dhcp22.suse.cz> In-Reply-To: <20130206160051.GG10254@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20130208060304.799F362F@pobox.sk> X-Maser: Georgo Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Michal, thank you very much but it just didn't work and broke everything :( This happened: Problem started to occur really often immediately after booting the new kernel, every few minutes for one of my users. But everything other seems to work fine so i gave it a try for a day (which was a mistake). I grabbed some data for you and go to sleep: http://watchdog.sk/lkml/memcg-bug-4.tar.gz Few hours later i was woke up from my sweet sweet dreams by alerts smses - Apache wasn't working and our system failed to restart it. When i observed the situation, two apache processes (of that user as above) were still running and it wasn't possible to kill them by any way. I grabbed some data for you: http://watchdog.sk/lkml/memcg-bug-5.tar.gz Then I logged to the console and this was waiting for me: http://watchdog.sk/lkml/error.jpg Finally i rebooted into different kernel, wrote this e-mail and go to my lovely bed ;) ______________________________________________________________ > Od: "Michal Hocko" > Komu: azurIt > Dátum: 06.02.2013 17:00 > Predmet: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set > > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , "Johannes Weiner" >On Wed 06-02-13 15:22:19, Michal Hocko wrote: >> On Wed 06-02-13 15:01:19, Michal Hocko wrote: >> > On Wed 06-02-13 02:17:21, azurIt wrote: >> > > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >> > > >mentioned in a follow up email. Here is the full patch: >> > > >> > > >> > > Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: >> > > http://www.watchdog.sk/lkml/oom_mysqld6 >> > >> > [...] >> > WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() >> > Hardware name: S5000VSA >> > gfp_mask:4304 nr_pages:1 oom:0 ret:2 >> > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >> > Call Trace: >> > [] warn_slowpath_common+0x7a/0xb0 >> > [] warn_slowpath_fmt+0x46/0x50 >> > [] ? mem_cgroup_margin+0x73/0xa0 >> > [] T.1149+0x2d9/0x610 >> > [] ? blk_finish_plug+0x18/0x50 >> > [] mem_cgroup_cache_charge+0xc4/0xf0 >> > [] add_to_page_cache_locked+0x4f/0x140 >> > [] add_to_page_cache_lru+0x22/0x50 >> > [] filemap_fault+0x252/0x4f0 >> > [] __do_fault+0x78/0x5a0 >> > [] handle_pte_fault+0x84/0x940 >> > [] ? vma_prio_tree_insert+0x30/0x50 >> > [] ? vma_link+0x88/0xe0 >> > [] handle_mm_fault+0x138/0x260 >> > [] do_page_fault+0x13d/0x460 >> > [] ? do_mmap_pgoff+0x3dc/0x430 >> > [] page_fault+0x1f/0x30 >> > ---[ end trace 8817670349022007 ]--- >> > apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 >> > apache2 cpuset=uid mems_allowed=0 >> > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >> > Call Trace: >> > [] dump_header+0x7e/0x1e0 >> > [] ? find_lock_task_mm+0x2f/0x70 >> > [] oom_kill_process+0x85/0x2a0 >> > [] out_of_memory+0xe5/0x200 >> > [] pagefault_out_of_memory+0xbd/0x110 >> > [] mm_fault_error+0xb6/0x1a0 >> > [] do_page_fault+0x3ee/0x460 >> > [] ? do_mmap_pgoff+0x3dc/0x430 >> > [] page_fault+0x1f/0x30 >> > >> > The first trace comes from the debugging WARN and it clearly points to >> > a file fault path. __do_fault pre-charges a page in case we need to >> > do CoW (copy-on-write) for the returned page. This one falls back to >> > memcg OOM and never returns ENOMEM as I have mentioned earlier. >> > However, the fs fault handler (filemap_fault here) can fallback to >> > page_cache_read if the readahead (do_sync_mmap_readahead) fails >> > to get page to the page cache. And we can see this happening in >> > the first trace. page_cache_read then calls add_to_page_cache_lru >> > and eventually gets to add_to_page_cache_locked which calls >> > mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should >> > happen. This ENOMEM gets to the fault handler and kaboom. >> > >> > So the fix is really much more complex than I thought. Although >> > add_to_page_cache_locked sounded like a good place it turned out to be >> > not in fact. >> > >> > We need something more clever appaerently. One way would be not misusing >> > __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32 >> > bits for those flags in gfp_t so there should be some room there. >> > Or we could do this per task flag, same we do for NO_IO in the current >> > -mm tree. >> > The later one seems easier wrt. gfp_mask passing horror - e.g. >> > __generic_file_aio_write doesn't pass flags and it can be called from >> > unlocked contexts as well. >> >> Ouch, PF_ flags space seem to be drained already because >> task_struct::flags is just unsigned int so there is just one bit left. I >> am not sure this is the best use for it. This will be a real pain! > >OK, so this something that should help you without any risk of false >OOMs. I do not believe that something like that would be accepted >upstream because it is really heavy. We will need to come up with >something more clever for upstream. >I have also added a warning which will trigger when the charge fails. If >you see too many of those messages then there is something bad going on >and the lack of OOM causes userspace to loop without getting any >progress. > >So there you go - your personal patch ;) You can drop all other patches. >Please note I have just compile tested it. But it should be pretty >trivial to check it is correct >--- >>From 6f155187f77c971b45caf05dbc80ca9c20bc278c Mon Sep 17 00:00:00 2001 >From: Michal Hocko >Date: Wed, 6 Feb 2013 16:45:07 +0100 >Subject: [PATCH 1/2] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set > >memcg oom killer might deadlock if the process which falls down to >mem_cgroup_handle_oom holds a lock which prevents other task to >terminate because it is blocked on the very same lock. >This can happen when a write system call needs to allocate a page but >the allocation hits the memcg hard limit and there is nothing to reclaim >(e.g. there is no swap or swap limit is hit as well and all cache pages >have been reclaimed already) and the process selected by memcg OOM >killer is blocked on i_mutex on the same inode (e.g. truncate it). > >Process A >[] do_truncate+0x58/0xa0 # takes i_mutex >[] do_last+0x250/0xa30 >[] path_openat+0xd7/0x440 >[] do_filp_open+0x49/0xa0 >[] do_sys_open+0x106/0x240 >[] sys_open+0x20/0x30 >[] system_call_fastpath+0x18/0x1d >[] 0xffffffffffffffff > >Process B >[] mem_cgroup_handle_oom+0x241/0x3b0 >[] T.1146+0x5ab/0x5c0 >[] mem_cgroup_cache_charge+0xbe/0xe0 >[] add_to_page_cache_locked+0x4c/0x140 >[] add_to_page_cache_lru+0x22/0x50 >[] grab_cache_page_write_begin+0x8b/0xe0 >[] ext3_write_begin+0x88/0x270 >[] generic_file_buffered_write+0x116/0x290 >[] __generic_file_aio_write+0x27c/0x480 >[] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex >[] do_sync_write+0xea/0x130 >[] vfs_write+0xf3/0x1f0 >[] sys_write+0x51/0x90 >[] system_call_fastpath+0x18/0x1d >[] 0xffffffffffffffff > >This is not a hard deadlock though because administrator can still >intervene and increase the limit on the group which helps the writer to >finish the allocation and release the lock. > >This patch heals the problem by forbidding OOM from dangerous context. >Memcg charging code has no way to find out whether it is called from a >locked context we have to help it via process flags. PF_OOM_ORIGIN flag >removed recently will be reused for PF_NO_MEMCG_OOM which signals that >the memcg OOM killer could lead to a deadlock. >Only locked callers of __generic_file_aio_write are currently marked. I >am pretty sure there are more places (I didn't check shmem and hugetlb >uses fancy instantion mutex during page fault and filesystems might >use some locks during the write) but I've ignored those as this will >probably be just a user specific patch without any way to get upstream >in the current form. > >Reported-by: azurIt >Signed-off-by: Michal Hocko >--- > drivers/staging/pohmelfs/inode.c | 2 ++ > include/linux/sched.h | 1 + > mm/filemap.c | 2 ++ > mm/memcontrol.c | 18 ++++++++++++++---- > 4 files changed, 19 insertions(+), 4 deletions(-) > >diff --git a/drivers/staging/pohmelfs/inode.c b/drivers/staging/pohmelfs/inode.c >index 7a19555..523de82e 100644 >--- a/drivers/staging/pohmelfs/inode.c >+++ b/drivers/staging/pohmelfs/inode.c >@@ -921,7 +921,9 @@ ssize_t pohmelfs_write(struct file *file, const char __user *buf, > if (ret) > goto err_out_unlock; > >+ current->flags |= PF_NO_MEMCG_OOM; > ret = __generic_file_aio_write(&kiocb, &iov, 1, &kiocb.ki_pos); >+ current->flags &= ~PF_NO_MEMCG_OOM; > *ppos = kiocb.ki_pos; > > mutex_unlock(&inode->i_mutex); >diff --git a/include/linux/sched.h b/include/linux/sched.h >index 1e86bb4..f275c8f 100644 >--- a/include/linux/sched.h >+++ b/include/linux/sched.h >@@ -1781,6 +1781,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t * > #define PF_FROZEN 0x00010000 /* frozen for system suspend */ > #define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */ > #define PF_KSWAPD 0x00040000 /* I am kswapd */ >+#define PF_NO_MEMCG_OOM 0x00080000 /* Memcg OOM could lead to a deadlock */ > #define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ > #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ > #define PF_RANDOMIZE 0x00400000 /* randomize virtual address space */ >diff --git a/mm/filemap.c b/mm/filemap.c >index 556858c..58a316b 100644 >--- a/mm/filemap.c >+++ b/mm/filemap.c >@@ -2617,7 +2617,9 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov, > > mutex_lock(&inode->i_mutex); > blk_start_plug(&plug); >+ current->flags |= PF_NO_MEMCG_OOM; > ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos); >+ current->flags &= ~PF_NO_MEMCG_OOM; > mutex_unlock(&inode->i_mutex); > > if (ret > 0 || ret == -EIOCBQUEUED) { >diff --git a/mm/memcontrol.c b/mm/memcontrol.c >index c8425b1..128b615 100644 >--- a/mm/memcontrol.c >+++ b/mm/memcontrol.c >@@ -2397,6 +2397,14 @@ done: > return 0; > nomem: > *ptr = NULL; >+ if (printk_ratelimit()) >+ printk(KERN_WARNING"%s: task:%s pid:%d got ENOMEM without OOM for memcg:%p." >+ " If this message shows up very often for the" >+ " same task then there is a risk that the" >+ " process is not able to make any progress" >+ " because of the current limit. Try to enlarge" >+ " the hard limit.\n", __FUNCTION__, >+ current->comm, current->pid, memcg); > return -ENOMEM; > bypass: > *ptr = NULL; >@@ -2703,7 +2711,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, > struct mem_cgroup *memcg = NULL; > unsigned int nr_pages = 1; > struct page_cgroup *pc; >- bool oom = true; >+ bool oom = !(current->flags & PF_NO_MEMCG_OOM); > int ret; > > if (PageTransHuge(page)) { >@@ -2770,6 +2778,7 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg, > int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > gfp_t gfp_mask) > { >+ bool oom = !(current->flags & PF_NO_MEMCG_OOM); > struct mem_cgroup *memcg = NULL; > int ret; > >@@ -2782,7 +2791,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > mm = &init_mm; > > if (page_is_file_cache(page)) { >- ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, true); >+ ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, oom); > if (ret || !memcg) > return ret; > >@@ -2818,6 +2827,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > struct page *page, > gfp_t mask, struct mem_cgroup **ptr) > { >+ bool oom = !(current->flags & PF_NO_MEMCG_OOM); > struct mem_cgroup *memcg; > int ret; > >@@ -2840,13 +2850,13 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > if (!memcg) > goto charge_cur_mm; > *ptr = memcg; >- ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, true); >+ ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, oom); > css_put(&memcg->css); > return ret; > charge_cur_mm: > if (unlikely(!mm)) > mm = &init_mm; >- return __mem_cgroup_try_charge(mm, mask, 1, ptr, true); >+ return __mem_cgroup_try_charge(mm, mask, 1, ptr, oom); > } > > static void >-- >1.7.10.4 > >-- >Michal Hocko >SUSE Labs > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1946169Ab3BHJo2 (ORCPT ); Fri, 8 Feb 2013 04:44:28 -0500 Received: from cantor2.suse.de ([195.135.220.15]:51248 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758207Ab3BHJoY (ORCPT ); Fri, 8 Feb 2013 04:44:24 -0500 Date: Fri, 8 Feb 2013 10:44:20 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Message-ID: <20130208094420.GA7557@dhcp22.suse.cz> References: <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> <20130206140119.GD10254@dhcp22.suse.cz> <20130206142219.GF10254@dhcp22.suse.cz> <20130206160051.GG10254@dhcp22.suse.cz> <20130208060304.799F362F@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130208060304.799F362F@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 08-02-13 06:03:04, azurIt wrote: > Michal, thank you very much but it just didn't work and broke > everything :( I am sorry to hear that. The patch should help to solve the deadlock you have seen earlier. It in no way can solve side effects of failing writes and it also cannot help much if the oom is permanent. > This happened: > Problem started to occur really often immediately after booting the > new kernel, every few minutes for one of my users. But everything > other seems to work fine so i gave it a try for a day (which was a > mistake). I grabbed some data for you and go to sleep: > http://watchdog.sk/lkml/memcg-bug-4.tar.gz Do you have logs from that time period? I have only glanced through the stacks and most of the threads are waiting in the mem_cgroup_handle_oom (mostly from the page fault path where we do not have other options than waiting) which suggests that your memory limit is seriously underestimated. If you look at the number of charging failures (memory.failcnt per-group file) then you will get 9332083 failures in _average_ per group. This is a lot! Not all those failures end with OOM, of course. But it clearly signals that the workload need much more memory than the limit allows. > Few hours later i was woke up from my sweet sweet dreams by alerts > smses - Apache wasn't working and our system failed to restart > it. When i observed the situation, two apache processes (of that user > as above) were still running and it wasn't possible to kill them by > any way. I grabbed some data for you: > http://watchdog.sk/lkml/memcg-bug-5.tar.gz There are only 5 groups in this one and all of them have no memory charged (so no OOM going on). All tasks are somewhere in the ptrace code. grep cache -r . ./1360297489/memory.stat:cache 0 ./1360297489/memory.stat:total_cache 65642496 ./1360297491/memory.stat:cache 0 ./1360297491/memory.stat:total_cache 65642496 ./1360297492/memory.stat:cache 0 ./1360297492/memory.stat:total_cache 65642496 ./1360297490/memory.stat:cache 0 ./1360297490/memory.stat:total_cache 65642496 ./1360297488/memory.stat:cache 0 ./1360297488/memory.stat:total_cache 65642496 which suggests that this is a parent group and the memory is charged in a child group. I guess that all those are under OOM as the number seems like they have limit at 62M. > Then I logged to the console and this was waiting for me: > http://watchdog.sk/lkml/error.jpg This is just a warning and it should be harmless. There is just one WARN in ptrace_check_attach: WARN_ON_ONCE(task_is_stopped(child)) This has been introduced by http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=321fb561 and the commit description claim this shouldn't happen. I am not familiar with this code but it sounds like a bug in the tracing code which is not related to the discussed issue. > Finally i rebooted into different kernel, wrote this e-mail and go to > my lovely bed ;) -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1946250Ab3BHLCy (ORCPT ); Fri, 8 Feb 2013 06:02:54 -0500 Received: from gmmr5.centrum.cz ([46.255.225.250]:41273 "EHLO gmmr5.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1946207Ab3BHLCw (ORCPT ); Fri, 8 Feb 2013 06:02:52 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_if_PF=5FNO=5FMEMCG=5FOOM_is_set?= Date: Fri, 08 Feb 2013 12:02:49 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20130125160723.FAE73567@pobox.sk>, <20130125163130.GF4721@dhcp22.suse.cz>, <20130205134937.GA22804@dhcp22.suse.cz>, <20130205154947.CD6411E2@pobox.sk>, <20130205160934.GB22804@dhcp22.suse.cz>, <20130206021721.1AE9E3C7@pobox.sk>, <20130206140119.GD10254@dhcp22.suse.cz>, <20130206142219.GF10254@dhcp22.suse.cz>, <20130206160051.GG10254@dhcp22.suse.cz>, <20130208060304.799F362F@pobox.sk> <20130208094420.GA7557@dhcp22.suse.cz> In-Reply-To: <20130208094420.GA7557@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20130208120249.FD733220@pobox.sk> X-Maser: Georgo Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > >Do you have logs from that time period? > >I have only glanced through the stacks and most of the threads are >waiting in the mem_cgroup_handle_oom (mostly from the page fault path >where we do not have other options than waiting) which suggests that >your memory limit is seriously underestimated. If you look at the number >of charging failures (memory.failcnt per-group file) then you will get >9332083 failures in _average_ per group. This is a lot! >Not all those failures end with OOM, of course. But it clearly signals >that the workload need much more memory than the limit allows. What type of logs? I have all. Memory usage graph: http://www.watchdog.sk/lkml/memory2.png New kernel was booted about 1:15. Data in memcg-bug-4.tar.gz were taken about 2:35 and data in memcg-bug-5.tar.gz about 5:25. There was always lots of free memory. Higher memory consumption between 3:39 and 5:33 was caused by data backup and was completed few minutes before i restarted the server (this was just a coincidence). >There are only 5 groups in this one and all of them have no memory >charged (so no OOM going on). All tasks are somewhere in the ptrace >code. It's all from the same cgroup but from different time. >grep cache -r . >./1360297489/memory.stat:cache 0 >./1360297489/memory.stat:total_cache 65642496 >./1360297491/memory.stat:cache 0 >./1360297491/memory.stat:total_cache 65642496 >./1360297492/memory.stat:cache 0 >./1360297492/memory.stat:total_cache 65642496 >./1360297490/memory.stat:cache 0 >./1360297490/memory.stat:total_cache 65642496 >./1360297488/memory.stat:cache 0 >./1360297488/memory.stat:total_cache 65642496 > >which suggests that this is a parent group and the memory is charged in >a child group. I guess that all those are under OOM as the number seems >like they have limit at 62M. The cgroup has limit 330M (346030080 bytes). As i said, these two processes were stucked and was impossible to kill them. They were, maybe, the processes which i was trying to 'strace' before - 'strace' was freezed as always when the cgroup has this problem and i killed it (i was just trying if it is the original cgroup problem). From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1946402Ab3BHMi6 (ORCPT ); Fri, 8 Feb 2013 07:38:58 -0500 Received: from cantor2.suse.de ([195.135.220.15]:57410 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1760023Ab3BHMi5 (ORCPT ); Fri, 8 Feb 2013 07:38:57 -0500 Date: Fri, 8 Feb 2013 13:38:54 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Message-ID: <20130208123854.GB7557@dhcp22.suse.cz> References: <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> <20130206140119.GD10254@dhcp22.suse.cz> <20130206142219.GF10254@dhcp22.suse.cz> <20130206160051.GG10254@dhcp22.suse.cz> <20130208060304.799F362F@pobox.sk> <20130208094420.GA7557@dhcp22.suse.cz> <20130208120249.FD733220@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130208120249.FD733220@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 08-02-13 12:02:49, azurIt wrote: > > > >Do you have logs from that time period? > > > >I have only glanced through the stacks and most of the threads are > >waiting in the mem_cgroup_handle_oom (mostly from the page fault path > >where we do not have other options than waiting) which suggests that > >your memory limit is seriously underestimated. If you look at the number > >of charging failures (memory.failcnt per-group file) then you will get > >9332083 failures in _average_ per group. This is a lot! > >Not all those failures end with OOM, of course. But it clearly signals > >that the workload need much more memory than the limit allows. > > > What type of logs? I have all. kernel log would be sufficient. > Memory usage graph: > http://www.watchdog.sk/lkml/memory2.png > > New kernel was booted about 1:15. Data in memcg-bug-4.tar.gz were taken about 2:35 and data in memcg-bug-5.tar.gz about 5:25. There was always lots of free memory. Higher memory consumption between 3:39 and 5:33 was caused by data backup and was completed few minutes before i restarted the server (this was just a coincidence). > > > > >There are only 5 groups in this one and all of them have no memory > >charged (so no OOM going on). All tasks are somewhere in the ptrace > >code. > > > It's all from the same cgroup but from different time. > > > > >grep cache -r . > >./1360297489/memory.stat:cache 0 > >./1360297489/memory.stat:total_cache 65642496 > >./1360297491/memory.stat:cache 0 > >./1360297491/memory.stat:total_cache 65642496 > >./1360297492/memory.stat:cache 0 > >./1360297492/memory.stat:total_cache 65642496 > >./1360297490/memory.stat:cache 0 > >./1360297490/memory.stat:total_cache 65642496 > >./1360297488/memory.stat:cache 0 > >./1360297488/memory.stat:total_cache 65642496 > > > >which suggests that this is a parent group and the memory is charged in > >a child group. I guess that all those are under OOM as the number seems > >like they have limit at 62M. > > > The cgroup has limit 330M (346030080 bytes). This limit is for top level groups, right? Those seem to children which have 62MB charged - is that a limit for those children? > As i said, these two processes Which are those two processes? > were stucked and was impossible to kill them. They were, > maybe, the processes which i was trying to 'strace' before - 'strace' > was freezed as always when the cgroup has this problem and i killed it > (i was just trying if it is the original cgroup problem). I have no idea what is the strace role here. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759971Ab3BHN4Z (ORCPT ); Fri, 8 Feb 2013 08:56:25 -0500 Received: from gmmr3.centrum.cz ([46.255.225.251]:47415 "EHLO gmmr3.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758353Ab3BHN4Y (ORCPT ); Fri, 8 Feb 2013 08:56:24 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_if_PF=5FNO=5FMEMCG=5FOOM_is_set?= Date: Fri, 08 Feb 2013 14:56:16 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20130205134937.GA22804@dhcp22.suse.cz>, <20130205154947.CD6411E2@pobox.sk>, <20130205160934.GB22804@dhcp22.suse.cz>, <20130206021721.1AE9E3C7@pobox.sk>, <20130206140119.GD10254@dhcp22.suse.cz>, <20130206142219.GF10254@dhcp22.suse.cz>, <20130206160051.GG10254@dhcp22.suse.cz>, <20130208060304.799F362F@pobox.sk>, <20130208094420.GA7557@dhcp22.suse.cz>, <20130208120249.FD733220@pobox.sk> <20130208123854.GB7557@dhcp22.suse.cz> In-Reply-To: <20130208123854.GB7557@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20130208145616.FB78CE24@pobox.sk> X-Maser: Georgo Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >kernel log would be sufficient. Full kernel log from kernel with you newest patch: http://watchdog.sk/lkml/kern2.log >This limit is for top level groups, right? Those seem to children which >have 62MB charged - is that a limit for those children? It was the limit for parent cgroup and processes were in one (the same) child cgroup. Child cgroup has no memory limit set (so limit for parent was also limit for child - 330 MB). >Which are those two processes? Data are inside memcg-bug-5.tar.gz in directories bug/// >I have no idea what is the strace role here. I was stracing exactly two processes from that cgroup and exactly two processes were stucked later and was immpossible to kill them. Both of them were waiting on 'ptrace_stop'. Maybe it's completely unrelated, just guessing. azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1760118Ab3BHOrX (ORCPT ); Fri, 8 Feb 2013 09:47:23 -0500 Received: from cantor2.suse.de ([195.135.220.15]:34342 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753386Ab3BHOrW (ORCPT ); Fri, 8 Feb 2013 09:47:22 -0500 Date: Fri, 8 Feb 2013 15:47:20 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Message-ID: <20130208144720.GC7557@dhcp22.suse.cz> References: <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> <20130206140119.GD10254@dhcp22.suse.cz> <20130206142219.GF10254@dhcp22.suse.cz> <20130206160051.GG10254@dhcp22.suse.cz> <20130208060304.799F362F@pobox.sk> <20130208094420.GA7557@dhcp22.suse.cz> <20130208120249.FD733220@pobox.sk> <20130208123854.GB7557@dhcp22.suse.cz> <20130208145616.FB78CE24@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130208145616.FB78CE24@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 08-02-13 14:56:16, azurIt wrote: > Data are inside memcg-bug-5.tar.gz in directories bug/// ohh, I didn't get those were timestamp directories. It makes more sense now. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1946612Ab3BHPYH (ORCPT ); Fri, 8 Feb 2013 10:24:07 -0500 Received: from cantor2.suse.de ([195.135.220.15]:36095 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1946450Ab3BHPYF (ORCPT ); Fri, 8 Feb 2013 10:24:05 -0500 Date: Fri, 8 Feb 2013 16:24:02 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Message-ID: <20130208152402.GD7557@dhcp22.suse.cz> References: <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> <20130206140119.GD10254@dhcp22.suse.cz> <20130206142219.GF10254@dhcp22.suse.cz> <20130206160051.GG10254@dhcp22.suse.cz> <20130208060304.799F362F@pobox.sk> <20130208094420.GA7557@dhcp22.suse.cz> <20130208120249.FD733220@pobox.sk> <20130208123854.GB7557@dhcp22.suse.cz> <20130208145616.FB78CE24@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130208145616.FB78CE24@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 08-02-13 14:56:16, azurIt wrote: > >kernel log would be sufficient. > > > Full kernel log from kernel with you newest patch: > http://watchdog.sk/lkml/kern2.log OK, so the log says that there is a little slaughter on your yard: $ grep "Memory cgroup out of memory:" kern2.log | wc -l 220 $ grep "Memory cgroup out of memory:" kern2.log | sed 's@.*Kill process \([0-9]*\) .*@\1@' | sort -u | wc -l 220 Which means that the oom killer didn't try to kill any task more than once which is good because it tells us that the killed task manages to die before we trigger oom again. So this is definitely not a deadlock. You are just hitting OOM very often. $ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n 1 Task in /1091/uid killed as a result of limit of /1091 1 Task in /1223/uid killed as a result of limit of /1223 1 Task in /1229/uid killed as a result of limit of /1229 1 Task in /1255/uid killed as a result of limit of /1255 1 Task in /1424/uid killed as a result of limit of /1424 1 Task in /1470/uid killed as a result of limit of /1470 1 Task in /1567/uid killed as a result of limit of /1567 2 Task in /1080/uid killed as a result of limit of /1080 3 Task in /1381/uid killed as a result of limit of /1381 4 Task in /1185/uid killed as a result of limit of /1185 4 Task in /1289/uid killed as a result of limit of /1289 4 Task in /1709/uid killed as a result of limit of /1709 5 Task in /1279/uid killed as a result of limit of /1279 6 Task in /1020/uid killed as a result of limit of /1020 6 Task in /1527/uid killed as a result of limit of /1527 9 Task in /1388/uid killed as a result of limit of /1388 17 Task in /1281/uid killed as a result of limit of /1281 22 Task in /1599/uid killed as a result of limit of /1599 30 Task in /1155/uid killed as a result of limit of /1155 31 Task in /1258/uid killed as a result of limit of /1258 71 Task in /1293/uid killed as a result of limit of /1293 So the group 1293 suffers the most. I would check how much memory the worklod in the group really needs because this level of OOM cannot possible be healthy. The log also says that the deadlock prevention implemented by the patch triggered and some writes really failed due to potential OOM: $ grep "If this message shows up" kern2.log Feb 8 01:17:10 server01 kernel: [ 431.033593] __mem_cgroup_try_charge: task:apache2 pid:6733 got ENOMEM without OOM for memcg:ffff8803807d5600. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 01:22:52 server01 kernel: [ 773.556782] __mem_cgroup_try_charge: task:apache2 pid:12092 got ENOMEM without OOM for memcg:ffff8803807d5600. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 01:22:52 server01 kernel: [ 773.567916] __mem_cgroup_try_charge: task:apache2 pid:12093 got ENOMEM without OOM for memcg:ffff8803807d5600. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 01:29:00 server01 kernel: [ 1141.355693] __mem_cgroup_try_charge: task:apache2 pid:17734 got ENOMEM without OOM for memcg:ffff88036e956e00. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 03:30:39 server01 kernel: [ 8440.346811] __mem_cgroup_try_charge: task:apache2 pid:8687 got ENOMEM without OOM for memcg:ffff8803654d6e00. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. This doesn't look very unhealthy. I have expected that write would fail more often but it seems that the biggest memory pressure comes from mmaps and page faults which have no way other than OOM. So my suggestion would be to reconsider limits for groups to provide more realistical environment. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1946688Ab3BHP6L (ORCPT ); Fri, 8 Feb 2013 10:58:11 -0500 Received: from gmmr1.centrum.cz ([46.255.225.252]:34120 "EHLO gmmr1.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1946660Ab3BHP6I (ORCPT ); Fri, 8 Feb 2013 10:58:08 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_if_PF=5FNO=5FMEMCG=5FOOM_is_set?= Date: Fri, 08 Feb 2013 16:58:05 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20130205160934.GB22804@dhcp22.suse.cz>, <20130206021721.1AE9E3C7@pobox.sk>, <20130206140119.GD10254@dhcp22.suse.cz>, <20130206142219.GF10254@dhcp22.suse.cz>, <20130206160051.GG10254@dhcp22.suse.cz>, <20130208060304.799F362F@pobox.sk>, <20130208094420.GA7557@dhcp22.suse.cz>, <20130208120249.FD733220@pobox.sk>, <20130208123854.GB7557@dhcp22.suse.cz>, <20130208145616.FB78CE24@pobox.sk> <20130208152402.GD7557@dhcp22.suse.cz> In-Reply-To: <20130208152402.GD7557@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20130208165805.8908B143@pobox.sk> X-Maser: Georgo Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >Which means that the oom killer didn't try to kill any task more than >once which is good because it tells us that the killed task manages to >die before we trigger oom again. So this is definitely not a deadlock. >You are just hitting OOM very often. >$ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n > 1 Task in /1091/uid killed as a result of limit of /1091 > 1 Task in /1223/uid killed as a result of limit of /1223 > 1 Task in /1229/uid killed as a result of limit of /1229 > 1 Task in /1255/uid killed as a result of limit of /1255 > 1 Task in /1424/uid killed as a result of limit of /1424 > 1 Task in /1470/uid killed as a result of limit of /1470 > 1 Task in /1567/uid killed as a result of limit of /1567 > 2 Task in /1080/uid killed as a result of limit of /1080 > 3 Task in /1381/uid killed as a result of limit of /1381 > 4 Task in /1185/uid killed as a result of limit of /1185 > 4 Task in /1289/uid killed as a result of limit of /1289 > 4 Task in /1709/uid killed as a result of limit of /1709 > 5 Task in /1279/uid killed as a result of limit of /1279 > 6 Task in /1020/uid killed as a result of limit of /1020 > 6 Task in /1527/uid killed as a result of limit of /1527 > 9 Task in /1388/uid killed as a result of limit of /1388 > 17 Task in /1281/uid killed as a result of limit of /1281 > 22 Task in /1599/uid killed as a result of limit of /1599 > 30 Task in /1155/uid killed as a result of limit of /1155 > 31 Task in /1258/uid killed as a result of limit of /1258 > 71 Task in /1293/uid killed as a result of limit of /1293 > >So the group 1293 suffers the most. I would check how much memory the >worklod in the group really needs because this level of OOM cannot >possible be healthy. I took the kernel log from yesterday from the same time frame: $ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n 1 Task in /1252/uid killed as a result of limit of /1252 1 Task in /1709/uid killed as a result of limit of /1709 2 Task in /1185/uid killed as a result of limit of /1185 2 Task in /1388/uid killed as a result of limit of /1388 2 Task in /1567/uid killed as a result of limit of /1567 2 Task in /1650/uid killed as a result of limit of /1650 3 Task in /1527/uid killed as a result of limit of /1527 5 Task in /1552/uid killed as a result of limit of /1552 1634 Task in /1258/uid killed as a result of limit of /1258 As you can see, there were much more OOM in '1258' and no such problems like this night (well, there were never such problems before :) ). As i said, cgroup 1258 were freezing every few minutes with your latest patch so there must be something wrong (it usually freezes about once per day). And it was really freezed (i checked that), the sypthoms were: - cannot strace any of cgroup processes - no new processes were started, still the same processes were 'running' - kernel was unable to resolve this by it's own - all processes togather were taking 100% CPU - the whole memory limit was used (see memcg-bug-4.tar.gz for more info) Unfortunately i forget to check if killing only few of the processes will resolve it (i always killed them all yesterday night). Don't know if is was in deadlock or not but kernel was definitely unable to resolve the problem. And there is still a mystery of two freezed processes which cannot be killed. By the way, i KNOW that so much OOM is not healthy but the client simply don't want to buy more memory. He knows about the problem of unsufficient memory limit. Thank you. azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1946695Ab3BHQBZ (ORCPT ); Fri, 8 Feb 2013 11:01:25 -0500 Received: from cantor2.suse.de ([195.135.220.15]:37778 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1946666Ab3BHQBX (ORCPT ); Fri, 8 Feb 2013 11:01:23 -0500 Date: Fri, 8 Feb 2013 17:01:19 +0100 From: Michal Hocko To: Kamezawa Hiroyuki Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20130208160119.GE7557@dhcp22.suse.cz> References: <20121230110815.GA12940@dhcp22.suse.cz> <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130206021721.1AE9E3C7@pobox.sk> <20130206140119.GD10254@dhcp22.suse.cz> <51138999.3090006@jp.fujitsu.com> <5114577D.70608@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5114577D.70608@jp.fujitsu.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 08-02-13 10:40:13, KAMEZAWA Hiroyuki wrote: > (2013/02/07 20:01), Kamezawa Hiroyuki wrote: [...] > >Hmm. do we need to increase the "limit" virtually at memcg oom until > >the oom-killed process dies ? > > Here is my naive idea... and the next step would be http://en.wikipedia.org/wiki/Credit_default_swap :P But seriously now. The idea is not bad at all. This implementation would need some tweaks to work though (e.g. you would need to wake oom sleepers when you get a loan - because those are ones which can block the resource). We should also give the borrowed charges only to those who would oom to prevent from stealing. I think that it should be mem_cgroup_out_of_memory who establishes the loan and it can have a look at how much memory the killed task frees - e.g. some portion of get_mm_rss() or a more precise but much more expensive traversing via private vmas and check whether they charged memory from the target memcg hierarchy (this is a slow path anyway). But who knows maybe a fixed 2MB would work out as well. Thanks! > == > From 1a46318cf89e7df94bd4844f29105b61dacf335b Mon Sep 17 00:00:00 2001 > From: KAMEZAWA Hiroyuki > Date: Fri, 8 Feb 2013 10:43:52 +0900 > Subject: [PATCH] [Don't Apply][PATCH] memcg relax resource at OOM situation. > > When an OOM happens, a task is killed and resources will be freed. > > A problem here is that a task, which is oom-killed, may wait for > some other resource in which memory resource is required. Some thread > waits for free memory may holds some mutex and oom-killed process > wait for the mutex. > > To avoid this, relaxing charged memory by giving virtual resource > can be a help. The system can get back it at uncharge(). > This is a sample native implementation. > > Signed-off-by: KAMEZAWA Hiroyuki > --- > mm/memcontrol.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++----- > 1 file changed, 73 insertions(+), 6 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 25ac5f4..4dea49a 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -301,6 +301,9 @@ struct mem_cgroup { > /* set when res.limit == memsw.limit */ > bool memsw_is_minimum; > + /* extra resource at emergency situation */ > + unsigned long loan; > + spinlock_t loan_lock; > /* protect arrays of thresholds */ > struct mutex thresholds_lock; > @@ -2034,6 +2037,61 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, > mem_cgroup_iter_break(root_memcg, victim); > return total; > } > +/* > + * When a memcg is in OOM situation, this lack of resource may cause deadlock > + * because of complicated lock dependency(i_mutex...). To avoid that, we > + * need extra resource or avoid charging. > + * > + * A memcg can request resource in an emergency state. We call it as loan. > + * A memcg will return a loan when it does uncharge resource. We disallow > + * double-loan and moving task to other groups until the loan is fully > + * returned. > + * > + * Note: the problem here is that we cannot know what amount resouce should > + * be necessary to exiting an emergency state..... > + */ > +#define LOAN_MAX (2 * 1024 * 1024) > + > +static void mem_cgroup_make_loan(struct mem_cgroup *memcg) > +{ > + u64 usage; > + unsigned long amount; > + > + amount = LOAN_MAX; > + > + usage = res_counter_read_u64(&memcg->res, RES_USAGE); > + if (amount > usage /2 ) > + amount = usage / 2; > + spin_lock(&memcg->loan_lock); > + if (memcg->loan) { > + spin_unlock(&memcg->loan_lock); > + return; > + } > + memcg->loan = amount; > + res_counter_uncharge(&memcg->res, amount); > + if (do_swap_account) > + res_counter_uncharge(&memcg->memsw, amount); > + spin_unlock(&memcg->loan_lock); > +} > + > +/* return amount of free resource which can be uncharged */ > +static unsigned long > +mem_cgroup_may_return_loan(struct mem_cgroup *memcg, unsigned long val) > +{ > + unsigned long tmp; > + /* we don't care small race here */ > + if (unlikely(!memcg->loan)) > + return val; > + spin_lock(&memcg->loan_lock); > + if (memcg->loan) { > + tmp = min(memcg->loan, val); > + memcg->loan -= tmp; > + val -= tmp; > + } > + spin_unlock(&memcg->loan_lock); > + return val; > +} > + > /* > * Check OOM-Killer is already running under our hierarchy. > @@ -2182,6 +2240,7 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, > if (need_to_kill) { > finish_wait(&memcg_oom_waitq, &owait.wait); > mem_cgroup_out_of_memory(memcg, mask, order); > + mem_cgroup_make_loan(memcg); > } else { > schedule(); > finish_wait(&memcg_oom_waitq, &owait.wait); > @@ -2748,6 +2807,8 @@ static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg, > if (!mem_cgroup_is_root(memcg)) { > unsigned long bytes = nr_pages * PAGE_SIZE; > + bytes = mem_cgroup_may_return_loan(memcg, bytes); > + > res_counter_uncharge(&memcg->res, bytes); > if (do_swap_account) > res_counter_uncharge(&memcg->memsw, bytes); > @@ -3989,6 +4050,7 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg, > { > struct memcg_batch_info *batch = NULL; > bool uncharge_memsw = true; > + unsigned long val; > /* If swapout, usage of swap doesn't decrease */ > if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) > @@ -4029,9 +4091,11 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg, > batch->memsw_nr_pages++; > return; > direct_uncharge: > - res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE); > + val = nr_pages * PAGE_SIZE; > + val = mem_cgroup_may_return_loan(memcg, val); > + res_counter_uncharge(&memcg->res, val); > if (uncharge_memsw) > - res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE); > + res_counter_uncharge(&memcg->memsw, val); > if (unlikely(batch->memcg != memcg)) > memcg_oom_recover(memcg); > } > @@ -4182,6 +4246,7 @@ void mem_cgroup_uncharge_start(void) > void mem_cgroup_uncharge_end(void) > { > struct memcg_batch_info *batch = ¤t->memcg_batch; > + unsigned long val; > if (!batch->do_batch) > return; > @@ -4192,16 +4257,16 @@ void mem_cgroup_uncharge_end(void) > if (!batch->memcg) > return; > + val = batch->nr_pages * PAGE_SIZE; > + val = mem_cgroup_may_return_loan(batch->memcg, val); > /* > * This "batch->memcg" is valid without any css_get/put etc... > * bacause we hide charges behind us. > */ > if (batch->nr_pages) > - res_counter_uncharge(&batch->memcg->res, > - batch->nr_pages * PAGE_SIZE); > + res_counter_uncharge(&batch->memcg->res, val); > if (batch->memsw_nr_pages) > - res_counter_uncharge(&batch->memcg->memsw, > - batch->memsw_nr_pages * PAGE_SIZE); > + res_counter_uncharge(&batch->memcg->memsw, val); > memcg_oom_recover(batch->memcg); > /* forget this pointer (for sanity check) */ > batch->memcg = NULL; > @@ -6291,6 +6356,8 @@ mem_cgroup_css_alloc(struct cgroup *cont) > memcg->move_charge_at_immigrate = 0; > mutex_init(&memcg->thresholds_lock); > spin_lock_init(&memcg->move_lock); > + memcg->loan = 0; > + spin_lock_init(&memcg->loan_lock); > return &memcg->css; > -- > 1.7.10.2 > > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1946705Ab3BHQ3V (ORCPT ); Fri, 8 Feb 2013 11:29:21 -0500 Received: from cantor2.suse.de ([195.135.220.15]:39149 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1946552Ab3BHQ3U (ORCPT ); Fri, 8 Feb 2013 11:29:20 -0500 Date: Fri, 8 Feb 2013 17:29:18 +0100 From: Michal Hocko To: Greg Thelen Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20130208162918.GF7557@dhcp22.suse.cz> References: <20130125160723.FAE73567@pobox.sk> <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130205174651.GA3959@dhcp22.suse.cz> <20130205185953.GB3959@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 07-02-13 20:27:00, Greg Thelen wrote: > On Tue, Feb 05 2013, Michal Hocko wrote: > > > On Tue 05-02-13 10:09:57, Greg Thelen wrote: > >> On Tue, Feb 05 2013, Michal Hocko wrote: > >> > >> > On Tue 05-02-13 08:48:23, Greg Thelen wrote: > >> >> On Tue, Feb 05 2013, Michal Hocko wrote: > >> >> > >> >> > On Tue 05-02-13 15:49:47, azurIt wrote: > >> >> > [...] > >> >> >> Just to be sure - am i supposed to apply this two patches? > >> >> >> http://watchdog.sk/lkml/patches/ > >> >> > > >> >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > >> >> > mentioned in a follow up email. Here is the full patch: > >> >> > --- > >> >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 > >> >> > From: Michal Hocko > >> >> > Date: Mon, 26 Nov 2012 11:47:57 +0100 > >> >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > >> >> > > >> >> > memcg oom killer might deadlock if the process which falls down to > >> >> > mem_cgroup_handle_oom holds a lock which prevents other task to > >> >> > terminate because it is blocked on the very same lock. > >> >> > This can happen when a write system call needs to allocate a page but > >> >> > the allocation hits the memcg hard limit and there is nothing to reclaim > >> >> > (e.g. there is no swap or swap limit is hit as well and all cache pages > >> >> > have been reclaimed already) and the process selected by memcg OOM > >> >> > killer is blocked on i_mutex on the same inode (e.g. truncate it). > >> >> > > >> >> > Process A > >> >> > [] do_truncate+0x58/0xa0 # takes i_mutex > >> >> > [] do_last+0x250/0xa30 > >> >> > [] path_openat+0xd7/0x440 > >> >> > [] do_filp_open+0x49/0xa0 > >> >> > [] do_sys_open+0x106/0x240 > >> >> > [] sys_open+0x20/0x30 > >> >> > [] system_call_fastpath+0x18/0x1d > >> >> > [] 0xffffffffffffffff > >> >> > > >> >> > Process B > >> >> > [] mem_cgroup_handle_oom+0x241/0x3b0 > >> >> > [] T.1146+0x5ab/0x5c0 > >> >> > [] mem_cgroup_cache_charge+0xbe/0xe0 > >> >> > [] add_to_page_cache_locked+0x4c/0x140 > >> >> > [] add_to_page_cache_lru+0x22/0x50 > >> >> > [] grab_cache_page_write_begin+0x8b/0xe0 > >> >> > [] ext3_write_begin+0x88/0x270 > >> >> > [] generic_file_buffered_write+0x116/0x290 > >> >> > [] __generic_file_aio_write+0x27c/0x480 > >> >> > [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > >> >> > [] do_sync_write+0xea/0x130 > >> >> > [] vfs_write+0xf3/0x1f0 > >> >> > [] sys_write+0x51/0x90 > >> >> > [] system_call_fastpath+0x18/0x1d > >> >> > [] 0xffffffffffffffff > >> >> > >> >> It looks like grab_cache_page_write_begin() passes __GFP_FS into > >> >> __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me > >> >> think that this deadlock is also possible in the page allocator even > >> >> before getting to add_to_page_cache_lru. no? > >> > > >> > I am not that familiar with VFS but i_mutex is a high level lock AFAIR > >> > and it shouldn't be called from the pageout path so __page_cache_alloc > >> > should be safe. > >> > >> I wasn't clear, sorry. My concern is not that pageout() grabs i_mutex. > >> My concern is that __page_cache_alloc() will invoke the oom killer and > >> select a victim which wants i_mutex. This victim will deadlock because > >> the oom killer caller already holds i_mutex. > > > > That would be true for the memcg oom because that one is blocking but > > the global oom just puts the allocator into sleep for a while and then > > the allocator should back off eventually (unless this is NOFAIL > > allocation). I would need to look closer whether this is really the case > > - I haven't seen that allocator code path for a while... > > I think the page allocator can loop forever waiting for an oom victim to > terminate even without NOFAIL. Especially if the oom victim wants a > resource exclusively held by the allocating thread (e.g. i_mutex). It > looks like the same deadlock you describe is also possible (though more > rare) without memcg. OK, I have checked the allocator slow path and you are right even GFP_KERNEL will not fail. This can lead to similar deadlocks - e.g. OOM killed task blocked on down_write(mmap_sem) while the page fault handler holding mmap_sem for reading and allocating a new page without any progress. Luckily there are memory reserves where the allocator fall back eventually so the allocation should be able to get some memory and release the lock. There is still a theoretical chance this would block though. This sounds like a corner case though so I wouldn't care about it very much. > If the looping thread is an eligible oom victim (i.e. not oom disabled, > not an kernel thread, etc) then the page allocator can return NULL in so > long as NOFAIL is not used. So any allocator which is able to call the > oom killer and is not oom disabled (kernel thread, etc) is already > exposed to the possibility of page allocator failure. So if the page > allocator could detect the deadlock, then it could safely return NULL. > Maybe after looping N times without forward progress the page allocator > should consider failing unless NOFAIL is given. page allocator is quite tricky to touch and the chances of this deadlock are not that big. > if memcg oom kill has been tried a reasonable number of times. Simply > failing the memcg charge with ENOMEM seems easier to support than > exceeding limit (Kame's loan patch). We cannot do that in the page fault path because this would lead to a global oom killer. We would need to either retry the page fault or send KILL to the faulting process. But I do not like this much as this could lead to DoS attacks. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1946632Ab3BHQlA (ORCPT ); Fri, 8 Feb 2013 11:41:00 -0500 Received: from cantor2.suse.de ([195.135.220.15]:39487 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1946537Ab3BHQk6 (ORCPT ); Fri, 8 Feb 2013 11:40:58 -0500 Date: Fri, 8 Feb 2013 17:40:56 +0100 From: Michal Hocko To: Greg Thelen Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Message-ID: <20130208164056.GG7557@dhcp22.suse.cz> References: <20130125163130.GF4721@dhcp22.suse.cz> <20130205134937.GA22804@dhcp22.suse.cz> <20130205154947.CD6411E2@pobox.sk> <20130205160934.GB22804@dhcp22.suse.cz> <20130205174651.GA3959@dhcp22.suse.cz> <20130205185953.GB3959@dhcp22.suse.cz> <20130208162918.GF7557@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130208162918.GF7557@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 08-02-13 17:29:18, Michal Hocko wrote: [...] > OK, I have checked the allocator slow path and you are right even > GFP_KERNEL will not fail. This can lead to similar deadlocks - e.g. > OOM killed task blocked on down_write(mmap_sem) while the page fault > handler holding mmap_sem for reading and allocating a new page without > any progress. And now that I think about it some more it sounds like it shouldn't be possible because allocator would fail because it would see TIF_MEMDIE (OOM killer kills all threads that share the same mm). But maybe there are other locks that are dangerous, but I think that the risk is pretty low. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1760214Ab3BHRKR (ORCPT ); Fri, 8 Feb 2013 12:10:17 -0500 Received: from cantor2.suse.de ([195.135.220.15]:40639 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753555Ab3BHRKP (ORCPT ); Fri, 8 Feb 2013 12:10:15 -0500 Date: Fri, 8 Feb 2013 18:10:12 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Message-ID: <20130208171012.GH7557@dhcp22.suse.cz> References: <20130206140119.GD10254@dhcp22.suse.cz> <20130206142219.GF10254@dhcp22.suse.cz> <20130206160051.GG10254@dhcp22.suse.cz> <20130208060304.799F362F@pobox.sk> <20130208094420.GA7557@dhcp22.suse.cz> <20130208120249.FD733220@pobox.sk> <20130208123854.GB7557@dhcp22.suse.cz> <20130208145616.FB78CE24@pobox.sk> <20130208152402.GD7557@dhcp22.suse.cz> <20130208165805.8908B143@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130208165805.8908B143@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 08-02-13 16:58:05, azurIt wrote: [...] > I took the kernel log from yesterday from the same time frame: > > $ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n > 1 Task in /1252/uid killed as a result of limit of /1252 > 1 Task in /1709/uid killed as a result of limit of /1709 > 2 Task in /1185/uid killed as a result of limit of /1185 > 2 Task in /1388/uid killed as a result of limit of /1388 > 2 Task in /1567/uid killed as a result of limit of /1567 > 2 Task in /1650/uid killed as a result of limit of /1650 > 3 Task in /1527/uid killed as a result of limit of /1527 > 5 Task in /1552/uid killed as a result of limit of /1552 > 1634 Task in /1258/uid killed as a result of limit of /1258 > > As you can see, there were much more OOM in '1258' and no such > problems like this night (well, there were never such problems before > :) ). Well, all the patch does is that it prevents from the deadlock we have seen earlier. Previously the writer would block on the oom wait queue while it fails with ENOMEM now. Caller sees this as a short write which can be retried (it is a question whether userspace can cope with that properly). All other OOMs are preserved. I suspect that all the problems you are seeing now are just side effects of the OOM conditions. > As i said, cgroup 1258 were freezing every few minutes with your > latest patch so there must be something wrong (it usually freezes > about once per day). And it was really freezed (i checked that), the > sypthoms were: I assume you have checked that the killed processes eventually die, right? > - cannot strace any of cgroup processes > - no new processes were started, still the same processes were 'running' > - kernel was unable to resolve this by it's own > - all processes togather were taking 100% CPU > - the whole memory limit was used > (see memcg-bug-4.tar.gz for more info) Well, I do not see anything supsicious during that time period (timestamps translate between Fri Feb 8 02:34:05 and Fri Feb 8 02:36:48). The kernel log shows a lot of oom during that time. All killed processes die eventually. > Unfortunately i forget to check if killing only few of the processes > will resolve it (i always killed them all yesterday night). Don't > know if is was in deadlock or not but kernel was definitely unable > to resolve the problem. Nothing shows it would be a deadlock so far. It is well possible that the userspace went mad when seeing a lot of processes dying because it doesn't expect it. > And there is still a mystery of two freezed processes which cannot be > killed. > > By the way, i KNOW that so much OOM is not healthy but the client > simply don't want to buy more memory. He knows about the problem of > unsufficient memory limit. Well, then you would see a permanent flood of OOM killing, I am afraid. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1947149Ab3BHVCu (ORCPT ); Fri, 8 Feb 2013 16:02:50 -0500 Received: from gmmr8.centrum.cz ([46.255.227.254]:41941 "EHLO gmmr8.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1947069Ab3BHVCq (ORCPT ); Fri, 8 Feb 2013 16:02:46 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_if_PF=5FNO=5FMEMCG=5FOOM_is_set?= Date: Fri, 08 Feb 2013 22:02:43 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20130206140119.GD10254@dhcp22.suse.cz>, <20130206142219.GF10254@dhcp22.suse.cz>, <20130206160051.GG10254@dhcp22.suse.cz>, <20130208060304.799F362F@pobox.sk>, <20130208094420.GA7557@dhcp22.suse.cz>, <20130208120249.FD733220@pobox.sk>, <20130208123854.GB7557@dhcp22.suse.cz>, <20130208145616.FB78CE24@pobox.sk>, <20130208152402.GD7557@dhcp22.suse.cz>, <20130208165805.8908B143@pobox.sk> <20130208171012.GH7557@dhcp22.suse.cz> In-Reply-To: <20130208171012.GH7557@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20130208220243.EDEE0825@pobox.sk> X-Maser: Georgo Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > >I assume you have checked that the killed processes eventually die, >right? > When i killed them by hand, yes, they dissappeard from process list (i saw it). I don't know if they really died when OOM killed them. >Well, I do not see anything supsicious during that time period >(timestamps translate between Fri Feb 8 02:34:05 and Fri Feb 8 >02:36:48). The kernel log shows a lot of oom during that time. All >killed processes die eventually. No, they didn't died by OOM when cgroup was freezed. Just check PIDs from memcg-bug-4.tar.gz and try to find them in kernel log. Why are all PIDs waiting on 'mem_cgroup_handle_oom' and there is no OOM message in the log? Data in memcg-bug-4.tar.gz are only for 2 minutes but i let it run for about 15-20 minutes, no single process killed by OOM. I'm 100% sure that OOM was not killing them (maybe it was trying to but it didn't happen). > >Nothing shows it would be a deadlock so far. It is well possible that >the userspace went mad when seeing a lot of processes dying because it >doesn't expect it. > Lots of processes are dying also now, without your latest patch, and no such things are happening. I'm sure there is something more it this, maybe it revealed another bug? azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759166Ab3BJPDU (ORCPT ); Sun, 10 Feb 2013 10:03:20 -0500 Received: from mail-ea0-f172.google.com ([209.85.215.172]:36657 "EHLO mail-ea0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755738Ab3BJPDS (ORCPT ); Sun, 10 Feb 2013 10:03:18 -0500 Date: Sun, 10 Feb 2013 16:03:13 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Message-ID: <20130210150310.GA9504@dhcp22.suse.cz> References: <20130206160051.GG10254@dhcp22.suse.cz> <20130208060304.799F362F@pobox.sk> <20130208094420.GA7557@dhcp22.suse.cz> <20130208120249.FD733220@pobox.sk> <20130208123854.GB7557@dhcp22.suse.cz> <20130208145616.FB78CE24@pobox.sk> <20130208152402.GD7557@dhcp22.suse.cz> <20130208165805.8908B143@pobox.sk> <20130208171012.GH7557@dhcp22.suse.cz> <20130208220243.EDEE0825@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130208220243.EDEE0825@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 08-02-13 22:02:43, azurIt wrote: > > > >I assume you have checked that the killed processes eventually die, > >right? > > > When i killed them by hand, yes, they dissappeard from process list (i > saw it). I don't know if they really died when OOM killed them. > > > >Well, I do not see anything supsicious during that time period > >(timestamps translate between Fri Feb 8 02:34:05 and Fri Feb 8 > >02:36:48). The kernel log shows a lot of oom during that time. All > >killed processes die eventually. > > > No, they didn't died by OOM when cgroup was freezed. Just check PIDs > from memcg-bug-4.tar.gz and try to find them in kernel log. OK, you seem to be right. My initial examination showed that each cgroup under OOM was able to move forward - in other words it was able to send SIGKILL somebody and we didn't loop on a single task which cannot die for some reason. Now when looking closer it seem we really have 2 tasks which didn't die after being killed by OOM killer: $ for i in `grep "Memory cgroup out of memory:" kern2.log | sed 's@.*Kill process \([0-9]*\) .*@\1@'`; do find bug -name $i; done | sed 's@.*/@@' | sort | uniq -c 141 18211 141 8102 $ md5sum bug/*/18211/stack | cut -d" " -f1 | uniq -c 141 3b8ce17e82a065a24ee046112033e1e8 So all the stacks are same: [] ptrace_stop+0x114/0x290 [] ptrace_do_notify+0x88/0xa0 [] ptrace_notify+0x53/0x70 [] syscall_trace_enter+0xf8/0x1c0 [] tracesys+0x71/0xd7 [] 0xffffffffffffffff stuck in the ptrace code. The other task is more interesting: $ md5sum bug/*/8102/stack | cut -d" " -f1 | sort | uniq -c 135 042e893c0e6657ed321ea9045e528f3e 6 dc7e71ce73be2a5c73404b565926e709 All snapshots with 042e893c0e6657ed321ea9045e528f3e are in: [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1149+0x5f3/0x600 [] mem_cgroup_charge_common+0x6c/0xb0 [] mem_cgroup_newpage_charge+0x45/0x50 [] handle_pte_fault+0x609/0x940 [] handle_mm_fault+0x138/0x260 [] do_page_fault+0x13d/0x460 [] page_fault+0x1f/0x30 [] 0xffffffffffffffff While the others do not show any stack: cat 1360287257/8102/stack [] 0xffffffffffffffff Which is quite interesting because we are talking about snapshots starting at 1360287245 (which maps to 02:34:05) but the kern2.log tells us that this process has been killed much earlier at: Feb 8 01:18:30 server01 kernel: [ 511.139921] Task in /1293/uid killed as a result of limit of /1293 [...] Feb 8 01:18:30 server01 kernel: [ 511.229755] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name Feb 8 01:18:30 server01 kernel: [ 511.230146] [ 8102] 1293 8102 170258 65869 7 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230339] [ 8113] 1293 8113 163756 59442 5 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230528] [ 8116] 1293 8116 170094 65675 2 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230726] [ 8119] 1293 8119 170094 65675 6 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230924] [ 8123] 1293 8123 169070 64612 7 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.231132] [ 8124] 1293 8124 170094 65675 5 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.231321] [ 8125] 1293 8125 170094 65673 1 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.231516] Memory cgroup out of memory: Kill process 8102 (apache2) score 1000 or sacrifice child This would suggest that the task is hung and cannot be killed but if we have a look at the following OOM in the same group 1293 it was _not_ present in the process list for that group: Feb 8 01:18:33 server01 kernel: [ 514.789550] Task in /1293/uid killed as a result of limit of /1293 [...] Feb 8 01:18:33 server01 kernel: [ 514.893198] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name Feb 8 01:18:33 server01 kernel: [ 514.893594] [ 8113] 1293 8113 168212 64036 1 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.893786] [ 8116] 1293 8116 170258 65870 6 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.893976] [ 8119] 1293 8119 170258 65870 7 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894166] [ 8123] 1293 8123 170158 65824 6 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894356] [ 8124] 1293 8124 170258 65870 5 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894547] [ 8125] 1293 8125 170158 65824 1 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894749] [ 8149] 1293 8149 163989 59647 7 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894944] Memory cgroup out of memory: Kill process 8113 (apache2) score 1000 or sacrifice child This is all _before_ you started collecting stacks and it also says that 8102 is gone. This all suggests that a) stack unwinder which displays /proc//stack is somehow confused and it doesn't show the correct stack for this process and b) the two processes cannot terminate due to some issue related to ptrace (stracing) the dying process. The above oom list doesn't include any processes which already released the memory which would explain why you still can see it as a member of the group (when looking into cgroup/tasks file). My guess would be that there is a bug in ptrace which doesn't free a reference to the task so it cannot cannot go away although it has dropped all the resources already. > Why are all PIDs waiting on 'mem_cgroup_handle_oom' and there is no > OOM message in the log? I am not sure what you mean here but there are $ grep "Memory cgroup out of memory:" kern2.collected.log | wc -l 16 OOM killer events during the time you were gathering memcg-bug-4 data. > Data in memcg-bug-4.tar.gz are only for 2 > minutes but i let it run for about 15-20 minutes, no single process > killed by OOM. I can see $ grep "Memory cgroup out of memory:" kern2.after.log | wc -l 57 killed after 02:38:47 when you stopped gathering data for memcg-bug-4 > I'm 100% sure that OOM was not killing them (maybe it was trying to > but it didn't happen). OK, let's do a little exercise. The list of processes eligible for OOM are listed before any task is killed. So if we collect both pid lists and "Kill process" messages per pid then no entries in the pid list should be present after the specific pid is killed. $ mkdir out $ for i in `grep "Memory cgroup out of memory: Kill process" kern2.log | sed 's@.*Kill process \([0-9]*\) .*@\1@'` do grep -e "Memory cgroup out of memory: Kill process $i" \ -e "\[ *\<$i\]" kern2.log > out/$i done $ for i in out/* do tail -n1 $i | grep "Memory cgroup out of memory:" >/dev/null|| echo "$i has already killed tasks" done out/6698 has already killed tasks out/6703 has already killed tasks OK, so there are two pids which were listed after they have been killed. Let's have a look at them. $ cat out/6698 Feb 8 01:17:04 server01 kernel: [ 425.497924] [ 6698] 1293 6698 170258 65846 1 0 0 apache2 Feb 8 01:17:05 server01 kernel: [ 426.079010] [ 6698] 1293 6698 170258 65846 1 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.144460] [ 6698] 1293 6698 169358 65220 1 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.146058] Memory cgroup out of memory: Kill process 6698 (apache2) score 1000 or sacrifice child Feb 8 03:27:57 server01 kernel: [ 8278.439896] [ 6698] 1020 6698 168518 64219 0 0 0 apache2 Feb 8 03:27:57 server01 kernel: [ 8278.879439] [ 6698] 1020 6698 168518 64218 6 0 0 apache2 Feb 8 03:27:59 server01 kernel: [ 8280.023944] [ 6698] 1020 6698 168816 64540 7 0 0 apache2 Feb 8 03:28:02 server01 kernel: [ 8283.242282] [ 6698] 1020 6698 171953 67751 6 0 0 apache2 $ cat out/6703 Feb 8 01:17:04 server01 kernel: [ 425.498118] [ 6703] 1293 6703 170258 65844 6 0 0 apache2 Feb 8 01:17:05 server01 kernel: [ 426.079206] [ 6703] 1293 6703 170258 65844 6 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.144653] [ 6703] 1293 6703 169358 65219 2 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.258924] [ 6703] 1293 6703 169358 65219 5 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.260282] Memory cgroup out of memory: Kill process 6703 (apache2) score 1000 or sacrifice child Feb 8 03:27:57 server01 kernel: [ 8278.440043] [ 6703] 1020 6703 166286 61978 7 0 0 apache2 Feb 8 03:27:57 server01 kernel: [ 8278.879587] [ 6703] 1020 6703 166286 61977 7 0 0 apache2 Feb 8 03:27:59 server01 kernel: [ 8280.024091] [ 6703] 1020 6703 166484 62233 7 0 0 apache2 Feb 8 03:28:02 server01 kernel: [ 8283.242429] [ 6703] 1020 6703 167402 63118 0 0 0 apache2 Lists have the following columns: [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name As we can see the uid changed for both pids after it has been killed (from 1293 to 1020) which suggests that the pid has been reused later for a different user (which is a clear sign that those pids died) - thus different group in your setup. So those two died as well, apparently. > >Nothing shows it would be a deadlock so far. It is well possible that > >the userspace went mad when seeing a lot of processes dying because it > >doesn't expect it. > > Lots of processes are dying also now, without your latest patch, and > no such things are happening. I'm sure there is something more it > this, maybe it revealed another bug? So far nothing shows that there would be anything broken wrt. memcg OOM killer. The ptrace issue sounds strange, all right, but that is another story and worth a separate investigation. I would be interested whether you still see anything wrong going on without that in game. You can get pretty nice overview of what is going on wrt. OOM from the log. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755963Ab3BJQq0 (ORCPT ); Sun, 10 Feb 2013 11:46:26 -0500 Received: from gmmr1.centrum.cz ([46.255.225.252]:35021 "EHLO gmmr1.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755164Ab3BJQqY (ORCPT ); Sun, 10 Feb 2013 11:46:24 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_if_PF=5FNO=5FMEMCG=5FOOM_is_set?= Date: Sun, 10 Feb 2013 17:46:19 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20130206160051.GG10254@dhcp22.suse.cz>, <20130208060304.799F362F@pobox.sk>, <20130208094420.GA7557@dhcp22.suse.cz>, <20130208120249.FD733220@pobox.sk>, <20130208123854.GB7557@dhcp22.suse.cz>, <20130208145616.FB78CE24@pobox.sk>, <20130208152402.GD7557@dhcp22.suse.cz>, <20130208165805.8908B143@pobox.sk>, <20130208171012.GH7557@dhcp22.suse.cz>, <20130208220243.EDEE0825@pobox.sk> <20130210150310.GA9504@dhcp22.suse.cz> In-Reply-To: <20130210150310.GA9504@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20130210174619.24F20488@pobox.sk> X-Maser: Georgo Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >stuck in the ptrace code. But this happens _after_ the cgroup was freezed and i tried to strace one of it's processes (to see what's happening): Feb 8 01:29:46 server01 kernel: [ 1187.540672] grsec: From 178.40.250.111: process /usr/lib/apache2/mpm-itk/apache2(apache2:18211) attached to via ptrace by /usr/bin/strace[strace:18258] uid/euid:0/0 gid/egid:0/0, parent /usr/bin/htop[htop:2901] uid/euid:0/0 gid/egid:0/0 >> Why are all PIDs waiting on 'mem_cgroup_handle_oom' and there is no >> OOM message in the log? > >I am not sure what you mean here but there are >$ grep "Memory cgroup out of memory:" kern2.collected.log | wc -l >16 > >OOM killer events during the time you were gathering memcg-bug-4 data. > >> Data in memcg-bug-4.tar.gz are only for 2 >> minutes but i let it run for about 15-20 minutes, no single process >> killed by OOM. > >I can see >$ grep "Memory cgroup out of memory:" kern2.after.log | wc -l >57 > >killed after 02:38:47 when you stopped gathering data for memcg-bug-4 I meant no single process was killed inside cgroup 1258 (data from this cgroup are in memcg-bug-4.tar.gz). Just get data from memcg-bug-4.tar.gz which were taken from cgroup 1258. Almost all processes are in 'mem_cgroup_handle_oom' so cgroup is under OOM. I assume that this is suppose to take only few seconds while kernel finds any process and kill it (and maybe do it again until enough of memory is freed). I was gathering the data for about 2 and a half minutes and NO SINGLE process was killed (just compate list of PIDs from the first and the last directory inside memcg-bug-4.tar.gz). Even more, no single process was killed in cgroup 1258 also after i stopped gathering the data. You can also take the list od PID from memcg-bug-4.tar.gz and you will find only 18211 and 8102 (which are the two stucked processes). So my question is: Why no process was killed inside cgroup 1258 while it was under OOM? It was under OOM for at least 2 and a half of minutes while i was gathering the data (then i let it run for additional, cca, 10 minutes and then killed processes by hand but i cannot proof this). Why kernel didn't kill any process for so long and ends the OOM? Btw, processes in cgroup 1258 (memcg-bug-4.tar.gz) are looping in this two tasks (i pasted only first line of stack): mem_cgroup_handle_oom+0x241/0x3b0 0xffffffffffffffff Some of them are in 'poll_schedule_timeout' and then they start to loop as above. Is this correct behavior? For example, do (first line of stack from process 7710 from all timestamps): for i in */7710/stack; do head -n1 $i; done From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755864Ab3BKLWp (ORCPT ); Mon, 11 Feb 2013 06:22:45 -0500 Received: from cantor2.suse.de ([195.135.220.15]:53789 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755449Ab3BKLWn (ORCPT ); Mon, 11 Feb 2013 06:22:43 -0500 Date: Mon, 11 Feb 2013 12:22:40 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Message-ID: <20130211112240.GC19922@dhcp22.suse.cz> References: <20130208094420.GA7557@dhcp22.suse.cz> <20130208120249.FD733220@pobox.sk> <20130208123854.GB7557@dhcp22.suse.cz> <20130208145616.FB78CE24@pobox.sk> <20130208152402.GD7557@dhcp22.suse.cz> <20130208165805.8908B143@pobox.sk> <20130208171012.GH7557@dhcp22.suse.cz> <20130208220243.EDEE0825@pobox.sk> <20130210150310.GA9504@dhcp22.suse.cz> <20130210174619.24F20488@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130210174619.24F20488@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun 10-02-13 17:46:19, azurIt wrote: > >stuck in the ptrace code. > > > But this happens _after_ the cgroup was freezed and i tried to strace > one of it's processes (to see what's happening): > > Feb 8 01:29:46 server01 kernel: [ 1187.540672] grsec: From 178.40.250.111: process /usr/lib/apache2/mpm-itk/apache2(apache2:18211) attached to via ptrace by /usr/bin/strace[strace:18258] uid/euid:0/0 gid/egid:0/0, parent /usr/bin/htop[htop:2901] uid/euid:0/0 gid/egid:0/0 Hmmm, Feb 8 01:39:16 server01 kernel: [ 1757.266678] Memory cgroup out of memory: Kill process 18211 (apache2) score 725 or sacrifice child) So the process has been killed 10 minutes ago and this was really the last OOM event for group /1258: $ grep "Task in /1258/uid killed" kern2.log | tail -n2 Feb 8 01:39:16 server01 kernel: [ 1757.045021] Task in /1258/uid killed as a result of limit of /1258 Feb 8 01:39:16 server01 kernel: [ 1757.167984] Task in /1258/uid killed as a result of limit of /1258 But this was still before you started collecting data for memcg-bug-4 (2:34) so we do not know what was the previous stack unfortunatelly. > >> Why are all PIDs waiting on 'mem_cgroup_handle_oom' and there is no > >> OOM message in the log? > > > >I am not sure what you mean here but there are > >$ grep "Memory cgroup out of memory:" kern2.collected.log | wc -l > >16 > > > >OOM killer events during the time you were gathering memcg-bug-4 data. > > > >> Data in memcg-bug-4.tar.gz are only for 2 > >> minutes but i let it run for about 15-20 minutes, no single process > >> killed by OOM. > > > >I can see > >$ grep "Memory cgroup out of memory:" kern2.after.log | wc -l > >57 > > > >killed after 02:38:47 when you stopped gathering data for memcg-bug-4 > > > I meant no single process was killed inside cgroup 1258 (data from > this cgroup are in memcg-bug-4.tar.gz). > > Just get data from memcg-bug-4.tar.gz which were taken from cgroup > 1258. Are you sure about that? When I extracted all pids from timestamp directories and greped them in the log I got this: for i in `cat bug/pids` ; do grep "\[ *\<$i\>\]" kern2.log ; done Feb 8 01:31:02 server01 kernel: [ 1263.429212] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:31:15 server01 kernel: [ 1276.655241] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:29 server01 kernel: [ 1350.797835] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:42 server01 kernel: [ 1363.662242] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:46 server01 kernel: [ 1367.181798] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:46 server01 kernel: [ 1367.381627] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:46 server01 kernel: [ 1367.490896] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:33:02 server01 kernel: [ 1383.709652] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:36:26 server01 kernel: [ 1587.458967] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:36:26 server01 kernel: [ 1587.558419] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:36:26 server01 kernel: [ 1587.652474] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:02 server01 kernel: [ 1743.107086] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:16 server01 kernel: [ 1757.015359] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:16 server01 kernel: [ 1757.133998] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:16 server01 kernel: [ 1757.262992] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:18:12 server01 kernel: [ 493.156641] [ 7888] 1293 7888 169326 64876 3 0 0 apache2 Feb 8 01:18:12 server01 kernel: [ 493.269129] [ 7888] 1293 7888 169390 64876 4 0 0 apache2 Feb 8 01:18:21 server01 kernel: [ 502.384221] [ 8011] 1293 8011 170094 65675 5 0 0 apache2 Feb 8 01:18:24 server01 kernel: [ 505.052600] [ 8011] 1293 8011 170260 65854 2 0 0 apache2 Feb 8 01:18:24 server01 kernel: [ 505.200454] [ 8011] 1293 8011 170260 65854 2 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.538637] [ 8054] 1258 8054 164404 60618 1 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230146] [ 8102] 1293 8102 170258 65869 7 0 0 apache2 So at least 7888, 8011 and 8102 were from a different group (1293). Others were never listed in the eligible processes list which is a bit unexpected. It is also unfortunate because I cannot match them to their groups from the log. $ for i in `cat bug/pids` ; do grep "\[ *\<$i\>\]" kern2.log >/dev/null || echo "$i not listed" ; done 7265 not listed 7474 not listed 7710 not listed 7969 not listed 7988 not listed 7997 not listed 8000 not listed 8014 not listed 8016 not listed 8019 not listed 8057 not listed 8058 not listed 8059 not listed 8063 not listed 8064 not listed 8066 not listed 8067 not listed 8069 not listed 8070 not listed 8071 not listed 8072 not listed 8075 not listed 8091 not listed 8092 not listed 8094 not listed 8098 not listed 8099 not listed 8100 not listed Are you sure all of them belong to 1258 group? > Almost all processes are in 'mem_cgroup_handle_oom' so cgroup > is under OOM. You are right, almost all of them are waiting in mem_cgroup_handle_oom which suggest that they should be listed in a per group eligible tasks list. One way how this might happen is when a process which manages to get oom_lock has a fatal signal pending. Then we wouldn't get to oom_kill_process and no OOM messages would get printed. This is correct because such a task would terminate soon anyway and all the waiters would wake up eventually. If not enough memory would be freed another task would get the oom_lock and this one would trigger OOM (unless it has fatal signal pending as well). Another option would be that no task could be selected - e.g. because select_bad_process sees TIF_MEMDIE marked task - the one already killed by OOM killer but that wasn't able to terminate for some reason. 18211 could be such a task. But we do not know what was going on with it before strace attached to it. Finally it is possible that the OOM header (everything up to Kill process) was suppressed because of rate limiting. But $ grep -B1 "Kill process" kern2.log Feb 8 01:15:02 server01 kernel: [ 304.000402] [ 4969] 1258 4969 163761 59554 6 0 0 apache2 Feb 8 01:15:02 server01 kernel: [ 304.000649] Memory cgroup out of memory: Kill process 4816 (apache2) score 1000 or sacrifice child -- Feb 8 01:15:51 server01 kernel: [ 352.924573] [ 5847] 1709 5847 163433 58952 6 0 0 apache2 Feb 8 01:15:51 server01 kernel: [ 352.924761] Memory cgroup out of memory: Kill process 5212 (apache2) score 1000 or sacrifice child [...] says that the message was preceded by a process list so we can exclude rate limiting. > I assume that this is suppose to take only few seconds > while kernel finds any process and kill it (and maybe do it again > until enough of memory is freed). I was gathering the data for > about 2 and a half minutes and NO SINGLE process was killed (just > compate list of PIDs from the first and the last directory inside > memcg-bug-4.tar.gz). Even more, no single process was killed in cgroup > 1258 also after i stopped gathering the data. You can also take the > list od PID from memcg-bug-4.tar.gz and you will find only 18211 and > 8102 (which are the two stucked processes). > > So my question is: Why no process was killed inside cgroup 1258 > while it was under OOM? I would bet that there is something weird going on with pid:18211. But I do not have enough information to find out what and why. > It was under OOM for at least 2 and a half of minutes while i was > gathering the data (then i let it run for additional, cca, 10 minutes > and then killed processes by hand but i cannot proof this). Why kernel > didn't kill any process for so long and ends the OOM? As already mentioned above, select_bad_process doesn't select any task if there is one which is on the way out. Maybe this is what is going on here. > Btw, processes in cgroup 1258 (memcg-bug-4.tar.gz) are looping in this > two tasks (i pasted only first line of stack): > mem_cgroup_handle_oom+0x241/0x3b0 > 0xffffffffffffffff 0xffffffffffffffff is just a bogus entry. No idea why this happens. > Some of them are in 'poll_schedule_timeout' and then they start to > loop as above. Is this correct behavior? > For example, do (first line of stack from process 7710 from all > timestamps): for i in */7710/stack; do head -n1 $i; done Yes, this is perfectly ok, because that task starts with: $ cat bug/1360287245/7710/stack [] poll_schedule_timeout+0x49/0x70 [] do_sys_poll+0x54b/0x680 [] sys_poll+0x7c/0xf0 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff and then later on it gets into OOM because of a page fault: $ cat bug/1360287250/7710/stack [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1149+0x5f3/0x600 [] mem_cgroup_charge_common+0x6c/0xb0 [] mem_cgroup_newpage_charge+0x45/0x50 [] do_wp_page+0x14e/0x800 [] handle_pte_fault+0x264/0x940 [] handle_mm_fault+0x138/0x260 [] do_page_fault+0x13d/0x460 [] page_fault+0x1f/0x30 [] 0xffffffffffffffff And it loops in it until the end which is possible as well if the group is under permanent OOM condition and the task is not selected to be killed. Unfortunately I am not able to reproduce this behavior even if I try to hammer OOM like mad so I am afraid I cannot help you much without further debugging patches. I do realize that experimenting in your environment is a problem but I do not many options left. Please do not use strace and rather collect /proc/pid/stack instead. It would be also helpful to get group/tasks file to have a full list of tasks in the group --- >>From 1139745d43cc8c56bc79c219291d1e5281799dd4 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Mon, 11 Feb 2013 12:18:36 +0100 Subject: [PATCH] oom: debug skipping killing --- mm/oom_kill.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 069b64e..3d759f0 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -329,6 +329,8 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, if (test_tsk_thread_flag(p, TIF_MEMDIE)) { if (unlikely(frozen(p))) thaw_process(p); + printk(KERN_WARNING"XXX: pid:%d (flags:%u) is TIF_MEMDIE. Waiting for it\n", + p->pid, p->flags); return ERR_PTR(-1UL); } if (!p->mm) @@ -353,8 +355,11 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, * then wait for it to finish before killing * some other task unnecessarily. */ - if (!(p->group_leader->ptrace & PT_TRACE_EXIT)) + if (!(p->group_leader->ptrace & PT_TRACE_EXIT)) { + printk(KERN_WARNING"XXX: pid:%d (flags:%u) is PF_EXITING. Waiting for it\n", + p->pid, p->flags); return ERR_PTR(-1UL); + } } } @@ -494,6 +499,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, * its children or threads, just set TIF_MEMDIE so it can die quickly */ if (p->flags & PF_EXITING) { + printk(KERN_WARNING"XXX: pid:%d (flags:%u). Not killing PF_EXITING\n", p->pid, p->flags); set_tsk_thread_flag(p, TIF_MEMDIE); return 0; } @@ -567,6 +573,8 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) * its memory. */ if (fatal_signal_pending(current)) { + printk(KERN_WARNING"XXX: pid:%d (flags:%u) has fatal_signal_pending. Waiting for it\n", + p->pid, p->flags); set_thread_flag(TIF_MEMDIE); return; } -- 1.7.10.4 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755615Ab3BVIXi (ORCPT ); Fri, 22 Feb 2013 03:23:38 -0500 Received: from gmmr2.centrum.cz ([46.255.227.252]:40342 "EHLO gmmr2.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754068Ab3BVIXf (ORCPT ); Fri, 22 Feb 2013 03:23:35 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_if_PF=5FNO=5FMEMCG=5FOOM_is_set?= Date: Fri, 22 Feb 2013 09:23:32 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20130208094420.GA7557@dhcp22.suse.cz>, <20130208120249.FD733220@pobox.sk>, <20130208123854.GB7557@dhcp22.suse.cz>, <20130208145616.FB78CE24@pobox.sk>, <20130208152402.GD7557@dhcp22.suse.cz>, <20130208165805.8908B143@pobox.sk>, <20130208171012.GH7557@dhcp22.suse.cz>, <20130208220243.EDEE0825@pobox.sk>, <20130210150310.GA9504@dhcp22.suse.cz>, <20130210174619.24F20488@pobox.sk> <20130211112240.GC19922@dhcp22.suse.cz> In-Reply-To: <20130211112240.GC19922@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20130222092332.4001E4B6@pobox.sk> X-Maser: oho Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >Unfortunately I am not able to reproduce this behavior even if I try >to hammer OOM like mad so I am afraid I cannot help you much without >further debugging patches. >I do realize that experimenting in your environment is a problem but I >do not many options left. Please do not use strace and rather collect >/proc/pid/stack instead. It would be also helpful to get group/tasks >file to have a full list of tasks in the group Hi Michal, sorry that i didn't response for a while. Today i installed kernel with your two patches and i'm running it now. I'm still having problems with OOM which is not able to handle low memory and is not killing processes. Here is some info: - data from cgroup 1258 while it was under OOM and no processes were killed (so OOM don't stop and cgroup was freezed) http://watchdog.sk/lkml/memcg-bug-6.tar.gz I noticed problem about on 8:39 and waited until 8:57 (nothing happend). Then i killed process 19864 which seems to help and other processes probably ends and cgroup started to work. But problem accoured again about 20 seconds later, so i killed all processes at 8:58. The problem is occuring all the time since then. All processes (in that cgroup) are always in state 'D' when it occurs. - kernel log from boot until now http://watchdog.sk/lkml/kern3.gz Btw, something probably happened also at about 3:09 but i wasn't able to gather any data because my 'load check script' killed all apache processes (load was more than 100). azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757072Ab3BVMBJ (ORCPT ); Fri, 22 Feb 2013 07:01:09 -0500 Received: from gmmr2.centrum.cz ([46.255.227.252]:50864 "EHLO gmmr2.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756432Ab3BVMBE (ORCPT ); Fri, 22 Feb 2013 07:01:04 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_if_PF=5FNO=5FMEMCG=5FOOM_is_set?= Date: Fri, 22 Feb 2013 13:00:55 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20130208094420.GA7557@dhcp22.suse.cz>, <20130208120249.FD733220@pobox.sk>, <20130208123854.GB7557@dhcp22.suse.cz>, <20130208145616.FB78CE24@pobox.sk>, <20130208152402.GD7557@dhcp22.suse.cz>, <20130208165805.8908B143@pobox.sk>, <20130208171012.GH7557@dhcp22.suse.cz>, <20130208220243.EDEE0825@pobox.sk>, <20130210150310.GA9504@dhcp22.suse.cz>, <20130210174619.24F20488@pobox.sk> <20130211112240.GC19922@dhcp22.suse.cz> In-Reply-To: <20130211112240.GC19922@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20130222130055.29151595@pobox.sk> X-Maser: oho Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >Unfortunately I am not able to reproduce this behavior even if I try >to hammer OOM like mad so I am afraid I cannot help you much without >further debugging patches. >I do realize that experimenting in your environment is a problem but I >do not many options left. Please do not use strace and rather collect >/proc/pid/stack instead. It would be also helpful to get group/tasks >file to have a full list of tasks in the group Sending new info! I found out one interesting thing. When problem occurs (it probably happen when OOM is started in target cgroup but i'm not sure), the target cgroup, somehow, becames broken. In other words, after the problem occurs once in target cgroup, it is happening always in this cgroup. I made this test: 1.) I create cgroup A with limits (also with memory limit). 2.) Waited when OOM is started (can takes hours). Processes in target cgroup becames freezed so they must be killed. 3.) After this, processes are always freezing in cgroup A, it usually takes 20-30 seconds after killing previously freezed processes. 4.) I created cgroup B with the *same* limits as cgroup A and moved user from A to B. Problem disappears. 5.) Go to (2) And second thing, i got've kernel oops, look at the end of: http://watchdog.sk/lkml/oops From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757335Ab3BVMyw (ORCPT ); Fri, 22 Feb 2013 07:54:52 -0500 Received: from gmmr8.centrum.cz ([46.255.227.254]:41705 "EHLO gmmr8.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756590Ab3BVMyu (ORCPT ); Fri, 22 Feb 2013 07:54:50 -0500 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_if_PF=5FNO=5FMEMCG=5FOOM_is_set?= Date: Fri, 22 Feb 2013 13:54:42 +0100 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20130208123854.GB7557@dhcp22.suse.cz>, <20130208145616.FB78CE24@pobox.sk>, <20130208152402.GD7557@dhcp22.suse.cz>, <20130208165805.8908B143@pobox.sk>, <20130208171012.GH7557@dhcp22.suse.cz>, <20130208220243.EDEE0825@pobox.sk>, <20130210150310.GA9504@dhcp22.suse.cz>, <20130210174619.24F20488@pobox.sk>, <20130211112240.GC19922@dhcp22.suse.cz>, <20130222092332.4001E4B6@pobox.sk> <20130222125217.GA32285@dhcp22.suse.cz> In-Reply-To: <20130222125217.GA32285@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20130222135442.ADFFF498@pobox.sk> X-Maser: Georgo Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >I am not sure how much time I'll have for this today but just to make >sure we are on the same page, could you point me to the two patches you >have applied in the mean time? Here: http://watchdog.sk/lkml/patches2 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757329Ab3BVMwX (ORCPT ); Fri, 22 Feb 2013 07:52:23 -0500 Received: from cantor2.suse.de ([195.135.220.15]:55772 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756705Ab3BVMwU (ORCPT ); Fri, 22 Feb 2013 07:52:20 -0500 Date: Fri, 22 Feb 2013 13:52:17 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Message-ID: <20130222125217.GA32285@dhcp22.suse.cz> References: <20130208123854.GB7557@dhcp22.suse.cz> <20130208145616.FB78CE24@pobox.sk> <20130208152402.GD7557@dhcp22.suse.cz> <20130208165805.8908B143@pobox.sk> <20130208171012.GH7557@dhcp22.suse.cz> <20130208220243.EDEE0825@pobox.sk> <20130210150310.GA9504@dhcp22.suse.cz> <20130210174619.24F20488@pobox.sk> <20130211112240.GC19922@dhcp22.suse.cz> <20130222092332.4001E4B6@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130222092332.4001E4B6@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, On Fri 22-02-13 09:23:32, azurIt wrote: [...] > sorry that i didn't response for a while. Today i installed kernel > with your two patches and i'm running it now. I am not sure how much time I'll have for this today but just to make sure we are on the same page, could you point me to the two patches you have applied in the mean time? -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757163Ab3BVNAW (ORCPT ); Fri, 22 Feb 2013 08:00:22 -0500 Received: from cantor2.suse.de ([195.135.220.15]:56066 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756872Ab3BVNAT (ORCPT ); Fri, 22 Feb 2013 08:00:19 -0500 Date: Fri, 22 Feb 2013 14:00:17 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Message-ID: <20130222130017.GB32285@dhcp22.suse.cz> References: <20130222125217.GA32285@dhcp22.suse.cz> <20130222135442.ADFFF498@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130222135442.ADFFF498@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 22-02-13 13:54:42, azurIt wrote: > >I am not sure how much time I'll have for this today but just to make > >sure we are on the same page, could you point me to the two patches you > >have applied in the mean time? > > > Here: > http://watchdog.sk/lkml/patches2 OK, looks correct. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752871Ab3FFQEw (ORCPT ); Thu, 6 Jun 2013 12:04:52 -0400 Received: from cantor2.suse.de ([195.135.220.15]:41703 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751728Ab3FFQEu (ORCPT ); Thu, 6 Jun 2013 12:04:50 -0400 Date: Thu, 6 Jun 2013 18:04:46 +0200 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Message-ID: <20130606160446.GE24115@dhcp22.suse.cz> References: <20130208123854.GB7557@dhcp22.suse.cz> <20130208145616.FB78CE24@pobox.sk> <20130208152402.GD7557@dhcp22.suse.cz> <20130208165805.8908B143@pobox.sk> <20130208171012.GH7557@dhcp22.suse.cz> <20130208220243.EDEE0825@pobox.sk> <20130210150310.GA9504@dhcp22.suse.cz> <20130210174619.24F20488@pobox.sk> <20130211112240.GC19922@dhcp22.suse.cz> <20130222092332.4001E4B6@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130222092332.4001E4B6@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, I am really sorry it took so long but I was constantly preempted by other stuff. I hope I have a good news for you, though. Johannes has found a nice way how to overcome deadlock issues from memcg OOM which might help you. Would you be willing to test with his patch (http://permalink.gmane.org/gmane.linux.kernel.mm/101437). Unlike my patch which handles just the i_mutex case his patch solved all possible locks. I can backport the patch for your kernel (are you still using 3.2 kernel or you have moved to a newer one?). On Fri 22-02-13 09:23:32, azurIt wrote: > >Unfortunately I am not able to reproduce this behavior even if I try > >to hammer OOM like mad so I am afraid I cannot help you much without > >further debugging patches. > >I do realize that experimenting in your environment is a problem but I > >do not many options left. Please do not use strace and rather collect > >/proc/pid/stack instead. It would be also helpful to get group/tasks > >file to have a full list of tasks in the group > > > > Hi Michal, > > > sorry that i didn't response for a while. Today i installed kernel with your two patches and i'm running it now. I'm still having problems with OOM which is not able to handle low memory and is not killing processes. Here is some info: > > - data from cgroup 1258 while it was under OOM and no processes were killed (so OOM don't stop and cgroup was freezed) > http://watchdog.sk/lkml/memcg-bug-6.tar.gz > > I noticed problem about on 8:39 and waited until 8:57 (nothing happend). Then i killed process 19864 which seems to help and other processes probably ends and cgroup started to work. But problem accoured again about 20 seconds later, so i killed all processes at 8:58. The problem is occuring all the time since then. All processes (in that cgroup) are always in state 'D' when it occurs. > > > - kernel log from boot until now > http://watchdog.sk/lkml/kern3.gz > > > Btw, something probably happened also at about 3:09 but i wasn't able to gather any data because my 'load check script' killed all apache processes (load was more than 100). > > > > azur > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751904Ab3FFQXF (ORCPT ); Thu, 6 Jun 2013 12:23:05 -0400 Received: from gmmr3.centrum.cz ([46.255.225.251]:40248 "EHLO gmmr3.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751147Ab3FFQXD (ORCPT ); Thu, 6 Jun 2013 12:23:03 -0400 X-Greylist: delayed 386 seconds by postgrey-1.27 at vger.kernel.org; Thu, 06 Jun 2013 12:23:03 EDT To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=2E34=5D_memcg=3A_do_not_trigger_OOM_if_PF=5FNO=5FMEMCG=5FOOM_is_set?= Date: Thu, 06 Jun 2013 18:16:33 +0200 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20130208123854.GB7557@dhcp22.suse.cz>, <20130208145616.FB78CE24@pobox.sk>, <20130208152402.GD7557@dhcp22.suse.cz>, <20130208165805.8908B143@pobox.sk>, <20130208171012.GH7557@dhcp22.suse.cz>, <20130208220243.EDEE0825@pobox.sk>, <20130210150310.GA9504@dhcp22.suse.cz>, <20130210174619.24F20488@pobox.sk>, <20130211112240.GC19922@dhcp22.suse.cz>, <20130222092332.4001E4B6@pobox.sk> <20130606160446.GE24115@dhcp22.suse.cz> In-Reply-To: <20130606160446.GE24115@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20130606181633.BCC3E02E@pobox.sk> X-Maser: oho Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello Michal, nice to read you! :) Yes, i'm still on 3.2. Could you be so kind and try to backport it? Thank you very much! azur ______________________________________________________________ > Od: "Michal Hocko" > Komu: azurIt > Dátum: 06.06.2013 18:04 > Predmet: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set > > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , "Johannes Weiner" >Hi, > >I am really sorry it took so long but I was constantly preempted by >other stuff. I hope I have a good news for you, though. Johannes has >found a nice way how to overcome deadlock issues from memcg OOM which >might help you. Would you be willing to test with his patch >(http://permalink.gmane.org/gmane.linux.kernel.mm/101437). Unlike my >patch which handles just the i_mutex case his patch solved all possible >locks. > >I can backport the patch for your kernel (are you still using 3.2 kernel >or you have moved to a newer one?). > >On Fri 22-02-13 09:23:32, azurIt wrote: >> >Unfortunately I am not able to reproduce this behavior even if I try >> >to hammer OOM like mad so I am afraid I cannot help you much without >> >further debugging patches. >> >I do realize that experimenting in your environment is a problem but I >> >do not many options left. Please do not use strace and rather collect >> >/proc/pid/stack instead. It would be also helpful to get group/tasks >> >file to have a full list of tasks in the group >> >> >> >> Hi Michal, >> >> >> sorry that i didn't response for a while. Today i installed kernel with your two patches and i'm running it now. I'm still having problems with OOM which is not able to handle low memory and is not killing processes. Here is some info: >> >> - data from cgroup 1258 while it was under OOM and no processes were killed (so OOM don't stop and cgroup was freezed) >> http://watchdog.sk/lkml/memcg-bug-6.tar.gz >> >> I noticed problem about on 8:39 and waited until 8:57 (nothing happend). Then i killed process 19864 which seems to help and other processes probably ends and cgroup started to work. But problem accoured again about 20 seconds later, so i killed all processes at 8:58. The problem is occuring all the time since then. All processes (in that cgroup) are always in state 'D' when it occurs. >> >> >> - kernel log from boot until now >> http://watchdog.sk/lkml/kern3.gz >> >> >> Btw, something probably happened also at about 3:09 but i wasn't able to gather any data because my 'load check script' killed all apache processes (load was more than 100). >> >> >> >> azur >> -- >> To unsubscribe from this list: send the line "unsubscribe cgroups" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > >-- >Michal Hocko >SUSE Labs > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755025Ab3FGNME (ORCPT ); Fri, 7 Jun 2013 09:12:04 -0400 Received: from cantor2.suse.de ([195.135.220.15]:54496 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753367Ab3FGNMB (ORCPT ); Fri, 7 Jun 2013 09:12:01 -0400 Date: Fri, 7 Jun 2013 15:11:57 +0200 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130607131157.GF8117@dhcp22.suse.cz> References: <20130208152402.GD7557@dhcp22.suse.cz> <20130208165805.8908B143@pobox.sk> <20130208171012.GH7557@dhcp22.suse.cz> <20130208220243.EDEE0825@pobox.sk> <20130210150310.GA9504@dhcp22.suse.cz> <20130210174619.24F20488@pobox.sk> <20130211112240.GC19922@dhcp22.suse.cz> <20130222092332.4001E4B6@pobox.sk> <20130606160446.GE24115@dhcp22.suse.cz> <20130606181633.BCC3E02E@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130606181633.BCC3E02E@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 06-06-13 18:16:33, azurIt wrote: > Hello Michal, > > nice to read you! :) Yes, i'm still on 3.2. Could you be so kind and > try to backport it? Thank you very much! Here we go. I hope I didn't screw anything (Johannes might double check) because there were quite some changes in the area since 3.2. Nothing earth shattering though. Please note that I have only compile tested this. Also make sure you remove the previous patches you have from me. --- >>From 9d2801c1f53147ca9134cc5f76ab28d505a37a54 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Fri, 7 Jun 2013 13:52:42 +0200 Subject: [PATCH] memcg: do not trap chargers with full callstack on OOM The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1146+0x5ab/0x5c0 [] mem_cgroup_cache_charge+0xbe/0xe0 [] add_to_page_cache_locked+0x4c/0x140 [] add_to_page_cache_lru+0x22/0x50 [] grab_cache_page_write_begin+0x8b/0xe0 [] ext3_write_begin+0x88/0x270 [] generic_file_buffered_write+0x116/0x290 [] __generic_file_aio_write+0x27c/0x480 [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [] do_sync_write+0xea/0x130 [] vfs_write+0xf3/0x1f0 [] sys_write+0x51/0x90 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff OOM kill victim: [] do_truncate+0x58/0xa0 # takes i_mutex [] do_last+0x250/0xa30 [] path_openat+0xd7/0x440 [] do_filp_open+0x49/0xa0 [] do_sys_open+0x106/0x240 [] sys_open+0x20/0x30 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting an OOM and makes sure nobody loops or sleeps on OOM with locks held: 1. When OOMing in a system call (buffered IO and friends), invoke the OOM killer but just return -ENOMEM, never sleep on a OOM waitqueue. Userspace should be able to handle this and it prevents anybody from looping or waiting with locks held. 2. When OOMing in a page fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 3. When detecting an OOM in a page fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. While reworking the OOM routine, also remove a needless OOM waitqueue wakeup when invoking the killer. Only uncharges and limit increases, things that actually change the memory situation, should do wakeups. Reported-by: Reported-by: azurIt Debugged-by: Michal Hocko Reported-by: David Rientjes Signed-off-by: Johannes Weiner Reviewed-by: Michal Hocko --- include/linux/memcontrol.h | 22 +++++++ include/linux/mm.h | 1 + include/linux/sched.h | 6 ++ mm/ksm.c | 2 +- mm/memcontrol.c | 149 ++++++++++++++++++++++++++++---------------- mm/memory.c | 40 ++++++++---- mm/oom_kill.c | 2 + 7 files changed, 156 insertions(+), 66 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index b87068a..56bfc39 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -120,6 +120,15 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page); extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); +static inline void mem_cgroup_set_userfault(struct task_struct *p) +{ + p->memcg_oom.in_userfault = 1; +} +static inline void mem_cgroup_clear_userfault(struct task_struct *p) +{ + p->memcg_oom.in_userfault = 0; +} +bool mem_cgroup_oom_synchronize(void); #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP extern int do_swap_account; #endif @@ -333,6 +342,19 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } +static inline void mem_cgroup_set_userfault(struct task_struct *p) +{ +} + +static inline void mem_cgroup_clear_userfault(struct task_struct *p) +{ +} + +static inline bool mem_cgroup_oom_synchronize(void) +{ + return false; +} + static inline void mem_cgroup_inc_page_stat(struct page *page, enum mem_cgroup_page_stat_item idx) { diff --git a/include/linux/mm.h b/include/linux/mm.h index 4baadd1..91380ef 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -156,6 +156,7 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_ALLOW_RETRY 0x08 /* Retry fault if blocking */ #define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */ #define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */ +#define FAULT_FLAG_KERNEL 0x80 /* kernel-triggered fault (get_user_pages etc.) */ /* * This interface is used by x86 PAT code to identify a pfn mapping that is diff --git a/include/linux/sched.h b/include/linux/sched.h index 1c4f3e9..d521a70 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1568,6 +1568,12 @@ struct task_struct { unsigned long nr_pages; /* uncharged usage */ unsigned long memsw_nr_pages; /* uncharged mem+swap usage */ } memcg_batch; + struct memcg_oom_info { + unsigned int in_userfault:1; + unsigned int in_memcg_oom:1; + int wakeups; + struct mem_cgroup *wait_on_memcg; + } memcg_oom; #endif #ifdef CONFIG_HAVE_HW_BREAKPOINT atomic_t ptrace_bp_refcnt; diff --git a/mm/ksm.c b/mm/ksm.c index 310544a..3295a3b 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -338,7 +338,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr) break; if (PageKsm(page)) ret = handle_mm_fault(vma->vm_mm, vma, addr, - FAULT_FLAG_WRITE); + FAULT_FLAG_KERNEL | FAULT_FLAG_WRITE); else ret = VM_FAULT_WRITE; put_page(page); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b63f5f7..67189b4 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -249,6 +249,7 @@ struct mem_cgroup { bool oom_lock; atomic_t under_oom; + atomic_t oom_wakeups; atomic_t refcnt; @@ -1846,6 +1847,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait, static void memcg_wakeup_oom(struct mem_cgroup *memcg) { + atomic_inc(&memcg->oom_wakeups); /* for filtering, pass "memcg" as argument. */ __wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg); } @@ -1857,55 +1859,109 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) } /* - * try to call OOM killer. returns false if we should exit memory-reclaim loop. + * try to call OOM killer */ -bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask) { - struct oom_wait_info owait; - bool locked, need_to_kill; - - owait.mem = memcg; - owait.wait.flags = 0; - owait.wait.func = memcg_oom_wake_function; - owait.wait.private = current; - INIT_LIST_HEAD(&owait.wait.task_list); - need_to_kill = true; - mem_cgroup_mark_under_oom(memcg); + bool locked, need_to_kill = true; /* At first, try to OOM lock hierarchy under memcg.*/ spin_lock(&memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); - /* - * Even if signal_pending(), we can't quit charge() loop without - * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL - * under OOM is always welcomed, use TASK_KILLABLE here. - */ - prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); if (!locked || memcg->oom_kill_disable) need_to_kill = false; if (locked) mem_cgroup_oom_notify(memcg); spin_unlock(&memcg_oom_lock); - if (need_to_kill) { - finish_wait(&memcg_oom_waitq, &owait.wait); - mem_cgroup_out_of_memory(memcg, mask); - } else { - schedule(); - finish_wait(&memcg_oom_waitq, &owait.wait); + /* + * A system call can just return -ENOMEM, but if this is a + * page fault and somebody else is handling the OOM already, + * we need to sleep on the OOM waitqueue for this memcg until + * the situation is resolved. Which can take some time + * because it might be handled by a userspace task. + * + * However, this is the charge context, which means that we + * may sit on a large call stack and hold various filesystem + * locks, the mmap_sem etc. and we don't want the OOM handler + * to deadlock on them while we sit here and wait. Store the + * current OOM context in the task_struct, then return + * -ENOMEM. At the end of the page fault handler, with the + * stack unwound, pagefault_out_of_memory() will check back + * with us by calling mem_cgroup_oom_synchronize(), possibly + * putting the task to sleep. + */ + if (current->memcg_oom.in_userfault) { + current->memcg_oom.in_memcg_oom = 1; + /* + * Somebody else is handling the situation. Make sure + * no wakeups are missed between now and going to + * sleep at the end of the page fault. + */ + if (!need_to_kill) { + mem_cgroup_mark_under_oom(memcg); + current->memcg_oom.wakeups = + atomic_read(&memcg->oom_wakeups); + css_get(&memcg->css); + current->memcg_oom.wait_on_memcg = memcg; + } } - spin_lock(&memcg_oom_lock); - if (locked) + + if (need_to_kill) + mem_cgroup_out_of_memory(memcg, mask); + + if (locked) { + spin_lock(&memcg_oom_lock); mem_cgroup_oom_unlock(memcg); - memcg_wakeup_oom(memcg); - spin_unlock(&memcg_oom_lock); + /* + * Sleeping tasks might have been killed, make sure + * they get scheduled so they can exit. + */ + if (need_to_kill) + memcg_oom_recover(memcg); + spin_unlock(&memcg_oom_lock); + } +} - mem_cgroup_unmark_under_oom(memcg); +bool mem_cgroup_oom_synchronize(void) +{ + struct oom_wait_info owait; + struct mem_cgroup *memcg; - if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + /* OOM is global, do not handle */ + if (!current->memcg_oom.in_memcg_oom) return false; - /* Give chance to dying process */ - schedule_timeout_uninterruptible(1); + + /* + * We invoked the OOM killer but there is a chance that a kill + * did not free up any charges. Everybody else might already + * be sleeping, so restart the fault and keep the rampage + * going until some charges are released. + */ + memcg = current->memcg_oom.wait_on_memcg; + if (!memcg) + goto out; + + if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + goto out_put; + + owait.mem = memcg; + owait.wait.flags = 0; + owait.wait.func = memcg_oom_wake_function; + owait.wait.private = current; + INIT_LIST_HEAD(&owait.wait.task_list); + + prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); + /* Only sleep if we didn't miss any wakeups since OOM */ + if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups) + schedule(); + finish_wait(&memcg_oom_waitq, &owait.wait); +out_put: + mem_cgroup_unmark_under_oom(memcg); + css_put(&memcg->css); + current->memcg_oom.wait_on_memcg = NULL; +out: + current->memcg_oom.in_memcg_oom = 0; return true; } @@ -2195,11 +2251,10 @@ enum { CHARGE_RETRY, /* need to retry but retry is not bad */ CHARGE_NOMEM, /* we can't do more. return -ENOMEM */ CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */ - CHARGE_OOM_DIE, /* the current is killed because of OOM */ }; static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, - unsigned int nr_pages, bool oom_check) + unsigned int nr_pages, bool invoke_oom) { unsigned long csize = nr_pages * PAGE_SIZE; struct mem_cgroup *mem_over_limit; @@ -2257,14 +2312,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (mem_cgroup_wait_acct_move(mem_over_limit)) return CHARGE_RETRY; - /* If we don't need to call oom-killer at el, return immediately */ - if (!oom_check) - return CHARGE_NOMEM; - /* check OOM */ - if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask)) - return CHARGE_OOM_DIE; + if (invoke_oom) + mem_cgroup_oom(mem_over_limit, gfp_mask); - return CHARGE_RETRY; + return CHARGE_NOMEM; } /* @@ -2349,7 +2400,7 @@ again: } do { - bool oom_check; + bool invoke_oom = oom && !nr_oom_retries; /* If killed, bypass charge */ if (fatal_signal_pending(current)) { @@ -2357,13 +2408,7 @@ again: goto bypass; } - oom_check = false; - if (oom && !nr_oom_retries) { - oom_check = true; - nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES; - } - - ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check); + ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, invoke_oom); switch (ret) { case CHARGE_OK: break; @@ -2376,16 +2421,12 @@ again: css_put(&memcg->css); goto nomem; case CHARGE_NOMEM: /* OOM routine works */ - if (!oom) { + if (!oom || invoke_oom) { css_put(&memcg->css); goto nomem; } - /* If oom, we never return -ENOMEM */ nr_oom_retries--; break; - case CHARGE_OOM_DIE: /* Killed by OOM Killer */ - css_put(&memcg->css); - goto bypass; } } while (ret != CHARGE_OK); diff --git a/mm/memory.c b/mm/memory.c index 829d437..bee177c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1720,7 +1720,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, cond_resched(); while (!(page = follow_page(vma, start, foll_flags))) { int ret; - unsigned int fault_flags = 0; + unsigned int fault_flags = FAULT_FLAG_KERNEL; /* For mlock, just skip the stack guard page. */ if (foll_flags & FOLL_MLOCK) { @@ -1842,6 +1842,7 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, if (!vma || address < vma->vm_start) return -EFAULT; + fault_flags |= FAULT_FLAG_KERNEL; ret = handle_mm_fault(mm, vma, address, fault_flags); if (ret & VM_FAULT_ERROR) { if (ret & VM_FAULT_OOM) @@ -3439,22 +3440,14 @@ unlock: /* * By the time we get here, we already hold the mm semaphore */ -int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, unsigned int flags) +static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) { pgd_t *pgd; pud_t *pud; pmd_t *pmd; pte_t *pte; - __set_current_state(TASK_RUNNING); - - count_vm_event(PGFAULT); - mem_cgroup_count_vm_event(mm, PGFAULT); - - /* do counter updates before entering really critical section. */ - check_sync_rss_stat(current); - if (unlikely(is_vm_hugetlb_page(vma))) return hugetlb_fault(mm, vma, address, flags); @@ -3503,6 +3496,31 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, return handle_pte_fault(mm, vma, address, pte, pmd, flags); } +int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) +{ + int in_userfault = !(flags & FAULT_FLAG_KERNEL); + int ret; + + __set_current_state(TASK_RUNNING); + + count_vm_event(PGFAULT); + mem_cgroup_count_vm_event(mm, PGFAULT); + + /* do counter updates before entering really critical section. */ + check_sync_rss_stat(current); + + if (in_userfault) + mem_cgroup_set_userfault(current); + + ret = __handle_mm_fault(mm, vma, address, flags); + + if (in_userfault) + mem_cgroup_clear_userfault(current); + + return ret; +} + #ifndef __PAGETABLE_PUD_FOLDED /* * Allocate page upper directory. diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 069b64e..aa60863 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -785,6 +785,8 @@ out: */ void pagefault_out_of_memory(void) { + if (mem_cgroup_oom_synchronize()) + return; if (try_set_system_oom()) { out_of_memory(NULL, 0, 0, NULL); clear_system_oom(); -- 1.7.10.4 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932566Ab3FQKVj (ORCPT ); Mon, 17 Jun 2013 06:21:39 -0400 Received: from gmmr1.centrum.cz ([46.255.225.252]:35912 "EHLO gmmr1.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932307Ab3FQKVh (ORCPT ); Mon, 17 Jun 2013 06:21:37 -0400 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=5D_memcg=3A_do_not_trap_chargers_with_full_callstack_on_OOM?= Date: Mon, 17 Jun 2013 12:21:34 +0200 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20130208152402.GD7557@dhcp22.suse.cz>, <20130208165805.8908B143@pobox.sk>, <20130208171012.GH7557@dhcp22.suse.cz>, <20130208220243.EDEE0825@pobox.sk>, <20130210150310.GA9504@dhcp22.suse.cz>, <20130210174619.24F20488@pobox.sk>, <20130211112240.GC19922@dhcp22.suse.cz>, <20130222092332.4001E4B6@pobox.sk>, <20130606160446.GE24115@dhcp22.suse.cz>, <20130606181633.BCC3E02E@pobox.sk> <20130607131157.GF8117@dhcp22.suse.cz> In-Reply-To: <20130607131157.GF8117@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20130617122134.2E072BA8@pobox.sk> X-Maser: Georgo Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >Here we go. I hope I didn't screw anything (Johannes might double check) >because there were quite some changes in the area since 3.2. Nothing >earth shattering though. Please note that I have only compile tested >this. Also make sure you remove the previous patches you have from me. Hi Michal, it, unfortunately, didn't work. Everything was working fine but original problem is still occuring. I'm unable to send you stacks or more info because problem is taking down the whole server for some time now (don't know what exactly caused it to start happening, maybe newer versions of 3.2.x). But i'm sure of one thing - when problem occurs, nothing is able to access hard drives (every process which tries it is freezed until problem is resolved or server is rebooted). Problem is fixed after killing processes from cgroup which caused it and everything immediatelly starts to work normally. I find this out by keeping terminal opened from another server to one where my problem is occuring quite often and running several apps there (htop, iotop, etc.). When problem occurs, all apps which wasn't working with HDD was ok. The htop proved to be very usefull here because it's only reading proc filesystem and is also able to send KILL signals - i was able to resolve the problem with it without rebooting the server. I created a special daemon (about month ago) which is able to detect and fix the problem so i'm not having server outages now. The point was to NOT access anything which is stored on HDDs, the daemon is only reading info from cgroup filesystem and sending KILL signals to processes. Maybe i should be able to also read stack files before killing, i will try it. Btw, which vanilla kernel includes this patch? Thank you and everyone involved very much for time and help. azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756774Ab3FSN0U (ORCPT ); Wed, 19 Jun 2013 09:26:20 -0400 Received: from cantor2.suse.de ([195.135.220.15]:39489 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756427Ab3FSN0R (ORCPT ); Wed, 19 Jun 2013 09:26:17 -0400 Date: Wed, 19 Jun 2013 15:26:14 +0200 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , Johannes Weiner Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130619132614.GC16457@dhcp22.suse.cz> References: <20130208171012.GH7557@dhcp22.suse.cz> <20130208220243.EDEE0825@pobox.sk> <20130210150310.GA9504@dhcp22.suse.cz> <20130210174619.24F20488@pobox.sk> <20130211112240.GC19922@dhcp22.suse.cz> <20130222092332.4001E4B6@pobox.sk> <20130606160446.GE24115@dhcp22.suse.cz> <20130606181633.BCC3E02E@pobox.sk> <20130607131157.GF8117@dhcp22.suse.cz> <20130617122134.2E072BA8@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130617122134.2E072BA8@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 17-06-13 12:21:34, azurIt wrote: > >Here we go. I hope I didn't screw anything (Johannes might double check) > >because there were quite some changes in the area since 3.2. Nothing > >earth shattering though. Please note that I have only compile tested > >this. Also make sure you remove the previous patches you have from me. > > > Hi Michal, > > it, unfortunately, didn't work. Everything was working fine but > original problem is still occuring. This would be more than surprising because tasks blocked at memcg OOM don't hold any locks anymore. Maybe I have messed something up during backport but I cannot spot anything. > I'm unable to send you stacks or more info because problem is taking > down the whole server for some time now (don't know what exactly > caused it to start happening, maybe newer versions of 3.2.x). So you are not testing with the same kernel with just the old patch replaced by the new one? > But i'm sure of one thing - when problem occurs, nothing is able to > access hard drives (every process which tries it is freezed until > problem is resolved or server is rebooted). I would be really interesting to see what those tasks are blocked on. > Problem is fixed after killing processes from cgroup which > caused it and everything immediatelly starts to work normally. I > find this out by keeping terminal opened from another server to one > where my problem is occuring quite often and running several apps > there (htop, iotop, etc.). When problem occurs, all apps which wasn't > working with HDD was ok. The htop proved to be very usefull here > because it's only reading proc filesystem and is also able to send > KILL signals - i was able to resolve the problem with it > without rebooting the server. sysrq+t will give you the list of all tasks and their traces. > I created a special daemon (about month ago) which is able to detect > and fix the problem so i'm not having server outages now. The point > was to NOT access anything which is stored on HDDs, the daemon is > only reading info from cgroup filesystem and sending KILL signals to > processes. Maybe i should be able to also read stack files before > killing, i will try it. > > Btw, which vanilla kernel includes this patch? None yet. But I hope it will be merged to 3.11 and backported to the stable trees. > Thank you and everyone involved very much for time and help. > > azur -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750980Ab3FVUQF (ORCPT ); Sat, 22 Jun 2013 16:16:05 -0400 Received: from gmmr4.centrum.cz ([46.255.227.253]:57519 "EHLO gmmr4.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750756Ab3FVUQD (ORCPT ); Sat, 22 Jun 2013 16:16:03 -0400 X-Greylist: delayed 356 seconds by postgrey-1.27 at vger.kernel.org; Sat, 22 Jun 2013 16:16:02 EDT To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=5D_memcg=3A_do_not_trap_chargers_with_full_callstack_on_OOM?= Date: Sat, 22 Jun 2013 22:09:58 +0200 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20130208171012.GH7557@dhcp22.suse.cz>, <20130208220243.EDEE0825@pobox.sk>, <20130210150310.GA9504@dhcp22.suse.cz>, <20130210174619.24F20488@pobox.sk>, <20130211112240.GC19922@dhcp22.suse.cz>, <20130222092332.4001E4B6@pobox.sk>, <20130606160446.GE24115@dhcp22.suse.cz>, <20130606181633.BCC3E02E@pobox.sk>, <20130607131157.GF8117@dhcp22.suse.cz>, <20130617122134.2E072BA8@pobox.sk> <20130619132614.GC16457@dhcp22.suse.cz> In-Reply-To: <20130619132614.GC16457@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20130622220958.D10567A4@pobox.sk> X-Maser: brud Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Michal, >> I'm unable to send you stacks or more info because problem is taking >> down the whole server for some time now (don't know what exactly >> caused it to start happening, maybe newer versions of 3.2.x). > >So you are not testing with the same kernel with just the old patch >replaced by the new one? No, i'm not testing with the same kernel but all are 3.2.x. I even cannot install older 3.2.x because grsecurity is always available for newest kernel and there is no archive of older versions (at least i don't know about any). >> But i'm sure of one thing - when problem occurs, nothing is able to >> access hard drives (every process which tries it is freezed until >> problem is resolved or server is rebooted). > >I would be really interesting to see what those tasks are blocked on. I'm trying to get it, stay tuned :) Today i noticed one bug, not 100% sure it is related to 'your' patch but i didn't seen this before. I noticed that i have lots of cgroups which cannot be removed - if i do 'rmdir ', it just hangs and never complete. Even more, it's not possible to access the whole cgroup filesystem until i kill that rmdir (anything, which tries it, just hangs). All unremoveable cgroups has this in 'memory.oom_control': oom_kill_disable 0 under_oom 1 And, yes, 'tasks' file is empty. azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752961Ab3FXQst (ORCPT ); Mon, 24 Jun 2013 12:48:49 -0400 Received: from gmmr4.centrum.cz ([46.255.227.253]:55167 "EHLO gmmr4.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750958Ab3FXQsr (ORCPT ); Mon, 24 Jun 2013 12:48:47 -0400 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=5D_memcg=3A_do_not_trap_chargers_with_full_callstack_on_OOM?= Date: Mon, 24 Jun 2013 18:48:40 +0200 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , =?utf-8?q?Johannes_Weiner?= References: <20130208171012.GH7557@dhcp22.suse.cz>, <20130208220243.EDEE0825@pobox.sk>, <20130210150310.GA9504@dhcp22.suse.cz>, <20130210174619.24F20488@pobox.sk>, <20130211112240.GC19922@dhcp22.suse.cz>, <20130222092332.4001E4B6@pobox.sk>, <20130606160446.GE24115@dhcp22.suse.cz>, <20130606181633.BCC3E02E@pobox.sk>, <20130607131157.GF8117@dhcp22.suse.cz>, <20130617122134.2E072BA8@pobox.sk> <20130619132614.GC16457@dhcp22.suse.cz> In-Reply-To: <20130619132614.GC16457@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20130624184840.781777E6@pobox.sk> X-Maser: Georgo Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >I would be really interesting to see what those tasks are blocked on. Ok, i got it! Problem occurs two times and it behaves differently each time, I was running kernel with that latest patch. 1.) It doesn't have impact on the whole server, only on one cgroup. Here are stacks: http://watchdog.sk/lkml/memcg-bug-7.tar.gz 2.) It almost takes down the server because of huge I/O on HDDs. Unfortunately, i had a bug in my script which was suppose to gather stacks (i wasn't able to do it by hand like in (1), server was almost unoperable). But I was lucky and somehow killed processes from problematic cgroup (via htop) and server was ok again EXCEPT one important thing - processes from that cgroup were still running in D state and i wasn't able to kill them for good. They were taking web server network ports so i had to reboot the server :( BUT, before that, i gathered stacks: http://watchdog.sk/lkml/memcg-bug-8.tar.gz What do you think? azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751891Ab3FXUN6 (ORCPT ); Mon, 24 Jun 2013 16:13:58 -0400 Received: from zene.cmpxchg.org ([85.214.230.12]:48037 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750745Ab3FXUN4 (ORCPT ); Mon, 24 Jun 2013 16:13:56 -0400 Date: Mon, 24 Jun 2013 16:13:45 -0400 From: Johannes Weiner To: azurIt Cc: Michal Hocko , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130624201345.GA21822@cmpxchg.org> References: <20130210150310.GA9504@dhcp22.suse.cz> <20130210174619.24F20488@pobox.sk> <20130211112240.GC19922@dhcp22.suse.cz> <20130222092332.4001E4B6@pobox.sk> <20130606160446.GE24115@dhcp22.suse.cz> <20130606181633.BCC3E02E@pobox.sk> <20130607131157.GF8117@dhcp22.suse.cz> <20130617122134.2E072BA8@pobox.sk> <20130619132614.GC16457@dhcp22.suse.cz> <20130622220958.D10567A4@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130622220958.D10567A4@pobox.sk> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi guys, On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote: > >> But i'm sure of one thing - when problem occurs, nothing is able to > >> access hard drives (every process which tries it is freezed until > >> problem is resolved or server is rebooted). > > > >I would be really interesting to see what those tasks are blocked on. > > I'm trying to get it, stay tuned :) > > Today i noticed one bug, not 100% sure it is related to 'your' patch > but i didn't seen this before. I noticed that i have lots of cgroups > which cannot be removed - if i do 'rmdir ', it > just hangs and never complete. Even more, it's not possible to > access the whole cgroup filesystem until i kill that rmdir > (anything, which tries it, just hangs). All unremoveable cgroups has > this in 'memory.oom_control': oom_kill_disable 0 under_oom 1 Somebody acquires the OOM wait reference to the memcg and marks it under oom but then does not call into mem_cgroup_oom_synchronize() to clean up. That's why under_oom is set and the rmdir waits for outstanding references. > And, yes, 'tasks' file is empty. It's not a kernel thread that does it because all kernel-context handle_mm_fault() are annotated properly, which means the task must be userspace and, since tasks is empty, have exited before synchronizing. Can you try with the following patch on top? diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 5db0490..9a0b152 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -846,17 +846,6 @@ static noinline int mm_fault_error(struct pt_regs *regs, unsigned long error_code, unsigned long address, unsigned int fault) { - /* - * Pagefault was interrupted by SIGKILL. We have no reason to - * continue pagefault. - */ - if (fatal_signal_pending(current)) { - if (!(fault & VM_FAULT_RETRY)) - up_read(¤t->mm->mmap_sem); - if (!(error_code & PF_USER)) - no_context(regs, error_code, address); - return 1; - } if (!(fault & VM_FAULT_ERROR)) return 0; From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755416Ab3F1KGR (ORCPT ); Fri, 28 Jun 2013 06:06:17 -0400 Received: from gmmr7.centrum.cz ([46.255.225.249]:47627 "EHLO gmmr7.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755185Ab3F1KGQ (ORCPT ); Fri, 28 Jun 2013 06:06:16 -0400 To: =?utf-8?q?Johannes_Weiner?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=5D_memcg=3A_do_not_trap_chargers_with_full_callstack_on_OOM?= Date: Fri, 28 Jun 2013 12:06:13 +0200 From: "azurIt" Cc: =?utf-8?q?Michal_Hocko?= , , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= References: <20130210150310.GA9504@dhcp22.suse.cz>, <20130210174619.24F20488@pobox.sk>, <20130211112240.GC19922@dhcp22.suse.cz>, <20130222092332.4001E4B6@pobox.sk>, <20130606160446.GE24115@dhcp22.suse.cz>, <20130606181633.BCC3E02E@pobox.sk>, <20130607131157.GF8117@dhcp22.suse.cz>, <20130617122134.2E072BA8@pobox.sk>, <20130619132614.GC16457@dhcp22.suse.cz>, <20130622220958.D10567A4@pobox.sk> <20130624201345.GA21822@cmpxchg.org> In-Reply-To: <20130624201345.GA21822@cmpxchg.org> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20130628120613.6D6CAD21@pobox.sk> X-Maser: Georgo Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >It's not a kernel thread that does it because all kernel-context >handle_mm_fault() are annotated properly, which means the task must be >userspace and, since tasks is empty, have exited before synchronizing. > >Can you try with the following patch on top? Michal and Johannes, i have some observations which i made: Original patch from Johannes was really fixing something but definitely not everything and was introducing new problems. I'm running unpatched kernel from time i send my last message and problems with freezing cgroups are occuring very often (several times per day) - they were, on the other hand, quite rare with patch from Johannes. Johannes, i didn't try your last patch yet. I would like to wait until you or Michal look at my last message which contained detailed information about freezing of cgroups on kernel running your original patch (which was suppose to fix it for good). Even more, i would like to hear your opinion about that stucked processes which was holding web server port and which forced me to reboot production server at the middle of the day :( more information was in my last message. Thank you very much for your time. azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965019Ab3GESRk (ORCPT ); Fri, 5 Jul 2013 14:17:40 -0400 Received: from zene.cmpxchg.org ([85.214.230.12]:48620 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752212Ab3GESRi (ORCPT ); Fri, 5 Jul 2013 14:17:38 -0400 Date: Fri, 5 Jul 2013 14:17:28 -0400 From: Johannes Weiner To: azurIt Cc: Michal Hocko , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130705181728.GQ17812@cmpxchg.org> References: <20130211112240.GC19922@dhcp22.suse.cz> <20130222092332.4001E4B6@pobox.sk> <20130606160446.GE24115@dhcp22.suse.cz> <20130606181633.BCC3E02E@pobox.sk> <20130607131157.GF8117@dhcp22.suse.cz> <20130617122134.2E072BA8@pobox.sk> <20130619132614.GC16457@dhcp22.suse.cz> <20130622220958.D10567A4@pobox.sk> <20130624201345.GA21822@cmpxchg.org> <20130628120613.6D6CAD21@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130628120613.6D6CAD21@pobox.sk> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi azurIt, On Fri, Jun 28, 2013 at 12:06:13PM +0200, azurIt wrote: > >It's not a kernel thread that does it because all kernel-context > >handle_mm_fault() are annotated properly, which means the task must be > >userspace and, since tasks is empty, have exited before synchronizing. > > > >Can you try with the following patch on top? > > > Michal and Johannes, > > i have some observations which i made: Original patch from Johannes > was really fixing something but definitely not everything and was > introducing new problems. I'm running unpatched kernel from time i > send my last message and problems with freezing cgroups are occuring > very often (several times per day) - they were, on the other hand, > quite rare with patch from Johannes. That's good! > Johannes, i didn't try your last patch yet. I would like to wait > until you or Michal look at my last message which contained detailed > information about freezing of cgroups on kernel running your > original patch (which was suppose to fix it for good). Even more, i > would like to hear your opinion about that stucked processes which > was holding web server port and which forced me to reboot production > server at the middle of the day :( more information was in my last > message. Thank you very much for your time. I looked at your debug messages but could not find anything that would hint at a deadlock. All tasks are stuck in the refrigerator, so I assume you use the freezer cgroup and enabled it somehow? Sorry about your production server locking up, but from the stacks I don't see any connection to the OOM problems you were having... :/ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933809Ab3GETCu (ORCPT ); Fri, 5 Jul 2013 15:02:50 -0400 Received: from gmmr2.centrum.cz ([46.255.227.252]:49696 "EHLO gmmr2.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752162Ab3GETCt (ORCPT ); Fri, 5 Jul 2013 15:02:49 -0400 To: =?utf-8?q?Johannes_Weiner?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=5D_memcg=3A_do_not_trap_chargers_with_full_callstack_on_OOM?= Date: Fri, 05 Jul 2013 21:02:46 +0200 From: "azurIt" Cc: =?utf-8?q?Michal_Hocko?= , , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= References: <20130211112240.GC19922@dhcp22.suse.cz>, <20130222092332.4001E4B6@pobox.sk>, <20130606160446.GE24115@dhcp22.suse.cz>, <20130606181633.BCC3E02E@pobox.sk>, <20130607131157.GF8117@dhcp22.suse.cz>, <20130617122134.2E072BA8@pobox.sk>, <20130619132614.GC16457@dhcp22.suse.cz>, <20130622220958.D10567A4@pobox.sk>, <20130624201345.GA21822@cmpxchg.org>, <20130628120613.6D6CAD21@pobox.sk> <20130705181728.GQ17812@cmpxchg.org> In-Reply-To: <20130705181728.GQ17812@cmpxchg.org> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20130705210246.11D2135A@pobox.sk> X-Maser: Georgo Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >I looked at your debug messages but could not find anything that would >hint at a deadlock. All tasks are stuck in the refrigerator, so I >assume you use the freezer cgroup and enabled it somehow? Yes, i'm really using freezer cgroup BUT i was checking if it's not doing problems - unfortunately, several days passed from that day and now i don't fully remember if i was checking it for both cases (unremoveabled cgroups and these freezed processes holding web server port). I'm 100% sure i was checking it for unremoveable cgroups but not so sure for the other problem (i had to act quickly in that case). Are you sure (from stacks) that freezer cgroup was enabled there? Btw, what about that other stacks? I mean this file: http://watchdog.sk/lkml/memcg-bug-7.tar.gz It was taken while running the kernel with your patch and from cgroup which was under unresolveable OOM (just like my very original problem). Thank you! azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757556Ab3GETTF (ORCPT ); Fri, 5 Jul 2013 15:19:05 -0400 Received: from zene.cmpxchg.org ([85.214.230.12]:48627 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751934Ab3GETTD (ORCPT ); Fri, 5 Jul 2013 15:19:03 -0400 Date: Fri, 5 Jul 2013 15:18:54 -0400 From: Johannes Weiner To: azurIt Cc: Michal Hocko , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130705191854.GR17812@cmpxchg.org> References: <20130606160446.GE24115@dhcp22.suse.cz> <20130606181633.BCC3E02E@pobox.sk> <20130607131157.GF8117@dhcp22.suse.cz> <20130617122134.2E072BA8@pobox.sk> <20130619132614.GC16457@dhcp22.suse.cz> <20130622220958.D10567A4@pobox.sk> <20130624201345.GA21822@cmpxchg.org> <20130628120613.6D6CAD21@pobox.sk> <20130705181728.GQ17812@cmpxchg.org> <20130705210246.11D2135A@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130705210246.11D2135A@pobox.sk> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote: > >I looked at your debug messages but could not find anything that would > >hint at a deadlock. All tasks are stuck in the refrigerator, so I > >assume you use the freezer cgroup and enabled it somehow? > > > Yes, i'm really using freezer cgroup BUT i was checking if it's not > doing problems - unfortunately, several days passed from that day > and now i don't fully remember if i was checking it for both cases > (unremoveabled cgroups and these freezed processes holding web > server port). I'm 100% sure i was checking it for unremoveable > cgroups but not so sure for the other problem (i had to act quickly > in that case). Are you sure (from stacks) that freezer cgroup was > enabled there? Yeah, all the traces without exception look like this: 1372089762/23433/stack:[] refrigerator+0x95/0x160 1372089762/23433/stack:[] get_signal_to_deliver+0x1cb/0x540 1372089762/23433/stack:[] do_signal+0x6b/0x750 1372089762/23433/stack:[] do_notify_resume+0x55/0x80 1372089762/23433/stack:[] int_signal+0x12/0x17 1372089762/23433/stack:[] 0xffffffffffffffff so the freezer was already enabled when you took the backtraces. > Btw, what about that other stacks? I mean this file: > http://watchdog.sk/lkml/memcg-bug-7.tar.gz > > It was taken while running the kernel with your patch and from > cgroup which was under unresolveable OOM (just like my very original > problem). I looked at these traces too, but none of the tasks are stuck in rmdir or the OOM path. Some /are/ in the page fault path, but they are happily doing reclaim and don't appear to be stuck. So I'm having a hard time matching this data to what you otherwise observed. However, based on what you reported the most likely explanation for the continued hangs is the unfinished OOM handling for which I sent the followup patch for arch/x86/mm/fault.c. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753336Ab3GGXme (ORCPT ); Sun, 7 Jul 2013 19:42:34 -0400 Received: from gmmr7.centrum.cz ([46.255.225.249]:42814 "EHLO gmmr7.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753288Ab3GGXmc (ORCPT ); Sun, 7 Jul 2013 19:42:32 -0400 To: =?utf-8?q?Johannes_Weiner?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=5D_memcg=3A_do_not_trap_chargers_with_full_callstack_on_OOM?= Date: Mon, 08 Jul 2013 01:42:24 +0200 From: "azurIt" Cc: =?utf-8?q?Michal_Hocko?= , , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= References: <20130606160446.GE24115@dhcp22.suse.cz>, <20130606181633.BCC3E02E@pobox.sk>, <20130607131157.GF8117@dhcp22.suse.cz>, <20130617122134.2E072BA8@pobox.sk>, <20130619132614.GC16457@dhcp22.suse.cz>, <20130622220958.D10567A4@pobox.sk>, <20130624201345.GA21822@cmpxchg.org>, <20130628120613.6D6CAD21@pobox.sk>, <20130705181728.GQ17812@cmpxchg.org>, <20130705210246.11D2135A@pobox.sk> <20130705191854.GR17812@cmpxchg.org> In-Reply-To: <20130705191854.GR17812@cmpxchg.org> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20130708014224.50F06960@pobox.sk> X-Maser: oho Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > CC: "Michal Hocko" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" >On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote: >> >I looked at your debug messages but could not find anything that would >> >hint at a deadlock. All tasks are stuck in the refrigerator, so I >> >assume you use the freezer cgroup and enabled it somehow? >> >> >> Yes, i'm really using freezer cgroup BUT i was checking if it's not >> doing problems - unfortunately, several days passed from that day >> and now i don't fully remember if i was checking it for both cases >> (unremoveabled cgroups and these freezed processes holding web >> server port). I'm 100% sure i was checking it for unremoveable >> cgroups but not so sure for the other problem (i had to act quickly >> in that case). Are you sure (from stacks) that freezer cgroup was >> enabled there? > >Yeah, all the traces without exception look like this: > >1372089762/23433/stack:[] refrigerator+0x95/0x160 >1372089762/23433/stack:[] get_signal_to_deliver+0x1cb/0x540 >1372089762/23433/stack:[] do_signal+0x6b/0x750 >1372089762/23433/stack:[] do_notify_resume+0x55/0x80 >1372089762/23433/stack:[] int_signal+0x12/0x17 >1372089762/23433/stack:[] 0xffffffffffffffff > >so the freezer was already enabled when you took the backtraces. > >> Btw, what about that other stacks? I mean this file: >> http://watchdog.sk/lkml/memcg-bug-7.tar.gz >> >> It was taken while running the kernel with your patch and from >> cgroup which was under unresolveable OOM (just like my very original >> problem). > >I looked at these traces too, but none of the tasks are stuck in rmdir >or the OOM path. Some /are/ in the page fault path, but they are >happily doing reclaim and don't appear to be stuck. So I'm having a >hard time matching this data to what you otherwise observed. > >However, based on what you reported the most likely explanation for >the continued hangs is the unfinished OOM handling for which I sent >the followup patch for arch/x86/mm/fault.c. > Johannes, today I tested both of your patches but problem with unremovable cgroups, unfortunately, persists. azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753911Ab3GINKe (ORCPT ); Tue, 9 Jul 2013 09:10:34 -0400 Received: from cantor2.suse.de ([195.135.220.15]:43670 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753717Ab3GINKb (ORCPT ); Tue, 9 Jul 2013 09:10:31 -0400 Date: Tue, 9 Jul 2013 15:10:29 +0200 From: Michal Hocko To: azurIt Cc: Johannes Weiner , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130709131029.GH20281@dhcp22.suse.cz> References: <20130607131157.GF8117@dhcp22.suse.cz> <20130617122134.2E072BA8@pobox.sk> <20130619132614.GC16457@dhcp22.suse.cz> <20130622220958.D10567A4@pobox.sk> <20130624201345.GA21822@cmpxchg.org> <20130628120613.6D6CAD21@pobox.sk> <20130705181728.GQ17812@cmpxchg.org> <20130705210246.11D2135A@pobox.sk> <20130705191854.GR17812@cmpxchg.org> <20130708014224.50F06960@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130708014224.50F06960@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 08-07-13 01:42:24, azurIt wrote: > > CC: "Michal Hocko" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" > >On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote: > >> >I looked at your debug messages but could not find anything that would > >> >hint at a deadlock. All tasks are stuck in the refrigerator, so I > >> >assume you use the freezer cgroup and enabled it somehow? > >> > >> > >> Yes, i'm really using freezer cgroup BUT i was checking if it's not > >> doing problems - unfortunately, several days passed from that day > >> and now i don't fully remember if i was checking it for both cases > >> (unremoveabled cgroups and these freezed processes holding web > >> server port). I'm 100% sure i was checking it for unremoveable > >> cgroups but not so sure for the other problem (i had to act quickly > >> in that case). Are you sure (from stacks) that freezer cgroup was > >> enabled there? > > > >Yeah, all the traces without exception look like this: > > > >1372089762/23433/stack:[] refrigerator+0x95/0x160 > >1372089762/23433/stack:[] get_signal_to_deliver+0x1cb/0x540 > >1372089762/23433/stack:[] do_signal+0x6b/0x750 > >1372089762/23433/stack:[] do_notify_resume+0x55/0x80 > >1372089762/23433/stack:[] int_signal+0x12/0x17 > >1372089762/23433/stack:[] 0xffffffffffffffff > > > >so the freezer was already enabled when you took the backtraces. > > > >> Btw, what about that other stacks? I mean this file: > >> http://watchdog.sk/lkml/memcg-bug-7.tar.gz > >> > >> It was taken while running the kernel with your patch and from > >> cgroup which was under unresolveable OOM (just like my very original > >> problem). > > > >I looked at these traces too, but none of the tasks are stuck in rmdir > >or the OOM path. Some /are/ in the page fault path, but they are > >happily doing reclaim and don't appear to be stuck. So I'm having a > >hard time matching this data to what you otherwise observed. Agreed. > >However, based on what you reported the most likely explanation for > >the continued hangs is the unfinished OOM handling for which I sent > >the followup patch for arch/x86/mm/fault.c. > > Johannes, > > today I tested both of your patches but problem with unremovable > cgroups, unfortunately, persists. Is the group empty again with marked under_oom? -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753871Ab3GINAY (ORCPT ); Tue, 9 Jul 2013 09:00:24 -0400 Received: from cantor2.suse.de ([195.135.220.15]:43254 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752846Ab3GINAU (ORCPT ); Tue, 9 Jul 2013 09:00:20 -0400 Date: Tue, 9 Jul 2013 15:00:17 +0200 From: Michal Hocko To: Johannes Weiner Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130709130017.GE20281@dhcp22.suse.cz> References: <20130210174619.24F20488@pobox.sk> <20130211112240.GC19922@dhcp22.suse.cz> <20130222092332.4001E4B6@pobox.sk> <20130606160446.GE24115@dhcp22.suse.cz> <20130606181633.BCC3E02E@pobox.sk> <20130607131157.GF8117@dhcp22.suse.cz> <20130617122134.2E072BA8@pobox.sk> <20130619132614.GC16457@dhcp22.suse.cz> <20130622220958.D10567A4@pobox.sk> <20130624201345.GA21822@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130624201345.GA21822@cmpxchg.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 24-06-13 16:13:45, Johannes Weiner wrote: > Hi guys, > > On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote: > > >> But i'm sure of one thing - when problem occurs, nothing is able to > > >> access hard drives (every process which tries it is freezed until > > >> problem is resolved or server is rebooted). > > > > > >I would be really interesting to see what those tasks are blocked on. > > > > I'm trying to get it, stay tuned :) > > > > Today i noticed one bug, not 100% sure it is related to 'your' patch > > but i didn't seen this before. I noticed that i have lots of cgroups > > which cannot be removed - if i do 'rmdir ', it > > just hangs and never complete. Even more, it's not possible to > > access the whole cgroup filesystem until i kill that rmdir > > (anything, which tries it, just hangs). All unremoveable cgroups has > > this in 'memory.oom_control': oom_kill_disable 0 under_oom 1 > > Somebody acquires the OOM wait reference to the memcg and marks it > under oom but then does not call into mem_cgroup_oom_synchronize() to > clean up. That's why under_oom is set and the rmdir waits for > outstanding references. > > > And, yes, 'tasks' file is empty. > > It's not a kernel thread that does it because all kernel-context > handle_mm_fault() are annotated properly, which means the task must be > userspace and, since tasks is empty, have exited before synchronizing. Yes, well spotted. I have missed that while reviewing your patch. The follow up fix looks correct. > Can you try with the following patch on top? > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index 5db0490..9a0b152 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -846,17 +846,6 @@ static noinline int > mm_fault_error(struct pt_regs *regs, unsigned long error_code, > unsigned long address, unsigned int fault) > { > - /* > - * Pagefault was interrupted by SIGKILL. We have no reason to > - * continue pagefault. > - */ > - if (fatal_signal_pending(current)) { > - if (!(fault & VM_FAULT_RETRY)) > - up_read(¤t->mm->mmap_sem); > - if (!(error_code & PF_USER)) > - no_context(regs, error_code, address); > - return 1; > - } > if (!(fault & VM_FAULT_ERROR)) > return 0; > -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753878Ab3GINIQ (ORCPT ); Tue, 9 Jul 2013 09:08:16 -0400 Received: from cantor2.suse.de ([195.135.220.15]:43608 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753782Ab3GINIL (ORCPT ); Tue, 9 Jul 2013 09:08:11 -0400 Date: Tue, 9 Jul 2013 15:08:08 +0200 From: Michal Hocko To: Johannes Weiner Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130709130808.GF20281@dhcp22.suse.cz> References: <20130211112240.GC19922@dhcp22.suse.cz> <20130222092332.4001E4B6@pobox.sk> <20130606160446.GE24115@dhcp22.suse.cz> <20130606181633.BCC3E02E@pobox.sk> <20130607131157.GF8117@dhcp22.suse.cz> <20130617122134.2E072BA8@pobox.sk> <20130619132614.GC16457@dhcp22.suse.cz> <20130622220958.D10567A4@pobox.sk> <20130624201345.GA21822@cmpxchg.org> <20130709130017.GE20281@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130709130017.GE20281@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 09-07-13 15:00:17, Michal Hocko wrote: > On Mon 24-06-13 16:13:45, Johannes Weiner wrote: > > Hi guys, > > > > On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote: > > > >> But i'm sure of one thing - when problem occurs, nothing is able to > > > >> access hard drives (every process which tries it is freezed until > > > >> problem is resolved or server is rebooted). > > > > > > > >I would be really interesting to see what those tasks are blocked on. > > > > > > I'm trying to get it, stay tuned :) > > > > > > Today i noticed one bug, not 100% sure it is related to 'your' patch > > > but i didn't seen this before. I noticed that i have lots of cgroups > > > which cannot be removed - if i do 'rmdir ', it > > > just hangs and never complete. Even more, it's not possible to > > > access the whole cgroup filesystem until i kill that rmdir > > > (anything, which tries it, just hangs). All unremoveable cgroups has > > > this in 'memory.oom_control': oom_kill_disable 0 under_oom 1 > > > > Somebody acquires the OOM wait reference to the memcg and marks it > > under oom but then does not call into mem_cgroup_oom_synchronize() to > > clean up. That's why under_oom is set and the rmdir waits for > > outstanding references. > > > > > And, yes, 'tasks' file is empty. > > > > It's not a kernel thread that does it because all kernel-context > > handle_mm_fault() are annotated properly, which means the task must be > > userspace and, since tasks is empty, have exited before synchronizing. > > Yes, well spotted. I have missed that while reviewing your patch. > The follow up fix looks correct. Hmm, I guess you wanted to remove !(fault & VM_FAULT_ERROR) test as well otherwise the else BUG() path would be unreachable and we wouldn't know that something fishy is going on. > > Can you try with the following patch on top? > > > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > > index 5db0490..9a0b152 100644 > > --- a/arch/x86/mm/fault.c > > +++ b/arch/x86/mm/fault.c > > @@ -846,17 +846,6 @@ static noinline int > > mm_fault_error(struct pt_regs *regs, unsigned long error_code, > > unsigned long address, unsigned int fault) > > { > > - /* > > - * Pagefault was interrupted by SIGKILL. We have no reason to > > - * continue pagefault. > > - */ > > - if (fatal_signal_pending(current)) { > > - if (!(fault & VM_FAULT_RETRY)) > > - up_read(¤t->mm->mmap_sem); > > - if (!(error_code & PF_USER)) > > - no_context(regs, error_code, address); > > - return 1; > > - } > > if (!(fault & VM_FAULT_ERROR)) > > return 0; > > > > -- > Michal Hocko > SUSE Labs > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753893Ab3GINKF (ORCPT ); Tue, 9 Jul 2013 09:10:05 -0400 Received: from cantor2.suse.de ([195.135.220.15]:43646 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753717Ab3GINKB (ORCPT ); Tue, 9 Jul 2013 09:10:01 -0400 Date: Tue, 9 Jul 2013 15:10:00 +0200 From: Michal Hocko To: Johannes Weiner Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130709131000.GG20281@dhcp22.suse.cz> References: <20130222092332.4001E4B6@pobox.sk> <20130606160446.GE24115@dhcp22.suse.cz> <20130606181633.BCC3E02E@pobox.sk> <20130607131157.GF8117@dhcp22.suse.cz> <20130617122134.2E072BA8@pobox.sk> <20130619132614.GC16457@dhcp22.suse.cz> <20130622220958.D10567A4@pobox.sk> <20130624201345.GA21822@cmpxchg.org> <20130709130017.GE20281@dhcp22.suse.cz> <20130709130808.GF20281@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130709130808.GF20281@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 09-07-13 15:08:08, Michal Hocko wrote: > On Tue 09-07-13 15:00:17, Michal Hocko wrote: > > On Mon 24-06-13 16:13:45, Johannes Weiner wrote: > > > Hi guys, > > > > > > On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote: > > > > >> But i'm sure of one thing - when problem occurs, nothing is able to > > > > >> access hard drives (every process which tries it is freezed until > > > > >> problem is resolved or server is rebooted). > > > > > > > > > >I would be really interesting to see what those tasks are blocked on. > > > > > > > > I'm trying to get it, stay tuned :) > > > > > > > > Today i noticed one bug, not 100% sure it is related to 'your' patch > > > > but i didn't seen this before. I noticed that i have lots of cgroups > > > > which cannot be removed - if i do 'rmdir ', it > > > > just hangs and never complete. Even more, it's not possible to > > > > access the whole cgroup filesystem until i kill that rmdir > > > > (anything, which tries it, just hangs). All unremoveable cgroups has > > > > this in 'memory.oom_control': oom_kill_disable 0 under_oom 1 > > > > > > Somebody acquires the OOM wait reference to the memcg and marks it > > > under oom but then does not call into mem_cgroup_oom_synchronize() to > > > clean up. That's why under_oom is set and the rmdir waits for > > > outstanding references. > > > > > > > And, yes, 'tasks' file is empty. > > > > > > It's not a kernel thread that does it because all kernel-context > > > handle_mm_fault() are annotated properly, which means the task must be > > > userspace and, since tasks is empty, have exited before synchronizing. > > > > Yes, well spotted. I have missed that while reviewing your patch. > > The follow up fix looks correct. > > Hmm, I guess you wanted to remove !(fault & VM_FAULT_ERROR) test as well > otherwise the else BUG() path would be unreachable and we wouldn't know > that something fishy is going on. No, scratch it! We need it for VM_FAULT_RETRY. Sorry about the noise. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753853Ab3GINT2 (ORCPT ); Tue, 9 Jul 2013 09:19:28 -0400 Received: from gmmr8.centrum.cz ([46.255.227.254]:43145 "EHLO gmmr8.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752846Ab3GINTY (ORCPT ); Tue, 9 Jul 2013 09:19:24 -0400 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=5D_memcg=3A_do_not_trap_chargers_with_full_callstack_on_OOM?= Date: Tue, 09 Jul 2013 15:19:21 +0200 From: "azurIt" Cc: =?utf-8?q?Johannes_Weiner?= , , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= References: <20130607131157.GF8117@dhcp22.suse.cz>, <20130617122134.2E072BA8@pobox.sk>, <20130619132614.GC16457@dhcp22.suse.cz>, <20130622220958.D10567A4@pobox.sk>, <20130624201345.GA21822@cmpxchg.org>, <20130628120613.6D6CAD21@pobox.sk>, <20130705181728.GQ17812@cmpxchg.org>, <20130705210246.11D2135A@pobox.sk>, <20130705191854.GR17812@cmpxchg.org>, <20130708014224.50F06960@pobox.sk> <20130709131029.GH20281@dhcp22.suse.cz> In-Reply-To: <20130709131029.GH20281@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20130709151921.5160C199@pobox.sk> X-Maser: brud Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >On Mon 08-07-13 01:42:24, azurIt wrote: >> > CC: "Michal Hocko" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" >> >On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote: >> >> >I looked at your debug messages but could not find anything that would >> >> >hint at a deadlock. All tasks are stuck in the refrigerator, so I >> >> >assume you use the freezer cgroup and enabled it somehow? >> >> >> >> >> >> Yes, i'm really using freezer cgroup BUT i was checking if it's not >> >> doing problems - unfortunately, several days passed from that day >> >> and now i don't fully remember if i was checking it for both cases >> >> (unremoveabled cgroups and these freezed processes holding web >> >> server port). I'm 100% sure i was checking it for unremoveable >> >> cgroups but not so sure for the other problem (i had to act quickly >> >> in that case). Are you sure (from stacks) that freezer cgroup was >> >> enabled there? >> > >> >Yeah, all the traces without exception look like this: >> > >> >1372089762/23433/stack:[] refrigerator+0x95/0x160 >> >1372089762/23433/stack:[] get_signal_to_deliver+0x1cb/0x540 >> >1372089762/23433/stack:[] do_signal+0x6b/0x750 >> >1372089762/23433/stack:[] do_notify_resume+0x55/0x80 >> >1372089762/23433/stack:[] int_signal+0x12/0x17 >> >1372089762/23433/stack:[] 0xffffffffffffffff >> > >> >so the freezer was already enabled when you took the backtraces. >> > >> >> Btw, what about that other stacks? I mean this file: >> >> http://watchdog.sk/lkml/memcg-bug-7.tar.gz >> >> >> >> It was taken while running the kernel with your patch and from >> >> cgroup which was under unresolveable OOM (just like my very original >> >> problem). >> > >> >I looked at these traces too, but none of the tasks are stuck in rmdir >> >or the OOM path. Some /are/ in the page fault path, but they are >> >happily doing reclaim and don't appear to be stuck. So I'm having a >> >hard time matching this data to what you otherwise observed. > >Agreed. > >> >However, based on what you reported the most likely explanation for >> >the continued hangs is the unfinished OOM handling for which I sent >> >the followup patch for arch/x86/mm/fault.c. >> >> Johannes, >> >> today I tested both of your patches but problem with unremovable >> cgroups, unfortunately, persists. > >Is the group empty again with marked under_oom? Now i realized that i forgot to remove UID from that cgroup before trying to remove it, so cgroup cannot be removed anyway (we are using third party cgroup called cgroup-uid from Andrea Righi, which is able to associate all user's processes with target cgroup). Look here for cgroup-uid patch: https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was permanently '1'. azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753666Ab3GINyy (ORCPT ); Tue, 9 Jul 2013 09:54:54 -0400 Received: from cantor2.suse.de ([195.135.220.15]:45123 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752734Ab3GINyx (ORCPT ); Tue, 9 Jul 2013 09:54:53 -0400 Date: Tue, 9 Jul 2013 15:54:50 +0200 From: Michal Hocko To: azurIt Cc: Johannes Weiner , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130709135450.GI20281@dhcp22.suse.cz> References: <20130619132614.GC16457@dhcp22.suse.cz> <20130622220958.D10567A4@pobox.sk> <20130624201345.GA21822@cmpxchg.org> <20130628120613.6D6CAD21@pobox.sk> <20130705181728.GQ17812@cmpxchg.org> <20130705210246.11D2135A@pobox.sk> <20130705191854.GR17812@cmpxchg.org> <20130708014224.50F06960@pobox.sk> <20130709131029.GH20281@dhcp22.suse.cz> <20130709151921.5160C199@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130709151921.5160C199@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 09-07-13 15:19:21, azurIt wrote: [...] > Now i realized that i forgot to remove UID from that cgroup before > trying to remove it, so cgroup cannot be removed anyway (we are using > third party cgroup called cgroup-uid from Andrea Righi, which is able > to associate all user's processes with target cgroup). Look here for > cgroup-uid patch: > https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > permanently '1'. This is really strange. Could you post the whole diff against stable tree you are using (except for grsecurity stuff and the above cgroup-uid patch)? Btw. the bellow patch might help us to point to the exit path which leaves wait_on_memcg without mem_cgroup_oom_synchronize: --- diff --git a/kernel/exit.c b/kernel/exit.c index e6e01b9..ad472e0 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -895,6 +895,7 @@ NORET_TYPE void do_exit(long code) profile_task_exit(tsk); + WARN_ON(current->memcg_oom.wait_on_memcg); WARN_ON(blk_needs_flush_plug(tsk)); if (unlikely(in_interrupt())) -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754997Ab3GJQZL (ORCPT ); Wed, 10 Jul 2013 12:25:11 -0400 Received: from gmmr2.centrum.cz ([46.255.227.252]:57034 "EHLO gmmr2.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754864Ab3GJQZJ (ORCPT ); Wed, 10 Jul 2013 12:25:09 -0400 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=5D_memcg=3A_do_not_trap_chargers_with_full_callstack_on_OOM?= Date: Wed, 10 Jul 2013 18:25:06 +0200 From: "azurIt" Cc: =?utf-8?q?Johannes_Weiner?= , , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , References: <20130619132614.GC16457@dhcp22.suse.cz>, <20130622220958.D10567A4@pobox.sk>, <20130624201345.GA21822@cmpxchg.org>, <20130628120613.6D6CAD21@pobox.sk>, <20130705181728.GQ17812@cmpxchg.org>, <20130705210246.11D2135A@pobox.sk>, <20130705191854.GR17812@cmpxchg.org>, <20130708014224.50F06960@pobox.sk>, <20130709131029.GH20281@dhcp22.suse.cz>, <20130709151921.5160C199@pobox.sk> <20130709135450.GI20281@dhcp22.suse.cz> In-Reply-To: <20130709135450.GI20281@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20130710182506.F25DF461@pobox.sk> X-Maser: oho Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >> Now i realized that i forgot to remove UID from that cgroup before >> trying to remove it, so cgroup cannot be removed anyway (we are using >> third party cgroup called cgroup-uid from Andrea Righi, which is able >> to associate all user's processes with target cgroup). Look here for >> cgroup-uid patch: >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch >> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was >> permanently '1'. > >This is really strange. Could you post the whole diff against stable >tree you are using (except for grsecurity stuff and the above cgroup-uid >patch)? Here are all patches which i applied to kernel 3.2.48 in my last test: http://watchdog.sk/lkml/patches3/ Patches marked as 7-* are from Johannes. I'm appling them in order except the grsecurity - it goes as first. azur >Btw. the bellow patch might help us to point to the exit path which >leaves wait_on_memcg without mem_cgroup_oom_synchronize: >--- >diff --git a/kernel/exit.c b/kernel/exit.c >index e6e01b9..ad472e0 100644 >--- a/kernel/exit.c >+++ b/kernel/exit.c >@@ -895,6 +895,7 @@ NORET_TYPE void do_exit(long code) > > profile_task_exit(tsk); > >+ WARN_ON(current->memcg_oom.wait_on_memcg); > WARN_ON(blk_needs_flush_plug(tsk)); > > if (unlikely(in_interrupt())) >-- >Michal Hocko >SUSE Labs > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755711Ab3GKHZO (ORCPT ); Thu, 11 Jul 2013 03:25:14 -0400 Received: from cantor2.suse.de ([195.135.220.15]:60570 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755626Ab3GKHZM (ORCPT ); Thu, 11 Jul 2013 03:25:12 -0400 Date: Thu, 11 Jul 2013 09:25:07 +0200 From: Michal Hocko To: azurIt Cc: Johannes Weiner , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea@gmail.com Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130711072507.GA21667@dhcp22.suse.cz> References: <20130624201345.GA21822@cmpxchg.org> <20130628120613.6D6CAD21@pobox.sk> <20130705181728.GQ17812@cmpxchg.org> <20130705210246.11D2135A@pobox.sk> <20130705191854.GR17812@cmpxchg.org> <20130708014224.50F06960@pobox.sk> <20130709131029.GH20281@dhcp22.suse.cz> <20130709151921.5160C199@pobox.sk> <20130709135450.GI20281@dhcp22.suse.cz> <20130710182506.F25DF461@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130710182506.F25DF461@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 10-07-13 18:25:06, azurIt wrote: > >> Now i realized that i forgot to remove UID from that cgroup before > >> trying to remove it, so cgroup cannot be removed anyway (we are using > >> third party cgroup called cgroup-uid from Andrea Righi, which is able > >> to associate all user's processes with target cgroup). Look here for > >> cgroup-uid patch: > >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > >> > >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > >> permanently '1'. > > > >This is really strange. Could you post the whole diff against stable > >tree you are using (except for grsecurity stuff and the above cgroup-uid > >patch)? > > > Here are all patches which i applied to kernel 3.2.48 in my last test: > http://watchdog.sk/lkml/patches3/ The two patches from Johannes seem correct. >>From a quick look even grsecurity patchset shouldn't interfere as it doesn't seem to put any code between handle_mm_fault and mm_fault_error and there also doesn't seem to be any new handle_mm_fault call sites. But I cannot tell there aren't other code paths which would lead to a memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752020Ab3GMXvV (ORCPT ); Sat, 13 Jul 2013 19:51:21 -0400 Received: from gmmr2.centrum.cz ([46.255.227.252]:59670 "EHLO gmmr2.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751684Ab3GMXvU (ORCPT ); Sat, 13 Jul 2013 19:51:20 -0400 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=5D_memcg=3A_do_not_trap_chargers_with_full_callstack_on_OOM?= Date: Sun, 14 Jul 2013 01:51:12 +0200 From: "azurIt" Cc: =?utf-8?q?Johannes_Weiner?= , , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , References: <20130624201345.GA21822@cmpxchg.org>, <20130628120613.6D6CAD21@pobox.sk>, <20130705181728.GQ17812@cmpxchg.org>, <20130705210246.11D2135A@pobox.sk>, <20130705191854.GR17812@cmpxchg.org>, <20130708014224.50F06960@pobox.sk>, <20130709131029.GH20281@dhcp22.suse.cz>, <20130709151921.5160C199@pobox.sk>, <20130709135450.GI20281@dhcp22.suse.cz>, <20130710182506.F25DF461@pobox.sk>, <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> In-Reply-To: <20130714012641.C2DA4E05@pobox.sk> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20130714015112.FFCB7AF7@pobox.sk> X-Maser: brud Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com >> CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com >>On Wed 10-07-13 18:25:06, azurIt wrote: >>> >> Now i realized that i forgot to remove UID from that cgroup before >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able >>> >> to associate all user's processes with target cgroup). Look here for >>> >> cgroup-uid patch: >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch >>> >> >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was >>> >> permanently '1'. >>> > >>> >This is really strange. Could you post the whole diff against stable >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid >>> >patch)? >>> >>> >>> Here are all patches which i applied to kernel 3.2.48 in my last test: >>> http://watchdog.sk/lkml/patches3/ >> >>The two patches from Johannes seem correct. >> >>>From a quick look even grsecurity patchset shouldn't interfere as it >>doesn't seem to put any code between handle_mm_fault and mm_fault_error >>and there also doesn't seem to be any new handle_mm_fault call sites. >> >>But I cannot tell there aren't other code paths which would lead to a >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > >Michal, > >now i can definitely confirm that problem with unremovable cgroups persists. What info do you need from me? I applied also your little 'WARN_ON' patch. > >azur Ok, i think you want this: http://watchdog.sk/lkml/kern4.log From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751931Ab3GMX0p (ORCPT ); Sat, 13 Jul 2013 19:26:45 -0400 Received: from gmmr3.centrum.cz ([46.255.225.251]:34208 "EHLO gmmr3.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751127Ab3GMX0n (ORCPT ); Sat, 13 Jul 2013 19:26:43 -0400 To: =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=5D_memcg=3A_do_not_trap_chargers_with_full_callstack_on_OOM?= Date: Sun, 14 Jul 2013 01:26:41 +0200 From: "azurIt" Cc: =?utf-8?q?Johannes_Weiner?= , , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , References: <20130624201345.GA21822@cmpxchg.org>, <20130628120613.6D6CAD21@pobox.sk>, <20130705181728.GQ17812@cmpxchg.org>, <20130705210246.11D2135A@pobox.sk>, <20130705191854.GR17812@cmpxchg.org>, <20130708014224.50F06960@pobox.sk>, <20130709131029.GH20281@dhcp22.suse.cz>, <20130709151921.5160C199@pobox.sk>, <20130709135450.GI20281@dhcp22.suse.cz>, <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> In-Reply-To: <20130711072507.GA21667@dhcp22.suse.cz> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20130714012641.C2DA4E05@pobox.sk> X-Maser: brud Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com >On Wed 10-07-13 18:25:06, azurIt wrote: >> >> Now i realized that i forgot to remove UID from that cgroup before >> >> trying to remove it, so cgroup cannot be removed anyway (we are using >> >> third party cgroup called cgroup-uid from Andrea Righi, which is able >> >> to associate all user's processes with target cgroup). Look here for >> >> cgroup-uid patch: >> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch >> >> >> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was >> >> permanently '1'. >> > >> >This is really strange. Could you post the whole diff against stable >> >tree you are using (except for grsecurity stuff and the above cgroup-uid >> >patch)? >> >> >> Here are all patches which i applied to kernel 3.2.48 in my last test: >> http://watchdog.sk/lkml/patches3/ > >The two patches from Johannes seem correct. > >>From a quick look even grsecurity patchset shouldn't interfere as it >doesn't seem to put any code between handle_mm_fault and mm_fault_error >and there also doesn't seem to be any new handle_mm_fault call sites. > >But I cannot tell there aren't other code paths which would lead to a >memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. Michal, now i can definitely confirm that problem with unremovable cgroups persists. What info do you need from me? I applied also your little 'WARN_ON' patch. azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752808Ab3GNRHe (ORCPT ); Sun, 14 Jul 2013 13:07:34 -0400 Received: from gmmr3.centrum.cz ([46.255.225.251]:38177 "EHLO gmmr3.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752772Ab3GNRHc (ORCPT ); Sun, 14 Jul 2013 13:07:32 -0400 To: =?utf-8?q?Johannes_Weiner?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=5D_memcg=3A_do_not_trap_chargers_with_full_callstack_on_OOM?= Date: Sun, 14 Jul 2013 19:07:23 +0200 From: "azurIt" Cc: =?utf-8?q?Michal_Hocko?= , , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= References: <20130606160446.GE24115@dhcp22.suse.cz>, <20130606181633.BCC3E02E@pobox.sk>, <20130607131157.GF8117@dhcp22.suse.cz>, <20130617122134.2E072BA8@pobox.sk>, <20130619132614.GC16457@dhcp22.suse.cz>, <20130622220958.D10567A4@pobox.sk>, <20130624201345.GA21822@cmpxchg.org>, <20130628120613.6D6CAD21@pobox.sk>, <20130705181728.GQ17812@cmpxchg.org>, <20130705210246.11D2135A@pobox.sk> <20130705191854.GR17812@cmpxchg.org> In-Reply-To: <20130705191854.GR17812@cmpxchg.org> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20130714190723.BF406E48@pobox.sk> X-Maser: brud Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > CC: "Michal Hocko" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" >On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote: >> >I looked at your debug messages but could not find anything that would >> >hint at a deadlock. All tasks are stuck in the refrigerator, so I >> >assume you use the freezer cgroup and enabled it somehow? >> >> >> Yes, i'm really using freezer cgroup BUT i was checking if it's not >> doing problems - unfortunately, several days passed from that day >> and now i don't fully remember if i was checking it for both cases >> (unremoveabled cgroups and these freezed processes holding web >> server port). I'm 100% sure i was checking it for unremoveable >> cgroups but not so sure for the other problem (i had to act quickly >> in that case). Are you sure (from stacks) that freezer cgroup was >> enabled there? > >Yeah, all the traces without exception look like this: > >1372089762/23433/stack:[] refrigerator+0x95/0x160 >1372089762/23433/stack:[] get_signal_to_deliver+0x1cb/0x540 >1372089762/23433/stack:[] do_signal+0x6b/0x750 >1372089762/23433/stack:[] do_notify_resume+0x55/0x80 >1372089762/23433/stack:[] int_signal+0x12/0x17 >1372089762/23433/stack:[] 0xffffffffffffffff > >so the freezer was already enabled when you took the backtraces. > >> Btw, what about that other stacks? I mean this file: >> http://watchdog.sk/lkml/memcg-bug-7.tar.gz >> >> It was taken while running the kernel with your patch and from >> cgroup which was under unresolveable OOM (just like my very original >> problem). > >I looked at these traces too, but none of the tasks are stuck in rmdir >or the OOM path. Some /are/ in the page fault path, but they are >happily doing reclaim and don't appear to be stuck. So I'm having a >hard time matching this data to what you otherwise observed. > >However, based on what you reported the most likely explanation for >the continued hangs is the unfinished OOM handling for which I sent >the followup patch for arch/x86/mm/fault.c. Johannes, this problem happened again but was even worse, now i'm sure it wasn't my fault. This time I even wasn't able to access /proc/ of hanged apache process (which was, again, helding web server port and forced me to reboot the server). Everything which tried to access /proc/ just hanged. Server even wasn't able to reboot correctly, it hanged and then done a hard reboot after few minutes. azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933241Ab3GOPlY (ORCPT ); Mon, 15 Jul 2013 11:41:24 -0400 Received: from cantor2.suse.de ([195.135.220.15]:47061 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933046Ab3GOPlV (ORCPT ); Mon, 15 Jul 2013 11:41:21 -0400 Date: Mon, 15 Jul 2013 17:41:19 +0200 From: Michal Hocko To: azurIt Cc: Johannes Weiner , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea@gmail.com Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130715154119.GA32435@dhcp22.suse.cz> References: <20130705210246.11D2135A@pobox.sk> <20130705191854.GR17812@cmpxchg.org> <20130708014224.50F06960@pobox.sk> <20130709131029.GH20281@dhcp22.suse.cz> <20130709151921.5160C199@pobox.sk> <20130709135450.GI20281@dhcp22.suse.cz> <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130714015112.FFCB7AF7@pobox.sk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun 14-07-13 01:51:12, azurIt wrote: > > CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com > >> CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com > >>On Wed 10-07-13 18:25:06, azurIt wrote: > >>> >> Now i realized that i forgot to remove UID from that cgroup before > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > >>> >> to associate all user's processes with target cgroup). Look here for > >>> >> cgroup-uid patch: > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > >>> >> > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > >>> >> permanently '1'. > >>> > > >>> >This is really strange. Could you post the whole diff against stable > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > >>> >patch)? > >>> > >>> > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > >>> http://watchdog.sk/lkml/patches3/ > >> > >>The two patches from Johannes seem correct. > >> > >>From a quick look even grsecurity patchset shouldn't interfere as it > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > >>and there also doesn't seem to be any new handle_mm_fault call sites. > >> > >>But I cannot tell there aren't other code paths which would lead to a > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > >Michal, > > > >now i can definitely confirm that problem with unremovable cgroups > >persists. What info do you need from me? I applied also your little > >'WARN_ON' patch. > > Ok, i think you want this: > http://watchdog.sk/lkml/kern4.log Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: Jul 14 01:11:41 server01 kernel: [ 595.393737] [] warn_slowpath_common+0x7a/0xb0 Jul 14 01:11:41 server01 kernel: [ 595.393903] [] warn_slowpath_null+0x1a/0x20 Jul 14 01:11:41 server01 kernel: [ 595.394068] [] do_exit+0x7d0/0x870 Jul 14 01:11:41 server01 kernel: [ 595.394231] [] ? thread_group_times+0x44/0xb0 Jul 14 01:11:41 server01 kernel: [ 595.394392] [] do_group_exit+0x51/0xc0 Jul 14 01:11:41 server01 kernel: [ 595.394551] [] sys_exit_group+0x17/0x20 Jul 14 01:11:41 server01 kernel: [ 595.394714] [] system_call_fastpath+0x18/0x1d Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- OK, so you had an OOM which has been handled by in-kernel oom handler (it killed 12021) and 12037 was in the same group. The warning tells us that it went through mem_cgroup_oom as well (otherwise it wouldn't have memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then it exited on the userspace request (by exit syscall). I do not see any way how, this could happen though. If mem_cgroup_oom is called then we always return CHARGE_NOMEM which turns into ENOMEM returned by __mem_cgroup_try_charge (invoke_oom must have been set to true). So if nobody screwed the return value on the way up to page fault handler then there is no way to escape. I will check the code. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933367Ab3GOQAL (ORCPT ); Mon, 15 Jul 2013 12:00:11 -0400 Received: from cantor2.suse.de ([195.135.220.15]:47752 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932635Ab3GOQAI (ORCPT ); Mon, 15 Jul 2013 12:00:08 -0400 Date: Mon, 15 Jul 2013 18:00:06 +0200 From: Michal Hocko To: azurIt Cc: Johannes Weiner , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea@gmail.com Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130715160006.GB32435@dhcp22.suse.cz> References: <20130705191854.GR17812@cmpxchg.org> <20130708014224.50F06960@pobox.sk> <20130709131029.GH20281@dhcp22.suse.cz> <20130709151921.5160C199@pobox.sk> <20130709135450.GI20281@dhcp22.suse.cz> <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130715154119.GA32435@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 15-07-13 17:41:19, Michal Hocko wrote: > On Sun 14-07-13 01:51:12, azurIt wrote: > > > CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com > > >> CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com > > >>On Wed 10-07-13 18:25:06, azurIt wrote: > > >>> >> Now i realized that i forgot to remove UID from that cgroup before > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > > >>> >> to associate all user's processes with target cgroup). Look here for > > >>> >> cgroup-uid patch: > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > >>> >> > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > > >>> >> permanently '1'. > > >>> > > > >>> >This is really strange. Could you post the whole diff against stable > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > > >>> >patch)? > > >>> > > >>> > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > > >>> http://watchdog.sk/lkml/patches3/ > > >> > > >>The two patches from Johannes seem correct. > > >> > > >>From a quick look even grsecurity patchset shouldn't interfere as it > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > > >>and there also doesn't seem to be any new handle_mm_fault call sites. > > >> > > >>But I cannot tell there aren't other code paths which would lead to a > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > > > > >Michal, > > > > > >now i can definitely confirm that problem with unremovable cgroups > > >persists. What info do you need from me? I applied also your little > > >'WARN_ON' patch. > > > > Ok, i think you want this: > > http://watchdog.sk/lkml/kern4.log > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: > Jul 14 01:11:41 server01 kernel: [ 595.393737] [] warn_slowpath_common+0x7a/0xb0 > Jul 14 01:11:41 server01 kernel: [ 595.393903] [] warn_slowpath_null+0x1a/0x20 > Jul 14 01:11:41 server01 kernel: [ 595.394068] [] do_exit+0x7d0/0x870 > Jul 14 01:11:41 server01 kernel: [ 595.394231] [] ? thread_group_times+0x44/0xb0 > Jul 14 01:11:41 server01 kernel: [ 595.394392] [] do_group_exit+0x51/0xc0 > Jul 14 01:11:41 server01 kernel: [ 595.394551] [] sys_exit_group+0x17/0x20 > Jul 14 01:11:41 server01 kernel: [ 595.394714] [] system_call_fastpath+0x18/0x1d > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- > > OK, so you had an OOM which has been handled by in-kernel oom handler > (it killed 12021) and 12037 was in the same group. The warning tells us > that it went through mem_cgroup_oom as well (otherwise it wouldn't have > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then > it exited on the userspace request (by exit syscall). > > I do not see any way how, this could happen though. If mem_cgroup_oom > is called then we always return CHARGE_NOMEM which turns into ENOMEM > returned by __mem_cgroup_try_charge (invoke_oom must have been set to > true). So if nobody screwed the return value on the way up to page > fault handler then there is no way to escape. > > I will check the code. OK, I guess I found it: __do_fault fault = filemap_fault do_async_mmap_readahead page_cache_async_readahead ondemand_readahead __do_page_cache_readahead read_pages readpages = ext3_readpages mpage_readpages # Doesn't propagate ENOMEM add_to_page_cache_lru add_to_page_cache add_to_page_cache_locked mem_cgroup_cache_charge So the read ahead most probably. Again! Duhhh. I will try to think about a fix for this. One obvious place is mpage_readpages but __do_page_cache_readahead ignores read_pages return value as well and page_cache_async_readahead, even worse, is just void and exported as such. So this smells like a hard to fix bugger. One possible, and really ugly way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault doesn't return VM_FAULT_ERROR, but that is a crude hack. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932993Ab3GPPgU (ORCPT ); Tue, 16 Jul 2013 11:36:20 -0400 Received: from zene.cmpxchg.org ([85.214.230.12]:50309 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932540Ab3GPPgR (ORCPT ); Tue, 16 Jul 2013 11:36:17 -0400 Date: Tue, 16 Jul 2013 11:35:44 -0400 From: Johannes Weiner To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea@gmail.com Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130716153544.GX17812@cmpxchg.org> References: <20130708014224.50F06960@pobox.sk> <20130709131029.GH20281@dhcp22.suse.cz> <20130709151921.5160C199@pobox.sk> <20130709135450.GI20281@dhcp22.suse.cz> <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130715160006.GB32435@dhcp22.suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote: > On Mon 15-07-13 17:41:19, Michal Hocko wrote: > > On Sun 14-07-13 01:51:12, azurIt wrote: > > > > CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com > > > >> CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com > > > >>On Wed 10-07-13 18:25:06, azurIt wrote: > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > > > >>> >> to associate all user's processes with target cgroup). Look here for > > > >>> >> cgroup-uid patch: > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > > >>> >> > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > > > >>> >> permanently '1'. > > > >>> > > > > >>> >This is really strange. Could you post the whole diff against stable > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > > > >>> >patch)? > > > >>> > > > >>> > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > > > >>> http://watchdog.sk/lkml/patches3/ > > > >> > > > >>The two patches from Johannes seem correct. > > > >> > > > >>From a quick look even grsecurity patchset shouldn't interfere as it > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > > > >>and there also doesn't seem to be any new handle_mm_fault call sites. > > > >> > > > >>But I cannot tell there aren't other code paths which would lead to a > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > > > > > > > >Michal, > > > > > > > >now i can definitely confirm that problem with unremovable cgroups > > > >persists. What info do you need from me? I applied also your little > > > >'WARN_ON' patch. > > > > > > Ok, i think you want this: > > > http://watchdog.sk/lkml/kern4.log > > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [] warn_slowpath_common+0x7a/0xb0 > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [] warn_slowpath_null+0x1a/0x20 > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [] do_exit+0x7d0/0x870 > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [] ? thread_group_times+0x44/0xb0 > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [] do_group_exit+0x51/0xc0 > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [] sys_exit_group+0x17/0x20 > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [] system_call_fastpath+0x18/0x1d > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- > > > > OK, so you had an OOM which has been handled by in-kernel oom handler > > (it killed 12021) and 12037 was in the same group. The warning tells us > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then > > it exited on the userspace request (by exit syscall). > > > > I do not see any way how, this could happen though. If mem_cgroup_oom > > is called then we always return CHARGE_NOMEM which turns into ENOMEM > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to > > true). So if nobody screwed the return value on the way up to page > > fault handler then there is no way to escape. > > > > I will check the code. > > OK, I guess I found it: > __do_fault > fault = filemap_fault > do_async_mmap_readahead > page_cache_async_readahead > ondemand_readahead > __do_page_cache_readahead > read_pages > readpages = ext3_readpages > mpage_readpages # Doesn't propagate ENOMEM > add_to_page_cache_lru > add_to_page_cache > add_to_page_cache_locked > mem_cgroup_cache_charge > > So the read ahead most probably. Again! Duhhh. I will try to think > about a fix for this. One obvious place is mpage_readpages but > __do_page_cache_readahead ignores read_pages return value as well and > page_cache_async_readahead, even worse, is just void and exported as > such. > > So this smells like a hard to fix bugger. One possible, and really ugly > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault > doesn't return VM_FAULT_ERROR, but that is a crude hack. Ouch, good spot. I don't think we need to handle an OOM from the readahead code. If readahead does not produce the desired page, we retry synchroneously in page_cache_read() and handle the OOM properly. We should not signal an OOM for optional pages anyway. So either we pass a flag from the readahead code down to add_to_page_cache and mem_cgroup_cache_charge that tells the charge code to ignore OOM conditions and do not set up an OOM context. Or we DO call mem_cgroup_oom_synchronize() from the read_cache_pages, with an argument that makes it only clean up the context and not wait. It would not be completely outlandish to place it there, since it's right next to where an error from add_to_page_cache() is not further propagated back through the fault stack. I'm travelling right now, I'll send a patch when I get back (Thursday). Unless you beat me to it :) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933519Ab3GPQJJ (ORCPT ); Tue, 16 Jul 2013 12:09:09 -0400 Received: from cantor2.suse.de ([195.135.220.15]:42166 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932807Ab3GPQJH (ORCPT ); Tue, 16 Jul 2013 12:09:07 -0400 Date: Tue, 16 Jul 2013 18:09:05 +0200 From: Michal Hocko To: Johannes Weiner Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea@gmail.com Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130716160905.GA20018@dhcp22.suse.cz> References: <20130709131029.GH20281@dhcp22.suse.cz> <20130709151921.5160C199@pobox.sk> <20130709135450.GI20281@dhcp22.suse.cz> <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> <20130716153544.GX17812@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130716153544.GX17812@cmpxchg.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 16-07-13 11:35:44, Johannes Weiner wrote: > On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote: > > On Mon 15-07-13 17:41:19, Michal Hocko wrote: > > > On Sun 14-07-13 01:51:12, azurIt wrote: > > > > > CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com > > > > >> CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com > > > > >>On Wed 10-07-13 18:25:06, azurIt wrote: > > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before > > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > > > > >>> >> to associate all user's processes with target cgroup). Look here for > > > > >>> >> cgroup-uid patch: > > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > > > >>> >> > > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > > > > >>> >> permanently '1'. > > > > >>> > > > > > >>> >This is really strange. Could you post the whole diff against stable > > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > > > > >>> >patch)? > > > > >>> > > > > >>> > > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > > > > >>> http://watchdog.sk/lkml/patches3/ > > > > >> > > > > >>The two patches from Johannes seem correct. > > > > >> > > > > >>From a quick look even grsecurity patchset shouldn't interfere as it > > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > > > > >>and there also doesn't seem to be any new handle_mm_fault call sites. > > > > >> > > > > >>But I cannot tell there aren't other code paths which would lead to a > > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > > > > > > > > > > >Michal, > > > > > > > > > >now i can definitely confirm that problem with unremovable cgroups > > > > >persists. What info do you need from me? I applied also your little > > > > >'WARN_ON' patch. > > > > > > > > Ok, i think you want this: > > > > http://watchdog.sk/lkml/kern4.log > > > > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name > > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child > > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB > > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ > > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() > > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA > > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 > > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: > > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [] warn_slowpath_common+0x7a/0xb0 > > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [] warn_slowpath_null+0x1a/0x20 > > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [] do_exit+0x7d0/0x870 > > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [] ? thread_group_times+0x44/0xb0 > > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [] do_group_exit+0x51/0xc0 > > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [] sys_exit_group+0x17/0x20 > > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [] system_call_fastpath+0x18/0x1d > > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- > > > > > > OK, so you had an OOM which has been handled by in-kernel oom handler > > > (it killed 12021) and 12037 was in the same group. The warning tells us > > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have > > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then > > > it exited on the userspace request (by exit syscall). > > > > > > I do not see any way how, this could happen though. If mem_cgroup_oom > > > is called then we always return CHARGE_NOMEM which turns into ENOMEM > > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to > > > true). So if nobody screwed the return value on the way up to page > > > fault handler then there is no way to escape. > > > > > > I will check the code. > > > > OK, I guess I found it: > > __do_fault > > fault = filemap_fault > > do_async_mmap_readahead > > page_cache_async_readahead > > ondemand_readahead > > __do_page_cache_readahead > > read_pages > > readpages = ext3_readpages > > mpage_readpages # Doesn't propagate ENOMEM > > add_to_page_cache_lru > > add_to_page_cache > > add_to_page_cache_locked > > mem_cgroup_cache_charge > > > > So the read ahead most probably. Again! Duhhh. I will try to think > > about a fix for this. One obvious place is mpage_readpages but > > __do_page_cache_readahead ignores read_pages return value as well and > > page_cache_async_readahead, even worse, is just void and exported as > > such. > > > > So this smells like a hard to fix bugger. One possible, and really ugly > > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault > > doesn't return VM_FAULT_ERROR, but that is a crude hack. > > Ouch, good spot. > > I don't think we need to handle an OOM from the readahead code. If > readahead does not produce the desired page, we retry synchroneously > in page_cache_read() and handle the OOM properly. We should not > signal an OOM for optional pages anyway. > > So either we pass a flag from the readahead code down to > add_to_page_cache and mem_cgroup_cache_charge that tells the charge > code to ignore OOM conditions and do not set up an OOM context. That was my previous attempt and it was sooo painful. > Or we DO call mem_cgroup_oom_synchronize() from the read_cache_pages, > with an argument that makes it only clean up the context and not wait. Yes, I was playing with this idea as well. I just do not like how fragile this is. We need some way to catch all possible places which might leak it. > It would not be completely outlandish to place it there, since it's > right next to where an error from add_to_page_cache() is not further > propagated back through the fault stack. > > I'm travelling right now, I'll send a patch when I get back > (Thursday). Unless you beat me to it :) I can cook something up but there is quite a big pile on my desk currently (as always :/). -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933239Ab3GPQsv (ORCPT ); Tue, 16 Jul 2013 12:48:51 -0400 Received: from zene.cmpxchg.org ([85.214.230.12]:50319 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932889Ab3GPQsu (ORCPT ); Tue, 16 Jul 2013 12:48:50 -0400 Date: Tue, 16 Jul 2013 12:48:30 -0400 From: Johannes Weiner To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea@gmail.com Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130716164830.GZ17812@cmpxchg.org> References: <20130709151921.5160C199@pobox.sk> <20130709135450.GI20281@dhcp22.suse.cz> <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> <20130716153544.GX17812@cmpxchg.org> <20130716160905.GA20018@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130716160905.GA20018@dhcp22.suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jul 16, 2013 at 06:09:05PM +0200, Michal Hocko wrote: > On Tue 16-07-13 11:35:44, Johannes Weiner wrote: > > On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote: > > > On Mon 15-07-13 17:41:19, Michal Hocko wrote: > > > > On Sun 14-07-13 01:51:12, azurIt wrote: > > > > > > CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com > > > > > >> CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com > > > > > >>On Wed 10-07-13 18:25:06, azurIt wrote: > > > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before > > > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > > > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > > > > > >>> >> to associate all user's processes with target cgroup). Look here for > > > > > >>> >> cgroup-uid patch: > > > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > > > > >>> >> > > > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > > > > > >>> >> permanently '1'. > > > > > >>> > > > > > > >>> >This is really strange. Could you post the whole diff against stable > > > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > > > > > >>> >patch)? > > > > > >>> > > > > > >>> > > > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > > > > > >>> http://watchdog.sk/lkml/patches3/ > > > > > >> > > > > > >>The two patches from Johannes seem correct. > > > > > >> > > > > > >>From a quick look even grsecurity patchset shouldn't interfere as it > > > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > > > > > >>and there also doesn't seem to be any new handle_mm_fault call sites. > > > > > >> > > > > > >>But I cannot tell there aren't other code paths which would lead to a > > > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > > > > > > > > > > > > > >Michal, > > > > > > > > > > > >now i can definitely confirm that problem with unremovable cgroups > > > > > >persists. What info do you need from me? I applied also your little > > > > > >'WARN_ON' patch. > > > > > > > > > > Ok, i think you want this: > > > > > http://watchdog.sk/lkml/kern4.log > > > > > > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name > > > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child > > > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB > > > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ > > > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() > > > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA > > > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 > > > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: > > > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [] warn_slowpath_common+0x7a/0xb0 > > > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [] warn_slowpath_null+0x1a/0x20 > > > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [] do_exit+0x7d0/0x870 > > > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [] ? thread_group_times+0x44/0xb0 > > > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [] do_group_exit+0x51/0xc0 > > > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [] sys_exit_group+0x17/0x20 > > > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [] system_call_fastpath+0x18/0x1d > > > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- > > > > > > > > OK, so you had an OOM which has been handled by in-kernel oom handler > > > > (it killed 12021) and 12037 was in the same group. The warning tells us > > > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have > > > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then > > > > it exited on the userspace request (by exit syscall). > > > > > > > > I do not see any way how, this could happen though. If mem_cgroup_oom > > > > is called then we always return CHARGE_NOMEM which turns into ENOMEM > > > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to > > > > true). So if nobody screwed the return value on the way up to page > > > > fault handler then there is no way to escape. > > > > > > > > I will check the code. > > > > > > OK, I guess I found it: > > > __do_fault > > > fault = filemap_fault > > > do_async_mmap_readahead > > > page_cache_async_readahead > > > ondemand_readahead > > > __do_page_cache_readahead > > > read_pages > > > readpages = ext3_readpages > > > mpage_readpages # Doesn't propagate ENOMEM > > > add_to_page_cache_lru > > > add_to_page_cache > > > add_to_page_cache_locked > > > mem_cgroup_cache_charge > > > > > > So the read ahead most probably. Again! Duhhh. I will try to think > > > about a fix for this. One obvious place is mpage_readpages but > > > __do_page_cache_readahead ignores read_pages return value as well and > > > page_cache_async_readahead, even worse, is just void and exported as > > > such. > > > > > > So this smells like a hard to fix bugger. One possible, and really ugly > > > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault > > > doesn't return VM_FAULT_ERROR, but that is a crude hack. > > > > Ouch, good spot. > > > > I don't think we need to handle an OOM from the readahead code. If > > readahead does not produce the desired page, we retry synchroneously > > in page_cache_read() and handle the OOM properly. We should not > > signal an OOM for optional pages anyway. > > > > So either we pass a flag from the readahead code down to > > add_to_page_cache and mem_cgroup_cache_charge that tells the charge > > code to ignore OOM conditions and do not set up an OOM context. > > That was my previous attempt and it was sooo painful. > > > Or we DO call mem_cgroup_oom_synchronize() from the read_cache_pages, > > with an argument that makes it only clean up the context and not wait. > > Yes, I was playing with this idea as well. I just do not like how > fragile this is. We need some way to catch all possible places which > might leak it. I don't think this is necessary, but we could add a sanity check in/near mem_cgroup_clear_userfault() that makes sure the OOM context is only set up when an error is returned. > > It would not be completely outlandish to place it there, since it's > > right next to where an error from add_to_page_cache() is not further > > propagated back through the fault stack. > > > > I'm travelling right now, I'll send a patch when I get back > > (Thursday). Unless you beat me to it :) > > I can cook something up but there is quite a big pile on my desk > currently (as always :/). No worries, I'll send an update. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752884Ab3GSEVh (ORCPT ); Fri, 19 Jul 2013 00:21:37 -0400 Received: from zene.cmpxchg.org ([85.214.230.12]:50453 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751429Ab3GSEVe (ORCPT ); Fri, 19 Jul 2013 00:21:34 -0400 Date: Fri, 19 Jul 2013 00:21:24 -0400 From: Johannes Weiner To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea@gmail.com Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130719042124.GC17812@cmpxchg.org> References: <20130709135450.GI20281@dhcp22.suse.cz> <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> <20130716153544.GX17812@cmpxchg.org> <20130716160905.GA20018@dhcp22.suse.cz> <20130716164830.GZ17812@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130716164830.GZ17812@cmpxchg.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jul 16, 2013 at 12:48:30PM -0400, Johannes Weiner wrote: > On Tue, Jul 16, 2013 at 06:09:05PM +0200, Michal Hocko wrote: > > On Tue 16-07-13 11:35:44, Johannes Weiner wrote: > > > On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote: > > > > On Mon 15-07-13 17:41:19, Michal Hocko wrote: > > > > > On Sun 14-07-13 01:51:12, azurIt wrote: > > > > > > > CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com > > > > > > >> CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com > > > > > > >>On Wed 10-07-13 18:25:06, azurIt wrote: > > > > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before > > > > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > > > > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > > > > > > >>> >> to associate all user's processes with target cgroup). Look here for > > > > > > >>> >> cgroup-uid patch: > > > > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > > > > > >>> >> > > > > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > > > > > > >>> >> permanently '1'. > > > > > > >>> > > > > > > > >>> >This is really strange. Could you post the whole diff against stable > > > > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > > > > > > >>> >patch)? > > > > > > >>> > > > > > > >>> > > > > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > > > > > > >>> http://watchdog.sk/lkml/patches3/ > > > > > > >> > > > > > > >>The two patches from Johannes seem correct. > > > > > > >> > > > > > > >>From a quick look even grsecurity patchset shouldn't interfere as it > > > > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > > > > > > >>and there also doesn't seem to be any new handle_mm_fault call sites. > > > > > > >> > > > > > > >>But I cannot tell there aren't other code paths which would lead to a > > > > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > > > > > > > > > > > > > > > > >Michal, > > > > > > > > > > > > > >now i can definitely confirm that problem with unremovable cgroups > > > > > > >persists. What info do you need from me? I applied also your little > > > > > > >'WARN_ON' patch. > > > > > > > > > > > > Ok, i think you want this: > > > > > > http://watchdog.sk/lkml/kern4.log > > > > > > > > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name > > > > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child > > > > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB > > > > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ > > > > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() > > > > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA > > > > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 > > > > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: > > > > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [] warn_slowpath_common+0x7a/0xb0 > > > > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [] warn_slowpath_null+0x1a/0x20 > > > > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [] do_exit+0x7d0/0x870 > > > > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [] ? thread_group_times+0x44/0xb0 > > > > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [] do_group_exit+0x51/0xc0 > > > > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [] sys_exit_group+0x17/0x20 > > > > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [] system_call_fastpath+0x18/0x1d > > > > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- > > > > > > > > > > OK, so you had an OOM which has been handled by in-kernel oom handler > > > > > (it killed 12021) and 12037 was in the same group. The warning tells us > > > > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have > > > > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then > > > > > it exited on the userspace request (by exit syscall). > > > > > > > > > > I do not see any way how, this could happen though. If mem_cgroup_oom > > > > > is called then we always return CHARGE_NOMEM which turns into ENOMEM > > > > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to > > > > > true). So if nobody screwed the return value on the way up to page > > > > > fault handler then there is no way to escape. > > > > > > > > > > I will check the code. > > > > > > > > OK, I guess I found it: > > > > __do_fault > > > > fault = filemap_fault > > > > do_async_mmap_readahead > > > > page_cache_async_readahead > > > > ondemand_readahead > > > > __do_page_cache_readahead > > > > read_pages > > > > readpages = ext3_readpages > > > > mpage_readpages # Doesn't propagate ENOMEM > > > > add_to_page_cache_lru > > > > add_to_page_cache > > > > add_to_page_cache_locked > > > > mem_cgroup_cache_charge > > > > > > > > So the read ahead most probably. Again! Duhhh. I will try to think > > > > about a fix for this. One obvious place is mpage_readpages but > > > > __do_page_cache_readahead ignores read_pages return value as well and > > > > page_cache_async_readahead, even worse, is just void and exported as > > > > such. > > > > > > > > So this smells like a hard to fix bugger. One possible, and really ugly > > > > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault > > > > doesn't return VM_FAULT_ERROR, but that is a crude hack. I fixed it by disabling the OOM killer altogether for readahead code. We don't do it globally, we should not do it in the memcg, these are optional allocations/charges. I also disabled it for kernel faults triggered from within a syscall (copy_*user, get_user_pages), which should just return -ENOMEM as usual (unless it's nested inside a userspace fault). The only downside is that we can't get around annotating userspace faults anymore, so every architecture fault handler now passes FAULT_FLAG_USER to handle_mm_fault(). Makes the series a little less self-contained, but it's not unreasonable. It's easy to detect leaks now by checking if the memcg OOM context is setup and we are not returning VM_FAULT_OOM. Here is a combined diff based on 3.2. azurIt, any chance you could give this a shot? I tested it on my local machines, but you have a known reproducer of fairly unlikely scenarios... Thanks! Johannes diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c index fadd5f8..fa6b4e4 100644 --- a/arch/alpha/mm/fault.c +++ b/arch/alpha/mm/fault.c @@ -89,6 +89,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr, struct mm_struct *mm = current->mm; const struct exception_table_entry *fixup; int fault, si_code = SEGV_MAPERR; + unsigned long flags = 0; siginfo_t info; /* As of EV6, a load into $31/$f31 is a prefetch, and never faults @@ -142,10 +143,15 @@ do_page_fault(unsigned long address, unsigned long mmcsr, goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (cause > 0) + flags |= FAULT_FLAG_WRITE; + /* If for any reason at all we couldn't handle the fault, make sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(mm, vma, address, cause > 0 ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); up_read(&mm->mmap_sem); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c index aa33949..31b1e69 100644 --- a/arch/arm/mm/fault.c +++ b/arch/arm/mm/fault.c @@ -231,9 +231,10 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma) static int __kprobes __do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr, - struct task_struct *tsk) + struct task_struct *tsk, struct pt_regs *regs) { struct vm_area_struct *vma; + unsigned long flags = 0; int fault; vma = find_vma(mm, addr); @@ -253,11 +254,16 @@ good_area: goto out; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (fsr & FSR_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, (fsr & FSR_WRITE) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags); if (unlikely(fault & VM_FAULT_ERROR)) return fault; if (fault & VM_FAULT_MAJOR) @@ -320,7 +326,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) #endif } - fault = __do_page_fault(mm, addr, fsr, tsk); + fault = __do_page_fault(mm, addr, fsr, tsk, regs); up_read(&mm->mmap_sem); perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c index f7040a1..ada6237 100644 --- a/arch/avr32/mm/fault.c +++ b/arch/avr32/mm/fault.c @@ -59,6 +59,7 @@ asmlinkage void do_page_fault(unsigned long ecr, struct pt_regs *regs) struct mm_struct *mm; struct vm_area_struct *vma; const struct exception_table_entry *fixup; + unsigned long flags = 0; unsigned long address; unsigned long page; int writeaccess; @@ -127,12 +128,17 @@ good_area: panic("Unhandled case %lu in do_page_fault!", ecr); } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/cris/mm/fault.c b/arch/cris/mm/fault.c index 9dcac8e..35d096a 100644 --- a/arch/cris/mm/fault.c +++ b/arch/cris/mm/fault.c @@ -55,6 +55,7 @@ do_page_fault(unsigned long address, struct pt_regs *regs, struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct * vma; + unsigned long flags = 0; siginfo_t info; int fault; @@ -156,13 +157,18 @@ do_page_fault(unsigned long address, struct pt_regs *regs, goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess & 1) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, (writeaccess & 1) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/frv/mm/fault.c b/arch/frv/mm/fault.c index a325d57..2dbf219 100644 --- a/arch/frv/mm/fault.c +++ b/arch/frv/mm/fault.c @@ -35,6 +35,7 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear struct vm_area_struct *vma; struct mm_struct *mm; unsigned long _pme, lrai, lrad, fixup; + unsigned long flags = 0; siginfo_t info; pgd_t *pge; pud_t *pue; @@ -158,12 +159,17 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear break; } + if (user_mode(__frame)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, ear0, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, ear0, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c index c10b76f..e56baf3 100644 --- a/arch/hexagon/mm/vm_fault.c +++ b/arch/hexagon/mm/vm_fault.c @@ -52,6 +52,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs) siginfo_t info; int si_code = SEGV_MAPERR; int fault; + unsigned long flags = 0; const struct exception_table_entry *fixup; /* @@ -96,7 +97,12 @@ good_area: break; } - fault = handle_mm_fault(mm, vma, address, (cause > 0)); + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (cause > 0) + flags |= FAULT_FLAG_WRITE; + + fault = handle_mm_fault(mm, vma, address, flags); /* The most common case -- we are done. */ if (likely(!(fault & VM_FAULT_ERROR))) { diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c index 20b3593..ad9ef9d 100644 --- a/arch/ia64/mm/fault.c +++ b/arch/ia64/mm/fault.c @@ -79,6 +79,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re int signal = SIGSEGV, code = SEGV_MAPERR; struct vm_area_struct *vma, *prev_vma; struct mm_struct *mm = current->mm; + unsigned long flags = 0; struct siginfo si; unsigned long mask; int fault; @@ -149,12 +150,17 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re if ((vma->vm_flags & mask) != mask) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (mask & VM_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(mm, vma, address, (mask & VM_WRITE) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { /* * We ran out of memory, or some other thing happened diff --git a/arch/m32r/mm/fault.c b/arch/m32r/mm/fault.c index 2c9aeb4..e74f6fa 100644 --- a/arch/m32r/mm/fault.c +++ b/arch/m32r/mm/fault.c @@ -79,6 +79,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code, struct mm_struct *mm; struct vm_area_struct * vma; unsigned long page, addr; + unsigned long flags = 0; int write; int fault; siginfo_t info; @@ -188,6 +189,11 @@ good_area: if ((error_code & ACE_INSTRUCTION) && !(vma->vm_flags & VM_EXEC)) goto bad_area; + if (error_code & ACE_USERMODE) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo @@ -195,7 +201,7 @@ good_area: */ addr = (address & PAGE_MASK); set_thread_fault_code(error_code); - fault = handle_mm_fault(mm, vma, addr, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, addr, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c index 2db6099..ab88a91 100644 --- a/arch/m68k/mm/fault.c +++ b/arch/m68k/mm/fault.c @@ -73,6 +73,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address, { struct mm_struct *mm = current->mm; struct vm_area_struct * vma; + unsigned long flags = 0; int write, fault; #ifdef DEBUG @@ -134,13 +135,18 @@ good_area: goto acc_err; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); #ifdef DEBUG printk("handle_mm_fault returns %d\n",fault); #endif diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c index ae97d2c..b002612 100644 --- a/arch/microblaze/mm/fault.c +++ b/arch/microblaze/mm/fault.c @@ -89,6 +89,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address, { struct vm_area_struct *vma; struct mm_struct *mm = current->mm; + unsigned long flags = 0; siginfo_t info; int code = SEGV_MAPERR; int is_write = error_code & ESR_S; @@ -206,12 +207,17 @@ good_area: goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c index 937cf33..e5b9fed 100644 --- a/arch/mips/mm/fault.c +++ b/arch/mips/mm/fault.c @@ -40,6 +40,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, unsigned long writ struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; const int field = sizeof(unsigned long) * 2; + unsigned long flags = 0; siginfo_t info; int fault; @@ -139,12 +140,17 @@ good_area: } } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c index 0945409..031be56 100644 --- a/arch/mn10300/mm/fault.c +++ b/arch/mn10300/mm/fault.c @@ -121,6 +121,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long fault_code, { struct vm_area_struct *vma; struct task_struct *tsk; + unsigned long flags = 0; struct mm_struct *mm; unsigned long page; siginfo_t info; @@ -247,12 +248,17 @@ good_area: break; } + if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; @@ -329,9 +335,10 @@ no_context: */ out_of_memory: up_read(&mm->mmap_sem); - printk(KERN_ALERT "VM: killing process %s\n", tsk->comm); - if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) - do_exit(SIGKILL); + if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) { + pagefault_out_of_memory(); + return; + } goto no_context; do_sigbus: diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c index a5dce82..d586119 100644 --- a/arch/openrisc/mm/fault.c +++ b/arch/openrisc/mm/fault.c @@ -52,6 +52,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address, struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct *vma; + unsigned long flags = 0; siginfo_t info; int fault; @@ -153,13 +154,18 @@ good_area: if ((vector == 0x400) && !(vma->vm_page_prot.pgprot & _PAGE_EXEC)) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write_acc) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write_acc); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; @@ -246,10 +252,10 @@ out_of_memory: __asm__ __volatile__("l.nop 1"); up_read(&mm->mmap_sem); - printk("VM: killing process %s\n", tsk->comm); - if (user_mode(regs)) - do_exit(SIGKILL); - goto no_context; + if (!user_mode(regs)) + goto no_context; + pagefault_out_of_memory(); + return; do_sigbus: up_read(&mm->mmap_sem); diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c index 18162ce..a151e87 100644 --- a/arch/parisc/mm/fault.c +++ b/arch/parisc/mm/fault.c @@ -173,6 +173,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code, struct vm_area_struct *vma, *prev_vma; struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; + unsigned long flags = 0; unsigned long acc_type; int fault; @@ -195,13 +196,18 @@ good_area: if ((vma->vm_flags & acc_type) != acc_type) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (acc_type & VM_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(mm, vma, address, (acc_type & VM_WRITE) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { /* * We hit a shared mapping outside of the file, or some diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 5efe8c9..2bf339c 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -122,6 +122,7 @@ int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address, { struct vm_area_struct * vma; struct mm_struct *mm = current->mm; + unsigned long flags = 0; siginfo_t info; int code = SEGV_MAPERR; int is_write = 0, ret; @@ -305,12 +306,17 @@ good_area: goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - ret = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + ret = handle_mm_fault(mm, vma, address, flags); if (unlikely(ret & VM_FAULT_ERROR)) { if (ret & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c index a9a3018..fe6109c 100644 --- a/arch/s390/mm/fault.c +++ b/arch/s390/mm/fault.c @@ -301,6 +301,8 @@ static inline int do_exception(struct pt_regs *regs, int access, address = trans_exc_code & __FAIL_ADDR_MASK; perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); flags = FAULT_FLAG_ALLOW_RETRY; + if (regs->psw.mask & PSW_MASK_PSTATE) + flags |= FAULT_FLAG_USER; if (access == VM_WRITE || (trans_exc_code & store_indication) == 0x400) flags |= FAULT_FLAG_WRITE; down_read(&mm->mmap_sem); diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c index 47b600e..2ca5ae5 100644 --- a/arch/score/mm/fault.c +++ b/arch/score/mm/fault.c @@ -47,6 +47,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write, struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; const int field = sizeof(unsigned long) * 2; + unsigned long flags = 0; siginfo_t info; int fault; @@ -101,12 +102,16 @@ good_area: } survive: + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; @@ -172,10 +177,10 @@ out_of_memory: down_read(&mm->mmap_sem); goto survive; } - printk("VM: killing process %s\n", tsk->comm); - if (user_mode(regs)) - do_group_exit(SIGKILL); - goto no_context; + if (!user_mode(regs)) + goto no_context; + pagefault_out_of_memory(); + return; do_sigbus: up_read(&mm->mmap_sem); diff --git a/arch/sh/mm/fault_32.c b/arch/sh/mm/fault_32.c index 7bebd04..a61b803 100644 --- a/arch/sh/mm/fault_32.c +++ b/arch/sh/mm/fault_32.c @@ -126,6 +126,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct * vma; + unsigned long flags = 0; int si_code; int fault; siginfo_t info; @@ -195,12 +196,17 @@ good_area: goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sh/mm/tlbflush_64.c b/arch/sh/mm/tlbflush_64.c index e3430e0..0a9d645 100644 --- a/arch/sh/mm/tlbflush_64.c +++ b/arch/sh/mm/tlbflush_64.c @@ -96,6 +96,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long writeaccess, struct mm_struct *mm; struct vm_area_struct * vma; const struct exception_table_entry *fixup; + unsigned long flags = 0; pte_t *pte; int fault; @@ -184,12 +185,17 @@ good_area: } } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c index 8023fd7..efa3d48 100644 --- a/arch/sparc/mm/fault_32.c +++ b/arch/sparc/mm/fault_32.c @@ -222,6 +222,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write, struct vm_area_struct *vma; struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; + unsigned long flags = 0; unsigned int fixup; unsigned long g2; int from_user = !(regs->psr & PSR_PS); @@ -285,12 +286,17 @@ good_area: goto bad_area; } + if (from_user) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c index 504c062..bc536ea 100644 --- a/arch/sparc/mm/fault_64.c +++ b/arch/sparc/mm/fault_64.c @@ -276,6 +276,7 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; + unsigned long flags = 0; unsigned int insn = 0; int si_code, fault_code, fault; unsigned long address, mm_rss; @@ -423,7 +424,12 @@ good_area: goto bad_area; } - fault = handle_mm_fault(mm, vma, address, (fault_code & FAULT_CODE_WRITE) ? FAULT_FLAG_WRITE : 0); + if (!(regs->tstate & TSTATE_PRIV)) + flags |= FAULT_FLAG_USER; + if (fault_code & FAULT_CODE_WRITE) + flags |= FAULT_FLAG_WRITE; + + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c index 25b7b90..b2a7fd5 100644 --- a/arch/tile/mm/fault.c +++ b/arch/tile/mm/fault.c @@ -263,6 +263,7 @@ static int handle_page_fault(struct pt_regs *regs, struct mm_struct *mm; struct vm_area_struct *vma; unsigned long stack_offset; + unsigned long flags = 0; int fault; int si_code; int is_kernel_mode; @@ -415,12 +416,16 @@ good_area: } survive: + if (!is_kernel_mode) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; @@ -540,10 +545,10 @@ out_of_memory: down_read(&mm->mmap_sem); goto survive; } - pr_alert("VM: killing process %s\n", tsk->comm); - if (!is_kernel_mode) - do_group_exit(SIGKILL); - goto no_context; + if (is_kernel_mode) + goto no_context; + pagefault_out_of_memory(); + return 0; do_sigbus: up_read(&mm->mmap_sem); diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c index dafc947..626a85e 100644 --- a/arch/um/kernel/trap.c +++ b/arch/um/kernel/trap.c @@ -25,6 +25,7 @@ int handle_page_fault(unsigned long address, unsigned long ip, { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; + unsigned long flags = 0; pgd_t *pgd; pud_t *pud; pmd_t *pmd; @@ -62,10 +63,15 @@ good_area: if (!is_write && !(vma->vm_flags & (VM_READ | VM_EXEC))) goto out; + if (is_user) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + do { int fault; - fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) { goto out_of_memory; diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c index 283aa4b..3026943 100644 --- a/arch/unicore32/mm/fault.c +++ b/arch/unicore32/mm/fault.c @@ -169,9 +169,10 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma) } static int __do_pf(struct mm_struct *mm, unsigned long addr, unsigned int fsr, - struct task_struct *tsk) + struct task_struct *tsk, struct pt_regs *regs) { struct vm_area_struct *vma; + unsigned long flags = 0; int fault; vma = find_vma(mm, addr); @@ -191,12 +192,16 @@ good_area: goto out; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (!(fsr ^ 0x12)) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, - (!(fsr ^ 0x12)) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags); if (unlikely(fault & VM_FAULT_ERROR)) return fault; if (fault & VM_FAULT_MAJOR) @@ -252,7 +257,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs) #endif } - fault = __do_pf(mm, addr, fsr, tsk); + fault = __do_pf(mm, addr, fsr, tsk, regs); up_read(&mm->mmap_sem); /* diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 5db0490..90248c9 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -846,17 +846,6 @@ static noinline int mm_fault_error(struct pt_regs *regs, unsigned long error_code, unsigned long address, unsigned int fault) { - /* - * Pagefault was interrupted by SIGKILL. We have no reason to - * continue pagefault. - */ - if (fatal_signal_pending(current)) { - if (!(fault & VM_FAULT_RETRY)) - up_read(¤t->mm->mmap_sem); - if (!(error_code & PF_USER)) - no_context(regs, error_code, address); - return 1; - } if (!(fault & VM_FAULT_ERROR)) return 0; @@ -999,8 +988,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code) struct mm_struct *mm; int fault; int write = error_code & PF_WRITE; - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE | - (write ? FAULT_FLAG_WRITE : 0); + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; tsk = current; mm = tsk->mm; @@ -1160,6 +1148,11 @@ good_area: return; } + if (error_code & PF_USER) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c index e367e30..7db9fbe 100644 --- a/arch/xtensa/mm/fault.c +++ b/arch/xtensa/mm/fault.c @@ -41,6 +41,7 @@ void do_page_fault(struct pt_regs *regs) struct mm_struct *mm = current->mm; unsigned int exccause = regs->exccause; unsigned int address = regs->excvaddr; + unsigned long flags = 0; siginfo_t info; int is_write, is_exec; @@ -101,11 +102,16 @@ good_area: if (!(vma->vm_flags & (VM_READ | VM_WRITE))) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + /* If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index b87068a..b92e5e7 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -120,6 +120,17 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page); extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); +static inline unsigned int mem_cgroup_xchg_may_oom(struct task_struct *p, + unsigned int new) +{ + unsigned int old; + + old = p->memcg_oom.may_oom; + p->memcg_oom.may_oom = new; + + return old; +} +bool mem_cgroup_oom_synchronize(void); #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP extern int do_swap_account; #endif @@ -333,6 +344,17 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } +static inline unsigned int mem_cgroup_xchg_may_oom(struct task_struct *p, + unsigned int new) +{ + return 0; +} + +static inline bool mem_cgroup_oom_synchronize(void) +{ + return false; +} + static inline void mem_cgroup_inc_page_stat(struct page *page, enum mem_cgroup_page_stat_item idx) { diff --git a/include/linux/mm.h b/include/linux/mm.h index 4baadd1..846b82b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -156,6 +156,7 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_ALLOW_RETRY 0x08 /* Retry fault if blocking */ #define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */ #define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */ +#define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */ /* * This interface is used by x86 PAT code to identify a pfn mapping that is diff --git a/include/linux/sched.h b/include/linux/sched.h index 1c4f3e9..a77d198 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -91,6 +91,7 @@ struct sched_param { #include #include #include +#include #include @@ -1568,6 +1569,14 @@ struct task_struct { unsigned long nr_pages; /* uncharged usage */ unsigned long memsw_nr_pages; /* uncharged mem+swap usage */ } memcg_batch; + struct memcg_oom_info { + unsigned int may_oom:1; + unsigned int in_memcg_oom:1; + struct stack_trace trace; + unsigned long trace_entries[16]; + int wakeups; + struct mem_cgroup *wait_on_memcg; + } memcg_oom; #endif #ifdef CONFIG_HAVE_HW_BREAKPOINT atomic_t ptrace_bp_refcnt; diff --git a/mm/filemap.c b/mm/filemap.c index 5f0a3c9..d18bd47 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1660,6 +1660,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) struct file_ra_state *ra = &file->f_ra; struct inode *inode = mapping->host; pgoff_t offset = vmf->pgoff; + unsigned int may_oom; struct page *page; pgoff_t size; int ret = 0; @@ -1669,7 +1670,12 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) return VM_FAULT_SIGBUS; /* - * Do we have something in the page cache already? + * Do we have something in the page cache already? Either + * way, try readahead, but disable the memcg OOM killer for it + * as readahead is optional and no errors are propagated up + * the fault stack, which does not allow proper unwinding of a + * memcg OOM state. The OOM killer is enabled while trying to + * instantiate the faulting page individually below. */ page = find_get_page(mapping, offset); if (likely(page)) { @@ -1677,10 +1683,14 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) * We found the page, so try async readahead before * waiting for the lock. */ + may_oom = mem_cgroup_xchg_may_oom(current, 0); do_async_mmap_readahead(vma, ra, file, page, offset); + mem_cgroup_xchg_may_oom(current, may_oom); } else { - /* No page in the page cache at all */ + /* No page in the page cache at all. */ + may_oom = mem_cgroup_xchg_may_oom(current, 0); do_sync_mmap_readahead(vma, ra, file, offset); + mem_cgroup_xchg_may_oom(current, may_oom); count_vm_event(PGMAJFAULT); mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); ret = VM_FAULT_MAJOR; diff --git a/mm/ksm.c b/mm/ksm.c index 310544a..ae7e4ae 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -338,7 +338,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr) break; if (PageKsm(page)) ret = handle_mm_fault(vma->vm_mm, vma, addr, - FAULT_FLAG_WRITE); + FAULT_FLAG_WRITE); else ret = VM_FAULT_WRITE; put_page(page); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b63f5f7..c47c77e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -49,6 +49,7 @@ #include #include #include +#include #include "internal.h" #include @@ -249,6 +250,7 @@ struct mem_cgroup { bool oom_lock; atomic_t under_oom; + atomic_t oom_wakeups; atomic_t refcnt; @@ -1846,6 +1848,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait, static void memcg_wakeup_oom(struct mem_cgroup *memcg) { + atomic_inc(&memcg->oom_wakeups); /* for filtering, pass "memcg" as argument. */ __wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg); } @@ -1857,30 +1860,26 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) } /* - * try to call OOM killer. returns false if we should exit memory-reclaim loop. + * try to call OOM killer */ -bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask) { - struct oom_wait_info owait; - bool locked, need_to_kill; + bool locked, need_to_kill = true; - owait.mem = memcg; - owait.wait.flags = 0; - owait.wait.func = memcg_oom_wake_function; - owait.wait.private = current; - INIT_LIST_HEAD(&owait.wait.task_list); - need_to_kill = true; - mem_cgroup_mark_under_oom(memcg); + if (!current->memcg_oom.may_oom) + return; + + current->memcg_oom.in_memcg_oom = 1; + + current->memcg_oom.trace.nr_entries = 0; + current->memcg_oom.trace.max_entries = 16; + current->memcg_oom.trace.entries = current->memcg_oom.trace_entries; + current->memcg_oom.trace.skip = 1; + save_stack_trace(¤t->memcg_oom.trace); /* At first, try to OOM lock hierarchy under memcg.*/ spin_lock(&memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); - /* - * Even if signal_pending(), we can't quit charge() loop without - * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL - * under OOM is always welcomed, use TASK_KILLABLE here. - */ - prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); if (!locked || memcg->oom_kill_disable) need_to_kill = false; if (locked) @@ -1888,24 +1887,86 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) spin_unlock(&memcg_oom_lock); if (need_to_kill) { - finish_wait(&memcg_oom_waitq, &owait.wait); mem_cgroup_out_of_memory(memcg, mask); } else { - schedule(); - finish_wait(&memcg_oom_waitq, &owait.wait); + /* + * A system call can just return -ENOMEM, but if this + * is a page fault and somebody else is handling the + * OOM already, we need to sleep on the OOM waitqueue + * for this memcg until the situation is resolved. + * Which can take some time because it might be + * handled by a userspace task. + * + * However, this is the charge context, which means + * that we may sit on a large call stack and hold + * various filesystem locks, the mmap_sem etc. and we + * don't want the OOM handler to deadlock on them + * while we sit here and wait. Store the current OOM + * context in the task_struct, then return -ENOMEM. + * At the end of the page fault handler, with the + * stack unwound, pagefault_out_of_memory() will check + * back with us by calling + * mem_cgroup_oom_synchronize(), possibly putting the + * task to sleep. + */ + mem_cgroup_mark_under_oom(memcg); + current->memcg_oom.wakeups = atomic_read(&memcg->oom_wakeups); + css_get(&memcg->css); + current->memcg_oom.wait_on_memcg = memcg; } - spin_lock(&memcg_oom_lock); - if (locked) + + if (locked) { + spin_lock(&memcg_oom_lock); mem_cgroup_oom_unlock(memcg); - memcg_wakeup_oom(memcg); - spin_unlock(&memcg_oom_lock); + /* + * Sleeping tasks might have been killed, make sure + * they get scheduled so they can exit. + */ + if (need_to_kill) + memcg_oom_recover(memcg); + spin_unlock(&memcg_oom_lock); + } +} - mem_cgroup_unmark_under_oom(memcg); +bool mem_cgroup_oom_synchronize(void) +{ + struct oom_wait_info owait; + struct mem_cgroup *memcg; - if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + /* OOM is global, do not handle */ + if (!current->memcg_oom.in_memcg_oom) return false; - /* Give chance to dying process */ - schedule_timeout_uninterruptible(1); + + /* + * We invoked the OOM killer but there is a chance that a kill + * did not free up any charges. Everybody else might already + * be sleeping, so restart the fault and keep the rampage + * going until some charges are released. + */ + memcg = current->memcg_oom.wait_on_memcg; + if (!memcg) + goto out; + + if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + goto out_put; + + owait.mem = memcg; + owait.wait.flags = 0; + owait.wait.func = memcg_oom_wake_function; + owait.wait.private = current; + INIT_LIST_HEAD(&owait.wait.task_list); + + prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); + /* Only sleep if we didn't miss any wakeups since OOM */ + if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups) + schedule(); + finish_wait(&memcg_oom_waitq, &owait.wait); +out_put: + mem_cgroup_unmark_under_oom(memcg); + css_put(&memcg->css); + current->memcg_oom.wait_on_memcg = NULL; +out: + current->memcg_oom.in_memcg_oom = 0; return true; } @@ -2195,11 +2256,10 @@ enum { CHARGE_RETRY, /* need to retry but retry is not bad */ CHARGE_NOMEM, /* we can't do more. return -ENOMEM */ CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */ - CHARGE_OOM_DIE, /* the current is killed because of OOM */ }; static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, - unsigned int nr_pages, bool oom_check) + unsigned int nr_pages, bool invoke_oom) { unsigned long csize = nr_pages * PAGE_SIZE; struct mem_cgroup *mem_over_limit; @@ -2257,14 +2317,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (mem_cgroup_wait_acct_move(mem_over_limit)) return CHARGE_RETRY; - /* If we don't need to call oom-killer at el, return immediately */ - if (!oom_check) - return CHARGE_NOMEM; - /* check OOM */ - if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask)) - return CHARGE_OOM_DIE; + if (invoke_oom) + mem_cgroup_oom(mem_over_limit, gfp_mask); - return CHARGE_RETRY; + return CHARGE_NOMEM; } /* @@ -2349,7 +2405,7 @@ again: } do { - bool oom_check; + bool invoke_oom = oom && !nr_oom_retries; /* If killed, bypass charge */ if (fatal_signal_pending(current)) { @@ -2357,13 +2413,7 @@ again: goto bypass; } - oom_check = false; - if (oom && !nr_oom_retries) { - oom_check = true; - nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES; - } - - ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check); + ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, invoke_oom); switch (ret) { case CHARGE_OK: break; @@ -2376,16 +2426,12 @@ again: css_put(&memcg->css); goto nomem; case CHARGE_NOMEM: /* OOM routine works */ - if (!oom) { + if (!oom || invoke_oom) { css_put(&memcg->css); goto nomem; } - /* If oom, we never return -ENOMEM */ nr_oom_retries--; break; - case CHARGE_OOM_DIE: /* Killed by OOM Killer */ - css_put(&memcg->css); - goto bypass; } } while (ret != CHARGE_OK); diff --git a/mm/memory.c b/mm/memory.c index 829d437..fc6d741 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -57,6 +57,7 @@ #include #include #include +#include #include #include @@ -3439,22 +3440,14 @@ unlock: /* * By the time we get here, we already hold the mm semaphore */ -int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, unsigned int flags) +static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) { pgd_t *pgd; pud_t *pud; pmd_t *pmd; pte_t *pte; - __set_current_state(TASK_RUNNING); - - count_vm_event(PGFAULT); - mem_cgroup_count_vm_event(mm, PGFAULT); - - /* do counter updates before entering really critical section. */ - check_sync_rss_stat(current); - if (unlikely(is_vm_hugetlb_page(vma))) return hugetlb_fault(mm, vma, address, flags); @@ -3503,6 +3496,39 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, return handle_pte_fault(mm, vma, address, pte, pmd, flags); } +int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) +{ + int userfault = flags & FAULT_FLAG_USER; + int ret; + + __set_current_state(TASK_RUNNING); + + count_vm_event(PGFAULT); + mem_cgroup_count_vm_event(mm, PGFAULT); + + /* do counter updates before entering really critical section. */ + check_sync_rss_stat(current); + + if (userfault) + WARN_ON(mem_cgroup_xchg_may_oom(current, 1) == 1); + + ret = __handle_mm_fault(mm, vma, address, flags); + + if (userfault) + WARN_ON(mem_cgroup_xchg_may_oom(current, 0) == 0); + +#ifdef CONFIG_CGROUP_MEM_RES_CTLR + if (WARN(!(ret & VM_FAULT_OOM) && current->memcg_oom.in_memcg_oom, + "Fixing unhandled memcg OOM context, set up from:\n")) { + print_stack_trace(¤t->memcg_oom.trace, 0); + mem_cgroup_oom_synchronize(); + } +#endif + + return ret; +} + #ifndef __PAGETABLE_PUD_FOLDED /* * Allocate page upper directory. diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 069b64e..aa60863 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -785,6 +785,8 @@ out: */ void pagefault_out_of_memory(void) { + if (mem_cgroup_oom_synchronize()) + return; if (try_set_system_oom()) { out_of_memory(NULL, 0, 0, NULL); clear_system_oom(); From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753855Ab3GSEZI (ORCPT ); Fri, 19 Jul 2013 00:25:08 -0400 Received: from zene.cmpxchg.org ([85.214.230.12]:50471 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750938Ab3GSEZF (ORCPT ); Fri, 19 Jul 2013 00:25:05 -0400 Date: Fri, 19 Jul 2013 00:25:02 -0400 From: Johannes Weiner To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea@gmail.com Subject: [patch 3/5] x86: finish fault error path with fatal signal Message-ID: <20130719042502.GF17812@cmpxchg.org> References: <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> <20130716153544.GX17812@cmpxchg.org> <20130716160905.GA20018@dhcp22.suse.cz> <20130716164830.GZ17812@cmpxchg.org> <20130719042124.GC17812@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130719042124.GC17812@cmpxchg.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The x86 fault handler bails in the middle of error handling when the task has been killed. For the next patch this is a problem, because it relies on pagefault_out_of_memory() being called even when the task has been killed, to perform proper OOM state unwinding. This is a rather minor optimization, just remove it. Signed-off-by: Johannes Weiner --- arch/x86/mm/fault.c | 11 ----------- 1 file changed, 11 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 1cebabe..90248c9 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -846,17 +846,6 @@ static noinline int mm_fault_error(struct pt_regs *regs, unsigned long error_code, unsigned long address, unsigned int fault) { - /* - * Pagefault was interrupted by SIGKILL. We have no reason to - * continue pagefault. - */ - if (fatal_signal_pending(current)) { - if (!(fault & VM_FAULT_RETRY)) - up_read(¤t->mm->mmap_sem); - if (!(error_code & PF_USER)) - no_context(regs, error_code, address); - return 1; - } if (!(fault & VM_FAULT_ERROR)) return 0; -- 1.8.3.2 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753124Ab3GSEWn (ORCPT ); Fri, 19 Jul 2013 00:22:43 -0400 Received: from zene.cmpxchg.org ([85.214.230.12]:50459 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751706Ab3GSEWl (ORCPT ); Fri, 19 Jul 2013 00:22:41 -0400 Date: Fri, 19 Jul 2013 00:22:38 -0400 From: Johannes Weiner To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea@gmail.com Subject: [patch 1/5] mm: invoke oom-killer from remaining unconverted page fault handlers Message-ID: <20130719042238.GD17812@cmpxchg.org> References: <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> <20130716153544.GX17812@cmpxchg.org> <20130716160905.GA20018@dhcp22.suse.cz> <20130716164830.GZ17812@cmpxchg.org> <20130719042124.GC17812@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130719042124.GC17812@cmpxchg.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org [already upstream, included for 3.2 reference] A few remaining architectures directly kill the page faulting task in an out of memory situation. This is usually not a good idea since that task might not even use a significant amount of memory and so may not be the optimal victim to resolve the situation. Since 2.6.29's 1c0fe6e ("mm: invoke oom-killer from page fault") there is a hook that architecture page fault handlers are supposed to call to invoke the OOM killer and let it pick the right task to kill. Convert the remaining architectures over to this hook. To have the previous behavior of simply taking out the faulting task the vm.oom_kill_allocating_task sysctl can be set to 1. Signed-off-by: Johannes Weiner Reviewed-by: Michal Hocko Cc: KAMEZAWA Hiroyuki Acked-by: David Rientjes Acked-by: Vineet Gupta [arch/arc bits] Cc: James Hogan Cc: David Howells Cc: Jonas Bonn Cc: Chen Liqin Cc: Lennox Wu Cc: Chris Metcalf Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/mn10300/mm/fault.c | 7 ++++--- arch/openrisc/mm/fault.c | 8 ++++---- arch/score/mm/fault.c | 8 ++++---- arch/tile/mm/fault.c | 8 ++++---- 4 files changed, 16 insertions(+), 15 deletions(-) diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c index 0945409..5ac4df5 100644 --- a/arch/mn10300/mm/fault.c +++ b/arch/mn10300/mm/fault.c @@ -329,9 +329,10 @@ no_context: */ out_of_memory: up_read(&mm->mmap_sem); - printk(KERN_ALERT "VM: killing process %s\n", tsk->comm); - if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) - do_exit(SIGKILL); + if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) { + pagefault_out_of_memory(); + return; + } goto no_context; do_sigbus: diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c index a5dce82..d78881c 100644 --- a/arch/openrisc/mm/fault.c +++ b/arch/openrisc/mm/fault.c @@ -246,10 +246,10 @@ out_of_memory: __asm__ __volatile__("l.nop 1"); up_read(&mm->mmap_sem); - printk("VM: killing process %s\n", tsk->comm); - if (user_mode(regs)) - do_exit(SIGKILL); - goto no_context; + if (!user_mode(regs)) + goto no_context; + pagefault_out_of_memory(); + return; do_sigbus: up_read(&mm->mmap_sem); diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c index 47b600e..6b18fb0 100644 --- a/arch/score/mm/fault.c +++ b/arch/score/mm/fault.c @@ -172,10 +172,10 @@ out_of_memory: down_read(&mm->mmap_sem); goto survive; } - printk("VM: killing process %s\n", tsk->comm); - if (user_mode(regs)) - do_group_exit(SIGKILL); - goto no_context; + if (!user_mode(regs)) + goto no_context; + pagefault_out_of_memory(); + return; do_sigbus: up_read(&mm->mmap_sem); diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c index 25b7b90..3312531 100644 --- a/arch/tile/mm/fault.c +++ b/arch/tile/mm/fault.c @@ -540,10 +540,10 @@ out_of_memory: down_read(&mm->mmap_sem); goto survive; } - pr_alert("VM: killing process %s\n", tsk->comm); - if (!is_kernel_mode) - do_group_exit(SIGKILL); - goto no_context; + if (is_kernel_mode) + goto no_context; + pagefault_out_of_memory(); + return 0; do_sigbus: up_read(&mm->mmap_sem); -- 1.8.3.2 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753188Ab3GSEYa (ORCPT ); Fri, 19 Jul 2013 00:24:30 -0400 Received: from zene.cmpxchg.org ([85.214.230.12]:50465 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750938Ab3GSEY2 (ORCPT ); Fri, 19 Jul 2013 00:24:28 -0400 Date: Fri, 19 Jul 2013 00:24:24 -0400 From: Johannes Weiner To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea@gmail.com Subject: [patch 2/5] mm: pass userspace fault flag to generic fault handler Message-ID: <20130719042424.GE17812@cmpxchg.org> References: <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> <20130716153544.GX17812@cmpxchg.org> <20130716160905.GA20018@dhcp22.suse.cz> <20130716164830.GZ17812@cmpxchg.org> <20130719042124.GC17812@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130719042124.GC17812@cmpxchg.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The global OOM killer is (XXX: for most architectures) only invoked for userspace faults, not for faults from kernelspace (uaccess, gup). Memcg OOM handling is currently invoked for all faults. Allow it to behave like the global case by having the architectures pass a flag to the generic fault handler code that identifies userspace faults. Signed-off-by: Johannes Weiner --- arch/alpha/mm/fault.c | 8 +++++++- arch/arm/mm/fault.c | 12 +++++++++--- arch/avr32/mm/fault.c | 8 +++++++- arch/cris/mm/fault.c | 8 +++++++- arch/frv/mm/fault.c | 8 +++++++- arch/hexagon/mm/vm_fault.c | 8 +++++++- arch/ia64/mm/fault.c | 8 +++++++- arch/m32r/mm/fault.c | 8 +++++++- arch/m68k/mm/fault.c | 8 +++++++- arch/microblaze/mm/fault.c | 8 +++++++- arch/mips/mm/fault.c | 8 +++++++- arch/mn10300/mm/fault.c | 8 +++++++- arch/openrisc/mm/fault.c | 8 +++++++- arch/parisc/mm/fault.c | 8 +++++++- arch/powerpc/mm/fault.c | 8 +++++++- arch/s390/mm/fault.c | 2 ++ arch/score/mm/fault.c | 7 ++++++- arch/sh/mm/fault_32.c | 8 +++++++- arch/sh/mm/tlbflush_64.c | 8 +++++++- arch/sparc/mm/fault_32.c | 8 +++++++- arch/sparc/mm/fault_64.c | 8 +++++++- arch/tile/mm/fault.c | 7 ++++++- arch/um/kernel/trap.c | 8 +++++++- arch/unicore32/mm/fault.c | 13 +++++++++---- arch/x86/mm/fault.c | 8 ++++++-- arch/xtensa/mm/fault.c | 8 +++++++- include/linux/mm.h | 1 + 27 files changed, 179 insertions(+), 31 deletions(-) diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c index fadd5f8..fa6b4e4 100644 --- a/arch/alpha/mm/fault.c +++ b/arch/alpha/mm/fault.c @@ -89,6 +89,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr, struct mm_struct *mm = current->mm; const struct exception_table_entry *fixup; int fault, si_code = SEGV_MAPERR; + unsigned long flags = 0; siginfo_t info; /* As of EV6, a load into $31/$f31 is a prefetch, and never faults @@ -142,10 +143,15 @@ do_page_fault(unsigned long address, unsigned long mmcsr, goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (cause > 0) + flags |= FAULT_FLAG_WRITE; + /* If for any reason at all we couldn't handle the fault, make sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(mm, vma, address, cause > 0 ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); up_read(&mm->mmap_sem); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c index aa33949..31b1e69 100644 --- a/arch/arm/mm/fault.c +++ b/arch/arm/mm/fault.c @@ -231,9 +231,10 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma) static int __kprobes __do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr, - struct task_struct *tsk) + struct task_struct *tsk, struct pt_regs *regs) { struct vm_area_struct *vma; + unsigned long flags = 0; int fault; vma = find_vma(mm, addr); @@ -253,11 +254,16 @@ good_area: goto out; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (fsr & FSR_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, (fsr & FSR_WRITE) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags); if (unlikely(fault & VM_FAULT_ERROR)) return fault; if (fault & VM_FAULT_MAJOR) @@ -320,7 +326,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) #endif } - fault = __do_page_fault(mm, addr, fsr, tsk); + fault = __do_page_fault(mm, addr, fsr, tsk, regs); up_read(&mm->mmap_sem); perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c index f7040a1..ada6237 100644 --- a/arch/avr32/mm/fault.c +++ b/arch/avr32/mm/fault.c @@ -59,6 +59,7 @@ asmlinkage void do_page_fault(unsigned long ecr, struct pt_regs *regs) struct mm_struct *mm; struct vm_area_struct *vma; const struct exception_table_entry *fixup; + unsigned long flags = 0; unsigned long address; unsigned long page; int writeaccess; @@ -127,12 +128,17 @@ good_area: panic("Unhandled case %lu in do_page_fault!", ecr); } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/cris/mm/fault.c b/arch/cris/mm/fault.c index 9dcac8e..35d096a 100644 --- a/arch/cris/mm/fault.c +++ b/arch/cris/mm/fault.c @@ -55,6 +55,7 @@ do_page_fault(unsigned long address, struct pt_regs *regs, struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct * vma; + unsigned long flags = 0; siginfo_t info; int fault; @@ -156,13 +157,18 @@ do_page_fault(unsigned long address, struct pt_regs *regs, goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess & 1) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, (writeaccess & 1) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/frv/mm/fault.c b/arch/frv/mm/fault.c index a325d57..2dbf219 100644 --- a/arch/frv/mm/fault.c +++ b/arch/frv/mm/fault.c @@ -35,6 +35,7 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear struct vm_area_struct *vma; struct mm_struct *mm; unsigned long _pme, lrai, lrad, fixup; + unsigned long flags = 0; siginfo_t info; pgd_t *pge; pud_t *pue; @@ -158,12 +159,17 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear break; } + if (user_mode(__frame)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, ear0, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, ear0, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c index c10b76f..e56baf3 100644 --- a/arch/hexagon/mm/vm_fault.c +++ b/arch/hexagon/mm/vm_fault.c @@ -52,6 +52,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs) siginfo_t info; int si_code = SEGV_MAPERR; int fault; + unsigned long flags = 0; const struct exception_table_entry *fixup; /* @@ -96,7 +97,12 @@ good_area: break; } - fault = handle_mm_fault(mm, vma, address, (cause > 0)); + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (cause > 0) + flags |= FAULT_FLAG_WRITE; + + fault = handle_mm_fault(mm, vma, address, flags); /* The most common case -- we are done. */ if (likely(!(fault & VM_FAULT_ERROR))) { diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c index 20b3593..ad9ef9d 100644 --- a/arch/ia64/mm/fault.c +++ b/arch/ia64/mm/fault.c @@ -79,6 +79,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re int signal = SIGSEGV, code = SEGV_MAPERR; struct vm_area_struct *vma, *prev_vma; struct mm_struct *mm = current->mm; + unsigned long flags = 0; struct siginfo si; unsigned long mask; int fault; @@ -149,12 +150,17 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re if ((vma->vm_flags & mask) != mask) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (mask & VM_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(mm, vma, address, (mask & VM_WRITE) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { /* * We ran out of memory, or some other thing happened diff --git a/arch/m32r/mm/fault.c b/arch/m32r/mm/fault.c index 2c9aeb4..e74f6fa 100644 --- a/arch/m32r/mm/fault.c +++ b/arch/m32r/mm/fault.c @@ -79,6 +79,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code, struct mm_struct *mm; struct vm_area_struct * vma; unsigned long page, addr; + unsigned long flags = 0; int write; int fault; siginfo_t info; @@ -188,6 +189,11 @@ good_area: if ((error_code & ACE_INSTRUCTION) && !(vma->vm_flags & VM_EXEC)) goto bad_area; + if (error_code & ACE_USERMODE) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo @@ -195,7 +201,7 @@ good_area: */ addr = (address & PAGE_MASK); set_thread_fault_code(error_code); - fault = handle_mm_fault(mm, vma, addr, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, addr, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c index 2db6099..ab88a91 100644 --- a/arch/m68k/mm/fault.c +++ b/arch/m68k/mm/fault.c @@ -73,6 +73,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address, { struct mm_struct *mm = current->mm; struct vm_area_struct * vma; + unsigned long flags = 0; int write, fault; #ifdef DEBUG @@ -134,13 +135,18 @@ good_area: goto acc_err; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); #ifdef DEBUG printk("handle_mm_fault returns %d\n",fault); #endif diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c index ae97d2c..b002612 100644 --- a/arch/microblaze/mm/fault.c +++ b/arch/microblaze/mm/fault.c @@ -89,6 +89,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address, { struct vm_area_struct *vma; struct mm_struct *mm = current->mm; + unsigned long flags = 0; siginfo_t info; int code = SEGV_MAPERR; int is_write = error_code & ESR_S; @@ -206,12 +207,17 @@ good_area: goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c index 937cf33..e5b9fed 100644 --- a/arch/mips/mm/fault.c +++ b/arch/mips/mm/fault.c @@ -40,6 +40,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, unsigned long writ struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; const int field = sizeof(unsigned long) * 2; + unsigned long flags = 0; siginfo_t info; int fault; @@ -139,12 +140,17 @@ good_area: } } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c index 5ac4df5..031be56 100644 --- a/arch/mn10300/mm/fault.c +++ b/arch/mn10300/mm/fault.c @@ -121,6 +121,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long fault_code, { struct vm_area_struct *vma; struct task_struct *tsk; + unsigned long flags = 0; struct mm_struct *mm; unsigned long page; siginfo_t info; @@ -247,12 +248,17 @@ good_area: break; } + if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c index d78881c..d586119 100644 --- a/arch/openrisc/mm/fault.c +++ b/arch/openrisc/mm/fault.c @@ -52,6 +52,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address, struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct *vma; + unsigned long flags = 0; siginfo_t info; int fault; @@ -153,13 +154,18 @@ good_area: if ((vector == 0x400) && !(vma->vm_page_prot.pgprot & _PAGE_EXEC)) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write_acc) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write_acc); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c index 18162ce..a151e87 100644 --- a/arch/parisc/mm/fault.c +++ b/arch/parisc/mm/fault.c @@ -173,6 +173,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code, struct vm_area_struct *vma, *prev_vma; struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; + unsigned long flags = 0; unsigned long acc_type; int fault; @@ -195,13 +196,18 @@ good_area: if ((vma->vm_flags & acc_type) != acc_type) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (acc_type & VM_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(mm, vma, address, (acc_type & VM_WRITE) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { /* * We hit a shared mapping outside of the file, or some diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 5efe8c9..2bf339c 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -122,6 +122,7 @@ int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address, { struct vm_area_struct * vma; struct mm_struct *mm = current->mm; + unsigned long flags = 0; siginfo_t info; int code = SEGV_MAPERR; int is_write = 0, ret; @@ -305,12 +306,17 @@ good_area: goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - ret = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + ret = handle_mm_fault(mm, vma, address, flags); if (unlikely(ret & VM_FAULT_ERROR)) { if (ret & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c index a9a3018..fe6109c 100644 --- a/arch/s390/mm/fault.c +++ b/arch/s390/mm/fault.c @@ -301,6 +301,8 @@ static inline int do_exception(struct pt_regs *regs, int access, address = trans_exc_code & __FAIL_ADDR_MASK; perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); flags = FAULT_FLAG_ALLOW_RETRY; + if (regs->psw.mask & PSW_MASK_PSTATE) + flags |= FAULT_FLAG_USER; if (access == VM_WRITE || (trans_exc_code & store_indication) == 0x400) flags |= FAULT_FLAG_WRITE; down_read(&mm->mmap_sem); diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c index 6b18fb0..2ca5ae5 100644 --- a/arch/score/mm/fault.c +++ b/arch/score/mm/fault.c @@ -47,6 +47,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write, struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; const int field = sizeof(unsigned long) * 2; + unsigned long flags = 0; siginfo_t info; int fault; @@ -101,12 +102,16 @@ good_area: } survive: + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sh/mm/fault_32.c b/arch/sh/mm/fault_32.c index 7bebd04..a61b803 100644 --- a/arch/sh/mm/fault_32.c +++ b/arch/sh/mm/fault_32.c @@ -126,6 +126,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct * vma; + unsigned long flags = 0; int si_code; int fault; siginfo_t info; @@ -195,12 +196,17 @@ good_area: goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sh/mm/tlbflush_64.c b/arch/sh/mm/tlbflush_64.c index e3430e0..0a9d645 100644 --- a/arch/sh/mm/tlbflush_64.c +++ b/arch/sh/mm/tlbflush_64.c @@ -96,6 +96,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long writeaccess, struct mm_struct *mm; struct vm_area_struct * vma; const struct exception_table_entry *fixup; + unsigned long flags = 0; pte_t *pte; int fault; @@ -184,12 +185,17 @@ good_area: } } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c index 8023fd7..efa3d48 100644 --- a/arch/sparc/mm/fault_32.c +++ b/arch/sparc/mm/fault_32.c @@ -222,6 +222,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write, struct vm_area_struct *vma; struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; + unsigned long flags = 0; unsigned int fixup; unsigned long g2; int from_user = !(regs->psr & PSR_PS); @@ -285,12 +286,17 @@ good_area: goto bad_area; } + if (from_user) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c index 504c062..bc536ea 100644 --- a/arch/sparc/mm/fault_64.c +++ b/arch/sparc/mm/fault_64.c @@ -276,6 +276,7 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; + unsigned long flags = 0; unsigned int insn = 0; int si_code, fault_code, fault; unsigned long address, mm_rss; @@ -423,7 +424,12 @@ good_area: goto bad_area; } - fault = handle_mm_fault(mm, vma, address, (fault_code & FAULT_CODE_WRITE) ? FAULT_FLAG_WRITE : 0); + if (!(regs->tstate & TSTATE_PRIV)) + flags |= FAULT_FLAG_USER; + if (fault_code & FAULT_CODE_WRITE) + flags |= FAULT_FLAG_WRITE; + + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c index 3312531..b2a7fd5 100644 --- a/arch/tile/mm/fault.c +++ b/arch/tile/mm/fault.c @@ -263,6 +263,7 @@ static int handle_page_fault(struct pt_regs *regs, struct mm_struct *mm; struct vm_area_struct *vma; unsigned long stack_offset; + unsigned long flags = 0; int fault; int si_code; int is_kernel_mode; @@ -415,12 +416,16 @@ good_area: } survive: + if (!is_kernel_mode) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c index dafc947..626a85e 100644 --- a/arch/um/kernel/trap.c +++ b/arch/um/kernel/trap.c @@ -25,6 +25,7 @@ int handle_page_fault(unsigned long address, unsigned long ip, { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; + unsigned long flags = 0; pgd_t *pgd; pud_t *pud; pmd_t *pmd; @@ -62,10 +63,15 @@ good_area: if (!is_write && !(vma->vm_flags & (VM_READ | VM_EXEC))) goto out; + if (is_user) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + do { int fault; - fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) { goto out_of_memory; diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c index 283aa4b..3026943 100644 --- a/arch/unicore32/mm/fault.c +++ b/arch/unicore32/mm/fault.c @@ -169,9 +169,10 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma) } static int __do_pf(struct mm_struct *mm, unsigned long addr, unsigned int fsr, - struct task_struct *tsk) + struct task_struct *tsk, struct pt_regs *regs) { struct vm_area_struct *vma; + unsigned long flags = 0; int fault; vma = find_vma(mm, addr); @@ -191,12 +192,16 @@ good_area: goto out; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (!(fsr ^ 0x12)) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, - (!(fsr ^ 0x12)) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags); if (unlikely(fault & VM_FAULT_ERROR)) return fault; if (fault & VM_FAULT_MAJOR) @@ -252,7 +257,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs) #endif } - fault = __do_pf(mm, addr, fsr, tsk); + fault = __do_pf(mm, addr, fsr, tsk, regs); up_read(&mm->mmap_sem); /* diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 5db0490..1cebabe 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -999,8 +999,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code) struct mm_struct *mm; int fault; int write = error_code & PF_WRITE; - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE | - (write ? FAULT_FLAG_WRITE : 0); + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; tsk = current; mm = tsk->mm; @@ -1160,6 +1159,11 @@ good_area: return; } + if (error_code & PF_USER) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c index e367e30..7db9fbe 100644 --- a/arch/xtensa/mm/fault.c +++ b/arch/xtensa/mm/fault.c @@ -41,6 +41,7 @@ void do_page_fault(struct pt_regs *regs) struct mm_struct *mm = current->mm; unsigned int exccause = regs->exccause; unsigned int address = regs->excvaddr; + unsigned long flags = 0; siginfo_t info; int is_write, is_exec; @@ -101,11 +102,16 @@ good_area: if (!(vma->vm_flags & (VM_READ | VM_WRITE))) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + /* If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/include/linux/mm.h b/include/linux/mm.h index 4baadd1..846b82b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -156,6 +156,7 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_ALLOW_RETRY 0x08 /* Retry fault if blocking */ #define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */ #define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */ +#define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */ /* * This interface is used by x86 PAT code to identify a pfn mapping that is -- 1.8.3.2 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754678Ab3GSEZx (ORCPT ); Fri, 19 Jul 2013 00:25:53 -0400 Received: from zene.cmpxchg.org ([85.214.230.12]:50477 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752472Ab3GSEZv (ORCPT ); Fri, 19 Jul 2013 00:25:51 -0400 Date: Fri, 19 Jul 2013 00:25:47 -0400 From: Johannes Weiner To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea@gmail.com Subject: [patch 4/5] memcg: do not trap chargers with full callstack on OOM Message-ID: <20130719042547.GG17812@cmpxchg.org> References: <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> <20130716153544.GX17812@cmpxchg.org> <20130716160905.GA20018@dhcp22.suse.cz> <20130716164830.GZ17812@cmpxchg.org> <20130719042124.GC17812@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130719042124.GC17812@cmpxchg.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1146+0x5ab/0x5c0 [] mem_cgroup_cache_charge+0xbe/0xe0 [] add_to_page_cache_locked+0x4c/0x140 [] add_to_page_cache_lru+0x22/0x50 [] grab_cache_page_write_begin+0x8b/0xe0 [] ext3_write_begin+0x88/0x270 [] generic_file_buffered_write+0x116/0x290 [] __generic_file_aio_write+0x27c/0x480 [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [] do_sync_write+0xea/0x130 [] vfs_write+0xf3/0x1f0 [] sys_write+0x51/0x90 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff OOM kill victim: [] do_truncate+0x58/0xa0 # takes i_mutex [] do_last+0x250/0xa30 [] path_openat+0xd7/0x440 [] do_filp_open+0x49/0xa0 [] do_sys_open+0x106/0x240 [] sys_open+0x20/0x30 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 0. When OOMing in a system call (buffered IO and friends), do not invoke the OOM killer, do not sleep on a OOM waitqueue, just return -ENOMEM. Userspace should be able to handle this and it prevents anybody from looping or waiting with locks held. 1. When OOMing in a kernel fault, do not invoke the OOM killer, do not sleep on the OOM waitqueue, just return -ENOMEM. The kernel fault stack knows how to handle this. If a kernel fault is nested inside a user fault, however, user fault handling applies: 2. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 3. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. While reworking the OOM routine, also remove a needless OOM waitqueue wakeup when invoking the killer. In addition to the wakeup implied in the kill signal delivery, only uncharges and limit increases, things that actually change the memory situation, should poke the waitqueue. Reported-by: Reported-by: azurIt Debugged-by: Michal Hocko Signed-off-by: Johannes Weiner --- include/linux/memcontrol.h | 22 +++++++ include/linux/sched.h | 6 ++ mm/filemap.c | 14 ++++- mm/ksm.c | 2 +- mm/memcontrol.c | 139 +++++++++++++++++++++++++++++---------------- mm/memory.c | 37 ++++++++---- mm/oom_kill.c | 2 + 7 files changed, 159 insertions(+), 63 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index b87068a..b92e5e7 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -120,6 +120,17 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page); extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); +static inline unsigned int mem_cgroup_xchg_may_oom(struct task_struct *p, + unsigned int new) +{ + unsigned int old; + + old = p->memcg_oom.may_oom; + p->memcg_oom.may_oom = new; + + return old; +} +bool mem_cgroup_oom_synchronize(void); #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP extern int do_swap_account; #endif @@ -333,6 +344,17 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } +static inline unsigned int mem_cgroup_xchg_may_oom(struct task_struct *p, + unsigned int new) +{ + return 0; +} + +static inline bool mem_cgroup_oom_synchronize(void) +{ + return false; +} + static inline void mem_cgroup_inc_page_stat(struct page *page, enum mem_cgroup_page_stat_item idx) { diff --git a/include/linux/sched.h b/include/linux/sched.h index 1c4f3e9..7e6c9e9 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1568,6 +1568,12 @@ struct task_struct { unsigned long nr_pages; /* uncharged usage */ unsigned long memsw_nr_pages; /* uncharged mem+swap usage */ } memcg_batch; + struct memcg_oom_info { + unsigned int may_oom:1; + unsigned int in_memcg_oom:1; + int wakeups; + struct mem_cgroup *wait_on_memcg; + } memcg_oom; #endif #ifdef CONFIG_HAVE_HW_BREAKPOINT atomic_t ptrace_bp_refcnt; diff --git a/mm/filemap.c b/mm/filemap.c index 5f0a3c9..d18bd47 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1660,6 +1660,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) struct file_ra_state *ra = &file->f_ra; struct inode *inode = mapping->host; pgoff_t offset = vmf->pgoff; + unsigned int may_oom; struct page *page; pgoff_t size; int ret = 0; @@ -1669,7 +1670,12 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) return VM_FAULT_SIGBUS; /* - * Do we have something in the page cache already? + * Do we have something in the page cache already? Either + * way, try readahead, but disable the memcg OOM killer for it + * as readahead is optional and no errors are propagated up + * the fault stack, which does not allow proper unwinding of a + * memcg OOM state. The OOM killer is enabled while trying to + * instantiate the faulting page individually below. */ page = find_get_page(mapping, offset); if (likely(page)) { @@ -1677,10 +1683,14 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) * We found the page, so try async readahead before * waiting for the lock. */ + may_oom = mem_cgroup_xchg_may_oom(current, 0); do_async_mmap_readahead(vma, ra, file, page, offset); + mem_cgroup_xchg_may_oom(current, may_oom); } else { - /* No page in the page cache at all */ + /* No page in the page cache at all. */ + may_oom = mem_cgroup_xchg_may_oom(current, 0); do_sync_mmap_readahead(vma, ra, file, offset); + mem_cgroup_xchg_may_oom(current, may_oom); count_vm_event(PGMAJFAULT); mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); ret = VM_FAULT_MAJOR; diff --git a/mm/ksm.c b/mm/ksm.c index 310544a..ae7e4ae 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -338,7 +338,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr) break; if (PageKsm(page)) ret = handle_mm_fault(vma->vm_mm, vma, addr, - FAULT_FLAG_WRITE); + FAULT_FLAG_WRITE); else ret = VM_FAULT_WRITE; put_page(page); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b63f5f7..99b0101 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -249,6 +249,7 @@ struct mem_cgroup { bool oom_lock; atomic_t under_oom; + atomic_t oom_wakeups; atomic_t refcnt; @@ -1846,6 +1847,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait, static void memcg_wakeup_oom(struct mem_cgroup *memcg) { + atomic_inc(&memcg->oom_wakeups); /* for filtering, pass "memcg" as argument. */ __wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg); } @@ -1857,30 +1859,20 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) } /* - * try to call OOM killer. returns false if we should exit memory-reclaim loop. + * try to call OOM killer */ -bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask) { - struct oom_wait_info owait; - bool locked, need_to_kill; + bool locked, need_to_kill = true; - owait.mem = memcg; - owait.wait.flags = 0; - owait.wait.func = memcg_oom_wake_function; - owait.wait.private = current; - INIT_LIST_HEAD(&owait.wait.task_list); - need_to_kill = true; - mem_cgroup_mark_under_oom(memcg); + if (!current->memcg_oom.may_oom) + return; + + current->memcg_oom.in_memcg_oom = 1; /* At first, try to OOM lock hierarchy under memcg.*/ spin_lock(&memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); - /* - * Even if signal_pending(), we can't quit charge() loop without - * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL - * under OOM is always welcomed, use TASK_KILLABLE here. - */ - prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); if (!locked || memcg->oom_kill_disable) need_to_kill = false; if (locked) @@ -1888,24 +1880,86 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) spin_unlock(&memcg_oom_lock); if (need_to_kill) { - finish_wait(&memcg_oom_waitq, &owait.wait); mem_cgroup_out_of_memory(memcg, mask); } else { - schedule(); - finish_wait(&memcg_oom_waitq, &owait.wait); + /* + * A system call can just return -ENOMEM, but if this + * is a page fault and somebody else is handling the + * OOM already, we need to sleep on the OOM waitqueue + * for this memcg until the situation is resolved. + * Which can take some time because it might be + * handled by a userspace task. + * + * However, this is the charge context, which means + * that we may sit on a large call stack and hold + * various filesystem locks, the mmap_sem etc. and we + * don't want the OOM handler to deadlock on them + * while we sit here and wait. Store the current OOM + * context in the task_struct, then return -ENOMEM. + * At the end of the page fault handler, with the + * stack unwound, pagefault_out_of_memory() will check + * back with us by calling + * mem_cgroup_oom_synchronize(), possibly putting the + * task to sleep. + */ + mem_cgroup_mark_under_oom(memcg); + current->memcg_oom.wakeups = atomic_read(&memcg->oom_wakeups); + css_get(&memcg->css); + current->memcg_oom.wait_on_memcg = memcg; } - spin_lock(&memcg_oom_lock); - if (locked) + + if (locked) { + spin_lock(&memcg_oom_lock); mem_cgroup_oom_unlock(memcg); - memcg_wakeup_oom(memcg); - spin_unlock(&memcg_oom_lock); + /* + * Sleeping tasks might have been killed, make sure + * they get scheduled so they can exit. + */ + if (need_to_kill) + memcg_oom_recover(memcg); + spin_unlock(&memcg_oom_lock); + } +} - mem_cgroup_unmark_under_oom(memcg); +bool mem_cgroup_oom_synchronize(void) +{ + struct oom_wait_info owait; + struct mem_cgroup *memcg; - if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + /* OOM is global, do not handle */ + if (!current->memcg_oom.in_memcg_oom) return false; - /* Give chance to dying process */ - schedule_timeout_uninterruptible(1); + + /* + * We invoked the OOM killer but there is a chance that a kill + * did not free up any charges. Everybody else might already + * be sleeping, so restart the fault and keep the rampage + * going until some charges are released. + */ + memcg = current->memcg_oom.wait_on_memcg; + if (!memcg) + goto out; + + if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + goto out_put; + + owait.mem = memcg; + owait.wait.flags = 0; + owait.wait.func = memcg_oom_wake_function; + owait.wait.private = current; + INIT_LIST_HEAD(&owait.wait.task_list); + + prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); + /* Only sleep if we didn't miss any wakeups since OOM */ + if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups) + schedule(); + finish_wait(&memcg_oom_waitq, &owait.wait); +out_put: + mem_cgroup_unmark_under_oom(memcg); + css_put(&memcg->css); + current->memcg_oom.wait_on_memcg = NULL; +out: + current->memcg_oom.in_memcg_oom = 0; return true; } @@ -2195,11 +2249,10 @@ enum { CHARGE_RETRY, /* need to retry but retry is not bad */ CHARGE_NOMEM, /* we can't do more. return -ENOMEM */ CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */ - CHARGE_OOM_DIE, /* the current is killed because of OOM */ }; static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, - unsigned int nr_pages, bool oom_check) + unsigned int nr_pages, bool invoke_oom) { unsigned long csize = nr_pages * PAGE_SIZE; struct mem_cgroup *mem_over_limit; @@ -2257,14 +2310,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (mem_cgroup_wait_acct_move(mem_over_limit)) return CHARGE_RETRY; - /* If we don't need to call oom-killer at el, return immediately */ - if (!oom_check) - return CHARGE_NOMEM; - /* check OOM */ - if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask)) - return CHARGE_OOM_DIE; + if (invoke_oom) + mem_cgroup_oom(mem_over_limit, gfp_mask); - return CHARGE_RETRY; + return CHARGE_NOMEM; } /* @@ -2349,7 +2398,7 @@ again: } do { - bool oom_check; + bool invoke_oom = oom && !nr_oom_retries; /* If killed, bypass charge */ if (fatal_signal_pending(current)) { @@ -2357,13 +2406,7 @@ again: goto bypass; } - oom_check = false; - if (oom && !nr_oom_retries) { - oom_check = true; - nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES; - } - - ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check); + ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, invoke_oom); switch (ret) { case CHARGE_OK: break; @@ -2376,16 +2419,12 @@ again: css_put(&memcg->css); goto nomem; case CHARGE_NOMEM: /* OOM routine works */ - if (!oom) { + if (!oom || invoke_oom) { css_put(&memcg->css); goto nomem; } - /* If oom, we never return -ENOMEM */ nr_oom_retries--; break; - case CHARGE_OOM_DIE: /* Killed by OOM Killer */ - css_put(&memcg->css); - goto bypass; } } while (ret != CHARGE_OK); diff --git a/mm/memory.c b/mm/memory.c index 829d437..2be02b7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3439,22 +3439,14 @@ unlock: /* * By the time we get here, we already hold the mm semaphore */ -int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, unsigned int flags) +static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) { pgd_t *pgd; pud_t *pud; pmd_t *pmd; pte_t *pte; - __set_current_state(TASK_RUNNING); - - count_vm_event(PGFAULT); - mem_cgroup_count_vm_event(mm, PGFAULT); - - /* do counter updates before entering really critical section. */ - check_sync_rss_stat(current); - if (unlikely(is_vm_hugetlb_page(vma))) return hugetlb_fault(mm, vma, address, flags); @@ -3503,6 +3495,31 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, return handle_pte_fault(mm, vma, address, pte, pmd, flags); } +int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) +{ + int userfault = flags & FAULT_FLAG_USER; + int ret; + + __set_current_state(TASK_RUNNING); + + count_vm_event(PGFAULT); + mem_cgroup_count_vm_event(mm, PGFAULT); + + /* do counter updates before entering really critical section. */ + check_sync_rss_stat(current); + + if (userfault) + WARN_ON(mem_cgroup_xchg_may_oom(current, 1) == 1); + + ret = __handle_mm_fault(mm, vma, address, flags); + + if (userfault) + WARN_ON(mem_cgroup_xchg_may_oom(current, 0) == 0); + + return ret; +} + #ifndef __PAGETABLE_PUD_FOLDED /* * Allocate page upper directory. diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 069b64e..aa60863 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -785,6 +785,8 @@ out: */ void pagefault_out_of_memory(void) { + if (mem_cgroup_oom_synchronize()) + return; if (try_set_system_oom()) { out_of_memory(NULL, 0, 0, NULL); clear_system_oom(); -- 1.8.3.2 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933516Ab3GSE02 (ORCPT ); Fri, 19 Jul 2013 00:26:28 -0400 Received: from zene.cmpxchg.org ([85.214.230.12]:50483 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752123Ab3GSE00 (ORCPT ); Fri, 19 Jul 2013 00:26:26 -0400 Date: Fri, 19 Jul 2013 00:26:23 -0400 From: Johannes Weiner To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea@gmail.com Subject: [patch 5/5] mm: memcontrol: sanity check memcg OOM context unwind Message-ID: <20130719042623.GH17812@cmpxchg.org> References: <20130710182506.F25DF461@pobox.sk> <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> <20130716153544.GX17812@cmpxchg.org> <20130716160905.GA20018@dhcp22.suse.cz> <20130716164830.GZ17812@cmpxchg.org> <20130719042124.GC17812@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130719042124.GC17812@cmpxchg.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Catch the cases where a memcg OOM context is set up in the failed charge path but the fault handler is not actually returning VM_FAULT_ERROR, which would be required to properly finalize the OOM. Example output: the first trace shows the stack at the end of handle_mm_fault() where an unexpected memcg OOM context is detected. The subsequent trace is of whoever set up that OOM context. In this case it was the charging of readahead pages in a file fault, which does not propagate VM_FAULT_OOM on failure and should disable OOM: [ 27.805359] WARNING: at /home/hannes/src/linux/linux/mm/memory.c:3523 handle_mm_fault+0x1fb/0x3f0() [ 27.805360] Hardware name: PowerEdge 1950 [ 27.805361] Fixing unhandled memcg OOM context, set up from: [ 27.805362] Pid: 1599, comm: file Tainted: G W 3.2.0-00005-g6d10010 #97 [ 27.805363] Call Trace: [ 27.805365] [] warn_slowpath_common+0x6a/0xa0 [ 27.805367] [] warn_slowpath_fmt+0x41/0x50 [ 27.805369] [] handle_mm_fault+0x1fb/0x3f0 [ 27.805371] [] do_page_fault+0x140/0x4a0 [ 27.805373] [] ? do_mmap_pgoff+0x34b/0x360 [ 27.805376] [] page_fault+0x1f/0x30 [ 27.805377] ---[ end trace 305ec584fba81649 ]--- [ 27.805378] [] __mem_cgroup_try_charge+0x5c8/0x7e0 [ 27.805380] [] mem_cgroup_cache_charge+0xac/0x110 [ 27.805381] [] add_to_page_cache_locked+0x3e/0x120 [ 27.805383] [] add_to_page_cache_lru+0x15/0x40 [ 27.805385] [] mpage_readpages+0xc3/0x150 [ 27.805387] [] ext4_readpages+0x18/0x20 [ 27.805388] [] __do_page_cache_readahead+0x1c1/0x270 [ 27.805390] [] ra_submit+0x1c/0x20 [ 27.805392] [] filemap_fault+0x3f4/0x450 [ 27.805394] [] __do_fault+0x6d/0x510 [ 27.805395] [] handle_pte_fault+0x8a/0x920 [ 27.805397] [] handle_mm_fault+0x19c/0x3f0 [ 27.805398] [] do_page_fault+0x140/0x4a0 [ 27.805400] [] page_fault+0x1f/0x30 [ 27.805401] [] 0xffffffffffffffff Debug patch only. Not-signed-off-by: Johannes Weiner --- include/linux/sched.h | 3 +++ mm/memcontrol.c | 7 +++++++ mm/memory.c | 9 +++++++++ 3 files changed, 19 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 7e6c9e9..a77d198 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -91,6 +91,7 @@ struct sched_param { #include #include #include +#include #include @@ -1571,6 +1572,8 @@ struct task_struct { struct memcg_oom_info { unsigned int may_oom:1; unsigned int in_memcg_oom:1; + struct stack_trace trace; + unsigned long trace_entries[16]; int wakeups; struct mem_cgroup *wait_on_memcg; } memcg_oom; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 99b0101..c47c77e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -49,6 +49,7 @@ #include #include #include +#include #include "internal.h" #include @@ -1870,6 +1871,12 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask) current->memcg_oom.in_memcg_oom = 1; + current->memcg_oom.trace.nr_entries = 0; + current->memcg_oom.trace.max_entries = 16; + current->memcg_oom.trace.entries = current->memcg_oom.trace_entries; + current->memcg_oom.trace.skip = 1; + save_stack_trace(¤t->memcg_oom.trace); + /* At first, try to OOM lock hierarchy under memcg.*/ spin_lock(&memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); diff --git a/mm/memory.c b/mm/memory.c index 2be02b7..fc6d741 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -57,6 +57,7 @@ #include #include #include +#include #include #include @@ -3517,6 +3518,14 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (userfault) WARN_ON(mem_cgroup_xchg_may_oom(current, 0) == 0); +#ifdef CONFIG_CGROUP_MEM_RES_CTLR + if (WARN(!(ret & VM_FAULT_OOM) && current->memcg_oom.in_memcg_oom, + "Fixing unhandled memcg OOM context, set up from:\n")) { + print_stack_trace(¤t->memcg_oom.trace, 0); + mem_cgroup_oom_synchronize(); + } +#endif + return ret; } -- 1.8.3.2 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933940Ab3GSIXs (ORCPT ); Fri, 19 Jul 2013 04:23:48 -0400 Received: from gmmr3.centrum.cz ([46.255.225.251]:40984 "EHLO gmmr3.centrum.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751751Ab3GSIXm (ORCPT ); Fri, 19 Jul 2013 04:23:42 -0400 To: =?utf-8?q?Johannes_Weiner?= , =?utf-8?q?Michal_Hocko?= Subject: =?utf-8?q?Re=3A_=5BPATCH_for_3=2E2=5D_memcg=3A_do_not_trap_chargers_with_full_callstack_on_OOM?= Date: Fri, 19 Jul 2013 10:23:39 +0200 From: "azurIt" Cc: , , =?utf-8?q?cgroups_mailinglist?= , =?utf-8?q?KAMEZAWA_Hiroyuki?= , References: <20130709135450.GI20281@dhcp22.suse.cz>, <20130710182506.F25DF461@pobox.sk>, <20130711072507.GA21667@dhcp22.suse.cz>, <20130714012641.C2DA4E05@pobox.sk>, <20130714015112.FFCB7AF7@pobox.sk>, <20130715154119.GA32435@dhcp22.suse.cz>, <20130715160006.GB32435@dhcp22.suse.cz>, <20130716153544.GX17812@cmpxchg.org>, <20130716160905.GA20018@dhcp22.suse.cz>, <20130716164830.GZ17812@cmpxchg.org> <20130719042124.GC17812@cmpxchg.org> In-Reply-To: <20130719042124.GC17812@cmpxchg.org> X-Mailer: Centrum Email 5.3 X-Priority: 3 X-Original-From: azurit@pobox.sk MIME-Version: 1.0 Message-Id: <20130719102339.34DF73E5@pobox.sk> X-Maser: brud Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com >On Tue, Jul 16, 2013 at 12:48:30PM -0400, Johannes Weiner wrote: >> On Tue, Jul 16, 2013 at 06:09:05PM +0200, Michal Hocko wrote: >> > On Tue 16-07-13 11:35:44, Johannes Weiner wrote: >> > > On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote: >> > > > On Mon 15-07-13 17:41:19, Michal Hocko wrote: >> > > > > On Sun 14-07-13 01:51:12, azurIt wrote: >> > > > > > > CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com >> > > > > > >> CC: "Johannes Weiner" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" , "KAMEZAWA Hiroyuki" , righi.andrea@gmail.com >> > > > > > >>On Wed 10-07-13 18:25:06, azurIt wrote: >> > > > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before >> > > > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using >> > > > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able >> > > > > > >>> >> to associate all user's processes with target cgroup). Look here for >> > > > > > >>> >> cgroup-uid patch: >> > > > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch >> > > > > > >>> >> >> > > > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was >> > > > > > >>> >> permanently '1'. >> > > > > > >>> > >> > > > > > >>> >This is really strange. Could you post the whole diff against stable >> > > > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid >> > > > > > >>> >patch)? >> > > > > > >>> >> > > > > > >>> >> > > > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: >> > > > > > >>> http://watchdog.sk/lkml/patches3/ >> > > > > > >> >> > > > > > >>The two patches from Johannes seem correct. >> > > > > > >> >> > > > > > >>From a quick look even grsecurity patchset shouldn't interfere as it >> > > > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error >> > > > > > >>and there also doesn't seem to be any new handle_mm_fault call sites. >> > > > > > >> >> > > > > > >>But I cannot tell there aren't other code paths which would lead to a >> > > > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. >> > > > > > > >> > > > > > > >> > > > > > >Michal, >> > > > > > > >> > > > > > >now i can definitely confirm that problem with unremovable cgroups >> > > > > > >persists. What info do you need from me? I applied also your little >> > > > > > >'WARN_ON' patch. >> > > > > > >> > > > > > Ok, i think you want this: >> > > > > > http://watchdog.sk/lkml/kern4.log >> > > > > >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [] warn_slowpath_common+0x7a/0xb0 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [] warn_slowpath_null+0x1a/0x20 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [] do_exit+0x7d0/0x870 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [] ? thread_group_times+0x44/0xb0 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [] do_group_exit+0x51/0xc0 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [] sys_exit_group+0x17/0x20 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [] system_call_fastpath+0x18/0x1d >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- >> > > > > >> > > > > OK, so you had an OOM which has been handled by in-kernel oom handler >> > > > > (it killed 12021) and 12037 was in the same group. The warning tells us >> > > > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have >> > > > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then >> > > > > it exited on the userspace request (by exit syscall). >> > > > > >> > > > > I do not see any way how, this could happen though. If mem_cgroup_oom >> > > > > is called then we always return CHARGE_NOMEM which turns into ENOMEM >> > > > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to >> > > > > true). So if nobody screwed the return value on the way up to page >> > > > > fault handler then there is no way to escape. >> > > > > >> > > > > I will check the code. >> > > > >> > > > OK, I guess I found it: >> > > > __do_fault >> > > > fault = filemap_fault >> > > > do_async_mmap_readahead >> > > > page_cache_async_readahead >> > > > ondemand_readahead >> > > > __do_page_cache_readahead >> > > > read_pages >> > > > readpages = ext3_readpages >> > > > mpage_readpages # Doesn't propagate ENOMEM >> > > > add_to_page_cache_lru >> > > > add_to_page_cache >> > > > add_to_page_cache_locked >> > > > mem_cgroup_cache_charge >> > > > >> > > > So the read ahead most probably. Again! Duhhh. I will try to think >> > > > about a fix for this. One obvious place is mpage_readpages but >> > > > __do_page_cache_readahead ignores read_pages return value as well and >> > > > page_cache_async_readahead, even worse, is just void and exported as >> > > > such. >> > > > >> > > > So this smells like a hard to fix bugger. One possible, and really ugly >> > > > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault >> > > > doesn't return VM_FAULT_ERROR, but that is a crude hack. > >I fixed it by disabling the OOM killer altogether for readahead code. >We don't do it globally, we should not do it in the memcg, these are >optional allocations/charges. > >I also disabled it for kernel faults triggered from within a syscall >(copy_*user, get_user_pages), which should just return -ENOMEM as >usual (unless it's nested inside a userspace fault). The only >downside is that we can't get around annotating userspace faults >anymore, so every architecture fault handler now passes >FAULT_FLAG_USER to handle_mm_fault(). Makes the series a little less >self-contained, but it's not unreasonable. > >It's easy to detect leaks now by checking if the memcg OOM context is >setup and we are not returning VM_FAULT_OOM. > >Here is a combined diff based on 3.2. azurIt, any chance you could >give this a shot? I tested it on my local machines, but you have a >known reproducer of fairly unlikely scenarios... I will be out of office between 25.7. and 1.8. and I don't want to run anything which can potentially do an outage of our services. I will test this patch after 2.8. Should I use also previous patches of this one is enough? Thank you very much Johannes. azur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753167Ab3GXUcR (ORCPT ); Wed, 24 Jul 2013 16:32:17 -0400 Received: from zene.cmpxchg.org ([85.214.230.12]:50862 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752173Ab3GXUcO (ORCPT ); Wed, 24 Jul 2013 16:32:14 -0400 Date: Wed, 24 Jul 2013 16:32:05 -0400 From: Johannes Weiner To: Michal Hocko Cc: azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea@gmail.com Subject: Re: [patch 3/5] x86: finish fault error path with fatal signal Message-ID: <20130724203205.GL715@cmpxchg.org> References: <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> <20130716153544.GX17812@cmpxchg.org> <20130716160905.GA20018@dhcp22.suse.cz> <20130716164830.GZ17812@cmpxchg.org> <20130719042124.GC17812@cmpxchg.org> <20130719042502.GF17812@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130719042502.GF17812@cmpxchg.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 19, 2013 at 12:25:02AM -0400, Johannes Weiner wrote: > The x86 fault handler bails in the middle of error handling when the > task has been killed. For the next patch this is a problem, because > it relies on pagefault_out_of_memory() being called even when the task > has been killed, to perform proper OOM state unwinding. > > This is a rather minor optimization, just remove it. > > Signed-off-by: Johannes Weiner > --- > arch/x86/mm/fault.c | 11 ----------- > 1 file changed, 11 deletions(-) > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index 1cebabe..90248c9 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -846,17 +846,6 @@ static noinline int > mm_fault_error(struct pt_regs *regs, unsigned long error_code, > unsigned long address, unsigned int fault) > { > - /* > - * Pagefault was interrupted by SIGKILL. We have no reason to > - * continue pagefault. > - */ > - if (fatal_signal_pending(current)) { > - if (!(fault & VM_FAULT_RETRY)) > - up_read(¤t->mm->mmap_sem); > - if (!(error_code & PF_USER)) > - no_context(regs, error_code, address); > - return 1; This is broken but I only hit it now after testing for a while. The patch has the right idea: in case of an OOM kill, we should continue the fault and not abort. What I missed is that in case of a kill during lock_page, i.e. VM_FAULT_RETRY && fatal_signal, we have to exit the fault and not do up_read() etc. This introduced a locking imbalance that would get everybody hung on mmap_sem. I moved the retry handling outside of mm_fault_error() (come on...) and stole some documentation from arm. It's now a little bit more explicit and comparable to other architectures. I'll send an updated series, patch for reference: --- From: Johannes Weiner Subject: [patch] x86: finish fault error path with fatal signal The x86 fault handler bails in the middle of error handling when the task has been killed. For the next patch this is a problem, because it relies on pagefault_out_of_memory() being called even when the task has been killed, to perform proper OOM state unwinding. This is a rather minor optimization that cuts short the fault handling by a few instructions in rare cases. Just remove it. Signed-off-by: Johannes Weiner --- arch/x86/mm/fault.c | 33 +++++++++++++-------------------- 1 file changed, 13 insertions(+), 20 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 6d77c38..0c18beb 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -842,31 +842,17 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address, force_sig_info_fault(SIGBUS, code, address, tsk, fault); } -static noinline int +static noinline void mm_fault_error(struct pt_regs *regs, unsigned long error_code, unsigned long address, unsigned int fault) { - /* - * Pagefault was interrupted by SIGKILL. We have no reason to - * continue pagefault. - */ - if (fatal_signal_pending(current)) { - if (!(fault & VM_FAULT_RETRY)) - up_read(¤t->mm->mmap_sem); - if (!(error_code & PF_USER)) - no_context(regs, error_code, address, 0, 0); - return 1; - } - if (!(fault & VM_FAULT_ERROR)) - return 0; - if (fault & VM_FAULT_OOM) { /* Kernel mode? Handle exceptions or die: */ if (!(error_code & PF_USER)) { up_read(¤t->mm->mmap_sem); no_context(regs, error_code, address, SIGSEGV, SEGV_MAPERR); - return 1; + return; } up_read(¤t->mm->mmap_sem); @@ -884,7 +870,6 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code, else BUG(); } - return 1; } static int spurious_fault_check(unsigned long error_code, pte_t *pte) @@ -1189,9 +1174,17 @@ good_area: */ fault = handle_mm_fault(mm, vma, address, flags); - if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) { - if (mm_fault_error(regs, error_code, address, fault)) - return; + /* + * If we need to retry but a fatal signal is pending, handle the + * signal first. We do not need to release the mmap_sem because it + * would already be released in __lock_page_or_retry in mm/filemap.c. + */ + if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) + return; + + if (unlikely(fault & VM_FAULT_ERROR)) { + mm_fault_error(regs, error_code, address, fault); + return; } /* -- 1.8.3.2 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753642Ab3GYU3A (ORCPT ); Thu, 25 Jul 2013 16:29:00 -0400 Received: from mail-gh0-f182.google.com ([209.85.160.182]:45791 "EHLO mail-gh0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752637Ab3GYU2z (ORCPT ); Thu, 25 Jul 2013 16:28:55 -0400 Message-ID: <51F18A99.7000306@gmail.com> Date: Thu, 25 Jul 2013 16:29:13 -0400 From: KOSAKI Motohiro User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:17.0) Gecko/20130620 Thunderbird/17.0.7 MIME-Version: 1.0 To: Johannes Weiner CC: Michal Hocko , azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea@gmail.com, kosaki.motohiro@gmail.com Subject: Re: [patch 3/5] x86: finish fault error path with fatal signal References: <20130711072507.GA21667@dhcp22.suse.cz> <20130714012641.C2DA4E05@pobox.sk> <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> <20130716153544.GX17812@cmpxchg.org> <20130716160905.GA20018@dhcp22.suse.cz> <20130716164830.GZ17812@cmpxchg.org> <20130719042124.GC17812@cmpxchg.org> <20130719042502.GF17812@cmpxchg.org> <20130724203205.GL715@cmpxchg.org> In-Reply-To: <20130724203205.GL715@cmpxchg.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org (7/24/13 4:32 PM), Johannes Weiner wrote: > On Fri, Jul 19, 2013 at 12:25:02AM -0400, Johannes Weiner wrote: >> The x86 fault handler bails in the middle of error handling when the >> task has been killed. For the next patch this is a problem, because >> it relies on pagefault_out_of_memory() being called even when the task >> has been killed, to perform proper OOM state unwinding. >> >> This is a rather minor optimization, just remove it. >> >> Signed-off-by: Johannes Weiner >> --- >> arch/x86/mm/fault.c | 11 ----------- >> 1 file changed, 11 deletions(-) >> >> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c >> index 1cebabe..90248c9 100644 >> --- a/arch/x86/mm/fault.c >> +++ b/arch/x86/mm/fault.c >> @@ -846,17 +846,6 @@ static noinline int >> mm_fault_error(struct pt_regs *regs, unsigned long error_code, >> unsigned long address, unsigned int fault) >> { >> - /* >> - * Pagefault was interrupted by SIGKILL. We have no reason to >> - * continue pagefault. >> - */ >> - if (fatal_signal_pending(current)) { >> - if (!(fault & VM_FAULT_RETRY)) >> - up_read(¤t->mm->mmap_sem); >> - if (!(error_code & PF_USER)) >> - no_context(regs, error_code, address); >> - return 1; > > This is broken but I only hit it now after testing for a while. > > The patch has the right idea: in case of an OOM kill, we should > continue the fault and not abort. What I missed is that in case of a > kill during lock_page, i.e. VM_FAULT_RETRY && fatal_signal, we have to > exit the fault and not do up_read() etc. This introduced a locking > imbalance that would get everybody hung on mmap_sem. > > I moved the retry handling outside of mm_fault_error() (come on...) > and stole some documentation from arm. It's now a little bit more > explicit and comparable to other architectures. > > I'll send an updated series, patch for reference: > > --- > From: Johannes Weiner > Subject: [patch] x86: finish fault error path with fatal signal > > The x86 fault handler bails in the middle of error handling when the > task has been killed. For the next patch this is a problem, because > it relies on pagefault_out_of_memory() being called even when the task > has been killed, to perform proper OOM state unwinding. > > This is a rather minor optimization that cuts short the fault handling > by a few instructions in rare cases. Just remove it. > > Signed-off-by: Johannes Weiner > --- > arch/x86/mm/fault.c | 33 +++++++++++++-------------------- > 1 file changed, 13 insertions(+), 20 deletions(-) > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index 6d77c38..0c18beb 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -842,31 +842,17 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address, > force_sig_info_fault(SIGBUS, code, address, tsk, fault); > } > > -static noinline int > +static noinline void > mm_fault_error(struct pt_regs *regs, unsigned long error_code, > unsigned long address, unsigned int fault) > { > - /* > - * Pagefault was interrupted by SIGKILL. We have no reason to > - * continue pagefault. > - */ > - if (fatal_signal_pending(current)) { > - if (!(fault & VM_FAULT_RETRY)) > - up_read(¤t->mm->mmap_sem); > - if (!(error_code & PF_USER)) > - no_context(regs, error_code, address, 0, 0); > - return 1; > - } > - if (!(fault & VM_FAULT_ERROR)) > - return 0; > - > if (fault & VM_FAULT_OOM) { > /* Kernel mode? Handle exceptions or die: */ > if (!(error_code & PF_USER)) { > up_read(¤t->mm->mmap_sem); > no_context(regs, error_code, address, > SIGSEGV, SEGV_MAPERR); > - return 1; > + return; > } > > up_read(¤t->mm->mmap_sem); > @@ -884,7 +870,6 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code, > else > BUG(); > } > - return 1; > } > > static int spurious_fault_check(unsigned long error_code, pte_t *pte) > @@ -1189,9 +1174,17 @@ good_area: > */ > fault = handle_mm_fault(mm, vma, address, flags); > > - if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) { > - if (mm_fault_error(regs, error_code, address, fault)) > - return; > + /* > + * If we need to retry but a fatal signal is pending, handle the > + * signal first. We do not need to release the mmap_sem because it > + * would already be released in __lock_page_or_retry in mm/filemap.c. > + */ > + if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) > + return; > + > + if (unlikely(fault & VM_FAULT_ERROR)) { > + mm_fault_error(regs, error_code, address, fault); > + return; > } When I made the patch you removed code, Ingo suggested we need put all rare case code into if(unlikely()) block. Yes, this is purely micro optimization. But it is not costly to maintain. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756538Ab3GYVuo (ORCPT ); Thu, 25 Jul 2013 17:50:44 -0400 Received: from zene.cmpxchg.org ([85.214.230.12]:50932 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754601Ab3GYVum (ORCPT ); Thu, 25 Jul 2013 17:50:42 -0400 Date: Thu, 25 Jul 2013 17:50:33 -0400 From: Johannes Weiner To: KOSAKI Motohiro Cc: Michal Hocko , azurIt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki , righi.andrea@gmail.com Subject: Re: [patch 3/5] x86: finish fault error path with fatal signal Message-ID: <20130725215033.GP715@cmpxchg.org> References: <20130714015112.FFCB7AF7@pobox.sk> <20130715154119.GA32435@dhcp22.suse.cz> <20130715160006.GB32435@dhcp22.suse.cz> <20130716153544.GX17812@cmpxchg.org> <20130716160905.GA20018@dhcp22.suse.cz> <20130716164830.GZ17812@cmpxchg.org> <20130719042124.GC17812@cmpxchg.org> <20130719042502.GF17812@cmpxchg.org> <20130724203205.GL715@cmpxchg.org> <51F18A99.7000306@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51F18A99.7000306@gmail.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jul 25, 2013 at 04:29:13PM -0400, KOSAKI Motohiro wrote: > (7/24/13 4:32 PM), Johannes Weiner wrote: > >@@ -1189,9 +1174,17 @@ good_area: > > */ > > fault = handle_mm_fault(mm, vma, address, flags); > > > >- if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) { > >- if (mm_fault_error(regs, error_code, address, fault)) > >- return; > >+ /* > >+ * If we need to retry but a fatal signal is pending, handle the > >+ * signal first. We do not need to release the mmap_sem because it > >+ * would already be released in __lock_page_or_retry in mm/filemap.c. > >+ */ > >+ if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) > >+ return; > >+ > >+ if (unlikely(fault & VM_FAULT_ERROR)) { > >+ mm_fault_error(regs, error_code, address, fault); > >+ return; > > } > > When I made the patch you removed code, Ingo suggested we need put all rare case code > into if(unlikely()) block. Yes, this is purely micro optimization. But it is not costly > to maintain. Fair enough, thanks for the heads up!