From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751831AbdJCOJb (ORCPT ); Tue, 3 Oct 2017 10:09:31 -0400 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:34770 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751171AbdJCOJ2 (ORCPT ); Tue, 3 Oct 2017 10:09:28 -0400 Date: Tue, 3 Oct 2017 15:08:41 +0100 From: Roman Gushchin To: Michal Hocko CC: , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , David Rientjes , Andrew Morton , Tejun Heo , , , , Subject: Re: [v9 3/5] mm, oom: cgroup-aware OOM killer Message-ID: <20171003140841.GA29624@castle.DHCP.thefacebook.com> References: <20170927130936.8601-1-guro@fb.com> <20170927130936.8601-4-guro@fb.com> <20171003114848.gstdawonla2gmfio@dhcp22.suse.cz> <20171003123721.GA27919@castle.dhcp.TheFacebook.com> <20171003133623.hoskmd3fsh4t2phf@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20171003133623.hoskmd3fsh4t2phf@dhcp22.suse.cz> User-Agent: Mutt/1.9.0 (2017-09-02) X-Originating-IP: [2620:10d:c092:200::1:8d4] X-ClientProxiedBy: DB6PR0501CA0018.eurprd05.prod.outlook.com (2603:10a6:4:8f::28) To SN2PR15MB1086.namprd15.prod.outlook.com (2603:10b6:804:22::8) X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: cc93dcdf-c145-4be9-7405-08d50a684f18 X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:(22001)(2017030254152)(2017052603199)(201703131423075)(201703031133081)(201702281549075);SRVR:SN2PR15MB1086; X-Microsoft-Exchange-Diagnostics: 1;SN2PR15MB1086;3:pVexzDBl1ywP/kRjpf6pvFYNZhUmbS1UXxmZC3QIvahUuE4scU4+45Z1YZT5eBrG59xrXfFyTvGwliwpkAQnWLoCbN4dO8KzrtGkJT0DVdOcWuf5OiKCJXCoajT6TByJ9sRRhqClWcKToHWdBN8reUQW4UOBu26ibN4+OctwuYZW3aumtsxTjFFhA01zDT3K51y9nzkL7Y8p+jcaO2eqX0RR4OL3B3YavpJoxqoYymjwqN1YIl9bdq69WC0r9Ce8;25:EmEb3VbRbB9mUzFqRtT0PeCXEg7sfHTPO1ZmmGZwLrL1P1eiJbFQs9EoWMZpm2OAhuJK8wRRP+QW5X50sWQlo48Kk9L+C75lUkOU2LCgprI3F6Vsbdo+KDv4C6sgCrHN3YLCuML13B7h5ekR0bOjXmR7chh+1Y7sm7y0muVsmwZVdBfTRyIOwIcjZsKzly6ntmXojnymGXVt3B7HWcH77+/vYOlAsG7zREy77KuFnUNNIdvQzv1oIukLHshUv7zqHwndPRNKjvj8IlHU/RdKhWKaoeez+ClvYkEqxwdsBgzkxOqbsuW0L36LAMda6sq+VVeXIE4eiu+vR4maT0cfQA==;31:shNSde6WWLu6nZkyb1DFclLPhAgU0hcVM15cXf55cd1YyF3KRx5+LrdsBoCqWZItw2nuuBFEt9J3bnjXx7d/AmHJUjPwBKrz3UznG8GOV+EuScTJCQR9WwKxMLuQXND/1i9TgbzHafE3mo7bzeWCvCyN6pIKft7MWECRZx5p4SXoGyPCsClDiyVim54P+xpQNKdp1NZxLhPpPkTxPnGjaotWUi+0aYZZBGR9yF+K5gw= X-MS-TrafficTypeDiagnostic: SN2PR15MB1086: X-Microsoft-Exchange-Diagnostics: 1;SN2PR15MB1086;20:/QhIPriY+OnrfbT0yIbLcj9RP2eLyvGevgVNhLnK0vvdSs1jcvCxtsq5af3HFGPKYMeCXXCHqe3tjQLya9qh6UuRhlzjvrJXCgtVEpgzPDbte7Mg/QjNUGjdd/6RQ9oKHtiYOnOJUq+N5McvsC+uYA7QAcvFrMuKtrzSPFCe38Wlm8RzuaZx8rEQI/ow1hxj2BSdLSzsvK8n2TPBx8+dyjkK29/0n9SqFunFdXuAuek3dS6mO9+gMWLivoHF3boXsMqg3mZ/MnlLMPjnx+RaQKx16NHSdROhn3jukYSP14OQK+H22BucceM0c4NkciMK8JBhdnC2IbUD7u6JxpHkqDrsKYTaLZNe4zFD2Z/O+D6tYYOQvNY6OGSP3pWsatVCWpmVSSFoo8kJ7N7G3VmvT6qKZXnckiDYd0cphQSiS2LMu6/YO9EPTF5UFD94I95r2svMw6EAGA+1LO6SQPl3UfyMrmnEZz+ZojyGYWiLtYt/Bjhzo68F7edaOdE4bj7i;4:Ftc080vvJtIqR8eeYb3aPHbOBELAfGcQi091tSskVWsDFWnh2/NrascMDAViJK9hxxdQUQjAEJQ2BULuLGME67z8bI/yPTQnKo2fl7K57gDn9c9JaavFEN9CoKTxK8Sy9i5BggzW6y07zNBHVzkLZEtMcu/2s58NIRSmUYpmm4Bnp3J7E5Wmow9qFPPN6CPIB8MT8VQI/4oHQzqj/NIlm/rZdi1crz2UlWTNfjs38o5XsI9IViP5gelaMi8lZ9HfyGDc1ItFWRYBEppgdu1z76w+FvkMdH0jDY/LQMP+KBAyxo+ptV6L6/VH1f9jEpZpMxFh/V9Z/6ox0PIh7MafDA== X-Exchange-Antispam-Report-Test: UriScan:(20558992708506)(42068640409301); X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(100000700101)(100105000095)(100000701101)(100105300095)(100000702101)(100105100095)(6040450)(2401047)(8121501046)(5005006)(93006095)(93001095)(10201501046)(3002001)(100000703101)(100105400095)(6041248)(20161123564025)(201703131423075)(201702281528075)(201703061421075)(201703061406153)(20161123562025)(20161123555025)(20161123558100)(20161123560025)(6072148)(201708071742011)(100000704101)(100105200095)(100000705101)(100105500095);SRVR:SN2PR15MB1086;BCL:0;PCL:0;RULEID:(100000800101)(100110000095)(100000801101)(100110300095)(100000802101)(100110100095)(100000803101)(100110400095)(100000804101)(100110200095)(100000805101)(100110500095);SRVR:SN2PR15MB1086; X-Forefront-PRVS: 044968D9E1 X-Forefront-Antispam-Report: SFV:NSPM;SFS:(10019020)(6009001)(346002)(376002)(24454002)(199003)(189002)(377424004)(1076002)(53936002)(39060400002)(6246003)(305945005)(7736002)(2906002)(8936002)(23726003)(25786009)(9686003)(6306002)(101416001)(4326008)(105586002)(7416002)(106356001)(55016002)(5660300001)(6506006)(16586007)(316002)(47776003)(8676002)(189998001)(93886005)(68736007)(6116002)(33656002)(97736004)(229853002)(6916009)(6666003)(81156014)(54906003)(50466002)(81166006)(2950100002)(58126008)(478600001)(54356999)(50986999)(76176999)(83506001)(966005)(86362001)(18370500001)(42262002);DIR:OUT;SFP:1102;SCL:1;SRVR:SN2PR15MB1086;H:castle.DHCP.thefacebook.com;FPR:;SPF:None;PTR:InfoNoRecords;A:1;MX:1;LANG:en; X-Microsoft-Exchange-Diagnostics: =?us-ascii?Q?1;SN2PR15MB1086;23:HMLssIY+WwdOQq+ta2dwbLEJRSHvyVApmHWrh57po?= =?us-ascii?Q?CLTTeH3zBy/5ILHGZeLKVq872L8i8gX8CTYmLWv5YVb+g+h0qzJU11uayV4H?= =?us-ascii?Q?ave5dVjWeq8L/VAdpfZx04BfjV3klW+L/cEuZh1BiItx+VZuH5irnYFls7aV?= =?us-ascii?Q?6yVUmCjcjpKigwHrmGHjJ0a/kz2EG0ogIWxCi7bWKU2s3TKqX7H86ayIFmyg?= =?us-ascii?Q?rRy0DxK6rqtlRSfzQik2ud93Qr0rx73sHdLulnnXztIIdlA3rl8mu8icMFDG?= =?us-ascii?Q?H9J/sm68duLr87ap2WFgC5ox1/t3p7HteghN49LGTS8kvAvvdl1HUpykygB5?= =?us-ascii?Q?lTxM5MnnzgVARqHprUqLvxBEBYcrJbf/TyjeGQ8WPUrAISPPjxh6ml6jORt/?= =?us-ascii?Q?eegq7UZnKTFXDDkd+Jb8dgfryuYH6gHxzcX7g3cq5UbzzR0ommqSZb5RxlQu?= =?us-ascii?Q?1/yAT4UvcLZ/1OI7xMBIF9T0oaQPPKi6x5Zajy1dk+REHx1sGMaXeIQQ5RqW?= =?us-ascii?Q?f6H+mhllYy/bczXRW/fsWeCSgJjqLxoOKgwb1/JQmDB7AHqYG8fGJgCZGOK0?= =?us-ascii?Q?AWGyQAS11CTUSh94lDlwHDijSr40724VPB1RUdE5OmXJnzV5MoNBW0gmeJMk?= =?us-ascii?Q?4K4OXVnNX/PxIgQm3rOTLg97vfDOvn8lM9+ezJi/LHLPPppDNxA87m1JEkwq?= =?us-ascii?Q?cPhiholD8p/SwdBlYn8tJlrU7frJO6amT92Um46AzAuHfNWHB1SKqhtWITy+?= =?us-ascii?Q?NeK1xK510f/BEeVIEY2O94lX8njK2znCI4T6NkmGU10cyNP6sD/Te5DMEVki?= =?us-ascii?Q?kGqH5FVp2sTo+sigAhHxnm556kzfHmZ+yQ6QyMCK4QAi8kyKvgNDImPJtfew?= =?us-ascii?Q?oRHIzNICAMZaPiPUs2zQsu9K7/ymyAZjaAztGRK8oz/Qhv4/xG8inHV4yIPY?= =?us-ascii?Q?SQOPqB+kbTIU6GlGitKkUTh53dPbQi5qwB8uX3Az2PRHvHNiTgdn3402fNMy?= =?us-ascii?Q?RjCefWX2+UEJmtUrNy46snMtfx/EMkCMuL148tt0p9g0yPEAInQE46Xa39xo?= =?us-ascii?Q?H4SQoDzah75ddGXkjtmFnsgGFexlVXxpUDKyb7p+yiJg0UasRv2diMrWnpPp?= =?us-ascii?Q?LRamNhF6m7UvWAqLD5GSOT13KGbM/giqu+WbnE1LzHrFP57WLiuIl4BYCZV4?= =?us-ascii?Q?BblDCf27zQ5h1n0gvIJklRSbC+SaCOV38AAlGiKj9KLjqR1o7SnfbBm13Jue?= =?us-ascii?Q?6iY8+1Eoix2kx3p+uUbJ7ndTOOU6Kad45JBFgzq?= X-Microsoft-Exchange-Diagnostics: 1;SN2PR15MB1086;6:DjnlkfagFvtx75Td4bcZCh3bcvnjeHhiyowjzF/ByiEZ3fO/6hxcuYMCFaZe+bvg+Fwo7jqLP76fo/enIDrSEqhztmuw+I6xHqHh0l0Lq7wv8MtXyJI29LOGD6pLmSHsbKEkpahwZ4la5Zl+M+o3L99jh6gyAbYiicvqs0GQM8n/UwbAXNFtvvdLvpoO9WWEvqqpQ6RI0g3lLjzkCeAqrIemoyQSiHsQxU+UGwGFex39VIRoVQYdZmBKkI6WGBCIiOjYt/3YuRlojp2PSGnPLA7EcwE8vt4ey7pkxNSpRHuFgQjuGB42sXahy0gFhDlWhEkBHr8fevFP6n8Tdsy6FA==;5:9n3gXI00T90a48EOzEyIdpx+n00nuo/uxDiDKQiAFyP9POqtaSc9k5AFZ9JCij/z88/zy/w1JnAXKvaOOJbIY6fKKsF37oZrrTGjdmw10WsBeX0Yo6ZtgBXPkN1IIBusI/GNV76Bnhu9mfk40G09vW0aR8cLam5zrIK0ybwWzPc=;24:pNcIvu0wvyQMtH8d21EmxpucV/BohUffsqIpgDd/+OomUu143ndNjeSWBSk/gZOLM5sOfsDV2jcW3FGFpvhzFo0ZLq8fhyYGTAv4TTMhaqY=;7:tQMJWWtUqK8lbWL9JJZErAkbaBNzQVJWYjyOK/TH0UTMNiinEMQuGkkDDZd1VqUjefNdkOTjL6SERGs/webg0qXlLnEyM8zPmEs19rVN+wibfXrp2UN8tsXEoTfL89U2dVYFUnjmpHc0ddjRSYSe86HYrvLzEs/sbVNSuFzLRCedt9zM8sU7e/wKFVJeVf1kBKuxoL47c1aWcd+at9MCkuuyJuhweSadVF/kF+fnkrU= SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-Microsoft-Exchange-Diagnostics: 1;SN2PR15MB1086;20:kHLy4HRI1vZup8XZE2CBWX6+QbvGuymmQIjQ8HbrbLANbjE9LT5SELQuxG/mB8fATwVkFrrX4RTqX7rxQ4rM/0rjxntI/49ZpNAG2GxzqOrstdOb3TLYiMwVsBSyraV6R1HCXm/qxretcl22Ay0IttMUuY/Ro9rzrOtSd1sK8hY= X-MS-Exchange-CrossTenant-OriginalArrivalTime: 03 Oct 2017 14:09:04.2046 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 8ae927fe-1255-47a7-a2af-5f3a069daaa2 X-MS-Exchange-Transport-CrossTenantHeadersStamped: SN2PR15MB1086 X-OriginatorOrg: fb.com X-Proofpoint-Spam-Reason: safe X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-10-03_05:,, signatures=0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 03, 2017 at 03:36:23PM +0200, Michal Hocko wrote: > On Tue 03-10-17 13:37:21, Roman Gushchin wrote: > > On Tue, Oct 03, 2017 at 01:48:48PM +0200, Michal Hocko wrote: > [...] > > > Wrt. to the implicit inheritance you brought up in a separate email > > > thread [1]. Let me quote > > > : after some additional thinking I don't think anymore that implicit > > > : propagation of oom_group is a good idea. Let me explain: assume we > > > : have memcg A with memory.max and memory.oom_group set, and nested > > > : memcg A/B with memory.max set. Let's imagine we have an OOM event if > > > : A/B. What is an expected system behavior? > > > : We have OOM scoped to A/B, and any action should be also scoped to A/B. > > > : We really shouldn't touch processes which are not belonging to A/B. > > > : That means we should either kill the biggest process in A/B, either all > > > : processes in A/B. It's natural to make A/B/memory.oom_group responsible > > > : for this decision. It's strange to make the depend on A/memory.oom_group, IMO. > > > : It really makes no sense, and makes oom_group knob really hard to describe. > > > : > > > : Also, after some off-list discussion, we've realized that memory.oom_knob > > > : should be delegatable. The workload should have control over it to express > > > : dependency between processes. > > > > > > OK, I have asked about this already but I am not sure the answer was > > > very explicit. So let me ask again. When exactly a subtree would > > > disagree with the parent on oom_group? In other words when do we want a > > > different cleanup based on the OOM root? I am not saying this is wrong > > > I am just curious about a practical example. > > > > Well, I do not have a practical example right now, but it's against the logic. > > Any OOM event has a scope, and group_oom knob is applied for OOM events > > scoped to the cgroup or any ancestors (including system as a whole). > > So, applying it implicitly to OOM scoped to descendant cgroups makes no sense. > > It's a strange configuration limitation, and I do not see any benefits: > > it doesn't provide any new functionality or guarantees. > > Well, I guess I agree. I was merely interested about consequences when > the oom behavior is different depending on which layer it happens. Does > it make sense to cleanup the whole hierarchy while any subtree would > kill a single task if the oom happened there? By setting or not setting the oom_group knob a user is expressing the readiness to handle the OOM by itself, e.g. looking at cgroup events, restarting killed tasks, etc. If workload is complex, and has some sub-parts with their own memory constraints, it's quite possible, that it's ready to restart these parts, but not a random process killed by the global OOM. This is actually a proper replacement for setting oom_score_adj: let say there is memcg A, which contains some control stuff in A/C, and several sub-workloads A/W1, A/W2, etc. In case of global OOM, caused by system miss-configuration, or, say, a memory leak in the control stuff, it makes perfect sense to kill A as a whole, so we can set A/memory.oom_groups to 1. But if there is a memory shortage in one of the workers (A/W1, for instance), it's quite possible that killing everything is excessive. So, a user has the freedom to decide what's the proper way to handle OOM. > > > Even if we don't have practical examples, we should build something less > > surprising for a user, and I don't understand why oom_group should be inherited. > > I guess we want to inherit the value on the memcg creation but I agree > that enforcing parent setting is weird. I will think about it some more > but I agree that it is saner to only enforce per memcg value. I'm not against, but we should come up with a good explanation, why we're inheriting it; or not inherit. > > > > > Tasks with oom_score_adj set to -1000 are considered as unkillable. > > > > > > > > The root cgroup is treated as a leaf memory cgroup, so it's score > > > > is compared with other leaf and oom_group memory cgroups. > > > > The oom_group option is not supported for the root cgroup. > > > > Due to memcg statistics implementation a special algorithm > > > > is used for estimating root cgroup oom_score: we define it > > > > as maximum oom_score of the belonging tasks. > > > > > > [1] http://lkml.kernel.org/r/20171002124712.GA17638@castle.DHCP.thefacebook.com > > > > > > [...] > > > > +static long memcg_oom_badness(struct mem_cgroup *memcg, > > > > + const nodemask_t *nodemask, > > > > + unsigned long totalpages) > > > > +{ > > > > + long points = 0; > > > > + int nid; > > > > + pg_data_t *pgdat; > > > > + > > > > + /* > > > > + * We don't have necessary stats for the root memcg, > > > > + * so we define it's oom_score as the maximum oom_score > > > > + * of the belonging tasks. > > > > + */ > > > > > > Why not a sum of all tasks which would more resemble what we do for > > > other memcgs? Sure this would require ignoring oom_score_adj so > > > oom_badness would have to be tweaked a bit (basically split it into > > > __oom_badness which calculates the value without the bias and > > > oom_badness on top adding the bias on top of the scaled value). > > > > We've discussed it already: calculating the sum is tricky, as tasks > > are sharing memory (and the mm struct(. As I remember, you suggested > > using maximum to solve exactly this problem, and I think it's a good > > approximation. Assuming that tasks in the root cgroup likely have > > nothing in common, and we don't support oom_group for it, looking > > at the biggest task makes perfect sense: we're exactly comparing > > killable entities. > > Please add a comment explaining that. I hope we can make root memcg less > special eventually. It shouldn't be all that hard. We already have per > LRU numbers and we only use few counters which could be accounted to the > root memcg as well. Counters should be quite cheap. Sure, this is my hope too. > > [...] > > > > > @@ -962,6 +968,48 @@ static void oom_kill_process(struct oom_control *oc, const char *message) > > > > __oom_kill_process(victim); > > > > } > > > > > > > > +static int oom_kill_memcg_member(struct task_struct *task, void *unused) > > > > +{ > > > > + if (!tsk_is_oom_victim(task)) { > > > > > > How can this happen? > > > > We do start with killing the largest process, and then iterate over all tasks > > in the cgroup. So, this check is required to avoid killing tasks which are > > already in the termination process. > > Do you mean we have tsk_is_oom_victim && MMF_OOM_SKIP == T? No, just tsk_is_oom_victim. We're are killing the biggest task, and then _all_ tasks. This is a way to skip the biggest task, and do not kill it again. > > > > > > > > + get_task_struct(task); > > > > + __oom_kill_process(task); > > > > + } > > > > + return 0; > > > > +} > > > > + > > > > +static bool oom_kill_memcg_victim(struct oom_control *oc) > > > > +{ > > > > + static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, > > > > + DEFAULT_RATELIMIT_BURST); > > > > + > > > > + if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM) > > > > + return oc->chosen_memcg; > > > > + > > > > + /* Always begin with the task with the biggest memory footprint */ > > > > + oc->chosen_points = 0; > > > > + oc->chosen_task = NULL; > > > > + mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc); > > > > + > > > > + if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM) > > > > + goto out; > > > > + > > > > + if (__ratelimit(&oom_rs)) > > > > + dump_header(oc, oc->chosen_task); > > > > > > Hmm, does the full dump_header really apply for the new heuristic? E.g. > > > does it make sense to dump_tasks()? Would it make sense to print stats > > > of all eligible memcgs instead? > > > > Hm, this is a tricky part: the dmesg output is at some point a part of ABI, > > People are parsing oom reports but I disagree this is an ABI of any > sort. The report is closely tight to the particular implementation and > as such it has changed several times over the time. > > > but is also closely connected with the implementation. So I would suggest > > to postpone this until we'll get more usage examples and will better > > understand what information we need. > > I would drop tasks list at least because that is clearly misleading in > this context because we are not selecting from all tasks. We are > selecting between memcgs. The memcg information can be added in a > separate patch of course. Let's postpone it until we'll land the rest of the patchset. Thank you!