From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751922AbdJKVwJ (ORCPT ); Wed, 11 Oct 2017 17:52:09 -0400 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:36290 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750778AbdJKVwE (ORCPT ); Wed, 11 Oct 2017 17:52:04 -0400 Date: Wed, 11 Oct 2017 22:49:27 +0100 From: Roman Gushchin To: David Rientjes CC: , Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , Andrew Morton , Tejun Heo , , , , Subject: Re: [v11 3/6] mm, oom: cgroup-aware OOM killer Message-ID: <20171011214927.GA28741@castle> References: <20171005130454.5590-1-guro@fb.com> <20171005130454.5590-4-guro@fb.com> <20171010122306.GA11653@castle.DHCP.thefacebook.com> <20171010220417.GA8667@castle> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.9.1 (2017-09-22) X-Originating-IP: [2620:10d:c092:180::1:1429] X-ClientProxiedBy: DB6PR07CA0144.eurprd07.prod.outlook.com (2603:10a6:6:16::37) To CO1PR15MB1078.namprd15.prod.outlook.com (2a01:111:e400:7b66::8) X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: 6e773c10-68f4-446a-cb01-08d510f1fa44 X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:(22001)(2017030254152)(2017052603199)(201703131423075)(201703031133081)(201702281549075);SRVR:CO1PR15MB1078; X-Microsoft-Exchange-Diagnostics: 1;CO1PR15MB1078;3:Xg45uJ3ZK0xLga15BRe7PSABObLJT0EawhEGdFQ5D01uoNvIpKQyOrhfbehSGEFMILNRCgO2OMM4qtNfCkh+nBaJF5LKAfUJ8EXnW8+DBJU4W0UWTLHxQR50adTQB6DjhAvxhHFMzkzwMNqoPkKGvpD01FR+Ful/Xs9dEcZmc43a2Yu6EwxvzryWpWnAnh0Fv6qHaCGgPAm/XYwrr7SlXdrLCmUyxTLlLVY7MBTF7rYnpXP+Fe5Qwps8+kXsGLcM;25:Ta+8li+6amm5RvJi+VdQQ9mtMXp5Le9YCxtQ7flNzs5y4PzhlbRQG3WHmLTyU2p15Ik7vNpKscuOoMhwSj8piHhH2nHFxIpgVvs/9+iPEBuwvOHrn7/QUQuSHEc5tJ+bHHYZxYIKJQWoHHaa84VO0OrsX9ehk9W2pANJUMP1h5HBIVfAmTIP14Zxg3TSyCYBK3KAts2pM0+ujkz5CSYXZRK4LL6lV9OYa9vwZZJInTNdAoP90lTpzMz3Prvij/ocqNt7zEueXsclu7xSQVAGewqvWe7NQYv9ljiX82TnF54t5kaVmVwIdX4l6YpoViE6C1UHCdrCU9NgjQaOlW0Pfw==;31:RG9cZPFEs5zX+WiggXJTEJuYQ+d/sy/Zn0iZsr+utWuv1l/zzLdZf8+HZoPXNQarc05fJNym78Jt1btq4R888aoZAK4utZea/lz2vLmO7ZlU4jCGh5mg33uTf5N7ojp3vCprPrbRBYwVEcTtY7YifD6auiS7YvRiRNevt/0Dp0wM5/kBXUpMDCKpekArH0ZHcEbZVZp9y3htBInF+/GrMb9v/qmPvfqck/ZWG7SFfdQ= X-MS-TrafficTypeDiagnostic: CO1PR15MB1078: X-Microsoft-Exchange-Diagnostics: 1;CO1PR15MB1078;20:QZ5NEjKvnvNNUdZcApGB0g1APtEbmTlZ6xrjkNKpyO/vwkT96/B5MlKTQmbzswVgoJxLCJYdebrWFZHbnt2O+6yfWAImetZXru4y3ehTe6kdCzYLJVXCc4SmRM6iuggzAvyutfDeW+P8+8A5pdXMPggqDIUc37NdHGrjygHeAoj6hLsTR7xPNXbiXkbfigDGF7n5185TtUdZHuZQbSivbBxF2ZUG1ZgyoeLzLLk7RjXdD6AwN9mK0G13p2ITqxTDBnfe+vH/JuHgeX3WqVDQOjoqUaqMo1ebJxyrqgweD58WwhXR4wa1rx0J2CBp7l60UT1T99isEyJxrPWvmztPBsuOKsaCvJE/JvAYY42yrs29G4632N4JVJcCLfId+kUJ9ZXiH5b4+pPkOaFMT+Jnwb5GXv7T53wE7sMVivE8eh5ijW+TOb9F7Li3vm9LrSiqIFOTQRWxr4bux3RiJbMTUaeHQggO7feyD9FhRpMfxAgVas8GUuBKMmOS79lAW3na;4:xFidXg9GcyYUxf2N62vX21k9D5vwW5aQOrlw7VHuMHN1BDbYLmEHb1NvRCI/r1LBjaPffftJvBQXiB7KIfpSCrfrQiHWHaBQqK3jsiSOrryqYkURSEVIxY24yUnYb7LJVYcK4s9XD9XfITUIL8E/utT+os5qZY5uj3lsL+EEUVBL4UrS+qRto0n6EK9nhwqrWp4HYhFIcjDcz8ly7jALZCTX0Z2JuNA3kwp/X0Yk4KYdioYz3vicAkMPvoSHR+jeF45Wk3L1dHU2zM5l38S+ZTljjA3E0RsSH4ky+5PScXYxR3HViy8j3hm//HXFwJehn9f2XOQoCQQCguvXyQoFYw== X-Exchange-Antispam-Report-Test: UriScan:(211936372134217)(153496737603132); X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(100000700101)(100105000095)(100000701101)(100105300095)(100000702101)(100105100095)(11241501159)(6040450)(2401047)(5005006)(8121501046)(100000703101)(100105400095)(10201501046)(93006095)(93001095)(3002001)(920507026)(6041248)(201703131423075)(201702281528075)(201703061421075)(201703061406153)(20161123560025)(20161123564025)(20161123555025)(20161123562025)(20161123558100)(6072148)(201708071742011)(100000704101)(100105200095)(100000705101)(100105500095);SRVR:CO1PR15MB1078;BCL:0;PCL:0;RULEID:(100000800101)(100110000095)(100000801101)(100110300095)(100000802101)(100110100095)(100000803101)(100110400095)(100000804101)(100110200095)(100000805101)(100110500095);SRVR:CO1PR15MB1078; X-Forefront-PRVS: 0457F11EAF X-Forefront-Antispam-Report: SFV:NSPM;SFS:(10019020)(6009001)(376002)(346002)(51884002)(199003)(24454002)(189002)(6246003)(86362001)(53936002)(9686003)(55016002)(5660300001)(7416002)(23726003)(1076002)(101416001)(6116002)(54906003)(229853002)(54356999)(50466002)(76176999)(33716001)(50986999)(316002)(93886005)(97736004)(83506001)(2950100002)(68736007)(6496005)(6916009)(106356001)(58126008)(6666003)(7736002)(8676002)(25786009)(47776003)(5890100001)(16586007)(4326008)(105586002)(189998001)(33656002)(305945005)(2906002)(8936002)(34040400001)(81166006)(39060400002)(81156014)(478600001)(18370500001)(42262002);DIR:OUT;SFP:1102;SCL:1;SRVR:CO1PR15MB1078;H:castle;FPR:;SPF:None;PTR:InfoNoRecords;A:1;MX:1;LANG:en; X-Microsoft-Exchange-Diagnostics: =?us-ascii?Q?1;CO1PR15MB1078;23:tw7EvjM9o3qwrNMoaQpCxyI7x4dbUutOLWlel3CTk?= =?us-ascii?Q?E0yAQpSPxQEozx0I+kSgIvP7oh68bX4uUT22uoAM5DHXCmCAI8o+h5buiZmk?= =?us-ascii?Q?z2IErSBpdDkLz80yAnukjaWnjRPH1Xg8fB7gX/ZZJPY/pITCn02p3iI6XnzL?= =?us-ascii?Q?CW7VHPLCCnqlu9YSdYErQeQIRhUCRlqOi+VF1ogE6wGWMifpYIdRw4cSwzmi?= =?us-ascii?Q?8V9DnjyPsl8hBOY4sMGShQTTNgjfXasbuHesYDyXiyGG2CAcKxypDTzJYE60?= =?us-ascii?Q?sXKAXbhkz3WRPAzhwUwmef3/BMMlDEojRx/6MTVIYTXdE7rbR/M5vpxs35uA?= =?us-ascii?Q?9moSbTIoeDup4QWwYCTz7szCFxVKXSyiE/Rrj+WBXhZIXwMAd5K+3wYduaBH?= =?us-ascii?Q?P7uFH1y6PFZVjqkOGPrGfWY3+Iuk0OXPFkQZT24cKlfsP2gFd/hXaPYzYPCr?= =?us-ascii?Q?+5MzNb/YtgAjyheyFM50uE+V0MQLqGCqmmP08L+rsZQaGwXtgM2abGalVB2t?= =?us-ascii?Q?u8zNRX7R3si1Dz6KdkGXzaFRT3HU2IGr1TyrzIs8+7I9LKERMoep9pGvwGdO?= =?us-ascii?Q?UZeFqhtn/WJJy3bphUxhIisa0IWkUZsL8jtIRH6Esi+driOVKCxPBpfu+7KW?= =?us-ascii?Q?JOyNUeCufcL/DVBNgnjZa0+Ks/nR1NaJydKxnCctDADPCqkspe6F6HbVkCia?= =?us-ascii?Q?nT5TZBQk1KwODfKxoU8nbK29+eOQEl0qmk24eVZPIjHEtrYjCt8wsjbthwr9?= =?us-ascii?Q?EWCk8NyFntnM7Nyr76xQTmb4CStdAsXKVRzHJACWuGpf1NcnyUVkO1N1cSdI?= =?us-ascii?Q?1mYLsBlHxaQUE47YfgKSRPak4Zsi2/9ixqq5kjQ1jvuA5yEtAXkprWb1q0p5?= =?us-ascii?Q?+aVtHJlsL88TSj+kogIO2+4h4xfL4s0qEGtYC1taBBKaUjsFp9yf6QPCRNDd?= =?us-ascii?Q?G9ixA9k9z67H/ek89Jeim41sRVTsdBeYgj0SO07y1Bd6giWO0A2zSn1faFb1?= =?us-ascii?Q?+9nJd36TaT8qmahC5ksKbOfvt5ZFRkegiWSKmGB44VPmnH4aCMWny+yZ6wiY?= =?us-ascii?Q?W/uqBEVMJIQPSPMngzWW4fXKL/HJ/xQuEtsvVtCU0TuaBBukMvbosMUpSpx1?= =?us-ascii?Q?AwDYwaG6aGGgpeuO6X9IdF6jrI5lJ1tCHmG8bFiB05LBxyWqoshIRyvvDYuE?= =?us-ascii?Q?I4vWNrSCh5bbbxgOghLGSJnqPDj72hOA1ya1COIEoYSHCdzHSgAlc3ncqf81?= =?us-ascii?Q?GuFbkMjwyia+SpRx2oHm5bUclMe79tpS9JoKx782b3jMwxXuFdVpRpIg6O2+?= =?us-ascii?Q?qcIYk76vf2Bat/kbbOYoXM=3D?= X-Microsoft-Exchange-Diagnostics: 1;CO1PR15MB1078;6:OywxaHeoZVWJvFFELWSNrmsyDj4NoR958D9A4wWsAq9eJx2SE657fYwbMMVfOPpxWNEGJpAWYYNH4CEDzgMf4WfBewIC5O60fKmix/EbCTuHFi4My9BaKOn0KGzuLU9/VGMRH3ngnNReUV5cQDDSFOBBMQjN+XHXcEXOGlrHmrcssHVgftpN01t/hCjEJLcdEIP+zBBb+LhrGfBmkZCo+fWSnimnCrTdg9VjmihTqPa00kGjgtyeNgMNR88LAg3iqBHjDLQoMZMdZuwUKrghxLHGcX37SnzBBeDBKGbmVYwgB8xaN3lIhq7BNcvF7BL1zt+wPe9taSQdXtiYJFiPYA==;5:mhh8rTFkHs/4IEkx4C5BD9DJfgJNwNXstGoriBoCrnOmsOQPuzTNHxufX47pTOPsqoI/PeR6fneSTw/Ujh1hr7ZqCCz2yfqW5EG/oosWG0D+tb22oUvdrsvE7q6EdGia1jCavE8Omk+AzI6MOtIvFQ==;24:vGTFAb68ZVsrzNgTRKh+60BlefmHch2kkvKBqFOsV3P6cjptVgv4rqVHPl0Btofae37RvgyLrPqFGAXBajlhzUo0ZTfvCtMw3IncvnKHR8M=;7:u4+rXyXJO6pl6iXlifAA6NC0I5P1VI45Uv2TtaHbisNgY8h9d9ic+I33CtHP1zvxlwhmbKb4WiY1j8Ww6SMXvpc0ZTC3luuaxGY1Nm/6UPY1wwECjwV71OfZo4kjeR0u9GkZrh6J1fo3tI+eQvrL6W1/vV9WI5EyS8BmftB91jxOBhaOIZCYWyocp93oa3W0zCCpjz6KhgnImcccWFKh8PVWH07e6EywvYF+zRP1o2U= SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-Microsoft-Exchange-Diagnostics: 1;CO1PR15MB1078;20:AN4d6GPq6qdPvFjAuJbejhUSBXJujdhLluq70do+IjAhiKw4+EWXKKItMA13O+S4zym75S0d4/dAb2KItd4T8wGaTpHhrXHqfbqb6gYVS7KWLvvRKTGYNhI3QL/ZYMT9rADRN3f2OJd5F41G3ycsstbDR0npvUwJXy4wtMnd0AI= X-MS-Exchange-CrossTenant-OriginalArrivalTime: 11 Oct 2017 21:49:38.7579 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 8ae927fe-1255-47a7-a2af-5f3a069daaa2 X-MS-Exchange-Transport-CrossTenantHeadersStamped: CO1PR15MB1078 X-OriginatorOrg: fb.com X-Proofpoint-Spam-Reason: safe X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-10-11_07:,, signatures=0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Oct 11, 2017 at 01:21:47PM -0700, David Rientjes wrote: > On Tue, 10 Oct 2017, Roman Gushchin wrote: > > > > We don't need a better approximation, we need a fair comparison. The > > > heuristic that this patchset is implementing is based on the usage of > > > individual mem cgroups. For the root mem cgroup to be considered > > > eligible, we need to understand its usage. That usage is _not_ what is > > > implemented by this patchset, which is the largest rss of a single > > > attached process. This, in fact, is not an "approximation" at all. In > > > the example of 10000 processes attached with 80MB rss each, the usage of > > > the root mem cgroup is _not_ 80MB. > > > > It's hard to imagine a "healthy" setup with 10000 process in the root > > memory cgroup, and even if we kill 1 process we will still have 9999 > > remaining process. I agree with you at some point, but it's not > > a real world example. > > > > It's an example that illustrates the problem with the unfair comparison > between the root mem cgroup and leaf mem cgroups. It's unfair to compare > [largest rss of a single process attached to a cgroup] to > [anon + unevictable + unreclaimable slab usage of a cgroup]. It's not an > approximation, as previously stated: the usage of the root mem cgroup is > not 100MB if there are 10 such processes attached to the root mem cgroup, > it's off by orders of magnitude. > > For the root mem cgroup to be treated equally as a leaf mem cgroup as this > patchset proposes, it must have a fair comparison. That can be done by > accounting memory to the root mem cgroup in the same way it is to leaf mem > cgroups. > > But let's move the discussion forward to fix it. To avoid necessarily > accounting memory to the root mem cgroup, have we considered if it is even > necessary to address the root mem cgroup? For the users who opt-in to > this heuristic, would it be possible to discount the root mem cgroup from > the heuristic entirely so that oom kills originate from leaf mem cgroups? > Or, perhaps better, oom kill from non-memory.oom_group cgroups only if > the victim rss is greater than an eligible victim rss attached to the root > mem cgroup? David, I'm not pretending for implementing the best possible accounting for the root memory cgroup, and I'm sure there is a place for further enhancement. But if it's not leading to some obviously stupid victim selection (like ignoring leaking task, which consumes most of the memory), I don't see why it should be treated as a blocker for the whole patchset. I also doubt that any of us has these examples, and the best way to get them is to get some real usage feedback. Ignoring oom_score_adj, subtracting leaf usage sum from system usage etc, these all are perfect ideas which can be implemented on top of this patchset. > > > > For these reasons: unfair comparison of root mem cgroup usage to bias > > > against that mem cgroup from oom kill in system oom conditions, the > > > ability of users to completely evade the oom killer by attaching all > > > processes to child cgroups either purposefully or unpurposefully, and the > > > inability of userspace to effectively control oom victim selection: > > > > > > Nacked-by: David Rientjes > > > > So, if we'll sum the oom_score of tasks belonging to the root memory cgroup, > > will it fix the problem? > > > > It might have some drawbacks as well (especially around oom_score_adj), > > but it's doable, if we'll ignore tasks which are not owners of their's mm struct. > > > > You would be required to discount oom_score_adj because the heuristic > doesn't account for oom_score_adj when comparing the anon + unevictable + > unreclaimable slab of leaf mem cgroups. This wouldn't result in the > correct victim selection in real-world scenarios where processes attached > to the root mem cgroup are vital to the system and not part of any user > job, i.e. they are important system daemons and the "activity manager" > responsible for orchestrating the cgroup hierarchy. > > It's also still unfair because it now compares > [sum of rss of processes attached to a cgroup] to > [anon + unevictable + unreclaimable slab usage of a cgroup]. RSS isn't > going to be a solution, regardless if its one process or all processes, if > it's being compared to more types of memory in leaf cgroups. > > If we really don't want root mem cgroup accounting so this is a fair > comparison, I think the heuristic needs to special case the root mem > cgroup either by discounting root oom kills if there are eligible oom > kills from leaf cgroups (the user would be opting-in to this behavior) or > comparing the badness of a victim from a leaf cgroup to the badness of a > victim from the root cgroup when deciding which to kill and allow the user > to protect root mem cgroup processes with oom_score_adj. > > That aside, all of this has only addressed one of the three concerns with > the patchset. > > I believe the solution to avoid allowing users to circumvent oom kill is > to account usage up the hierarchy as you have done in the past. Cgroup > hierarchies can be managed by the user so they can create their own > subcontainers, this is nothing new, and I would hope that you wouldn't > limit your feature to only a very specific set of usecases. That may be > your solution for the root mem cgroup itself: if the hierarchical usage of > all top-level mem cgroups is known, it's possible to find the root mem > cgroup usage by subtraction, you are using stats that are global vmstats > in your heuristic. > > Accounting usage up the hierarchy avoids the first two concerns with the > patchset. It allows you to implicitly understand the usage of the root > mem cgroup itself, and does not allow users to circumvent oom kill by > creating subcontainers, either purposefully or not. The third concern, > userspace influence, can allow users to attack leaf mem cgroups deeper in > the tree if it is using more memory than expected, but the hierarchical > usage is lower at the top-level. That is the only objection that I have > seen to using hierarchical usage: there may be a single cgroup deeper in > the tree that avoids oom kill because another hierarchy has a higher > usage. This can trivially be addressed either by oom priorities or an > adjustment, just like oom_score_adj, on cgroup usage. As I've said, I barely understand how the exact implementation of root memory cgroup accounting is considered a blocker for the whole feature. The same is true for oom priorities: it's something that can and should be implemented on top of the basic semantics, introduced by this patchset. So, the only real question is the way how we find a victim memcg in the subtree: by performing independent election on each level or by searching tree-wide. We all had many discussion around, and as you remember, initially I was supporting the first option. But then Michal provided a very strong argument: if you have 3 similar workloads in A, B and C, but for non-memory-related reasons (e.g. cpu time sharing) you have to join A and B into a group D: /\ D C / \ A B it's strange to penalize A and B for it. It looks to me that you're talking about the similar case, but you consider this hierarchy useful. So, overall, it seems to be depending on exact configuration. I have to add, that if you can enable memory.oom_group, your problem doesn't exist. The selected approach is easy extendable into hierarchical direction: as I've said before, we can introduce a new value of memory.oom_group, which will enable cumulative accounting without mass killing. And, tbh, I don't see how oom_priorities will resolve an opposite problem if we'd take the hierarchical approach. Thanks!