From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2F732C433EF for ; Fri, 13 May 2022 20:53:23 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5CA9C8D0006; Fri, 13 May 2022 16:53:22 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 578788D0002; Fri, 13 May 2022 16:53:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 467D78D0006; Fri, 13 May 2022 16:53:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 391598D0002 for ; Fri, 13 May 2022 16:53:22 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay11.hostedemail.com (Postfix) with ESMTP id 14019803E4 for ; Fri, 13 May 2022 20:53:22 +0000 (UTC) X-FDA: 79461920244.19.1CF2820 Received: from vulcan.natalenko.name (vulcan.natalenko.name [104.207.131.136]) by imf07.hostedemail.com (Postfix) with ESMTP id 6E30F400AA for ; Fri, 13 May 2022 20:53:14 +0000 (UTC) Received: from spock.localnet (unknown [83.148.33.151]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by vulcan.natalenko.name (Postfix) with ESMTPSA id 794FDEDEB30; Fri, 13 May 2022 22:53:18 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=natalenko.name; s=dkim-20170712; t=1652475198; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=or0NKeUsbE01vgrU5fNAJITH1vwKDIErppkCvQpSrPA=; b=LbulMFpXvocMvhpK93KvRsv+Q2NABXhQtu/RLQ284R+Mg5gUH5urrWVmpOXOV18v+LsEhG HX60DMFprIuin+muTeoCrCXkip2NpUA7jWQUjTUURTCD8dx4doEh/ef4nQRh0+YNYBqoQZ 7NYEnNneDB9MURIT/lbaQOCvBMbFzPk= From: Oleksandr Natalenko To: Andrew Morton Cc: cgel.zte@gmail.com, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, corbet@lwn.net, xu xin , Yang Yang , Ran Xiaokai , wangyong , Yunkai Zhang , Matthew Wilcox Subject: Re: [PATCH v6] mm/ksm: introduce ksm_force for each process Date: Fri, 13 May 2022 22:53:16 +0200 Message-ID: <1817008.tdWV9SEqCh@natalenko.name> In-Reply-To: <20220513133210.9dd0a4216bd8baaa1047562c@linux-foundation.org> References: <20220510122242.1380536-1-xu.xin16@zte.com.cn> <1835064.A2aMcgg3dW@natalenko.name> <20220513133210.9dd0a4216bd8baaa1047562c@linux-foundation.org> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 6E30F400AA X-Stat-Signature: rq98nab7biaauzqbmiqpzwnudix399io Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=natalenko.name header.s=dkim-20170712 header.b=LbulMFpX; dmarc=pass (policy=reject) header.from=natalenko.name; spf=pass (imf07.hostedemail.com: domain of oleksandr@natalenko.name designates 104.207.131.136 as permitted sender) smtp.mailfrom=oleksandr@natalenko.name X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1652475194-699842 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hello. On p=C3=A1tek 13. kv=C4=9Btna 2022 22:32:10 CEST Andrew Morton wrote: > On Fri, 13 May 2022 11:51:53 +0200 Oleksandr Natalenko wrote: > > On p=C3=A1tek 13. kv=C4=9Btna 2022 0:37:53 CEST Andrew Morton wrote: > > > On Tue, 10 May 2022 15:30:36 +0200 Oleksandr Natalenko wrote: > > >=20 > > > > > If ksm_force is set to 1, force all anonymous and 'qualified' VMAs > > > > > of this mm to be involved in KSM scanning without explicitly call= ing > > > > > madvise to mark VMA as MADV_MERGEABLE. But It is effective only w= hen > > > > > the klob of /sys/kernel/mm/ksm/run is set as 1. > > > > >=20 > > > > > If ksm_force is set to 0, cancel the feature of ksm_force of this > > > > > process (fallback to the default state) and unmerge those merged = pages > > > > > belonging to VMAs which is not madvised as MADV_MERGEABLE of this= process, > > > > > but still leave MADV_MERGEABLE areas merged. > > > >=20 > > > > To my best knowledge, last time a forcible KSM was discussed (see t= hreads [1], [2], [3] and probably others) it was concluded that a) procfs w= as a horrible interface for things like this one; and b) process_madvise() = syscall was among the best suggested places to implement this (which would = require a more tricky handling from userspace, but still). > > > >=20 > > > > So, what changed since that discussion? > > > >=20 > > > > P.S. For now I do it via dedicated syscall, but I'm not trying to u= pstream this approach. > > >=20 > > > Why are you patching the kernel with a new syscall rather than using > > > process_madvise()? > >=20 > > Because I'm not sure how to use `process_madvise()` to achieve $subj pr= operly. > >=20 > > The objective is to mark all the eligible VMAs of the target task for K= SM to consider them for merging. > >=20 > > For that, all the eligible VMAs have to be traversed. > >=20 > > Given `process_madvise()` has got an iovec API, this means the process = that will call `process_madvise()` has to know the list of VMAs of the targ= et process. In order to traverse them in a race-free manner the target task= has to be SIGSTOP'ped or frozen, then the list of VMAs has to be obtained,= then `process_madvise()` has to be called, and the the target task can con= tinue. This is: > >=20 > > a) superfluous (the kernel already knows the list of VMAs of the target= tasks, why proxy it through the userspace then?); and > > b) may induce more latency than needed because the target task has to b= e stopped to avoid races. >=20 > OK. And what happens to new vmas that the target process creates after > the process_madvise()? Call `process_madvise()` on them again. And do that again. Regularly, with = some intervals. Use some daemon for that (like [1]). [1] https://gitlab.com/post-factum/uksmd > > OTOH, IIUC, even if `MADV_MERGEABLE` is allowed for `process_madvise()`, >=20 > Is it not? It is not: ``` 1158 static bool 1159 process_madvise_behavior_valid(int behavior) 1160 { 1161 switch (behavior) { 1162 case MADV_COLD: 1163 case MADV_PAGEOUT: 1164 case MADV_WILLNEED: 1165 return true; 1166 default: 1167 return false; 1168 } 1169 } ``` Initially, when `process_madvise()` stuff was being prepared for merging, I= tried to enabled it but failed [2], and it was decided [3] to move forward= without it. [2] https://lore.kernel.org/linux-api/34f812b8-df54-eaad-5cf0-335f07da55c6@= suse.cz/ [3] https://lore.kernel.org/lkml/20200623085944.cvob63vrv54fo7cs@butterfly.= localdomain/ > > I cannot just call it like this: > >=20 > > ``` > > iovec.iov_base =3D 0; > > iovec.iov_len =3D ~0ULL; > > process_madvise(pidfd, &iovec, 1, MADV_MERGEABLE, 0); > > ``` > >=20 > > to cover the whole address space because iovec expects total size to be= under ssize_t. > >=20 > > Or maybe there's no need to cover the whole address space, only the low= er half of it? >=20 > Call process_madvise() twice, once for each half? And still get `-ENOMEM`? ``` 1191 /* 1192 * If the interval [start,end) covers some unmapped address 1193 * ranges, just ignore them, but return -ENOMEM at the end. 1194 * - different from the way of handling in mlock etc. 1195 */ ``` I mean, it probably will work, and probably having the error returned is fi= ne, but generally speaking an error value should hint that something is not= being done right. > > Or maybe there's another way of doing things, and I just look stupid an= d do not understand how this is supposed to work?.. > >=20 > > I'm more than happy to read your comments on this. > >=20 >=20 > I see the problem. I do like the simplicity of the ksm_force concept.=20 > Are there alternative ideas? I do like it too. I wonder what to do with older concerns [4] [5] regarding= presenting such an API. [4] https://lore.kernel.org/lkml/20190516172452.GA2106@avx2/ [5] https://lore.kernel.org/lkml/20190515145151.GG16651@dhcp22.suse.cz/ =2D-=20 Oleksandr Natalenko (post-factum)