All of lore.kernel.org
 help / color / mirror / Atom feed
From: Zach O'Keefe <zokeefe@google.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Alex Shi <alex.shi@linux.alibaba.com>,
	David Hildenbrand <david@redhat.com>,
	David Rientjes <rientjes@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	Michal Hocko <mhocko@suse.com>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	Peter Xu <peterx@redhat.com>,
	Rongwei Wang <rongwei.wang@linux.alibaba.com>,
	SeongJae Park <sj@kernel.org>, Song Liu <songliubraving@fb.com>,
	Vlastimil Babka <vbabka@suse.cz>, Yang Shi <shy828301@gmail.com>,
	Zi Yan <ziy@nvidia.com>,
	linux-mm@kvack.org, Andrea Arcangeli <aarcange@redhat.com>,
	Arnd Bergmann <arnd@arndb.de>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Chris Kennelly <ckennelly@google.com>,
	Chris Zankel <chris@zankel.net>, Helge Deller <deller@gmx.de>,
	Hugh Dickins <hughd@google.com>,
	Ivan Kokshaysky <ink@jurassic.park.msu.ru>,
	"James E.J. Bottomley" <James.Bottomley@hansenpartnership.com>,
	Jens Axboe <axboe@kernel.dk>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Matt Turner <mattst88@gmail.com>,
	Max Filippov <jcmvbkbc@gmail.com>,
	Miaohe Lin <linmiaohe@huawei.com>,
	Minchan Kim <minchan@kernel.org>,
	Patrick Xia <patrickx@google.com>,
	Pavel Begunkov <asml.silence@gmail.com>,
	Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Subject: Re: [mm-unstable v7 12/18] mm/madvise: add MADV_COLLAPSE to process_madvise()
Date: Tue, 12 Jul 2022 18:05:17 -0700	[thread overview]
Message-ID: <Ys4aTRqWIbjNs1mI@google.com> (raw)
In-Reply-To: <20220708134732.fd9cc80739a3b9781a1ecf9e@linux-foundation.org>

On Jul 08 13:47, Andrew Morton wrote:
> On Wed,  6 Jul 2022 16:59:30 -0700 "Zach O'Keefe" <zokeefe@google.com> wrote:
> 
> > Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
> > CAP_SYS_ADMIN or is requesting collapse of it's own memory.
> 
> This is maximally restrictive.  I didn't see any discussion of why this
> was chosen either here of in the [0/N].  I expect that people will be
> coming after us to relax this.
> 
> So please do add (a lot of) words explaining this decision, and
> describing what might be done in the future to relax it.

Hey Andrew,

Thanks for taking the time to look at this series. After taking a look through
capabilities(7) I think you're absolutely right to call this out - thanks for
that.

I think move_pages(2) seems to be the best comparison here. There, we use
CAP_SYS_NICE + PTRACE_MODE_READ_REALCREDS to ensure the caller is able to
copying + moving memory of an eternal process, between nodes.  This is also the
current default for process_madvise(2). However, MADV_COLLAPSE additionally is
able to:

1) Influence the RSS of a process / memory charged to a cgroup (by
  collapsing a hugepage-sized/aligned region with nonresident pages). Note that
  for file/shmem, this might cause increase in file/shmem RSS for non-target
  mm's.
2) Bypass sysfs THP settings

For (1), process_madvise(MADV_WILLNEED) could presumably be used to increase RSS
/ memcg usage, and we don't require any additional capabilities there.

For (2), I don't think there is an easy precedent. I think it makes sense that
the caller has write permission to /sys/kernel/mm/transparent_hugapage/*.
AFAICT, this means an effective user ID of 0 ... which is similarly restrictive
like CAP_SYS_ADMIN. One idea would be to use CAP_SETUID, since these threads
could always assume an real/effective user ID of 0.

That said, I'm note sure CAP_SETUID is needed, and perhaps the existing
process_madvise(2) restrictions are enough given CAP_SYS_NICE confers ability to
copy around all the same memory.. we'll just be doing some additional page table
manipulations after some of that copying - which should (mostly) be transparent
to the users. I.e. I don't think it expands CAP_SYS_NICE's "security silo" that
much. Could be wrong through.

Again, thanks for your time,
Zach



  reply	other threads:[~2022-07-13  1:05 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-06 23:59 [mm-unstable v7 00/18] mm: userspace hugepage collapse Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 01/18] mm/khugepaged: remove redundant transhuge_vma_suitable() check Zach O'Keefe
2022-07-11 20:38   ` Yang Shi
2022-07-12 17:14     ` Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 02/18] mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 03/18] mm/khugepaged: add struct collapse_control Zach O'Keefe
2022-07-08 21:01   ` Andrew Morton
2022-07-11 18:29     ` Zach O'Keefe
2022-07-11 18:45       ` Andrew Morton
2022-07-12 14:17         ` Zach O'Keefe
2022-07-11 21:51       ` Yang Shi
2022-07-06 23:59 ` [mm-unstable v7 04/18] mm/khugepaged: dedup and simplify hugepage alloc and charging Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 05/18] mm/khugepaged: pipe enum scan_result codes back to callers Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 06/18] mm/khugepaged: add flag to predicate khugepaged-only behavior Zach O'Keefe
2022-07-11 20:43   ` Yang Shi
2022-07-12 17:06     ` Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 07/18] mm/thp: add flag to enforce sysfs THP in hugepage_vma_check() Zach O'Keefe
2022-07-11 20:57   ` Yang Shi
2022-07-12 16:58     ` Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 08/18] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage Zach O'Keefe
2022-07-11 21:03   ` Yang Shi
2022-07-12 16:50     ` Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 09/18] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse Zach O'Keefe
2022-07-11 21:22   ` Yang Shi
2022-07-12 16:54     ` Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 10/18] mm/khugepaged: rename prefix of shared collapse functions Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 11/18] mm/madvise: add huge_memory:mm_madvise_collapse tracepoint Zach O'Keefe
2022-07-11 21:32   ` Yang Shi
2022-07-12 16:21     ` Zach O'Keefe
2022-07-12 17:05       ` Yang Shi
2022-07-12 17:30         ` Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 12/18] mm/madvise: add MADV_COLLAPSE to process_madvise() Zach O'Keefe
2022-07-08 20:47   ` Andrew Morton
2022-07-13  1:05     ` Zach O'Keefe [this message]
2022-07-06 23:59 ` [mm-unstable v7 13/18] proc/smaps: add PMDMappable field to smaps Zach O'Keefe
2022-07-11 21:37   ` Yang Shi
2022-07-12 16:31     ` Zach O'Keefe
2022-07-12 17:27       ` Yang Shi
2022-07-12 17:57         ` Zach O'Keefe
2022-07-13 18:02           ` Andrew Morton
2022-07-13 18:40             ` Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 14/18] selftests/vm: modularize collapse selftests Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 15/18] selftests/vm: dedup hugepage allocation logic Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 16/18] selftests/vm: add MADV_COLLAPSE collapse context to selftests Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 17/18] selftests/vm: add selftest to verify recollapse of THPs Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 18/18] selftests/vm: add selftest to verify multi THP collapse Zach O'Keefe
2022-07-14 18:55 ` [RFC] mm: userspace hugepage collapse: file/shmem semantics Zach O'Keefe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Ys4aTRqWIbjNs1mI@google.com \
    --to=zokeefe@google.com \
    --cc=James.Bottomley@hansenpartnership.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.shi@linux.alibaba.com \
    --cc=arnd@arndb.de \
    --cc=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=axelrasmussen@google.com \
    --cc=chris@zankel.net \
    --cc=ckennelly@google.com \
    --cc=david@redhat.com \
    --cc=deller@gmx.de \
    --cc=hughd@google.com \
    --cc=ink@jurassic.park.msu.ru \
    --cc=jcmvbkbc@gmail.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-mm@kvack.org \
    --cc=mattst88@gmail.com \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=pasha.tatashin@soleen.com \
    --cc=patrickx@google.com \
    --cc=peterx@redhat.com \
    --cc=rientjes@google.com \
    --cc=rongwei.wang@linux.alibaba.com \
    --cc=shy828301@gmail.com \
    --cc=sj@kernel.org \
    --cc=songliubraving@fb.com \
    --cc=tsbogend@alpha.franken.de \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.