linux-hardening.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff Xu <jeffxu@chromium.org>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	 David Hildenbrand <david@redhat.com>,
	Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
	 Pedro Falcato <pfalcato@suse.de>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	 Kees Cook <kees@kernel.org>,
	linux-hardening@vger.kernel.org
Subject: Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
Date: Thu, 24 Jul 2025 11:39:05 -0700	[thread overview]
Message-ID: <CABi2SkU75e7tXcpgnLbDhqSJBWVPOvmfMh2uPsU3mUjMUhhYqw@mail.gmail.com> (raw)
In-Reply-To: <ec480dc1fd4ce04bb11c0acac6c6da78dc6f4156.1752687069.git.lorenzo.stoakes@oracle.com>

Hi Lorenzo,

This change has two parts: a non-functional refactoring work of moving
function from mseal.c to madvise.c, and a functional change to the
behavior of madvise under mseal. Would you consider splitting the
change into two parts? which seems to be a common practice at
supplying patches in lkml.

On Wed, Jul 16, 2025 at 10:38 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> The madvise() logic is inexplicably performed in mm/mseal.c - this ought
> to be located in mm/madvise.c.
>
This is part one of a non-functional refactoring work to move a
function from mseal.c to madvise.c.

There are two main reasons why I initially wanted to keep all
mseal-related functions in mseal.c:

1 Keeping all mseal related logic in mseal.c makes it easier for
developers to find all the impacted areas of mseal.
2 mseal is not supported in 32 bits, and mseal.c is excluded from 32
bits build (see makefile).This would prevent accidentally using mseal
in code paths shared between 32-bit and 64-bit architectures. It also
avoids adding #Ifdef in the .c file, which is recommended by the mm
coding standard (I received comments about avoiding  #ifdef in .c  in
the past).

Point 2 can go aways if 32 bits mseal support is coming soon, same as
my other comments for patch 1/5.

Point 1 remains. However, I want to focus on the functional change
part of this patch, rather than the other aspects.

> Additionally can_modify_vma_madv() is inconsistently named and, in
> combination with is_ro_anon(), is very confusing logic.
>
> Put a static function in mm/madvise.c instead - can_madvise_modify() -
> that spells out exactly what's happening.  Also explicitly check for an
> anon VMA.
>
> Also add commentary to explain what's going on.
>
> Essentially - we disallow discarding of data in mseal()'d mappings in
> instances where the user couldn't otherwise write to that data.
>
> Shared mappings are always backed, so no discard will actually truly
> discard the data.  Read-only anonymous and MAP_PRIVATE file-backed
> mappings are the ones we are interested in.
>
> We make a change to the logic here to correct a mistake - we must disallow
> discard of read-only MAP_PRIVATE file-backed mappings, which previously we
> were not.
>
> The justification for this change is to account for the case where:
>
> 1. A MAP_PRIVATE R/W file-backed mapping is established.
> 2. The mapping is written to, which backs it with anonymous memory.
> 3. The mapping is mprotect()'d read-only.
> 4. The mapping is mseal()'d.
>
> If we were to now allow discard of this data, it would mean mseal() would
> not prevent the unrecoverable discarding of data and it was thus violate
> the semantics of sealed VMAs.
>
This is the functional change and the most important area that I would
like to discuss in this patch series.

Since Jann Horn first highlighted [1] the problematic behavior of
destructive madvise for anonymous mapping during initial discussions
of mseal(), the proper solution has been open to discussion and
exploration. Linus Torvalds has expressed interest [2] in fixing this
within madvise itself, without requiring mseal, and I copied it here
for ease of reference:

“Hmm. I actually would be happier if we just made that change in
general. Maybe even without sealing, but I agree that it *definitely*
makes sense in general as a sealing thing.”

After mseal was merged, Pedro Falcato raised a valid concern regarding
file-backed private mappings. The issue is that these mappings can be
written to, and subsequently changed to read-only, which is precisely
the problem this patch aims to fix. I attempted to address this in
[3]. However, that patch was rejected due to the introduction of a new
vm_flags, which was a valid rejection as it wasn't the ideal solution.
Nevertheless, it sparked an interesting discussion, with me raising a
point that userspace might use this feature to free up RAM for
file-backed private mapping that is never written to,  and input about
this topic from Vlastimil Babka [4] is about MADV_COLD/MADV_PAGEOUT.

A detail about userspace calling those madvise for file-backed private
mapping. Previously, I added extra logging in the kernel, and observed
many instances of those calls  during Android phone startup, although
I haven’t delved into the reason behind why user space calls those,
e.g. if it is from an individual app or Android platform.

Incidentally, recently while I was studying selinux code, particularly
exemod permission [5] , I learned that selinux utilizes vma->anon_vma
to identify file-backed mappings that have been written to.  Jann
pointed out to me that the kernel creates a COW mapping when a private
file-backed mapping is written. Now I'm wondering if this could be the
answer to our problem. We could try having destructive madvise check
vma->anon_vma and reject the call if it's true. I haven't had a chance
to test this theory yet, though.

To summarize all the discussion points so far:
1. It's questionable behavior for madvise to allow destructive
behavior for read-only anonymous mappings, regardless of mseal state.
2. We could potentially fix point 1 within madvise itself, without
involving mseal, as Linus desires.
3. Android userspace uses destructive madvise to free up RAM, but I
need to take a closer look at the patterns and usage to understand why
they do that.
4. We could ask applications to switch to non-destructive madvise,
like MADV_COLD or MADV_PAGEOUT. Or, another option is that we could
switch the kernel to use non-destructive madvise implicitly for
destructive madvise in suitable situations.
5. We could investigate more based on vma->anon_vma

Link: https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com/
[1]
Link: https://lore.kernel.org/lkml/CAHk-=wiVhHmnXviy1xqStLRozC4ziSugTk=1JOc8ORWd2_0h7g@mail.gmail.com/
[2]
Link: https://lore.kernel.org/all/20241017005105.3047458-2-jeffxu@chromium.org/
[3]
Link: https://lore.kernel.org/all/8f68ad82-2f60-49f8-b150-0cf183c9cc71@suse.cz/
[4]
Link: https://elixir.bootlin.com/linux/v6.15.7/source/security/selinux/hooks.c#L3878
[5]

> Finally, update the mseal tests, which were asserting previously that a
> read-only MAP_PRIVATE file-backed mapping could be discarded.
>
The test you are updating is not intended for the scenario this patch
is aimed to fix: i.e. the scenario:
1. A MAP_PRIVATE R/W file-backed mapping is established.
2. The mapping is written to, which backs it with anonymous memory.
3. The mapping is mprotect()'d read-only.
4. The mapping is mseal()'d.

The test case doesn't include steps 1, 2, and 3, is it possible to
keep the existing one and create a new test case? But I don't think
that's the main discussion point. We can revisit test cases once we've
fully discussed the design.

Thanks and regards,
-Jeff
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> Reviewed-by: Pedro Falcato <pfalcato@suse.de>
> Acked-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/madvise.c                            | 63 ++++++++++++++++++++++++-
>  mm/mseal.c                              | 49 -------------------
>  mm/vma.h                                |  7 ---
>  tools/testing/selftests/mm/mseal_test.c |  3 +-
>  4 files changed, 63 insertions(+), 59 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 2bf80989d5b6..dc3d8497b0f4 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -19,6 +19,7 @@
>  #include <linux/sched.h>
>  #include <linux/sched/mm.h>
>  #include <linux/mm_inline.h>
> +#include <linux/mmu_context.h>
>  #include <linux/string.h>
>  #include <linux/uio.h>
>  #include <linux/ksm.h>
> @@ -1255,6 +1256,66 @@ static long madvise_guard_remove(struct madvise_behavior *madv_behavior)
>                                &guard_remove_walk_ops, NULL);
>  }
>
> +#ifdef CONFIG_64BIT
> +/* Does the madvise operation result in discarding of mapped data? */
> +static bool is_discard(int behavior)
> +{
> +       switch (behavior) {
> +       case MADV_FREE:
> +       case MADV_DONTNEED:
> +       case MADV_DONTNEED_LOCKED:
> +       case MADV_REMOVE:
> +       case MADV_DONTFORK:
> +       case MADV_WIPEONFORK:
> +       case MADV_GUARD_INSTALL:
> +               return true;
> +       }
> +
> +       return false;
> +}
> +
> +/*
> + * We are restricted from madvise()'ing mseal()'d VMAs only in very particular
> + * circumstances - discarding of data from read-only anonymous SEALED mappings.
> + *
> + * This is because users cannot trivally discard data from these VMAs, and may
> + * only do so via an appropriate madvise() call.
> + */
> +static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
> +{
> +       struct vm_area_struct *vma = madv_behavior->vma;
> +
> +       /* If the VMA isn't sealed we're good. */
> +       if (can_modify_vma(vma))
> +               return true;
> +
> +       /* For a sealed VMA, we only care about discard operations. */
> +       if (!is_discard(madv_behavior->behavior))
> +               return true;
> +
> +       /*
> +        * But shared mappings are fine, as dirty pages will be written to a
> +        * backing store regardless of discard.
> +        */
> +       if (vma->vm_flags & VM_SHARED)
> +               return true;
> +
> +       /* If the user could write to the mapping anyway, then this is fine. */
> +       if ((vma->vm_flags & VM_WRITE) &&
> +           arch_vma_access_permitted(vma, /* write= */ true,
> +                       /* execute= */ false, /* foreign= */ false))
> +               return true;
> +
> +       /* Otherwise, we are not permitted to perform this operation. */
> +       return false;
> +}
> +#else
> +static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
> +{
> +       return true;
> +}
> +#endif
> +
>  /*
>   * Apply an madvise behavior to a region of a vma.  madvise_update_vma
>   * will handle splitting a vm area into separate areas, each area with its own
> @@ -1268,7 +1329,7 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
>         struct madvise_behavior_range *range = &madv_behavior->range;
>         int error;
>
> -       if (unlikely(!can_modify_vma_madv(madv_behavior->vma, behavior)))
> +       if (unlikely(!can_madvise_modify(madv_behavior)))
>                 return -EPERM;
>
>         switch (behavior) {
> diff --git a/mm/mseal.c b/mm/mseal.c
> index c27197ac04e8..1308e88ab184 100644
> --- a/mm/mseal.c
> +++ b/mm/mseal.c
> @@ -11,7 +11,6 @@
>  #include <linux/mman.h>
>  #include <linux/mm.h>
>  #include <linux/mm_inline.h>
> -#include <linux/mmu_context.h>
>  #include <linux/syscalls.h>
>  #include <linux/sched.h>
>  #include "internal.h"
> @@ -21,54 +20,6 @@ static inline void set_vma_sealed(struct vm_area_struct *vma)
>         vm_flags_set(vma, VM_SEALED);
>  }
>
> -static bool is_madv_discard(int behavior)
> -{
> -       switch (behavior) {
> -       case MADV_FREE:
> -       case MADV_DONTNEED:
> -       case MADV_DONTNEED_LOCKED:
> -       case MADV_REMOVE:
> -       case MADV_DONTFORK:
> -       case MADV_WIPEONFORK:
> -       case MADV_GUARD_INSTALL:
> -               return true;
> -       }
> -
> -       return false;
> -}
> -
> -static bool is_ro_anon(struct vm_area_struct *vma)
> -{
> -       /* check anonymous mapping. */
> -       if (vma->vm_file || vma->vm_flags & VM_SHARED)
> -               return false;
> -
> -       /*
> -        * check for non-writable:
> -        * PROT=RO or PKRU is not writeable.
> -        */
> -       if (!(vma->vm_flags & VM_WRITE) ||
> -               !arch_vma_access_permitted(vma, true, false, false))
> -               return true;
> -
> -       return false;
> -}
> -
> -/*
> - * Check if a vma is allowed to be modified by madvise.
> - */
> -bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior)
> -{
> -       if (!is_madv_discard(behavior))
> -               return true;
> -
> -       if (unlikely(!can_modify_vma(vma) && is_ro_anon(vma)))
> -               return false;
> -
> -       /* Allow by default. */
> -       return true;
> -}
> -
>  static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
>                 struct vm_area_struct **prev, unsigned long start,
>                 unsigned long end, vm_flags_t newflags)
> diff --git a/mm/vma.h b/mm/vma.h
> index acdcc515c459..85db5e880fcc 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -577,8 +577,6 @@ static inline bool can_modify_vma(struct vm_area_struct *vma)
>         return true;
>  }
>
> -bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior);
> -
>  #else
>
>  static inline bool can_modify_vma(struct vm_area_struct *vma)
> @@ -586,11 +584,6 @@ static inline bool can_modify_vma(struct vm_area_struct *vma)
>         return true;
>  }
>
> -static inline bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior)
> -{
> -       return true;
> -}
> -
>  #endif
>
>  #if defined(CONFIG_STACK_GROWSUP)
> diff --git a/tools/testing/selftests/mm/mseal_test.c b/tools/testing/selftests/mm/mseal_test.c
> index 005f29c86484..34c042da4de2 100644
> --- a/tools/testing/selftests/mm/mseal_test.c
> +++ b/tools/testing/selftests/mm/mseal_test.c
> @@ -1712,7 +1712,6 @@ static void test_seal_discard_ro_anon_on_filebacked(bool seal)
>         unsigned long size = 4 * page_size;
>         int ret;
>         int fd;
> -       unsigned long mapflags = MAP_PRIVATE;
>
>         fd = memfd_create("test", 0);
>         FAIL_TEST_IF_FALSE(fd > 0);
> @@ -1720,7 +1719,7 @@ static void test_seal_discard_ro_anon_on_filebacked(bool seal)
>         ret = fallocate(fd, 0, 0, size);
>         FAIL_TEST_IF_FALSE(!ret);
>
> -       ptr = mmap(NULL, size, PROT_READ, mapflags, fd, 0);
> +       ptr = mmap(NULL, size, PROT_READ, MAP_SHARED, fd, 0);
>         FAIL_TEST_IF_FALSE(ptr != MAP_FAILED);
>
>         if (seal) {
> --
> 2.50.1
>

  parent reply	other threads:[~2025-07-24 18:39 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <cover.1752687069.git.lorenzo.stoakes@oracle.com>
2025-07-24 18:32 ` [PATCH v3 0/5] mseal cleanups, fixup MAP_PRIVATE file-backed case Jeff Xu
2025-07-24 19:10   ` Lorenzo Stoakes
     [not found] ` <d0e9b39c9d1abceb0ffac341c6fae96186c5d843.1752687069.git.lorenzo.stoakes@oracle.com>
2025-07-24 18:34   ` [PATCH v3 1/5] mm/mseal: always define VM_SEALED Jeff Xu
2025-07-24 18:44     ` Lorenzo Stoakes
     [not found] ` <ec480dc1fd4ce04bb11c0acac6c6da78dc6f4156.1752687069.git.lorenzo.stoakes@oracle.com>
2025-07-24 18:39   ` Jeff Xu [this message]
2025-07-24 18:56     ` [PATCH v3 2/5] mm/mseal: update madvise() logic David Hildenbrand
2025-07-24 22:18       ` David Hildenbrand
2025-07-24 19:07     ` Lorenzo Stoakes
2025-07-24 21:53       ` David Hildenbrand
2025-07-25  6:17         ` Lorenzo Stoakes
2025-07-25 16:22         ` Jeff Xu
     [not found] ` <ac51c2a3c68a2475149b54180ff012fffab72c02.1752687069.git.lorenzo.stoakes@oracle.com>
2025-07-24 18:40   ` [PATCH v3 3/5] mm/mseal: small cleanups Jeff Xu
     [not found] ` <f4dda4c4840c69b2ea735bcb6d54ae70d527a48a.1752687069.git.lorenzo.stoakes@oracle.com>
2025-07-24 18:40   ` [PATCH v3 4/5] mm/mseal: Simplify and rename VMA gap check Jeff Xu
2025-07-25  5:33     ` Lorenzo Stoakes
     [not found] ` <9ae70e1c509d790cf174f16e491975efd9be50b6.1752687069.git.lorenzo.stoakes@oracle.com>
2025-07-24 18:41   ` [PATCH v3 5/5] mm/mseal: rework mseal apply logic Jeff Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CABi2SkU75e7tXcpgnLbDhqSJBWVPOvmfMh2uPsU3mUjMUhhYqw@mail.gmail.com \
    --to=jeffxu@chromium.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=jannh@google.com \
    --cc=kees@kernel.org \
    --cc=linux-hardening@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=pfalcato@suse.de \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).