Alpha arch development list
 help / color / mirror / Atom feed
* Re: [RFC PATCH 1/4] mm: pagewalk: add the ability to install PTEs
From: Lorenzo Stoakes @ 2024-10-14 11:10 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
	Matthew Wilcox, Vlastimil Babka, Paul E . McKenney,
	David Hildenbrand, linux-mm, linux-kernel, Muchun Song,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	Thomas Bogendoerfer, James E . J . Bottomley, Helge Deller,
	Chris Zankel, Max Filippov, Arnd Bergmann, linux-alpha,
	linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar
In-Reply-To: <CAG48ez3XwT3aeXRDrW2e_46aAzyQ8iVGr6G7PrMz0KATaJNtew@mail.gmail.com>

On Fri, Oct 11, 2024 at 08:11:28PM +0200, Jann Horn wrote:
> On Fri, Sep 27, 2024 at 2:51 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > Rather than add yet another implementation, we extend the generic pagewalk
> > logic to enable the installation of page table entries by adding a new
> > install_pte() callback in mm_walk_ops. If this is specified, then upon
> > encountering a missing page table entry, we allocate and install a new one
> > and continue the traversal.
> [...]
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Reviewed-by: Jann Horn <jannh@google.com>

Thanks!

^ permalink raw reply

* Re: [RFC PATCH 3/4] mm: madvise: implement lightweight guard page mechanism
From: Lorenzo Stoakes @ 2024-10-14 11:08 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
	Matthew Wilcox, Vlastimil Babka, Paul E . McKenney,
	David Hildenbrand, linux-mm, linux-kernel, Muchun Song,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	Thomas Bogendoerfer, James E . J . Bottomley, Helge Deller,
	Chris Zankel, Max Filippov, Arnd Bergmann, linux-alpha,
	linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar,
	Vlastimil Babka
In-Reply-To: <CAG48ez3ursoL-f=mYpV79Do18XPPt+MPPHNUBv6uFE1GhpOwSA@mail.gmail.com>

On Fri, Oct 11, 2024 at 08:11:36PM +0200, Jann Horn wrote:
> On Fri, Sep 27, 2024 at 2:51 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > Implement a new lightweight guard page feature, that is regions of userland
> > virtual memory that, when accessed, cause a fatal signal to arise.
> [...]
> > ---
> >  arch/alpha/include/uapi/asm/mman.h     |   3 +
> >  arch/mips/include/uapi/asm/mman.h      |   3 +
> >  arch/parisc/include/uapi/asm/mman.h    |   3 +
> >  arch/xtensa/include/uapi/asm/mman.h    |   3 +
> >  include/uapi/asm-generic/mman-common.h |   3 +
>
> I kinda wonder if we could start moving the parts of those headers
> that are the same for all architectures to include/uapi/linux/mman.h
> instead... but that's maybe out of scope for this series.

Arnd already had a look at this in a recent series. I had the same feeling doing
this...

>
> [...]
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index e871a72a6c32..7216e10723ae 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior)
> >         case MADV_POPULATE_READ:
> >         case MADV_POPULATE_WRITE:
> >         case MADV_COLLAPSE:
> > +       case MADV_GUARD_UNPOISON: /* Only poisoning needs a write lock. */
>
> What does poisoning need a write lock for? anon_vma_prepare() doesn't
> need it (it only needs mmap_lock held for reading),
> zap_page_range_single() doesn't need it, and pagewalk also doesn't
> need it as long as the range being walked is covered by a VMA, which
> it is...
>
> I see you set PGWALK_WRLOCK in guard_poison_walk_ops with a comment
> saying "We might need to install an anon_vma" - is that referring to
> an older version of the patch where the anon_vma_prepare() call was
> inside the pagewalk callback or something like that? Either way,
> anon_vma_prepare() doesn't need write locks (it can't, it has to work
> from the page fault handling path).

OK this was a misunderstanding. Actually there have been more than one, at first
I thought a write lock would protect us against racing faults (nope, due to RCU
vma locking now :) and then I had assumed literally changing a vma field
_surely_ must require a write lock, also it appears no as __anon_vma_prepare(),
amusingly, uses the mm->page_table_lock to protect against accesses to
vma->anon_vma.

And yes you're right it is triggered on the fault path so has to work that way.

TL;DR will change to read lock.

>
> >                 return 0;
> >         default:
> >                 /* be safe, default to 1. list exceptions explicitly */
> [...]
> > +static long madvise_guard_poison(struct vm_area_struct *vma,
> > +                                struct vm_area_struct **prev,
> > +                                unsigned long start, unsigned long end)
> > +{
> > +       long err;
> > +       bool retried = false;
> > +
> > +       *prev = vma;
> > +       if (!is_valid_guard_vma(vma, /* allow_locked = */false))
> > +               return -EINVAL;
> > +
> > +       /*
> > +        * Optimistically try to install the guard poison pages first. If any
> > +        * non-guard pages are encountered, give up and zap the range before
> > +        * trying again.
> > +        */
> > +       while (true) {
> > +               unsigned long num_installed = 0;
> > +
> > +               /* Returns < 0 on error, == 0 if success, > 0 if zap needed. */
> > +               err = walk_page_range_mm(vma->vm_mm, start, end,
> > +                                        &guard_poison_walk_ops,
> > +                                        &num_installed);
> > +               /*
> > +                * If we install poison markers, then the range is no longer
> > +                * empty from a page table perspective and therefore it's
> > +                * appropriate to have an anon_vma.
> > +                *
> > +                * This ensures that on fork, we copy page tables correctly.
> > +                */
> > +               if (err >= 0 && num_installed > 0) {
> > +                       int err_anon = anon_vma_prepare(vma);
>
> I'd move this up, to before we create poison PTEs. There's no harm in
> attaching an anon_vma to the VMA even if the rest of the operation
> fails; and I think it would be weird to have error paths that don't
> attach an anon_vma even though they .

I think you didn't finish this sentence :)

I disagree, we might have absolutely no need to do it, and I'd rather only
do so _if_ we have to.

It feels like the logical spot to do it and, while the cases where it
wouldn't happen are ones where pages are already poisoned (the
vma->anon_vma == NULL test will fail so basically a no-op) or error on page
walk.

>
> > +                       if (err_anon)
> > +                               err = err_anon;
> > +               }
> > +
> > +               if (err <= 0)
> > +                       return err;
> > +
> > +               if (!retried)
> > +                       /*
> > +                        * OK some of the range have non-guard pages mapped, zap
> > +                        * them. This leaves existing guard pages in place.
> > +                        */
> > +                       zap_page_range_single(vma, start, end - start, NULL);
> > +               else
> > +                       /*
> > +                        * If we reach here, then there is a racing fault that
> > +                        * has populated the PTE after we zapped. Give up and
> > +                        * let the user know to try again.
> > +                        */
> > +                       return -EAGAIN;
>
> Hmm, yeah, it would be nice if we could avoid telling userspace to
> loop on -EAGAIN but I guess we don't have any particularly good
> options here? Well, we could bail out with -EINTR if a (fatal?) signal
> is pending and otherwise keep looping... if we'd tell userspace "try
> again on -EAGAIN", we might as well do that in the kernel...

The problem is you could conceivably go on for quite some time, while
holding and contending a HIGHLY contended lock (mm->mmap_lock) so I'd
really rather let userspace take care of it.

You could avoid this by having the walker be a _replace_ operation, that is
- if we encounter an existing mapping, replace in-place with a poison
marker rather than install marker/zap.

However doing that would involve either completely abstracting such logic
from scratch (a significant task in itself) to avoid duplication which be
hugely off-topic for the patch set or worse, duplicating a whole bunch of
page walking logic once again.

By being optimistic and simply having the user having to handle looping
which seems reasonable (again, it's weird if you're installing poison
markers and another thread could be racing you) we avoid all of that.


>
> (Personally I would put curly braces around these branches because
> they occupy multiple lines, though the coding style doesn't explicitly
> say that, so I guess maybe it's a matter of personal preference...
> adding curly braces here would match what is done, for example, in
> relocate_vma_down().)

Hey I wrote that too! ;) Sure I can change that.

>
> > +               retried = true;
> > +       }
> > +}
> > +
> > +static int guard_unpoison_pte_entry(pte_t *pte, unsigned long addr,
> > +                                   unsigned long next, struct mm_walk *walk)
> > +{
> > +       pte_t ptent = ptep_get(pte);
> > +
> > +       if (is_guard_pte_marker(ptent)) {
> > +               /* Simply clear the PTE marker. */
> > +               pte_clear_not_present_full(walk->mm, addr, pte, true);
>
> I think that last parameter probably should be "false"? The sparc code
> calls it "fullmm", which is a term the MM code uses when talking about
> operations that remove all mappings in the entire mm_struct because
> the process has died, which allows using some faster special-case
> version of TLB shootdown or something along those lines.

Yeah I think you're right. Will change.

>
> > +               update_mmu_cache(walk->vma, addr, pte);
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> > +static const struct mm_walk_ops guard_unpoison_walk_ops = {
> > +       .pte_entry              = guard_unpoison_pte_entry,
> > +       .walk_lock              = PGWALK_RDLOCK,
> > +};
>
> It is a _little_ weird that unpoisoning creates page tables when they
> don't already exist, which will also prevent creating THP entries on
> fault in such areas afterwards... but I guess it doesn't really matter
> given that poisoning has that effect, too, and you probably usually
> won't call MADV_GUARD_UNPOISON on an area that hasn't been poisoned
> before... so I guess this is not an actionable comment.

It doesn't? There's no .install_pte so if an entries are non-present we
ignore.

HOWEVER, we do split THP. I don't think there's any way around it unless we
extended the page walker to handle this more gracefully (pmd level being
able to hint that we shouldn't do that or something), but that's really out
of scope here.

The idea is that a caller can lazily call MADV_GUARD_UNPOISON on a range
knowing things stay as they were, I guess we can add to the manpage a note
that this will split THP?

^ permalink raw reply

* [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
  To: David S. Miller, James E.J. Bottomley, Andreas Larsson,
	Andrew Morton, Anshuman Khandual, Anton Ivanov, Ard Biesheuvel,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Chris Zankel,
	Dave Hansen, David Hildenbrand, Dinh Nguyen, Geert Uytterhoeven,
	Greg Marsden, Helge Deller, Huacai Chen, Ingo Molnar, Ivan Ivanov,
	Johannes Berg, John Paul Adrian Glaubitz, Jonas Bonn,
	Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
	Max Filippov, Miroslav Benes, Rich Felker, Richard Weinberger,
	Stafford Horne, Stefan Kristiansson, Thomas Bogendoerfer,
	Thomas Gleixner, Will Deacon, Yoshinori Sato, x86
  Cc: Ryan Roberts, linux-alpha, linux-arch, linux-arm-kernel,
	linux-csky, linux-hexagon, linux-kernel, linux-m68k, linux-mips,
	linux-mm, linux-openrisc, linux-parisc, linux-riscv, linux-s390,
	linux-sh, linux-snps-arc, linux-um, linuxppc-dev, loongarch,
	sparclinux
In-Reply-To: <20241014105514.3206191-1-ryan.roberts@arm.com>

arm64 can support multiple base page sizes. Instead of selecting a page
size at compile time, as is done today, we will make it possible to
select the desired page size on the command line.

In this case PAGE_SHIFT and it's derivatives, PAGE_SIZE and PAGE_MASK
(as well as a number of other macros related to or derived from
PAGE_SHIFT, but I'm not worrying about those yet), are no longer
compile-time constants. So the code base needs to cope with that.

As a first step, introduce MIN and MAX variants of these macros, which
express the range of possible page sizes. These are always compile-time
constants and can be used in many places where PAGE_[SHIFT|SIZE|MASK]
were previously used where a compile-time constant is required.
(Subsequent patches will do that conversion work). When the arch/build
doesn't support boot-time page size selection, the MIN and MAX variants
are equal and everything resolves as it did previously.

Additionally, introduce DEFINE_GLOBAL_PAGE_SIZE_VAR[_CONST]() which wrap
global variable defintions so that for boot-time page size selection
builds, the variable being wrapped is initialized at boot-time, instead
of compile-time. This is done by defining a function to do the
assignment, which has the "constructor" attribute. Constructor is
preferred over initcall, because when compiling a module, the module is
limited to a single initcall but constructors are unlimited. For
built-in code, constructors are now called earlier to guarrantee that
the variables are initialized by the time they are used. Any arch that
wants to enable boot-time page size selection will need to select
CONFIG_CONSTRUCTORS.

These new macros need to be available anywhere PAGE_SHIFT and friends
are available. Those are defined via asm/page.h (although some arches
have a sub-include that defines them). Unfortunately there is no
reliable asm-generic header we can easily piggy-back on, so let's define
a new one, pgtable-geometry.h, which we include near where each arch
defines PAGE_SHIFT. Ugh.

-------

Most of the problems that need to be solved over the next few patches
fall into these broad categories, which are all solved with the help of
these new macros:

1. Assignment of values derived from PAGE_SIZE in global variables

  For boot-time page size builds, we must defer the initialization of
  these variables until boot-time, when the page size is known. See
  DEFINE_GLOBAL_PAGE_SIZE_VAR[_CONST]() as described above.

2. Define static storage in units related to PAGE_SIZE

  This static storage will be defined according to PAGE_SIZE_MAX.

3. Define size of struct so that it is related to PAGE_SIZE

  The struct often contains an array that is sized to fill the page. In
  this case, use a flexible array with dynamic allocation. In other
  cases, the struct fits exactly over a page, which is a header (e.g.
  swap file header). In this case, remove the padding, and manually
  determine the struct pointer within the page.

4. BUILD_BUG_ON() with values derived from PAGE_SIZE

  In most cases, we can change these to compare againt the appropriate
  limit (either MIN or MAX). In other cases, we must change these to
  run-time BUG_ON().

5. Ensure page alignment of static data structures

  Align instead to PAGE_SIZE_MAX.

6. #ifdeffery based on PAGE_SIZE

  Often these can be changed to c code constructs. e.g. a macro that
  returns a different value depending on page size can be changed to use
  the ternary operator and the compiler will dead code strip it for the
  compile-time constant case and runtime evaluate it for the non-const
  case. Or #if/#else/#endif within a function can be converted to c
  if/else blocks, which are also dead code stripped for the const case.
  Sometimes we can change the c-preprocessor logic to use the
  appropriate MIN/MAX limit.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---

***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/

 arch/alpha/include/asm/page.h          |  1 +
 arch/arc/include/asm/page.h            |  1 +
 arch/arm/include/asm/page.h            |  1 +
 arch/arm64/include/asm/page-def.h      |  2 +
 arch/csky/include/asm/page.h           |  3 ++
 arch/hexagon/include/asm/page.h        |  2 +
 arch/loongarch/include/asm/page.h      |  2 +
 arch/m68k/include/asm/page.h           |  1 +
 arch/microblaze/include/asm/page.h     |  1 +
 arch/mips/include/asm/page.h           |  1 +
 arch/nios2/include/asm/page.h          |  2 +
 arch/openrisc/include/asm/page.h       |  1 +
 arch/parisc/include/asm/page.h         |  1 +
 arch/powerpc/include/asm/page.h        |  2 +
 arch/riscv/include/asm/page.h          |  1 +
 arch/s390/include/asm/page.h           |  1 +
 arch/sh/include/asm/page.h             |  1 +
 arch/sparc/include/asm/page.h          |  3 ++
 arch/um/include/asm/page.h             |  2 +
 arch/x86/include/asm/page_types.h      |  2 +
 arch/xtensa/include/asm/page.h         |  1 +
 include/asm-generic/pgtable-geometry.h | 71 ++++++++++++++++++++++++++
 init/main.c                            |  5 +-
 23 files changed, 107 insertions(+), 1 deletion(-)
 create mode 100644 include/asm-generic/pgtable-geometry.h

diff --git a/arch/alpha/include/asm/page.h b/arch/alpha/include/asm/page.h
index 70419e6be1a35..d0096fb5521b8 100644
--- a/arch/alpha/include/asm/page.h
+++ b/arch/alpha/include/asm/page.h
@@ -88,5 +88,6 @@ typedef struct page *pgtable_t;
 
 #include <asm-generic/memory_model.h>
 #include <asm-generic/getorder.h>
+#include <asm-generic/pgtable-geometry.h>
 
 #endif /* _ALPHA_PAGE_H */
diff --git a/arch/arc/include/asm/page.h b/arch/arc/include/asm/page.h
index def0dfb95b436..8d56549db7a33 100644
--- a/arch/arc/include/asm/page.h
+++ b/arch/arc/include/asm/page.h
@@ -6,6 +6,7 @@
 #define __ASM_ARC_PAGE_H
 
 #include <uapi/asm/page.h>
+#include <asm-generic/pgtable-geometry.h>
 
 #ifdef CONFIG_ARC_HAS_PAE40
 
diff --git a/arch/arm/include/asm/page.h b/arch/arm/include/asm/page.h
index 62af9f7f9e963..417aa8533c718 100644
--- a/arch/arm/include/asm/page.h
+++ b/arch/arm/include/asm/page.h
@@ -191,5 +191,6 @@ extern int pfn_valid(unsigned long);
 
 #include <asm-generic/getorder.h>
 #include <asm-generic/memory_model.h>
+#include <asm-generic/pgtable-geometry.h>
 
 #endif
diff --git a/arch/arm64/include/asm/page-def.h b/arch/arm64/include/asm/page-def.h
index 792e9fe881dcf..d69971cf49cd2 100644
--- a/arch/arm64/include/asm/page-def.h
+++ b/arch/arm64/include/asm/page-def.h
@@ -15,4 +15,6 @@
 #define PAGE_SIZE		(_AC(1, UL) << PAGE_SHIFT)
 #define PAGE_MASK		(~(PAGE_SIZE-1))
 
+#include <asm-generic/pgtable-geometry.h>
+
 #endif /* __ASM_PAGE_DEF_H */
diff --git a/arch/csky/include/asm/page.h b/arch/csky/include/asm/page.h
index 0ca6c408c07f2..95173d57adc8b 100644
--- a/arch/csky/include/asm/page.h
+++ b/arch/csky/include/asm/page.h
@@ -92,4 +92,7 @@ static inline unsigned long virt_to_pfn(const void *kaddr)
 #include <asm-generic/getorder.h>
 
 #endif /* !__ASSEMBLY__ */
+
+#include <asm-generic/pgtable-geometry.h>
+
 #endif /* __ASM_CSKY_PAGE_H */
diff --git a/arch/hexagon/include/asm/page.h b/arch/hexagon/include/asm/page.h
index 8a6af57274c2d..ba7ad5231695f 100644
--- a/arch/hexagon/include/asm/page.h
+++ b/arch/hexagon/include/asm/page.h
@@ -139,4 +139,6 @@ static inline unsigned long virt_to_pfn(const void *kaddr)
 #endif /* ifdef __ASSEMBLY__ */
 #endif /* ifdef __KERNEL__ */
 
+#include <asm-generic/pgtable-geometry.h>
+
 #endif
diff --git a/arch/loongarch/include/asm/page.h b/arch/loongarch/include/asm/page.h
index e85df33f11c77..9862e8fb047a6 100644
--- a/arch/loongarch/include/asm/page.h
+++ b/arch/loongarch/include/asm/page.h
@@ -123,4 +123,6 @@ extern int __virt_addr_valid(volatile void *kaddr);
 
 #endif /* !__ASSEMBLY__ */
 
+#include <asm-generic/pgtable-geometry.h>
+
 #endif /* _ASM_PAGE_H */
diff --git a/arch/m68k/include/asm/page.h b/arch/m68k/include/asm/page.h
index 8cfb84b499751..4df4681b02194 100644
--- a/arch/m68k/include/asm/page.h
+++ b/arch/m68k/include/asm/page.h
@@ -60,5 +60,6 @@ extern unsigned long _ramend;
 
 #include <asm-generic/getorder.h>
 #include <asm-generic/memory_model.h>
+#include <asm-generic/pgtable-geometry.h>
 
 #endif /* _M68K_PAGE_H */
diff --git a/arch/microblaze/include/asm/page.h b/arch/microblaze/include/asm/page.h
index 8810f4f1c3b02..abc23c3d743bd 100644
--- a/arch/microblaze/include/asm/page.h
+++ b/arch/microblaze/include/asm/page.h
@@ -142,5 +142,6 @@ static inline const void *pfn_to_virt(unsigned long pfn)
 
 #include <asm-generic/memory_model.h>
 #include <asm-generic/getorder.h>
+#include <asm-generic/pgtable-geometry.h>
 
 #endif /* _ASM_MICROBLAZE_PAGE_H */
diff --git a/arch/mips/include/asm/page.h b/arch/mips/include/asm/page.h
index 4609cb0326cf3..3d91021538f02 100644
--- a/arch/mips/include/asm/page.h
+++ b/arch/mips/include/asm/page.h
@@ -227,5 +227,6 @@ static inline unsigned long kaslr_offset(void)
 
 #include <asm-generic/memory_model.h>
 #include <asm-generic/getorder.h>
+#include <asm-generic/pgtable-geometry.h>
 
 #endif /* _ASM_PAGE_H */
diff --git a/arch/nios2/include/asm/page.h b/arch/nios2/include/asm/page.h
index 0722f88e63cc7..2e5f93beb42b7 100644
--- a/arch/nios2/include/asm/page.h
+++ b/arch/nios2/include/asm/page.h
@@ -97,4 +97,6 @@ extern struct page *mem_map;
 
 #endif /* !__ASSEMBLY__ */
 
+#include <asm-generic/pgtable-geometry.h>
+
 #endif /* _ASM_NIOS2_PAGE_H */
diff --git a/arch/openrisc/include/asm/page.h b/arch/openrisc/include/asm/page.h
index 1d5913f67c312..a0da2a9842241 100644
--- a/arch/openrisc/include/asm/page.h
+++ b/arch/openrisc/include/asm/page.h
@@ -88,5 +88,6 @@ static inline unsigned long virt_to_pfn(const void *kaddr)
 
 #include <asm-generic/memory_model.h>
 #include <asm-generic/getorder.h>
+#include <asm-generic/pgtable-geometry.h>
 
 #endif /* __ASM_OPENRISC_PAGE_H */
diff --git a/arch/parisc/include/asm/page.h b/arch/parisc/include/asm/page.h
index 4bea2e95798f0..2a75496237c09 100644
--- a/arch/parisc/include/asm/page.h
+++ b/arch/parisc/include/asm/page.h
@@ -173,6 +173,7 @@ extern int npmem_ranges;
 
 #include <asm-generic/memory_model.h>
 #include <asm-generic/getorder.h>
+#include <asm-generic/pgtable-geometry.h>
 #include <asm/pdc.h>
 
 #define PAGE0   ((struct zeropage *)absolute_pointer(__PAGE_OFFSET))
diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index 83d0a4fc5f755..4601c115b6485 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -300,4 +300,6 @@ static inline unsigned long kaslr_offset(void)
 #include <asm-generic/memory_model.h>
 #endif /* __ASSEMBLY__ */
 
+#include <asm-generic/pgtable-geometry.h>
+
 #endif /* _ASM_POWERPC_PAGE_H */
diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index 7ede2111c5917..e5af7579e45bf 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -204,5 +204,6 @@ static __always_inline void *pfn_to_kaddr(unsigned long pfn)
 
 #include <asm-generic/memory_model.h>
 #include <asm-generic/getorder.h>
+#include <asm-generic/pgtable-geometry.h>
 
 #endif /* _ASM_RISCV_PAGE_H */
diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h
index 16e4caa931f1f..42157e7690a77 100644
--- a/arch/s390/include/asm/page.h
+++ b/arch/s390/include/asm/page.h
@@ -275,6 +275,7 @@ static inline unsigned long virt_to_pfn(const void *kaddr)
 
 #include <asm-generic/memory_model.h>
 #include <asm-generic/getorder.h>
+#include <asm-generic/pgtable-geometry.h>
 
 #define AMODE31_SIZE		(3 * PAGE_SIZE)
 
diff --git a/arch/sh/include/asm/page.h b/arch/sh/include/asm/page.h
index f780b467e75d7..09533d46ef033 100644
--- a/arch/sh/include/asm/page.h
+++ b/arch/sh/include/asm/page.h
@@ -162,5 +162,6 @@ typedef struct page *pgtable_t;
 
 #include <asm-generic/memory_model.h>
 #include <asm-generic/getorder.h>
+#include <asm-generic/pgtable-geometry.h>
 
 #endif /* __ASM_SH_PAGE_H */
diff --git a/arch/sparc/include/asm/page.h b/arch/sparc/include/asm/page.h
index 5e44cdf2a8f2b..4327fe2bfa010 100644
--- a/arch/sparc/include/asm/page.h
+++ b/arch/sparc/include/asm/page.h
@@ -9,4 +9,7 @@
 #else
 #include <asm/page_32.h>
 #endif
+
+#include <asm-generic/pgtable-geometry.h>
+
 #endif
diff --git a/arch/um/include/asm/page.h b/arch/um/include/asm/page.h
index 9ef9a8aedfa66..f26011808f514 100644
--- a/arch/um/include/asm/page.h
+++ b/arch/um/include/asm/page.h
@@ -119,4 +119,6 @@ extern unsigned long uml_physmem;
 #define __HAVE_ARCH_GATE_AREA 1
 #endif
 
+#include <asm-generic/pgtable-geometry.h>
+
 #endif	/* __UM_PAGE_H */
diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index 52f1b4ff0cc16..6d2381342047f 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -71,4 +71,6 @@ extern void initmem_init(void);
 
 #endif	/* !__ASSEMBLY__ */
 
+#include <asm-generic/pgtable-geometry.h>
+
 #endif	/* _ASM_X86_PAGE_DEFS_H */
diff --git a/arch/xtensa/include/asm/page.h b/arch/xtensa/include/asm/page.h
index 4db56ef052d22..86952cb32af23 100644
--- a/arch/xtensa/include/asm/page.h
+++ b/arch/xtensa/include/asm/page.h
@@ -200,4 +200,5 @@ static inline unsigned long ___pa(unsigned long va)
 #endif /* __ASSEMBLY__ */
 
 #include <asm-generic/memory_model.h>
+#include <asm-generic/pgtable-geometry.h>
 #endif /* _XTENSA_PAGE_H */
diff --git a/include/asm-generic/pgtable-geometry.h b/include/asm-generic/pgtable-geometry.h
new file mode 100644
index 0000000000000..358e729a6ac37
--- /dev/null
+++ b/include/asm-generic/pgtable-geometry.h
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef ASM_GENERIC_PGTABLE_GEOMETRY_H
+#define ASM_GENERIC_PGTABLE_GEOMETRY_H
+
+#if   defined(PAGE_SHIFT_MAX) && defined(PAGE_SIZE_MAX) && defined(PAGE_MASK_MAX) && \
+      defined(PAGE_SHIFT_MIN) && defined(PAGE_SIZE_MIN) && defined(PAGE_MASK_MIN)
+/* Arch supports boot-time page size selection. */
+#elif defined(PAGE_SHIFT_MAX) || defined(PAGE_SIZE_MAX) || defined(PAGE_MASK_MAX) || \
+      defined(PAGE_SHIFT_MIN) || defined(PAGE_SIZE_MIN) || defined(PAGE_MASK_MIN)
+#error Arch must define all or none of the boot-time page size macros
+#else
+/* Arch does not support boot-time page size selection. */
+#define PAGE_SHIFT_MIN	PAGE_SHIFT
+#define PAGE_SIZE_MIN	PAGE_SIZE
+#define PAGE_MASK_MIN	PAGE_MASK
+#define PAGE_SHIFT_MAX	PAGE_SHIFT
+#define PAGE_SIZE_MAX	PAGE_SIZE
+#define PAGE_MASK_MAX	PAGE_MASK
+#endif
+
+/*
+ * Define a global variable (scalar or struct), whose value is derived from
+ * PAGE_SIZE and friends. When PAGE_SIZE is a compile-time constant, the global
+ * variable is simply defined with the static value. When PAGE_SIZE is
+ * determined at boot-time, a pure initcall is registered and run during boot to
+ * initialize the variable.
+ *
+ * @type: Unqualified type. Do not include "const"; implied by macro variant.
+ * @name: Variable name.
+ * @...:  Initialization value. May be scalar or initializer.
+ *
+ * "static" is declared by placing "static" before the macro.
+ *
+ * Example:
+ *
+ * struct my_struct {
+ *         int a;
+ *         char b;
+ * };
+ *
+ * static DEFINE_GLOBAL_PAGE_SIZE_VAR(struct my_struct, my_variable, {
+ *         .a = 10,
+ *         .b = 'e',
+ * });
+ */
+#if PAGE_SIZE_MIN != PAGE_SIZE_MAX
+#define __DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, attrib, ...)		\
+	type name attrib;						\
+	static int __init __attribute__((constructor)) __##name##_init(void)	\
+	{								\
+		name = (type)__VA_ARGS__;				\
+		return 0;						\
+	}
+
+#define DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, ...)			\
+	__DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, , __VA_ARGS__)
+
+#define DEFINE_GLOBAL_PAGE_SIZE_VAR_CONST(type, name, ...)		\
+	__DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, __ro_after_init, __VA_ARGS__)
+#else /* PAGE_SIZE_MIN == PAGE_SIZE_MAX */
+#define __DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, attrib, ...)		\
+	type name attrib = __VA_ARGS__;					\
+
+#define DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, ...)			\
+	__DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, , __VA_ARGS__)
+
+#define DEFINE_GLOBAL_PAGE_SIZE_VAR_CONST(type, name, ...)		\
+	__DEFINE_GLOBAL_PAGE_SIZE_VAR(const type, name, , __VA_ARGS__)
+#endif
+
+#endif /* ASM_GENERIC_PGTABLE_GEOMETRY_H */
diff --git a/init/main.c b/init/main.c
index 206acdde51f5a..ba1515eb20b9d 100644
--- a/init/main.c
+++ b/init/main.c
@@ -899,6 +899,8 @@ static void __init early_numa_node_init(void)
 #endif
 }
 
+static __init void do_ctors(void);
+
 asmlinkage __visible __init __no_sanitize_address __noreturn __no_stack_protector
 void start_kernel(void)
 {
@@ -910,6 +912,8 @@ void start_kernel(void)
 	debug_objects_early_init();
 	init_vmlinux_build_id();
 
+	do_ctors();
+
 	cgroup_init_early();
 
 	local_irq_disable();
@@ -1360,7 +1364,6 @@ static void __init do_basic_setup(void)
 	cpuset_init_smp();
 	driver_init();
 	init_irq_proc();
-	do_ctors();
 	do_initcalls();
 }
 
-- 
2.43.0


^ permalink raw reply related

* Re: [RFC PATCH 2/4] mm: add PTE_MARKER_GUARD PTE marker
From: Lorenzo Stoakes @ 2024-10-14 10:23 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
	Matthew Wilcox, Vlastimil Babka, Paul E . McKenney,
	David Hildenbrand, linux-mm, linux-kernel, Muchun Song,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	Thomas Bogendoerfer, James E . J . Bottomley, Helge Deller,
	Chris Zankel, Max Filippov, Arnd Bergmann, linux-alpha,
	linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar
In-Reply-To: <CAG48ez0rLrTrNiT93T2fG86w_n+ARRqNxOS6OXGS-Q_V54GjoQ@mail.gmail.com>

On Fri, Oct 11, 2024 at 08:11:32PM +0200, Jann Horn wrote:
> On Fri, Sep 27, 2024 at 2:51 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > Add a new PTE marker that results in any access causing the accessing
> > process to segfault.
> [...]
> >  static inline int is_poisoned_swp_entry(swp_entry_t entry)
> > +{
> > +       /*
> > +        * We treat guard pages as poisoned too as these have the same semantics
> > +        * as poisoned ranges, only with different fault handling.
> > +        */
> > +       return is_pte_marker_entry(entry) &&
> > +               (pte_marker_get(entry) &
> > +                (PTE_MARKER_POISONED | PTE_MARKER_GUARD));
> > +}
>
> This means MADV_FREE will also clear guard PTEs, right?

Yes, this is expected, it acts like unmap in effect (with a delayed
effect), so we give it the same semantics. The same thing happens with
hardware poisoning.

You can see in the tests what expectations we have with different
operations, we assert there this specific behaviour:

	/* Lazyfree range. */
	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_FREE), 0);

	/* This should simply clear the poison markers. */
	for (i = 0; i < 10; i++) {
		ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size]));
	}

The tests somewhat self-document expected behaviour.

>
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 5c6486e33e63..6c413c3d72fd 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1457,7 +1457,7 @@ static inline bool should_zap_folio(struct zap_details *details,
> >         return !folio_test_anon(folio);
> >  }
> >
> > -static inline bool zap_drop_file_uffd_wp(struct zap_details *details)
> > +static inline bool zap_drop_markers(struct zap_details *details)
> >  {
> >         if (!details)
> >                 return false;
> > @@ -1478,7 +1478,7 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
> >         if (vma_is_anonymous(vma))
> >                 return;
> >
> > -       if (zap_drop_file_uffd_wp(details))
> > +       if (zap_drop_markers(details))
> >                 return;
> >
> >         for (;;) {
> > @@ -1673,7 +1673,15 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> >                          * drop the marker if explicitly requested.
> >                          */
> >                         if (!vma_is_anonymous(vma) &&
> > -                           !zap_drop_file_uffd_wp(details))
> > +                           !zap_drop_markers(details))
> > +                               continue;
> > +               } else if (is_guard_swp_entry(entry)) {
> > +                       /*
> > +                        * Ordinary zapping should not remove guard PTE
> > +                        * markers. Only do so if we should remove PTE markers
> > +                        * in general.
> > +                        */
> > +                       if (!zap_drop_markers(details))
> >                                 continue;
>
> Just a comment: It's nice that the feature is restricted to anonymous
> VMAs, otherwise we'd have to figure out here what to do about
> unmap_mapping_folio() (which sets ZAP_FLAG_DROP_MARKER together with
> details.single_folio)...

Yes this is not the only issue with file-backed mappings. Readahead being
another, and plenty more.

We will probably look at how we might do this once this patch set lands,
and tackle all of these fun things then...

>
>
> >                 } else if (is_hwpoison_entry(entry) ||
> >                            is_poisoned_swp_entry(entry)) {
> > @@ -4005,6 +4013,10 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
> >         if (marker & PTE_MARKER_POISONED)
> >                 return VM_FAULT_HWPOISON;
> >
> > +       /* Hitting a guard page is always a fatal condition. */
> > +       if (marker & PTE_MARKER_GUARD)
> > +               return VM_FAULT_SIGSEGV;
> > +
> >         if (pte_marker_entry_uffd_wp(entry))
> >                 return pte_marker_handle_uffd_wp(vmf);
> >
> > --
> > 2.46.2
> >

^ permalink raw reply

* Re: [PATCH v5 7/8] execmem: add support for cache of large ROX pages
From: Christoph Hellwig @ 2024-10-14  5:55 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Christoph Hellwig, Andrew Morton, Andreas Larsson,
	Andy Lutomirski, Ard Biesheuvel, Arnd Bergmann, Borislav Petkov,
	Brian Cain, Catalin Marinas, Christophe Leroy, Dave Hansen,
	Dinh Nguyen, Geert Uytterhoeven, Guo Ren, Helge Deller,
	Huacai Chen, Ingo Molnar, Johannes Berg,
	John Paul Adrian Glaubitz, Kent Overstreet, Liam R. Howlett,
	Luis Chamberlain, Mark Rutland, Masami Hiramatsu, Matt Turner,
	Max Filippov, Michael Ellerman, Michal Simek, Oleg Nesterov,
	Palmer Dabbelt, Peter Zijlstra, Richard Weinberger, Russell King,
	Song Liu, Stafford Horne, Steven Rostedt, Thomas Bogendoerfer,
	Thomas Gleixner, Uladzislau Rezki, Vineet Gupta, Will Deacon, bpf,
	linux-alpha, linux-arch, linux-arm-kernel, linux-csky,
	linux-hexagon, linux-kernel, linux-m68k, linux-mips, linux-mm,
	linux-modules, linux-openrisc, linux-parisc, linux-riscv,
	linux-sh, linux-snps-arc, linux-trace-kernel, linux-um,
	linuxppc-dev, loongarch, sparclinux, x86
In-Reply-To: <ZwuIPZkjX0CfzhjS@kernel.org>

On Sun, Oct 13, 2024 at 11:43:41AM +0300, Mike Rapoport wrote:
> > But why?  That's pretty different from our normal style of arch hooks,
> > and introduces an indirect call in a security sensitive area.
> 
> Will change to __weak hook. 

Isn't the callback required when using the large ROX page?  I.e.
shouldn't it be an unconditional callback and not a weak override?


^ permalink raw reply

* Re: [PATCH v5 7/8] execmem: add support for cache of large ROX pages
From: Andrew Morton @ 2024-10-14  3:26 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Christoph Hellwig, Andreas Larsson, Andy Lutomirski,
	Ard Biesheuvel, Arnd Bergmann, Borislav Petkov, Brian Cain,
	Catalin Marinas, Christophe Leroy, Dave Hansen, Dinh Nguyen,
	Geert Uytterhoeven, Guo Ren, Helge Deller, Huacai Chen,
	Ingo Molnar, Johannes Berg, John Paul Adrian Glaubitz,
	Kent Overstreet, Liam R. Howlett, Luis Chamberlain, Mark Rutland,
	Masami Hiramatsu, Matt Turner, Max Filippov, Michael Ellerman,
	Michal Simek, Oleg Nesterov, Palmer Dabbelt, Peter Zijlstra,
	Richard Weinberger, Russell King, Song Liu, Stafford Horne,
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Uladzislau Rezki, Vineet Gupta, Will Deacon, bpf, linux-alpha,
	linux-arch, linux-arm-kernel, linux-csky, linux-hexagon,
	linux-kernel, linux-m68k, linux-mips, linux-mm, linux-modules,
	linux-openrisc, linux-parisc, linux-riscv, linux-sh,
	linux-snps-arc, linux-trace-kernel, linux-um, linuxppc-dev,
	loongarch, sparclinux, x86
In-Reply-To: <ZwuIPZkjX0CfzhjS@kernel.org>

On Sun, 13 Oct 2024 11:43:41 +0300 Mike Rapoport <rppt@kernel.org> wrote:

> > > The idea is to keep everything together and have execmem_info describe all
> > > that architecture needs. 
> > 
> > But why?  That's pretty different from our normal style of arch hooks,
> > and introduces an indirect call in a security sensitive area.
> 
> Will change to __weak hook. 
> 

Thanks, I'll drop the v1 series;

The todos which I collected are:

https://lkml.kernel.org/r/CAPhsuW66etfdU3Fvk0KsELXcgWD6_TkBFjJ-BTHQu5OejDsP2w@mail.gmail.com
https://lkml.kernel.org/r/Zwd6vH0rz0PVedLI@infradead.org
https://lkml.kernel.org/r/ZwjXz0dz-RldVNx0@infradead.org
https://lkml.kernel.org/r/202410111408.8fe6f604-lkp@intel.com

^ permalink raw reply

* Re: [PATCH v5 7/8] execmem: add support for cache of large ROX pages
From: Mike Rapoport @ 2024-10-13  8:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Andreas Larsson, Andy Lutomirski, Ard Biesheuvel,
	Arnd Bergmann, Borislav Petkov, Brian Cain, Catalin Marinas,
	Christophe Leroy, Dave Hansen, Dinh Nguyen, Geert Uytterhoeven,
	Guo Ren, Helge Deller, Huacai Chen, Ingo Molnar, Johannes Berg,
	John Paul Adrian Glaubitz, Kent Overstreet, Liam R. Howlett,
	Luis Chamberlain, Mark Rutland, Masami Hiramatsu, Matt Turner,
	Max Filippov, Michael Ellerman, Michal Simek, Oleg Nesterov,
	Palmer Dabbelt, Peter Zijlstra, Richard Weinberger, Russell King,
	Song Liu, Stafford Horne, Steven Rostedt, Thomas Bogendoerfer,
	Thomas Gleixner, Uladzislau Rezki, Vineet Gupta, Will Deacon, bpf,
	linux-alpha, linux-arch, linux-arm-kernel, linux-csky,
	linux-hexagon, linux-kernel, linux-m68k, linux-mips, linux-mm,
	linux-modules, linux-openrisc, linux-parisc, linux-riscv,
	linux-sh, linux-snps-arc, linux-trace-kernel, linux-um,
	linuxppc-dev, loongarch, sparclinux, x86
In-Reply-To: <ZwjXz0dz-RldVNx0@infradead.org>

On Fri, Oct 11, 2024 at 12:46:23AM -0700, Christoph Hellwig wrote:
> On Thu, Oct 10, 2024 at 03:57:33PM +0300, Mike Rapoport wrote:
> > On Wed, Oct 09, 2024 at 11:58:33PM -0700, Christoph Hellwig wrote:
> > > On Wed, Oct 09, 2024 at 09:08:15PM +0300, Mike Rapoport wrote:
> > > >  /**
> > > >   * struct execmem_info - architecture parameters for code allocations
> > > > + * @fill_trapping_insns: set memory to contain instructions that will trap
> > > >   * @ranges: array of parameter sets defining architecture specific
> > > >   * parameters for executable memory allocations. The ranges that are not
> > > >   * explicitly initialized by an architecture use parameters defined for
> > > >   * @EXECMEM_DEFAULT.
> > > >   */
> > > >  struct execmem_info {
> > > > +	void (*fill_trapping_insns)(void *ptr, size_t size, bool writable);
> > > >  	struct execmem_range	ranges[EXECMEM_TYPE_MAX];
> > > 
> > > Why is the filler an indirect function call and not an architecture
> > > hook?
> > 
> > The idea is to keep everything together and have execmem_info describe all
> > that architecture needs. 
> 
> But why?  That's pretty different from our normal style of arch hooks,
> and introduces an indirect call in a security sensitive area.

Will change to __weak hook. 

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [RFC PATCH 3/4] mm: madvise: implement lightweight guard page mechanism
From: Suren Baghdasaryan @ 2024-10-11 20:55 UTC (permalink / raw)
  To: Jann Horn
  Cc: Lorenzo Stoakes, Andrew Morton, Liam R . Howlett, Matthew Wilcox,
	Vlastimil Babka, Paul E . McKenney, David Hildenbrand, linux-mm,
	linux-kernel, Muchun Song, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, Thomas Bogendoerfer, James E . J . Bottomley,
	Helge Deller, Chris Zankel, Max Filippov, Arnd Bergmann,
	linux-alpha, linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar,
	Vlastimil Babka
In-Reply-To: <CAG48ez3ursoL-f=mYpV79Do18XPPt+MPPHNUBv6uFE1GhpOwSA@mail.gmail.com>

On Fri, Oct 11, 2024 at 11:12 AM Jann Horn <jannh@google.com> wrote:
>
> On Fri, Sep 27, 2024 at 2:51 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > Implement a new lightweight guard page feature, that is regions of userland
> > virtual memory that, when accessed, cause a fatal signal to arise.
> [...]
> > ---
> >  arch/alpha/include/uapi/asm/mman.h     |   3 +
> >  arch/mips/include/uapi/asm/mman.h      |   3 +
> >  arch/parisc/include/uapi/asm/mman.h    |   3 +
> >  arch/xtensa/include/uapi/asm/mman.h    |   3 +
> >  include/uapi/asm-generic/mman-common.h |   3 +
>
> I kinda wonder if we could start moving the parts of those headers
> that are the same for all architectures to include/uapi/linux/mman.h
> instead... but that's maybe out of scope for this series.
>
> [...]
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index e871a72a6c32..7216e10723ae 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior)
> >         case MADV_POPULATE_READ:
> >         case MADV_POPULATE_WRITE:
> >         case MADV_COLLAPSE:
> > +       case MADV_GUARD_UNPOISON: /* Only poisoning needs a write lock. */
>
> What does poisoning need a write lock for? anon_vma_prepare() doesn't
> need it (it only needs mmap_lock held for reading),
> zap_page_range_single() doesn't need it, and pagewalk also doesn't
> need it as long as the range being walked is covered by a VMA, which
> it is...
>
> I see you set PGWALK_WRLOCK in guard_poison_walk_ops with a comment
> saying "We might need to install an anon_vma" - is that referring to
> an older version of the patch where the anon_vma_prepare() call was
> inside the pagewalk callback or something like that? Either way,
> anon_vma_prepare() doesn't need write locks (it can't, it has to work
> from the page fault handling path).

I was wondering about that too and I can't find any reason for
write-locking the mm for this operation. PGWALK_WRLOCK should also be
changed to PGWALK_RDLOCK as we are not modifying the VMA.

BTW, I'm testing your patchset on Android and so far it is stable!

>
> >                 return 0;
> >         default:
> >                 /* be safe, default to 1. list exceptions explicitly */
> [...]
> > +static long madvise_guard_poison(struct vm_area_struct *vma,
> > +                                struct vm_area_struct **prev,
> > +                                unsigned long start, unsigned long end)
> > +{
> > +       long err;
> > +       bool retried = false;
> > +
> > +       *prev = vma;
> > +       if (!is_valid_guard_vma(vma, /* allow_locked = */false))
> > +               return -EINVAL;
> > +
> > +       /*
> > +        * Optimistically try to install the guard poison pages first. If any
> > +        * non-guard pages are encountered, give up and zap the range before
> > +        * trying again.
> > +        */
> > +       while (true) {
> > +               unsigned long num_installed = 0;
> > +
> > +               /* Returns < 0 on error, == 0 if success, > 0 if zap needed. */
> > +               err = walk_page_range_mm(vma->vm_mm, start, end,
> > +                                        &guard_poison_walk_ops,
> > +                                        &num_installed);
> > +               /*
> > +                * If we install poison markers, then the range is no longer
> > +                * empty from a page table perspective and therefore it's
> > +                * appropriate to have an anon_vma.
> > +                *
> > +                * This ensures that on fork, we copy page tables correctly.
> > +                */
> > +               if (err >= 0 && num_installed > 0) {
> > +                       int err_anon = anon_vma_prepare(vma);
>
> I'd move this up, to before we create poison PTEs. There's no harm in
> attaching an anon_vma to the VMA even if the rest of the operation
> fails; and I think it would be weird to have error paths that don't
> attach an anon_vma even though they .
>
> > +                       if (err_anon)
> > +                               err = err_anon;
> > +               }
> > +
> > +               if (err <= 0)
> > +                       return err;
> > +
> > +               if (!retried)
> > +                       /*
> > +                        * OK some of the range have non-guard pages mapped, zap
> > +                        * them. This leaves existing guard pages in place.
> > +                        */
> > +                       zap_page_range_single(vma, start, end - start, NULL);
> > +               else
> > +                       /*
> > +                        * If we reach here, then there is a racing fault that
> > +                        * has populated the PTE after we zapped. Give up and
> > +                        * let the user know to try again.
> > +                        */
> > +                       return -EAGAIN;
>
> Hmm, yeah, it would be nice if we could avoid telling userspace to
> loop on -EAGAIN but I guess we don't have any particularly good
> options here? Well, we could bail out with -EINTR if a (fatal?) signal
> is pending and otherwise keep looping... if we'd tell userspace "try
> again on -EAGAIN", we might as well do that in the kernel...
>
> (Personally I would put curly braces around these branches because
> they occupy multiple lines, though the coding style doesn't explicitly
> say that, so I guess maybe it's a matter of personal preference...
> adding curly braces here would match what is done, for example, in
> relocate_vma_down().)
>
> > +               retried = true;
> > +       }
> > +}
> > +
> > +static int guard_unpoison_pte_entry(pte_t *pte, unsigned long addr,
> > +                                   unsigned long next, struct mm_walk *walk)
> > +{
> > +       pte_t ptent = ptep_get(pte);
> > +
> > +       if (is_guard_pte_marker(ptent)) {
> > +               /* Simply clear the PTE marker. */
> > +               pte_clear_not_present_full(walk->mm, addr, pte, true);
>
> I think that last parameter probably should be "false"? The sparc code
> calls it "fullmm", which is a term the MM code uses when talking about
> operations that remove all mappings in the entire mm_struct because
> the process has died, which allows using some faster special-case
> version of TLB shootdown or something along those lines.
>
> > +               update_mmu_cache(walk->vma, addr, pte);
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> > +static const struct mm_walk_ops guard_unpoison_walk_ops = {
> > +       .pte_entry              = guard_unpoison_pte_entry,
> > +       .walk_lock              = PGWALK_RDLOCK,
> > +};
>
> It is a _little_ weird that unpoisoning creates page tables when they
> don't already exist, which will also prevent creating THP entries on
> fault in such areas afterwards... but I guess it doesn't really matter
> given that poisoning has that effect, too, and you probably usually
> won't call MADV_GUARD_UNPOISON on an area that hasn't been poisoned
> before... so I guess this is not an actionable comment.

^ permalink raw reply

* Re: [PATCH net-next v25 10/13] net: add SO_DEVMEM_DONTNEED setsockopt to release RX frags
From: Mina Almasry @ 2024-10-11 19:38 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Lai, Yi, netdev, linux-kernel, linux-doc, linux-alpha, linux-mips,
	linux-parisc, sparclinux, linux-trace-kernel, linux-arch, bpf,
	linux-kselftest, linux-media, dri-devel, Donald Hunter,
	David S. Miller, Eric Dumazet, Paolo Abeni, Jonathan Corbet,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	Thomas Bogendoerfer, James E.J. Bottomley, Helge Deller,
	Andreas Larsson, Jesper Dangaard Brouer, Ilias Apalodimas,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Arnd Bergmann, Steffen Klassert, Herbert Xu, David Ahern,
	Willem de Bruijn, Björn Töpel, Magnus Karlsson,
	Maciej Fijalkowski, Jonathan Lemon, Shuah Khan,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend, Sumit Semwal,
	Christian König, Pavel Begunkov, David Wei, Jason Gunthorpe,
	Yunsheng Lin, Shailend Chand, Harshitha Ramamurthy, Shakeel Butt,
	Jeroen de Borst, Praveen Kaligineedi, Bagas Sanjaya,
	Christoph Hellwig, Nikolay Aleksandrov, Taehee Yoo,
	Willem de Bruijn, Kaiyuan Zhang, yi1.lai
In-Reply-To: <20241011082707.5de66f15@kernel.org>

On Fri, Oct 11, 2024 at 8:27 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 10 Oct 2024 12:05:38 -0700 Mina Almasry wrote:
> > diff --git a/net/core/sock.c b/net/core/sock.c
> > index 083d438d8b6f..cb3d8b19de14 100644
> > --- a/net/core/sock.c
> > +++ b/net/core/sock.c
> > @@ -1071,11 +1071,11 @@ sock_devmem_dontneed(struct sock *sk,
> > sockptr_t optval, unsigned int optlen)
> >             optlen > sizeof(*tokens) * MAX_DONTNEED_TOKENS)
> >                 return -EINVAL;
> >
> > -       tokens = kvmalloc_array(optlen, sizeof(*tokens), GFP_KERNEL);
> > +       num_tokens = optlen / sizeof(struct dmabuf_token);
> > +       tokens = kvmalloc_array(num_tokens, sizeof(*tokens), GFP_KERNEL);
> >         if (!tokens)
> >                 return -ENOMEM;
> >
> > -       num_tokens = optlen / sizeof(struct dmabuf_token);
> >         if (copy_from_sockptr(tokens, optval, optlen)) {
> >                 kvfree(tokens);
> >                 return -EFAULT;
> > @@ -1083,6 +1083,10 @@ sock_devmem_dontneed(struct sock *sk, sockptr_t
> > optval, unsigned int optlen)
> >
> >         xa_lock_bh(&sk->sk_user_frags);
> >         for (i = 0; i < num_tokens; i++) {
> > +
> > +               if (tokens[i].token_count > MAX_DONTNEED_TOKENS)
> > +                       continue;
>
> For the real fix let's scan the tokens before we take the xa lock
> and return an error rather than silently skipping?
>
> >                 for (j = 0; j < tokens[i].token_count; j++) {
>

Yes, sorry, I called the diff above an 'untested fix' but it was more
of a hack to see if I got the root cause right. For a proper fix, we
should do exactly that. Scan and see how many tokens the user is
asking us to free ahead of time, then exit early if it's too much
before we acquire locks and loop. Will do!

-- 
Thanks,
Mina

^ permalink raw reply

* Re: [RFC PATCH 3/4] mm: madvise: implement lightweight guard page mechanism
From: Jann Horn @ 2024-10-11 18:11 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
	Matthew Wilcox, Vlastimil Babka, Paul E . McKenney,
	David Hildenbrand, linux-mm, linux-kernel, Muchun Song,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	Thomas Bogendoerfer, James E . J . Bottomley, Helge Deller,
	Chris Zankel, Max Filippov, Arnd Bergmann, linux-alpha,
	linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar,
	Vlastimil Babka
In-Reply-To: <a578ee9bb656234d3a19bf9e97c3012378d31a19.1727440966.git.lorenzo.stoakes@oracle.com>

On Fri, Sep 27, 2024 at 2:51 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> Implement a new lightweight guard page feature, that is regions of userland
> virtual memory that, when accessed, cause a fatal signal to arise.
[...]
> ---
>  arch/alpha/include/uapi/asm/mman.h     |   3 +
>  arch/mips/include/uapi/asm/mman.h      |   3 +
>  arch/parisc/include/uapi/asm/mman.h    |   3 +
>  arch/xtensa/include/uapi/asm/mman.h    |   3 +
>  include/uapi/asm-generic/mman-common.h |   3 +

I kinda wonder if we could start moving the parts of those headers
that are the same for all architectures to include/uapi/linux/mman.h
instead... but that's maybe out of scope for this series.

[...]
> diff --git a/mm/madvise.c b/mm/madvise.c
> index e871a72a6c32..7216e10723ae 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior)
>         case MADV_POPULATE_READ:
>         case MADV_POPULATE_WRITE:
>         case MADV_COLLAPSE:
> +       case MADV_GUARD_UNPOISON: /* Only poisoning needs a write lock. */

What does poisoning need a write lock for? anon_vma_prepare() doesn't
need it (it only needs mmap_lock held for reading),
zap_page_range_single() doesn't need it, and pagewalk also doesn't
need it as long as the range being walked is covered by a VMA, which
it is...

I see you set PGWALK_WRLOCK in guard_poison_walk_ops with a comment
saying "We might need to install an anon_vma" - is that referring to
an older version of the patch where the anon_vma_prepare() call was
inside the pagewalk callback or something like that? Either way,
anon_vma_prepare() doesn't need write locks (it can't, it has to work
from the page fault handling path).

>                 return 0;
>         default:
>                 /* be safe, default to 1. list exceptions explicitly */
[...]
> +static long madvise_guard_poison(struct vm_area_struct *vma,
> +                                struct vm_area_struct **prev,
> +                                unsigned long start, unsigned long end)
> +{
> +       long err;
> +       bool retried = false;
> +
> +       *prev = vma;
> +       if (!is_valid_guard_vma(vma, /* allow_locked = */false))
> +               return -EINVAL;
> +
> +       /*
> +        * Optimistically try to install the guard poison pages first. If any
> +        * non-guard pages are encountered, give up and zap the range before
> +        * trying again.
> +        */
> +       while (true) {
> +               unsigned long num_installed = 0;
> +
> +               /* Returns < 0 on error, == 0 if success, > 0 if zap needed. */
> +               err = walk_page_range_mm(vma->vm_mm, start, end,
> +                                        &guard_poison_walk_ops,
> +                                        &num_installed);
> +               /*
> +                * If we install poison markers, then the range is no longer
> +                * empty from a page table perspective and therefore it's
> +                * appropriate to have an anon_vma.
> +                *
> +                * This ensures that on fork, we copy page tables correctly.
> +                */
> +               if (err >= 0 && num_installed > 0) {
> +                       int err_anon = anon_vma_prepare(vma);

I'd move this up, to before we create poison PTEs. There's no harm in
attaching an anon_vma to the VMA even if the rest of the operation
fails; and I think it would be weird to have error paths that don't
attach an anon_vma even though they .

> +                       if (err_anon)
> +                               err = err_anon;
> +               }
> +
> +               if (err <= 0)
> +                       return err;
> +
> +               if (!retried)
> +                       /*
> +                        * OK some of the range have non-guard pages mapped, zap
> +                        * them. This leaves existing guard pages in place.
> +                        */
> +                       zap_page_range_single(vma, start, end - start, NULL);
> +               else
> +                       /*
> +                        * If we reach here, then there is a racing fault that
> +                        * has populated the PTE after we zapped. Give up and
> +                        * let the user know to try again.
> +                        */
> +                       return -EAGAIN;

Hmm, yeah, it would be nice if we could avoid telling userspace to
loop on -EAGAIN but I guess we don't have any particularly good
options here? Well, we could bail out with -EINTR if a (fatal?) signal
is pending and otherwise keep looping... if we'd tell userspace "try
again on -EAGAIN", we might as well do that in the kernel...

(Personally I would put curly braces around these branches because
they occupy multiple lines, though the coding style doesn't explicitly
say that, so I guess maybe it's a matter of personal preference...
adding curly braces here would match what is done, for example, in
relocate_vma_down().)

> +               retried = true;
> +       }
> +}
> +
> +static int guard_unpoison_pte_entry(pte_t *pte, unsigned long addr,
> +                                   unsigned long next, struct mm_walk *walk)
> +{
> +       pte_t ptent = ptep_get(pte);
> +
> +       if (is_guard_pte_marker(ptent)) {
> +               /* Simply clear the PTE marker. */
> +               pte_clear_not_present_full(walk->mm, addr, pte, true);

I think that last parameter probably should be "false"? The sparc code
calls it "fullmm", which is a term the MM code uses when talking about
operations that remove all mappings in the entire mm_struct because
the process has died, which allows using some faster special-case
version of TLB shootdown or something along those lines.

> +               update_mmu_cache(walk->vma, addr, pte);
> +       }
> +
> +       return 0;
> +}
> +
> +static const struct mm_walk_ops guard_unpoison_walk_ops = {
> +       .pte_entry              = guard_unpoison_pte_entry,
> +       .walk_lock              = PGWALK_RDLOCK,
> +};

It is a _little_ weird that unpoisoning creates page tables when they
don't already exist, which will also prevent creating THP entries on
fault in such areas afterwards... but I guess it doesn't really matter
given that poisoning has that effect, too, and you probably usually
won't call MADV_GUARD_UNPOISON on an area that hasn't been poisoned
before... so I guess this is not an actionable comment.

^ permalink raw reply

* Re: [RFC PATCH 2/4] mm: add PTE_MARKER_GUARD PTE marker
From: Jann Horn @ 2024-10-11 18:11 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
	Matthew Wilcox, Vlastimil Babka, Paul E . McKenney,
	David Hildenbrand, linux-mm, linux-kernel, Muchun Song,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	Thomas Bogendoerfer, James E . J . Bottomley, Helge Deller,
	Chris Zankel, Max Filippov, Arnd Bergmann, linux-alpha,
	linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar
In-Reply-To: <03570f8a0ad2a9c0a92cc0c594e375c4185eccdc.1727440966.git.lorenzo.stoakes@oracle.com>

On Fri, Sep 27, 2024 at 2:51 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> Add a new PTE marker that results in any access causing the accessing
> process to segfault.
[...]
>  static inline int is_poisoned_swp_entry(swp_entry_t entry)
> +{
> +       /*
> +        * We treat guard pages as poisoned too as these have the same semantics
> +        * as poisoned ranges, only with different fault handling.
> +        */
> +       return is_pte_marker_entry(entry) &&
> +               (pte_marker_get(entry) &
> +                (PTE_MARKER_POISONED | PTE_MARKER_GUARD));
> +}

This means MADV_FREE will also clear guard PTEs, right?

> diff --git a/mm/memory.c b/mm/memory.c
> index 5c6486e33e63..6c413c3d72fd 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1457,7 +1457,7 @@ static inline bool should_zap_folio(struct zap_details *details,
>         return !folio_test_anon(folio);
>  }
>
> -static inline bool zap_drop_file_uffd_wp(struct zap_details *details)
> +static inline bool zap_drop_markers(struct zap_details *details)
>  {
>         if (!details)
>                 return false;
> @@ -1478,7 +1478,7 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
>         if (vma_is_anonymous(vma))
>                 return;
>
> -       if (zap_drop_file_uffd_wp(details))
> +       if (zap_drop_markers(details))
>                 return;
>
>         for (;;) {
> @@ -1673,7 +1673,15 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>                          * drop the marker if explicitly requested.
>                          */
>                         if (!vma_is_anonymous(vma) &&
> -                           !zap_drop_file_uffd_wp(details))
> +                           !zap_drop_markers(details))
> +                               continue;
> +               } else if (is_guard_swp_entry(entry)) {
> +                       /*
> +                        * Ordinary zapping should not remove guard PTE
> +                        * markers. Only do so if we should remove PTE markers
> +                        * in general.
> +                        */
> +                       if (!zap_drop_markers(details))
>                                 continue;

Just a comment: It's nice that the feature is restricted to anonymous
VMAs, otherwise we'd have to figure out here what to do about
unmap_mapping_folio() (which sets ZAP_FLAG_DROP_MARKER together with
details.single_folio)...


>                 } else if (is_hwpoison_entry(entry) ||
>                            is_poisoned_swp_entry(entry)) {
> @@ -4005,6 +4013,10 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
>         if (marker & PTE_MARKER_POISONED)
>                 return VM_FAULT_HWPOISON;
>
> +       /* Hitting a guard page is always a fatal condition. */
> +       if (marker & PTE_MARKER_GUARD)
> +               return VM_FAULT_SIGSEGV;
> +
>         if (pte_marker_entry_uffd_wp(entry))
>                 return pte_marker_handle_uffd_wp(vmf);
>
> --
> 2.46.2
>

^ permalink raw reply

* Re: [RFC PATCH 1/4] mm: pagewalk: add the ability to install PTEs
From: Jann Horn @ 2024-10-11 18:11 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
	Matthew Wilcox, Vlastimil Babka, Paul E . McKenney,
	David Hildenbrand, linux-mm, linux-kernel, Muchun Song,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	Thomas Bogendoerfer, James E . J . Bottomley, Helge Deller,
	Chris Zankel, Max Filippov, Arnd Bergmann, linux-alpha,
	linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar
In-Reply-To: <59e218670565accf978aeb8cf4745de4c0738773.1727440966.git.lorenzo.stoakes@oracle.com>

On Fri, Sep 27, 2024 at 2:51 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> Rather than add yet another implementation, we extend the generic pagewalk
> logic to enable the installation of page table entries by adding a new
> install_pte() callback in mm_walk_ops. If this is specified, then upon
> encountering a missing page table entry, we allocate and install a new one
> and continue the traversal.
[...]
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Jann Horn <jannh@google.com>

^ permalink raw reply

* Re: [PATCH net-next v25 10/13] net: add SO_DEVMEM_DONTNEED setsockopt to release RX frags
From: Jakub Kicinski @ 2024-10-11 15:27 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Lai, Yi, netdev, linux-kernel, linux-doc, linux-alpha, linux-mips,
	linux-parisc, sparclinux, linux-trace-kernel, linux-arch, bpf,
	linux-kselftest, linux-media, dri-devel, Donald Hunter,
	David S. Miller, Eric Dumazet, Paolo Abeni, Jonathan Corbet,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	Thomas Bogendoerfer, James E.J. Bottomley, Helge Deller,
	Andreas Larsson, Jesper Dangaard Brouer, Ilias Apalodimas,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Arnd Bergmann, Steffen Klassert, Herbert Xu, David Ahern,
	Willem de Bruijn, Björn Töpel, Magnus Karlsson,
	Maciej Fijalkowski, Jonathan Lemon, Shuah Khan,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend, Sumit Semwal,
	Christian König, Pavel Begunkov, David Wei, Jason Gunthorpe,
	Yunsheng Lin, Shailend Chand, Harshitha Ramamurthy, Shakeel Butt,
	Jeroen de Borst, Praveen Kaligineedi, Bagas Sanjaya,
	Christoph Hellwig, Nikolay Aleksandrov, Taehee Yoo,
	Willem de Bruijn, Kaiyuan Zhang, yi1.lai
In-Reply-To: <CAHS8izPuEUA20BDXvwq2vW-24ez36YFJFMQok-oBDbgk6bajSA@mail.gmail.com>

On Thu, 10 Oct 2024 12:05:38 -0700 Mina Almasry wrote:
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 083d438d8b6f..cb3d8b19de14 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1071,11 +1071,11 @@ sock_devmem_dontneed(struct sock *sk,
> sockptr_t optval, unsigned int optlen)
>             optlen > sizeof(*tokens) * MAX_DONTNEED_TOKENS)
>                 return -EINVAL;
> 
> -       tokens = kvmalloc_array(optlen, sizeof(*tokens), GFP_KERNEL);
> +       num_tokens = optlen / sizeof(struct dmabuf_token);
> +       tokens = kvmalloc_array(num_tokens, sizeof(*tokens), GFP_KERNEL);
>         if (!tokens)
>                 return -ENOMEM;
> 
> -       num_tokens = optlen / sizeof(struct dmabuf_token);
>         if (copy_from_sockptr(tokens, optval, optlen)) {
>                 kvfree(tokens);
>                 return -EFAULT;
> @@ -1083,6 +1083,10 @@ sock_devmem_dontneed(struct sock *sk, sockptr_t
> optval, unsigned int optlen)
> 
>         xa_lock_bh(&sk->sk_user_frags);
>         for (i = 0; i < num_tokens; i++) {
> +
> +               if (tokens[i].token_count > MAX_DONTNEED_TOKENS)
> +                       continue;

For the real fix let's scan the tokens before we take the xa lock
and return an error rather than silently skipping?

>                 for (j = 0; j < tokens[i].token_count; j++) {


^ permalink raw reply

* Re: [PATCH v5 6/8] x86/module: perpare module loading for ROX allocations of text
From: Mike Rapoport @ 2024-10-11 12:58 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: Andrew Morton, Andreas Larsson, Andy Lutomirski, Ard Biesheuvel,
	Arnd Bergmann, Borislav Petkov, Brian Cain, Catalin Marinas,
	Christoph Hellwig, Christophe Leroy, Dave Hansen, Dinh Nguyen,
	Geert Uytterhoeven, Guo Ren, Helge Deller, Huacai Chen,
	Ingo Molnar, Johannes Berg, John Paul Adrian Glaubitz,
	Kent Overstreet, Liam R. Howlett, Luis Chamberlain, Mark Rutland,
	Masami Hiramatsu, Matt Turner, Max Filippov, Michael Ellerman,
	Michal Simek, Oleg Nesterov, Palmer Dabbelt, Peter Zijlstra,
	Richard Weinberger, Russell King, Song Liu, Stafford Horne,
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Uladzislau Rezki, Vineet Gupta, Will Deacon, bpf, linux-alpha,
	linux-arch, linux-arm-kernel, linux-csky, linux-hexagon,
	linux-kernel, linux-m68k, linux-mips, linux-mm, linux-modules,
	linux-openrisc, linux-parisc, linux-riscv, linux-sh,
	linux-snps-arc, linux-trace-kernel, linux-um, linuxppc-dev,
	loongarch, sparclinux, x86
In-Reply-To: <20241010225411.GA922684@thelio-3990X>

On Thu, Oct 10, 2024 at 03:54:11PM -0700, Nathan Chancellor wrote:
> Hi Mike,
> 
> On Wed, Oct 09, 2024 at 09:08:14PM +0300, Mike Rapoport wrote:
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > 
> > When module text memory will be allocated with ROX permissions, the
> > memory at the actual address where the module will live will contain
> > invalid instructions and there will be a writable copy that contains the
> > actual module code.
> > 
> > Update relocations and alternatives patching to deal with it.
> > 
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> 
> I bisected a boot failure that I see with CONFIG_CFI_CLANG enabled to
> this change as commit be712757cabd ("x86/module: perpare module loading
> for ROX allocations of text") in -next.
 
>   [    0.000000] Linux version 6.12.0-rc2-00140-gbe712757cabd (nathan@n3-xlarge-x86) (ClangBuiltLinux clang version 19.1.0 (https://github.com/llvm/llvm-project.git a4bf6cd7cfb1a1421ba92bca9d017b49936c55e4), ClangBuiltLinux LLD 19.1.0 (https://github.com/llvm/llvm-project.git a4bf6cd7cfb1a1421ba92bca9d017b49936c55e4)) #1 SMP PREEMPT_DYNAMIC Thu Oct 10 22:42:57 UTC 2024
>   ...
>   [    0.092204] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl
>   [    0.093207] TAA: Mitigation: TSX disabled
>   [    0.093711] MMIO Stale Data: Mitigation: Clear CPU buffers
>   [    0.094228] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
>   [    0.095203] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
>   [    0.096203] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
>   [    0.097203] x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
>   [    0.098003] x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
>   [    0.098203] x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256'
>   [    0.099203] x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers'
>   [    0.100204] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
>   [    0.101204] x86/fpu: xstate_offset[5]:  832, xstate_sizes[5]:   64
>   [    0.102203] x86/fpu: xstate_offset[6]:  896, xstate_sizes[6]:  512
>   [    0.103204] x86/fpu: xstate_offset[7]: 1408, xstate_sizes[7]: 1024
>   [    0.104051] x86/fpu: xstate_offset[9]: 2432, xstate_sizes[9]:    8
>   [    0.104204] x86/fpu: Enabled xstate features 0x2e7, context size is 2440 bytes, using 'compacted' format.
> 
> then nothing after that. Boot is successful if CFI is not enabled (the
> initrd will just shutdown the machine after printing the version string).
> 
> If there is any further information I can provide or patches I can test,
> I am more than happy to do so.

I overlooked how cfi_*_callers routines update addr.
This patch should fix it:

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 3b3fa93af3b1..cf782f431110 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1148,11 +1148,13 @@ static int cfi_disable_callers(s32 *start, s32 *end, struct module *mod)
 
 	for (s = start; s < end; s++) {
 		void *addr = (void *)s + *s;
-		void *wr_addr = module_writable_address(mod, addr);
+		void *wr_addr;
 		u32 hash;
 
 		addr -= fineibt_caller_size;
-		hash = decode_caller_hash(addr);
+		wr_addr = module_writable_address(mod, addr);
+		hash = decode_caller_hash(wr_addr);
+
 		if (!hash) /* nocfi callers */
 			continue;
 
@@ -1172,11 +1174,12 @@ static int cfi_enable_callers(s32 *start, s32 *end, struct module *mod)
 
 	for (s = start; s < end; s++) {
 		void *addr = (void *)s + *s;
-		void *wr_addr = module_writable_address(mod, addr);
+		void *wr_addr;
 		u32 hash;
 
 		addr -= fineibt_caller_size;
-		hash = decode_caller_hash(addr);
+		wr_addr = module_writable_address(mod, addr);
+		hash = decode_caller_hash(wr_addr);
 		if (!hash) /* nocfi callers */
 			continue;
 
@@ -1249,11 +1252,12 @@ static int cfi_rand_callers(s32 *start, s32 *end, struct module *mod)
 
 	for (s = start; s < end; s++) {
 		void *addr = (void *)s + *s;
-		void *wr_addr = module_writable_address(mod, addr);
+		void *wr_addr;
 		u32 hash;
 
 		addr -= fineibt_caller_size;
-		hash = decode_caller_hash(addr);
+		wr_addr = module_writable_address(mod, addr);
+		hash = decode_caller_hash(wr_addr);
 		if (hash) {
 			hash = -cfi_rehash(hash);
 			text_poke_early(wr_addr + 2, &hash, 4);
@@ -1269,14 +1273,15 @@ static int cfi_rewrite_callers(s32 *start, s32 *end, struct module *mod)
 
 	for (s = start; s < end; s++) {
 		void *addr = (void *)s + *s;
-		void *wr_addr = module_writable_address(mod, addr);
+		void *wr_addr;
 		u32 hash;
 
 		addr -= fineibt_caller_size;
-		hash = decode_caller_hash(addr);
+		wr_addr = module_writable_address(mod, addr);
+		hash = decode_caller_hash(wr_addr);
 		if (hash) {
 			text_poke_early(wr_addr, fineibt_caller_start, fineibt_caller_size);
-			WARN_ON(*(u32 *)(addr + fineibt_caller_hash) != 0x12345678);
+			WARN_ON(*(u32 *)(wr_addr + fineibt_caller_hash) != 0x12345678);
 			text_poke_early(wr_addr + fineibt_caller_hash, &hash, 4);
 		}
 		/* rely on apply_retpolines() */
 
> Cheers,
> Nathan

-- 
Sincerely yours,
Mike.

^ permalink raw reply related

* Re: [PATCH v5 7/8] execmem: add support for cache of large ROX pages
From: Christoph Hellwig @ 2024-10-11  7:46 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Christoph Hellwig, Andrew Morton, Andreas Larsson,
	Andy Lutomirski, Ard Biesheuvel, Arnd Bergmann, Borislav Petkov,
	Brian Cain, Catalin Marinas, Christophe Leroy, Dave Hansen,
	Dinh Nguyen, Geert Uytterhoeven, Guo Ren, Helge Deller,
	Huacai Chen, Ingo Molnar, Johannes Berg,
	John Paul Adrian Glaubitz, Kent Overstreet, Liam R. Howlett,
	Luis Chamberlain, Mark Rutland, Masami Hiramatsu, Matt Turner,
	Max Filippov, Michael Ellerman, Michal Simek, Oleg Nesterov,
	Palmer Dabbelt, Peter Zijlstra, Richard Weinberger, Russell King,
	Song Liu, Stafford Horne, Steven Rostedt, Thomas Bogendoerfer,
	Thomas Gleixner, Uladzislau Rezki, Vineet Gupta, Will Deacon, bpf,
	linux-alpha, linux-arch, linux-arm-kernel, linux-csky,
	linux-hexagon, linux-kernel, linux-m68k, linux-mips, linux-mm,
	linux-modules, linux-openrisc, linux-parisc, linux-riscv,
	linux-sh, linux-snps-arc, linux-trace-kernel, linux-um,
	linuxppc-dev, loongarch, sparclinux, x86
In-Reply-To: <ZwfPPZrxHzQgYfx7@kernel.org>

On Thu, Oct 10, 2024 at 03:57:33PM +0300, Mike Rapoport wrote:
> On Wed, Oct 09, 2024 at 11:58:33PM -0700, Christoph Hellwig wrote:
> > On Wed, Oct 09, 2024 at 09:08:15PM +0300, Mike Rapoport wrote:
> > >  /**
> > >   * struct execmem_info - architecture parameters for code allocations
> > > + * @fill_trapping_insns: set memory to contain instructions that will trap
> > >   * @ranges: array of parameter sets defining architecture specific
> > >   * parameters for executable memory allocations. The ranges that are not
> > >   * explicitly initialized by an architecture use parameters defined for
> > >   * @EXECMEM_DEFAULT.
> > >   */
> > >  struct execmem_info {
> > > +	void (*fill_trapping_insns)(void *ptr, size_t size, bool writable);
> > >  	struct execmem_range	ranges[EXECMEM_TYPE_MAX];
> > 
> > Why is the filler an indirect function call and not an architecture
> > hook?
> 
> The idea is to keep everything together and have execmem_info describe all
> that architecture needs. 

But why?  That's pretty different from our normal style of arch hooks,
and introduces an indirect call in a security sensitive area.


^ permalink raw reply

* Re: [PATCH net-next v25 10/13] net: add SO_DEVMEM_DONTNEED setsockopt to release RX frags
From: Lai, Yi @ 2024-10-11  2:53 UTC (permalink / raw)
  To: Mina Almasry
  Cc: netdev, linux-kernel, linux-doc, linux-alpha, linux-mips,
	linux-parisc, sparclinux, linux-trace-kernel, linux-arch, bpf,
	linux-kselftest, linux-media, dri-devel, Donald Hunter,
	Jakub Kicinski, David S. Miller, Eric Dumazet, Paolo Abeni,
	Jonathan Corbet, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	Thomas Bogendoerfer, James E.J. Bottomley, Helge Deller,
	Andreas Larsson, Jesper Dangaard Brouer, Ilias Apalodimas,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Arnd Bergmann, Steffen Klassert, Herbert Xu, David Ahern,
	Willem de Bruijn, Björn Töpel, Magnus Karlsson,
	Maciej Fijalkowski, Jonathan Lemon, Shuah Khan,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend, Sumit Semwal,
	Christian König, Pavel Begunkov, David Wei, Jason Gunthorpe,
	Yunsheng Lin, Shailend Chand, Harshitha Ramamurthy, Shakeel Butt,
	Jeroen de Borst, Praveen Kaligineedi, Bagas Sanjaya,
	Christoph Hellwig, Nikolay Aleksandrov, Taehee Yoo,
	Willem de Bruijn, Kaiyuan Zhang, yi1.lai
In-Reply-To: <CAHS8izPuEUA20BDXvwq2vW-24ez36YFJFMQok-oBDbgk6bajSA@mail.gmail.com>

On Thu, Oct 10, 2024 at 12:05:38PM -0700, Mina Almasry wrote:
> On Thu, Oct 10, 2024 at 4:17 AM Lai, Yi <yi1.lai@linux.intel.com> wrote:
> >
> > Hi Mina Almasry,
> >
> > Greetings!
> >
> > I used Syzkaller and found that there is BUG: soft lockup inqt in linux-next tree next-20241008
> >
> > After bisection and the first bad commit is:
> > "
> > 678f6e28b5f6 net: add SO_DEVMEM_DONTNEED setsockopt to release RX frags
> > "
> >
> > All detailed into can be found at:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/241009_103423_do_sock_setsockopt
> > Syzkaller repro code:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/241009_103423_do_sock_setsockopt/repro.c
> > Syzkaller repro syscall steps:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/241009_103423_do_sock_setsockopt/repro.prog
> > Syzkaller report:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/241009_103423_do_sock_setsockopt/repro.report
> > Kconfig(make olddefconfig):
> > https://github.com/laifryiee/syzkaller_logs/tree/main/241009_103423_do_sock_setsockopt/kconfig_origin
> > Bisect info:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/241009_103423_do_sock_setsockopt/bisect_info.log
> > bzImage:
> > https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/241009_103423_do_sock_setsockopt/bzImage_8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b
> > Issue dmesg:
> > https://github.com/laifryiee/syzkaller_logs/blob/main/241009_103423_do_sock_setsockopt/8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b_dmesg.log
> >
> > "
> > [   48.825073]  ? __lock_acquire+0x1b0f/0x5c90
> > [   48.825419]  ? __pfx___lock_acquire+0x10/0x10
> > [   48.825774]  sock_setsockopt+0x68/0x90
> > [   48.826117]  do_sock_setsockopt+0x3fb/0x480
> > [   48.826455]  ? __pfx_do_sock_setsockopt+0x10/0x10
> > [   48.826829]  ? lock_release+0x441/0x870
> > [   48.827140]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> > [   48.827558]  ? fdget+0x188/0x230
> > [   48.827846]  __sys_setsockopt+0x131/0x200
> > [   48.828184]  ? __pfx___sys_setsockopt+0x10/0x10
> > [   48.828551]  ? seqcount_lockdep_reader_access.constprop.0+0xc0/0xd0
> > [   48.829042]  ? __sanitizer_cov_trace_cmp4+0x1a/0x20
> > [   48.829425]  ? ktime_get_coarse_real_ts64+0xbf/0xf0
> > [   48.829817]  __x64_sys_setsockopt+0xc6/0x160
> > [   48.830160]  ? syscall_trace_enter+0x14a/0x230
> > [   48.830520]  x64_sys_call+0x6cf/0x20d0
> > [   48.830825]  do_syscall_64+0x6d/0x140
> > [   48.831124]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [   48.831517] RIP: 0033:0x7f26cdc3ee5d
> > [   48.831804] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
> > [   48.833180] RSP: 002b:00007fff33f36278 EFLAGS: 00000213 ORIG_RAX: 0000000000000036
> > [   48.833756] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f26cdc3ee5d
> > [   48.834294] RDX: 0000000000000050 RSI: 0000000000000001 RDI: 0000000000000003
> > [   48.834830] RBP: 00007fff33f36290 R08: 0000000000000010 R09: 00007fff33f36290
> > [   48.835368] R10: 0000000020000080 R11: 0000000000000213 R12: 00007fff33f363e8
> > [   48.835906] R13: 000000000040178f R14: 0000000000403e08 R15: 00007f26cde51000
> > [   48.836466]  </TASK>
> > [   48.836648] Kernel panic - not syncing: softlockup: hung tasks
> > [   48.837096] CPU: 1 UID: 0 PID: 729 Comm: repro Tainted: G             L     6.12.0-rc2-8cf0b93919e1 #1
> > [   48.837796] Tainted: [L]=SOFTLOCKUP
> > [   48.838071] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> > [   48.838916] Call Trace:
> > [   48.839113]  <IRQ>
> > [   48.839282]  dump_stack_lvl+0x42/0x150
> > [   48.839584]  dump_stack+0x19/0x20
> > [   48.839846]  panic+0x703/0x790
> > [   48.840100]  ? __pfx_panic+0x10/0x10
> > [   48.840394]  ? watchdog_timer_fn+0x599/0x6b0
> > [   48.840727]  ? watchdog_timer_fn+0x58c/0x6b0
> > [   48.841065]  watchdog_timer_fn+0x5aa/0x6b0
> > [   48.841382]  ? __pfx_watchdog_timer_fn+0x10/0x10
> > [   48.841743]  __hrtimer_run_queues+0x5d6/0xc30
> > [   48.842091]  ? __pfx___hrtimer_run_queues+0x10/0x10
> > [   48.842473]  hrtimer_interrupt+0x324/0x7a0
> > [   48.842802]  __sysvec_apic_timer_interrupt+0x10b/0x410
> > [   48.843198]  ? debug_smp_processor_id+0x20/0x30
> > [   48.843551]  sysvec_apic_timer_interrupt+0xaf/0xd0
> > [   48.843922]  </IRQ>
> > [   48.844101]  <TASK>
> > [   48.844275]  asm_sysvec_apic_timer_interrupt+0x1f/0x30
> > [   48.844711] RIP: 0010:__sanitizer_cov_trace_pc+0x45/0x70
> > [   48.845130] Code: a9 00 01 ff 00 74 1d f6 c4 01 74 43 a9 00 00 0f 00 75 3c a9 00 00 f0 00 75 35 8b 82 04 1e 00 00 85 c0 74 2b 8b 82 e0 1d 00 00 <83> f8 02 75 20 48 8b 8a e8 1d 00 00 8b 92 e4 1d 00 00 48 8b 01 48
> > [   48.846480] RSP: 0018:ffff8880239cf790 EFLAGS: 00000246
> > [   48.846876] RAX: 0000000000000000 RBX: ffff8880239cf900 RCX: ffffffff8581c19f
> > [   48.847407] RDX: ffff88801a818000 RSI: ffffffff8581c1d5 RDI: 0000000000000007
> > [   48.847933] RBP: ffff8880239cf790 R08: 0000000000000001 R09: ffffed1004739f23
> > [   48.848472] R10: 0000000077cc006e R11: 0000000000000001 R12: 0000000000000000
> > [   48.849002] R13: 0000000077cc006e R14: ffff8880239cf918 R15: 0000000000000000
> > [   48.849536]  ? xas_start+0x11f/0x730
> > [   48.849818]  ? xas_start+0x155/0x730
> > [   48.850101]  xas_start+0x155/0x730
> > [   48.850372]  xas_load+0x2f/0x520
> > [   48.850629]  ? irqentry_exit+0x3e/0xa0
> > [   48.850922]  ? sysvec_apic_timer_interrupt+0x6a/0xd0
> > [   48.851304]  xas_store+0x1165/0x1ad0
> > [   48.851588]  ? __this_cpu_preempt_check+0x21/0x30
> > [   48.851950]  ? irqentry_exit+0x3e/0xa0
> > [   48.852254]  __xa_erase+0xc6/0x180
> > [   48.852524]  ? __pfx___xa_erase+0x10/0x10
> > [   48.852842]  ? __xa_erase+0xf1/0x180
> > [   48.853123]  ? sock_devmem_dontneed+0x42c/0x6d0
> > [   48.853480]  sock_devmem_dontneed+0x3a8/0x6d0
> > [   48.853829]  ? __pfx_sock_devmem_dontneed+0x10/0x10
> > [   48.854205]  ? trace_lock_acquire+0x139/0x1b0
> > [   48.854548]  ? lock_acquire+0x80/0xb0
> > [   48.854833]  ? __might_fault+0xf1/0x1b0
> > [   48.855133]  ? __might_fault+0xf1/0x1b0
> > [   48.855437]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
> > [   48.855849]  sk_setsockopt+0x480/0x3c60
> > [   48.856158]  ? __pfx_sk_setsockopt+0x10/0x10
> > [   48.856491]  ? __kasan_check_read+0x15/0x20
> > [   48.856814]  ? __lock_acquire+0x1b0f/0x5c90
> > [   48.857144]  ? __pfx___lock_acquire+0x10/0x10
> > [   48.857488]  sock_setsockopt+0x68/0x90
> > [   48.857785]  do_sock_setsockopt+0x3fb/0x480
> > [   48.858110]  ? __pfx_do_sock_setsockopt+0x10/0x10
> > [   48.858474]  ? lock_release+0x441/0x870
> > [   48.858776]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> > [   48.859184]  ? fdget+0x188/0x230
> > [   48.859448]  __sys_setsockopt+0x131/0x200
> > [   48.859764]  ? __pfx___sys_setsockopt+0x10/0x10
> > [   48.860123]  ? seqcount_lockdep_reader_access.constprop.0+0xc0/0xd0
> > [   48.860598]  ? __sanitizer_cov_trace_cmp4+0x1a/0x20
> > [   48.860982]  ? ktime_get_coarse_real_ts64+0xbf/0xf0
> > [   48.861370]  __x64_sys_setsockopt+0xc6/0x160
> > [   48.861710]  ? syscall_trace_enter+0x14a/0x230
> > [   48.862057]  x64_sys_call+0x6cf/0x20d0
> > [   48.862350]  do_syscall_64+0x6d/0x140
> > [   48.862639]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [   48.863023] RIP: 0033:0x7f26cdc3ee5d
> > [   48.863301] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
> > [   48.864659] RSP: 002b:00007fff33f36278 EFLAGS: 00000213 ORIG_RAX: 0000000000000036
> > [   48.865223] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f26cdc3ee5d
> > "
> >
> > I hope you find it useful.
> 
> Thank you for the report. I think I see the issue and I commented on
> the fix in the code below.
> 
> Only issue is that this is unlucky timing for me. I have a flight
> tomorrow for a vacation where I think I may have internet access and
> may not. I will try to follow up here, but in case I can't, what's the
> urgency for this issue? Can this wait 2 weeks when I get back?
> 
> > > +     if (optlen % sizeof(struct dmabuf_token) ||
> > > +         optlen > sizeof(*tokens) * MAX_DONTNEED_TOKENS)
> > > +             return -EINVAL;
> > > +
> > > +     tokens = kvmalloc_array(optlen, sizeof(*tokens), GFP_KERNEL);
> > > +     if (!tokens)
> > > +             return -ENOMEM;
> > > +
> 
> There is an unrelated bug here. The first argument for kvmalloc_array
> is the number of elements, I think, not the number of bytes. So this
> should be:
> 
> num_tokens = optlen / sizeof(struct dmabuf_token);
> tokens = kvmalloc_array(num_tokens, sizeof(*tokens), GFP_KERNEL);
> if (!tokens)
>    return -ENOMEM;
> 
> > > +
> > > +     if (copy_from_sockptr(tokens, optval, optlen)) {
> > > +             kvfree(tokens);
> > > +             return -EFAULT;
> > > +     }
> > > +
> > > +     xa_lock_bh(&sk->sk_user_frags);
> > > +     for (i = 0; i < num_tokens; i++) {
> > > +             for (j = 0; j < tokens[i].token_count; j++) {
> 
> The bug should be here. tokens[i].token_count is a u32 provided by the
> user. The user can specify U32_MAX here, which will make the loop
> below spin for a very long time with the lock held, which should be
> the cause of the soft lockup.
> 
> We should add a check that token_count is < MAX_DONTNEED_TOKENS or
> something like that, above this line.
> 
> Please let me know of urgency. If this can't wait I'll try very hard
> to repro the issue/fix while I'm out. Untested fix I'm going to try
> out:
>

Hi Mina

This bug can wait as you mentioned two weeks. Just let you know,
after applied your proposed fix, the issue cannot be reproduced using
the same repro binary.

If you have a formal fix patch later, I would be happy to test it again.
Just let me know.

Regards,
Yi Lai

> diff --git a/net/core/sock.c b/net/core/sock.c
> index 083d438d8b6f..cb3d8b19de14 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1071,11 +1071,11 @@ sock_devmem_dontneed(struct sock *sk,
> sockptr_t optval, unsigned int optlen)
>             optlen > sizeof(*tokens) * MAX_DONTNEED_TOKENS)
>                 return -EINVAL;
> 
> -       tokens = kvmalloc_array(optlen, sizeof(*tokens), GFP_KERNEL);
> +       num_tokens = optlen / sizeof(struct dmabuf_token);
> +       tokens = kvmalloc_array(num_tokens, sizeof(*tokens), GFP_KERNEL);
>         if (!tokens)
>                 return -ENOMEM;
> 
> -       num_tokens = optlen / sizeof(struct dmabuf_token);
>         if (copy_from_sockptr(tokens, optval, optlen)) {
>                 kvfree(tokens);
>                 return -EFAULT;
> @@ -1083,6 +1083,10 @@ sock_devmem_dontneed(struct sock *sk, sockptr_t
> optval, unsigned int optlen)
> 
>         xa_lock_bh(&sk->sk_user_frags);
>         for (i = 0; i < num_tokens; i++) {
> +
> +               if (tokens[i].token_count > MAX_DONTNEED_TOKENS)
> +                       continue;
> +
>                 for (j = 0; j < tokens[i].token_count; j++) {
>                         netmem_ref netmem = (__force netmem_ref)__xa_erase(
>                                 &sk->sk_user_frags, tokens[i].token_start + j);
> 
> -- 
> Thanks,
> Mina

^ permalink raw reply

* Re: [PATCH v5 6/8] x86/module: perpare module loading for ROX allocations of text
From: Nathan Chancellor @ 2024-10-10 22:54 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, Andreas Larsson, Andy Lutomirski, Ard Biesheuvel,
	Arnd Bergmann, Borislav Petkov, Brian Cain, Catalin Marinas,
	Christoph Hellwig, Christophe Leroy, Dave Hansen, Dinh Nguyen,
	Geert Uytterhoeven, Guo Ren, Helge Deller, Huacai Chen,
	Ingo Molnar, Johannes Berg, John Paul Adrian Glaubitz,
	Kent Overstreet, Liam R. Howlett, Luis Chamberlain, Mark Rutland,
	Masami Hiramatsu, Matt Turner, Max Filippov, Michael Ellerman,
	Michal Simek, Oleg Nesterov, Palmer Dabbelt, Peter Zijlstra,
	Richard Weinberger, Russell King, Song Liu, Stafford Horne,
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Uladzislau Rezki, Vineet Gupta, Will Deacon, bpf, linux-alpha,
	linux-arch, linux-arm-kernel, linux-csky, linux-hexagon,
	linux-kernel, linux-m68k, linux-mips, linux-mm, linux-modules,
	linux-openrisc, linux-parisc, linux-riscv, linux-sh,
	linux-snps-arc, linux-trace-kernel, linux-um, linuxppc-dev,
	loongarch, sparclinux, x86
In-Reply-To: <20241009180816.83591-7-rppt@kernel.org>

Hi Mike,

On Wed, Oct 09, 2024 at 09:08:14PM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> When module text memory will be allocated with ROX permissions, the
> memory at the actual address where the module will live will contain
> invalid instructions and there will be a writable copy that contains the
> actual module code.
> 
> Update relocations and alternatives patching to deal with it.
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

I bisected a boot failure that I see with CONFIG_CFI_CLANG enabled to
this change as commit be712757cabd ("x86/module: perpare module loading
for ROX allocations of text") in -next.

  $ echo CONFIG_CFI_CLANG=y >arch/x86/configs/cfi.config

  $ make -skj"$(nproc)" ARCH=x86_64 LLVM=1 mrproper defconfig cfi.config bzImage

  $ curl -LSs https://github.com/ClangBuiltLinux/boot-utils/releases/download/20230707-182910/x86_64-rootfs.cpio.zst | zstd -d >rootfs.cpio

  $ qemu-system-x86_64 \
      -display none \
      -nodefaults \
      -M q35 \
      -d unimp,guest_errors \
      -append 'console=ttyS0 earlycon=uart8250,io,0x3f8' \
      -kernel arch/x86/boot/bzImage \
      -initrd rootfs.cpio \
      -cpu host \
      -enable-kvm \
      -m 512m \
      -smp 8 \
      -serial mon:stdio
  [    0.000000] Linux version 6.12.0-rc2-00140-gbe712757cabd (nathan@n3-xlarge-x86) (ClangBuiltLinux clang version 19.1.0 (https://github.com/llvm/llvm-project.git a4bf6cd7cfb1a1421ba92bca9d017b49936c55e4), ClangBuiltLinux LLD 19.1.0 (https://github.com/llvm/llvm-project.git a4bf6cd7cfb1a1421ba92bca9d017b49936c55e4)) #1 SMP PREEMPT_DYNAMIC Thu Oct 10 22:42:57 UTC 2024
  ...
  [    0.092204] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl
  [    0.093207] TAA: Mitigation: TSX disabled
  [    0.093711] MMIO Stale Data: Mitigation: Clear CPU buffers
  [    0.094228] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
  [    0.095203] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
  [    0.096203] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
  [    0.097203] x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
  [    0.098003] x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
  [    0.098203] x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256'
  [    0.099203] x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers'
  [    0.100204] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
  [    0.101204] x86/fpu: xstate_offset[5]:  832, xstate_sizes[5]:   64
  [    0.102203] x86/fpu: xstate_offset[6]:  896, xstate_sizes[6]:  512
  [    0.103204] x86/fpu: xstate_offset[7]: 1408, xstate_sizes[7]: 1024
  [    0.104051] x86/fpu: xstate_offset[9]: 2432, xstate_sizes[9]:    8
  [    0.104204] x86/fpu: Enabled xstate features 0x2e7, context size is 2440 bytes, using 'compacted' format.

then nothing after that. Boot is successful if CFI is not enabled (the
initrd will just shutdown the machine after printing the version string).

If there is any further information I can provide or patches I can test,
I am more than happy to do so.

Cheers,
Nathan

^ permalink raw reply

* Re: [PATCH net-next v25 10/13] net: add SO_DEVMEM_DONTNEED setsockopt to release RX frags
From: Mina Almasry @ 2024-10-10 19:05 UTC (permalink / raw)
  To: Lai, Yi
  Cc: netdev, linux-kernel, linux-doc, linux-alpha, linux-mips,
	linux-parisc, sparclinux, linux-trace-kernel, linux-arch, bpf,
	linux-kselftest, linux-media, dri-devel, Donald Hunter,
	Jakub Kicinski, David S. Miller, Eric Dumazet, Paolo Abeni,
	Jonathan Corbet, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	Thomas Bogendoerfer, James E.J. Bottomley, Helge Deller,
	Andreas Larsson, Jesper Dangaard Brouer, Ilias Apalodimas,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Arnd Bergmann, Steffen Klassert, Herbert Xu, David Ahern,
	Willem de Bruijn, Björn Töpel, Magnus Karlsson,
	Maciej Fijalkowski, Jonathan Lemon, Shuah Khan,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend, Sumit Semwal,
	Christian König, Pavel Begunkov, David Wei, Jason Gunthorpe,
	Yunsheng Lin, Shailend Chand, Harshitha Ramamurthy, Shakeel Butt,
	Jeroen de Borst, Praveen Kaligineedi, Bagas Sanjaya,
	Christoph Hellwig, Nikolay Aleksandrov, Taehee Yoo,
	Willem de Bruijn, Kaiyuan Zhang, yi1.lai
In-Reply-To: <Zwe3lWTN36IUaIdd@ly-workstation>

On Thu, Oct 10, 2024 at 4:17 AM Lai, Yi <yi1.lai@linux.intel.com> wrote:
>
> Hi Mina Almasry,
>
> Greetings!
>
> I used Syzkaller and found that there is BUG: soft lockup inqt in linux-next tree next-20241008
>
> After bisection and the first bad commit is:
> "
> 678f6e28b5f6 net: add SO_DEVMEM_DONTNEED setsockopt to release RX frags
> "
>
> All detailed into can be found at:
> https://github.com/laifryiee/syzkaller_logs/tree/main/241009_103423_do_sock_setsockopt
> Syzkaller repro code:
> https://github.com/laifryiee/syzkaller_logs/tree/main/241009_103423_do_sock_setsockopt/repro.c
> Syzkaller repro syscall steps:
> https://github.com/laifryiee/syzkaller_logs/tree/main/241009_103423_do_sock_setsockopt/repro.prog
> Syzkaller report:
> https://github.com/laifryiee/syzkaller_logs/tree/main/241009_103423_do_sock_setsockopt/repro.report
> Kconfig(make olddefconfig):
> https://github.com/laifryiee/syzkaller_logs/tree/main/241009_103423_do_sock_setsockopt/kconfig_origin
> Bisect info:
> https://github.com/laifryiee/syzkaller_logs/tree/main/241009_103423_do_sock_setsockopt/bisect_info.log
> bzImage:
> https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/241009_103423_do_sock_setsockopt/bzImage_8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b
> Issue dmesg:
> https://github.com/laifryiee/syzkaller_logs/blob/main/241009_103423_do_sock_setsockopt/8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b_dmesg.log
>
> "
> [   48.825073]  ? __lock_acquire+0x1b0f/0x5c90
> [   48.825419]  ? __pfx___lock_acquire+0x10/0x10
> [   48.825774]  sock_setsockopt+0x68/0x90
> [   48.826117]  do_sock_setsockopt+0x3fb/0x480
> [   48.826455]  ? __pfx_do_sock_setsockopt+0x10/0x10
> [   48.826829]  ? lock_release+0x441/0x870
> [   48.827140]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> [   48.827558]  ? fdget+0x188/0x230
> [   48.827846]  __sys_setsockopt+0x131/0x200
> [   48.828184]  ? __pfx___sys_setsockopt+0x10/0x10
> [   48.828551]  ? seqcount_lockdep_reader_access.constprop.0+0xc0/0xd0
> [   48.829042]  ? __sanitizer_cov_trace_cmp4+0x1a/0x20
> [   48.829425]  ? ktime_get_coarse_real_ts64+0xbf/0xf0
> [   48.829817]  __x64_sys_setsockopt+0xc6/0x160
> [   48.830160]  ? syscall_trace_enter+0x14a/0x230
> [   48.830520]  x64_sys_call+0x6cf/0x20d0
> [   48.830825]  do_syscall_64+0x6d/0x140
> [   48.831124]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   48.831517] RIP: 0033:0x7f26cdc3ee5d
> [   48.831804] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
> [   48.833180] RSP: 002b:00007fff33f36278 EFLAGS: 00000213 ORIG_RAX: 0000000000000036
> [   48.833756] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f26cdc3ee5d
> [   48.834294] RDX: 0000000000000050 RSI: 0000000000000001 RDI: 0000000000000003
> [   48.834830] RBP: 00007fff33f36290 R08: 0000000000000010 R09: 00007fff33f36290
> [   48.835368] R10: 0000000020000080 R11: 0000000000000213 R12: 00007fff33f363e8
> [   48.835906] R13: 000000000040178f R14: 0000000000403e08 R15: 00007f26cde51000
> [   48.836466]  </TASK>
> [   48.836648] Kernel panic - not syncing: softlockup: hung tasks
> [   48.837096] CPU: 1 UID: 0 PID: 729 Comm: repro Tainted: G             L     6.12.0-rc2-8cf0b93919e1 #1
> [   48.837796] Tainted: [L]=SOFTLOCKUP
> [   48.838071] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> [   48.838916] Call Trace:
> [   48.839113]  <IRQ>
> [   48.839282]  dump_stack_lvl+0x42/0x150
> [   48.839584]  dump_stack+0x19/0x20
> [   48.839846]  panic+0x703/0x790
> [   48.840100]  ? __pfx_panic+0x10/0x10
> [   48.840394]  ? watchdog_timer_fn+0x599/0x6b0
> [   48.840727]  ? watchdog_timer_fn+0x58c/0x6b0
> [   48.841065]  watchdog_timer_fn+0x5aa/0x6b0
> [   48.841382]  ? __pfx_watchdog_timer_fn+0x10/0x10
> [   48.841743]  __hrtimer_run_queues+0x5d6/0xc30
> [   48.842091]  ? __pfx___hrtimer_run_queues+0x10/0x10
> [   48.842473]  hrtimer_interrupt+0x324/0x7a0
> [   48.842802]  __sysvec_apic_timer_interrupt+0x10b/0x410
> [   48.843198]  ? debug_smp_processor_id+0x20/0x30
> [   48.843551]  sysvec_apic_timer_interrupt+0xaf/0xd0
> [   48.843922]  </IRQ>
> [   48.844101]  <TASK>
> [   48.844275]  asm_sysvec_apic_timer_interrupt+0x1f/0x30
> [   48.844711] RIP: 0010:__sanitizer_cov_trace_pc+0x45/0x70
> [   48.845130] Code: a9 00 01 ff 00 74 1d f6 c4 01 74 43 a9 00 00 0f 00 75 3c a9 00 00 f0 00 75 35 8b 82 04 1e 00 00 85 c0 74 2b 8b 82 e0 1d 00 00 <83> f8 02 75 20 48 8b 8a e8 1d 00 00 8b 92 e4 1d 00 00 48 8b 01 48
> [   48.846480] RSP: 0018:ffff8880239cf790 EFLAGS: 00000246
> [   48.846876] RAX: 0000000000000000 RBX: ffff8880239cf900 RCX: ffffffff8581c19f
> [   48.847407] RDX: ffff88801a818000 RSI: ffffffff8581c1d5 RDI: 0000000000000007
> [   48.847933] RBP: ffff8880239cf790 R08: 0000000000000001 R09: ffffed1004739f23
> [   48.848472] R10: 0000000077cc006e R11: 0000000000000001 R12: 0000000000000000
> [   48.849002] R13: 0000000077cc006e R14: ffff8880239cf918 R15: 0000000000000000
> [   48.849536]  ? xas_start+0x11f/0x730
> [   48.849818]  ? xas_start+0x155/0x730
> [   48.850101]  xas_start+0x155/0x730
> [   48.850372]  xas_load+0x2f/0x520
> [   48.850629]  ? irqentry_exit+0x3e/0xa0
> [   48.850922]  ? sysvec_apic_timer_interrupt+0x6a/0xd0
> [   48.851304]  xas_store+0x1165/0x1ad0
> [   48.851588]  ? __this_cpu_preempt_check+0x21/0x30
> [   48.851950]  ? irqentry_exit+0x3e/0xa0
> [   48.852254]  __xa_erase+0xc6/0x180
> [   48.852524]  ? __pfx___xa_erase+0x10/0x10
> [   48.852842]  ? __xa_erase+0xf1/0x180
> [   48.853123]  ? sock_devmem_dontneed+0x42c/0x6d0
> [   48.853480]  sock_devmem_dontneed+0x3a8/0x6d0
> [   48.853829]  ? __pfx_sock_devmem_dontneed+0x10/0x10
> [   48.854205]  ? trace_lock_acquire+0x139/0x1b0
> [   48.854548]  ? lock_acquire+0x80/0xb0
> [   48.854833]  ? __might_fault+0xf1/0x1b0
> [   48.855133]  ? __might_fault+0xf1/0x1b0
> [   48.855437]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
> [   48.855849]  sk_setsockopt+0x480/0x3c60
> [   48.856158]  ? __pfx_sk_setsockopt+0x10/0x10
> [   48.856491]  ? __kasan_check_read+0x15/0x20
> [   48.856814]  ? __lock_acquire+0x1b0f/0x5c90
> [   48.857144]  ? __pfx___lock_acquire+0x10/0x10
> [   48.857488]  sock_setsockopt+0x68/0x90
> [   48.857785]  do_sock_setsockopt+0x3fb/0x480
> [   48.858110]  ? __pfx_do_sock_setsockopt+0x10/0x10
> [   48.858474]  ? lock_release+0x441/0x870
> [   48.858776]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> [   48.859184]  ? fdget+0x188/0x230
> [   48.859448]  __sys_setsockopt+0x131/0x200
> [   48.859764]  ? __pfx___sys_setsockopt+0x10/0x10
> [   48.860123]  ? seqcount_lockdep_reader_access.constprop.0+0xc0/0xd0
> [   48.860598]  ? __sanitizer_cov_trace_cmp4+0x1a/0x20
> [   48.860982]  ? ktime_get_coarse_real_ts64+0xbf/0xf0
> [   48.861370]  __x64_sys_setsockopt+0xc6/0x160
> [   48.861710]  ? syscall_trace_enter+0x14a/0x230
> [   48.862057]  x64_sys_call+0x6cf/0x20d0
> [   48.862350]  do_syscall_64+0x6d/0x140
> [   48.862639]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   48.863023] RIP: 0033:0x7f26cdc3ee5d
> [   48.863301] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
> [   48.864659] RSP: 002b:00007fff33f36278 EFLAGS: 00000213 ORIG_RAX: 0000000000000036
> [   48.865223] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f26cdc3ee5d
> "
>
> I hope you find it useful.

Thank you for the report. I think I see the issue and I commented on
the fix in the code below.

Only issue is that this is unlucky timing for me. I have a flight
tomorrow for a vacation where I think I may have internet access and
may not. I will try to follow up here, but in case I can't, what's the
urgency for this issue? Can this wait 2 weeks when I get back?

> > +     if (optlen % sizeof(struct dmabuf_token) ||
> > +         optlen > sizeof(*tokens) * MAX_DONTNEED_TOKENS)
> > +             return -EINVAL;
> > +
> > +     tokens = kvmalloc_array(optlen, sizeof(*tokens), GFP_KERNEL);
> > +     if (!tokens)
> > +             return -ENOMEM;
> > +

There is an unrelated bug here. The first argument for kvmalloc_array
is the number of elements, I think, not the number of bytes. So this
should be:

num_tokens = optlen / sizeof(struct dmabuf_token);
tokens = kvmalloc_array(num_tokens, sizeof(*tokens), GFP_KERNEL);
if (!tokens)
   return -ENOMEM;

> > +
> > +     if (copy_from_sockptr(tokens, optval, optlen)) {
> > +             kvfree(tokens);
> > +             return -EFAULT;
> > +     }
> > +
> > +     xa_lock_bh(&sk->sk_user_frags);
> > +     for (i = 0; i < num_tokens; i++) {
> > +             for (j = 0; j < tokens[i].token_count; j++) {

The bug should be here. tokens[i].token_count is a u32 provided by the
user. The user can specify U32_MAX here, which will make the loop
below spin for a very long time with the lock held, which should be
the cause of the soft lockup.

We should add a check that token_count is < MAX_DONTNEED_TOKENS or
something like that, above this line.

Please let me know of urgency. If this can't wait I'll try very hard
to repro the issue/fix while I'm out. Untested fix I'm going to try
out:

diff --git a/net/core/sock.c b/net/core/sock.c
index 083d438d8b6f..cb3d8b19de14 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1071,11 +1071,11 @@ sock_devmem_dontneed(struct sock *sk,
sockptr_t optval, unsigned int optlen)
            optlen > sizeof(*tokens) * MAX_DONTNEED_TOKENS)
                return -EINVAL;

-       tokens = kvmalloc_array(optlen, sizeof(*tokens), GFP_KERNEL);
+       num_tokens = optlen / sizeof(struct dmabuf_token);
+       tokens = kvmalloc_array(num_tokens, sizeof(*tokens), GFP_KERNEL);
        if (!tokens)
                return -ENOMEM;

-       num_tokens = optlen / sizeof(struct dmabuf_token);
        if (copy_from_sockptr(tokens, optval, optlen)) {
                kvfree(tokens);
                return -EFAULT;
@@ -1083,6 +1083,10 @@ sock_devmem_dontneed(struct sock *sk, sockptr_t
optval, unsigned int optlen)

        xa_lock_bh(&sk->sk_user_frags);
        for (i = 0; i < num_tokens; i++) {
+
+               if (tokens[i].token_count > MAX_DONTNEED_TOKENS)
+                       continue;
+
                for (j = 0; j < tokens[i].token_count; j++) {
                        netmem_ref netmem = (__force netmem_ref)__xa_erase(
                                &sk->sk_user_frags, tokens[i].token_start + j);

-- 
Thanks,
Mina

^ permalink raw reply related

* Re: [PATCH v5 7/8] execmem: add support for cache of large ROX pages
From: Kees Bakker @ 2024-10-10 18:35 UTC (permalink / raw)
  To: Mike Rapoport, Andrew Morton
  Cc: Andreas Larsson, Andy Lutomirski, Ard Biesheuvel, Arnd Bergmann,
	Borislav Petkov, Brian Cain, Catalin Marinas, Christoph Hellwig,
	Christophe Leroy, Dave Hansen, Dinh Nguyen, Geert Uytterhoeven,
	Guo Ren, Helge Deller, Huacai Chen, Ingo Molnar, Johannes Berg,
	John Paul Adrian Glaubitz, Kent Overstreet, Liam R. Howlett,
	Luis Chamberlain, Mark Rutland, Masami Hiramatsu, Matt Turner,
	Max Filippov, Michael Ellerman, Michal Simek, Oleg Nesterov,
	Palmer Dabbelt, Peter Zijlstra, Richard Weinberger, Russell King,
	Song Liu, Stafford Horne, Steven Rostedt, Thomas Bogendoerfer,
	Thomas Gleixner, Uladzislau Rezki, Vineet Gupta, Will Deacon, bpf,
	linux-alpha, linux-arch, linux-arm-kernel, linux-csky,
	linux-hexagon, linux-kernel, linux-m68k, linux-mips, linux-mm,
	linux-modules, linux-openrisc, linux-parisc, linux-riscv,
	linux-sh, linux-snps-arc, linux-trace-kernel, linux-um,
	linuxppc-dev, loongarch, sparclinux, x86
In-Reply-To: <20241009180816.83591-8-rppt@kernel.org>

Op 09-10-2024 om 20:08 schreef Mike Rapoport:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> Using large pages to map text areas reduces iTLB pressure and improves
> performance.
>
> Extend execmem_alloc() with an ability to use huge pages with ROX
> permissions as a cache for smaller allocations.
>
> To populate the cache, a writable large page is allocated from vmalloc with
> VM_ALLOW_HUGE_VMAP, filled with invalid instructions and then remapped as
> ROX.
>
> Portions of that large page are handed out to execmem_alloc() callers
> without any changes to the permissions.
>
> When the memory is freed with execmem_free() it is invalidated again so
> that it won't contain stale instructions.
>
> The cache is enabled when an architecture sets EXECMEM_ROX_CACHE flag in
> definition of an execmem_range.
>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
>   include/linux/execmem.h |   2 +
>   mm/execmem.c            | 317 +++++++++++++++++++++++++++++++++++++++-
>   mm/internal.h           |   1 +
>   mm/vmalloc.c            |   5 +
>   4 files changed, 320 insertions(+), 5 deletions(-)
> [...]
> +static void execmem_cache_clean(struct work_struct *work)
> +{
> +	struct maple_tree *free_areas = &execmem_cache.free_areas;
> +	struct mutex *mutex = &execmem_cache.mutex;
> +	MA_STATE(mas, free_areas, 0, ULONG_MAX);
> +	void *area;
> +
> +	mutex_lock(mutex);
> +	mas_for_each(&mas, area, ULONG_MAX) {
> +		size_t size;
> +
No need to check for !area, because it is already guaranteed by the 
while loop condition (mas_for_each)
> +		if (!area)
> +			continue;
> +
> +		size = mas_range_len(&mas);
> +
> +		if (IS_ALIGNED(size, PMD_SIZE) &&
> +		    IS_ALIGNED(mas.index, PMD_SIZE)) {
> +			struct vm_struct *vm = find_vm_area(area);
> +
> +			execmem_set_direct_map_valid(vm, true);
> +			mas_store_gfp(&mas, NULL, GFP_KERNEL);
> +			vfree(area);
> +		}
> +	}
> +	mutex_unlock(mutex);
> +}
>

^ permalink raw reply

* Re: [PATCH v5 4/8] module: prepare to handle ROX allocations for text
From: Song Liu @ 2024-10-10 17:18 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, Andreas Larsson, Andy Lutomirski, Ard Biesheuvel,
	Arnd Bergmann, Borislav Petkov, Brian Cain, Catalin Marinas,
	Christoph Hellwig, Christophe Leroy, Dave Hansen, Dinh Nguyen,
	Geert Uytterhoeven, Guo Ren, Helge Deller, Huacai Chen,
	Ingo Molnar, Johannes Berg, John Paul Adrian Glaubitz,
	Kent Overstreet, Liam R. Howlett, Luis Chamberlain, Mark Rutland,
	Masami Hiramatsu, Matt Turner, Max Filippov, Michael Ellerman,
	Michal Simek, Oleg Nesterov, Palmer Dabbelt, Peter Zijlstra,
	Richard Weinberger, Russell King, Stafford Horne, Steven Rostedt,
	Thomas Bogendoerfer, Thomas Gleixner, Uladzislau Rezki,
	Vineet Gupta, Will Deacon, bpf, linux-alpha, linux-arch,
	linux-arm-kernel, linux-csky, linux-hexagon, linux-kernel,
	linux-m68k, linux-mips, linux-mm, linux-modules, linux-openrisc,
	linux-parisc, linux-riscv, linux-sh, linux-snps-arc,
	linux-trace-kernel, linux-um, linuxppc-dev, loongarch, sparclinux,
	x86
In-Reply-To: <ZwdpnPKKQGF5DtSv@kernel.org>

On Wed, Oct 9, 2024 at 10:47 PM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Wed, Oct 09, 2024 at 03:23:40PM -0700, Song Liu wrote:
> > On Wed, Oct 9, 2024 at 11:10 AM Mike Rapoport <rppt@kernel.org> wrote:
> > [...]
> > > diff --git a/include/linux/module.h b/include/linux/module.h
> > > index 88ecc5e9f523..7039f609c6ef 100644
> > > --- a/include/linux/module.h
> > > +++ b/include/linux/module.h
> > > @@ -367,6 +367,8 @@ enum mod_mem_type {
> > >
> > >  struct module_memory {
> > >         void *base;
> > > +       void *rw_copy;
> > > +       bool is_rox;
> > >         unsigned int size;
> >
> > Do we really need to hold the rw_copy all the time?
>
> We hold it only during module initialization, it's freed in
> post_relocation.

Ah, I missed this part. Sorry for the noise.

Song

^ permalink raw reply

* Re: Bisected: [PATCH v5 8/8] x86/module: enable ROX caches for module text
From: Mike Rapoport @ 2024-10-10 15:20 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Andreas Larsson, Andy Lutomirski, Ard Biesheuvel,
	Arnd Bergmann, Borislav Petkov, Brian Cain, Catalin Marinas,
	Christoph Hellwig, Christophe Leroy, Dave Hansen, Dinh Nguyen,
	Geert Uytterhoeven, Guo Ren, Helge Deller, Huacai Chen,
	Ingo Molnar, Johannes Berg, John Paul Adrian Glaubitz,
	Kent Overstreet, Liam R. Howlett, Luis Chamberlain, Mark Rutland,
	Masami Hiramatsu, Matt Turner, Max Filippov, Michael Ellerman,
	Michal Simek, Oleg Nesterov, Palmer Dabbelt, Peter Zijlstra,
	Richard Weinberger, Russell King, Song Liu, Stafford Horne,
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Uladzislau Rezki, Vineet Gupta, Will Deacon, bpf, linux-alpha,
	linux-arch, linux-arm-kernel, linux-csky, linux-hexagon,
	linux-kernel, linux-m68k, linux-mips, linux-mm, linux-modules,
	linux-openrisc, linux-parisc, linux-riscv, linux-sh,
	linux-snps-arc, linux-trace-kernel, linux-um, linuxppc-dev,
	loongarch, sparclinux, x86
In-Reply-To: <20241010083033.GA1279924@google.com>

On Thu, Oct 10, 2024 at 05:30:33PM +0900, Sergey Senozhatsky wrote:
> On (24/10/09 21:08), Mike Rapoport wrote:
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > 
> > Enable execmem's cache of PMD_SIZE'ed pages mapped as ROX for module
> > text allocations.
> > 
> 
> With this modprobe disappoints kmemleak
> 
> [   12.700128] kmemleak: Found object by alias at 0xffffffffa000a000
> [   12.702179] CPU: 5 UID: 0 PID: 410 Comm: modprobe Tainted: G                 N 6.12.0-rc2+ #760
> [   12.704656] Tainted: [N]=TEST
> [   12.705526] Call Trace:
> [   12.706250]  <TASK>
> [   12.706888]  dump_stack_lvl+0x3e/0xdb
> [   12.707961]  __find_and_get_object+0x100/0x110
> [   12.709256]  kmemleak_no_scan+0x2e/0xb0
> [   12.710354]  kmemleak_load_module+0xad/0xe0
> [   12.711557]  load_module+0x2391/0x45a0
> [   12.712507]  __se_sys_finit_module+0x4e0/0x7a0
> [   12.713599]  do_syscall_64+0x54/0xf0
> [   12.714477]  ? irqentry_exit_to_user_mode+0x33/0x100
> [   12.715696]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
> [   12.716931] RIP: 0033:0x7fc7af51f059
> [   12.717816] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8f 1d 0d 00 f7 d8 64 89 01 48
> [   12.722324] RSP: 002b:00007ffc1d0b0c18 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
> [   12.724173] RAX: ffffffffffffffda RBX: 00005618a9439b20 RCX: 00007fc7af51f059
> [   12.725884] RDX: 0000000000000000 RSI: 000056187aea098b RDI: 0000000000000003
> [   12.727617] RBP: 0000000000000000 R08: 0000000000000060 R09: 00005618a943af60
> [   12.729361] R10: 0000000000000038 R11: 0000000000000246 R12: 000056187aea098b
> [   12.731101] R13: 0000000000040000 R14: 00005618a9439ac0 R15: 0000000000000000
> [   12.732814]  </TASK>

Below is a quick fix, I'll revisit module - kmemleak interaction in v6


diff --git a/kernel/module/debug_kmemleak.c b/kernel/module/debug_kmemleak.c
index b4cc03842d70..df873dad049d 100644
--- a/kernel/module/debug_kmemleak.c
+++ b/kernel/module/debug_kmemleak.c
@@ -14,7 +14,8 @@ void kmemleak_load_module(const struct module *mod,
 {
 	/* only scan writable, non-executable sections */
 	for_each_mod_mem_type(type) {
-		if (type != MOD_DATA && type != MOD_INIT_DATA)
+		if (type != MOD_DATA && type != MOD_INIT_DATA &&
+		    !mod->mem[type].is_rox)
 			kmemleak_no_scan(mod->mem[type].base);
 	}
 }

-- 
Sincerely yours,
Mike.

^ permalink raw reply related

* Re: [PATCH v5 7/8] execmem: add support for cache of large ROX pages
From: Mike Rapoport @ 2024-10-10 12:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Andreas Larsson, Andy Lutomirski, Ard Biesheuvel,
	Arnd Bergmann, Borislav Petkov, Brian Cain, Catalin Marinas,
	Christophe Leroy, Dave Hansen, Dinh Nguyen, Geert Uytterhoeven,
	Guo Ren, Helge Deller, Huacai Chen, Ingo Molnar, Johannes Berg,
	John Paul Adrian Glaubitz, Kent Overstreet, Liam R. Howlett,
	Luis Chamberlain, Mark Rutland, Masami Hiramatsu, Matt Turner,
	Max Filippov, Michael Ellerman, Michal Simek, Oleg Nesterov,
	Palmer Dabbelt, Peter Zijlstra, Richard Weinberger, Russell King,
	Song Liu, Stafford Horne, Steven Rostedt, Thomas Bogendoerfer,
	Thomas Gleixner, Uladzislau Rezki, Vineet Gupta, Will Deacon, bpf,
	linux-alpha, linux-arch, linux-arm-kernel, linux-csky,
	linux-hexagon, linux-kernel, linux-m68k, linux-mips, linux-mm,
	linux-modules, linux-openrisc, linux-parisc, linux-riscv,
	linux-sh, linux-snps-arc, linux-trace-kernel, linux-um,
	linuxppc-dev, loongarch, sparclinux, x86
In-Reply-To: <Zwd7GRyBtCwiAv1v@infradead.org>

On Wed, Oct 09, 2024 at 11:58:33PM -0700, Christoph Hellwig wrote:
> On Wed, Oct 09, 2024 at 09:08:15PM +0300, Mike Rapoport wrote:
> >  /**
> >   * struct execmem_info - architecture parameters for code allocations
> > + * @fill_trapping_insns: set memory to contain instructions that will trap
> >   * @ranges: array of parameter sets defining architecture specific
> >   * parameters for executable memory allocations. The ranges that are not
> >   * explicitly initialized by an architecture use parameters defined for
> >   * @EXECMEM_DEFAULT.
> >   */
> >  struct execmem_info {
> > +	void (*fill_trapping_insns)(void *ptr, size_t size, bool writable);
> >  	struct execmem_range	ranges[EXECMEM_TYPE_MAX];
> 
> Why is the filler an indirect function call and not an architecture
> hook?

The idea is to keep everything together and have execmem_info describe all
that architecture needs. 

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH net-next v25 10/13] net: add SO_DEVMEM_DONTNEED setsockopt to release RX frags
From: Lai, Yi @ 2024-10-10 11:16 UTC (permalink / raw)
  To: Mina Almasry
  Cc: netdev, linux-kernel, linux-doc, linux-alpha, linux-mips,
	linux-parisc, sparclinux, linux-trace-kernel, linux-arch, bpf,
	linux-kselftest, linux-media, dri-devel, Donald Hunter,
	Jakub Kicinski, David S. Miller, Eric Dumazet, Paolo Abeni,
	Jonathan Corbet, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	Thomas Bogendoerfer, James E.J. Bottomley, Helge Deller,
	Andreas Larsson, Jesper Dangaard Brouer, Ilias Apalodimas,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Arnd Bergmann, Steffen Klassert, Herbert Xu, David Ahern,
	Willem de Bruijn, Björn Töpel, Magnus Karlsson,
	Maciej Fijalkowski, Jonathan Lemon, Shuah Khan,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend, Sumit Semwal,
	Christian König, Pavel Begunkov, David Wei, Jason Gunthorpe,
	Yunsheng Lin, Shailend Chand, Harshitha Ramamurthy, Shakeel Butt,
	Jeroen de Borst, Praveen Kaligineedi, Bagas Sanjaya,
	Christoph Hellwig, Nikolay Aleksandrov, Taehee Yoo,
	Willem de Bruijn, Kaiyuan Zhang, yi1.lai
In-Reply-To: <20240909054318.1809580-11-almasrymina@google.com>

Hi Mina Almasry,

Greetings!

I used Syzkaller and found that there is BUG: soft lockup inqt in linux-next tree next-20241008

After bisection and the first bad commit is:
"
678f6e28b5f6 net: add SO_DEVMEM_DONTNEED setsockopt to release RX frags
"

All detailed into can be found at:
https://github.com/laifryiee/syzkaller_logs/tree/main/241009_103423_do_sock_setsockopt
Syzkaller repro code:
https://github.com/laifryiee/syzkaller_logs/tree/main/241009_103423_do_sock_setsockopt/repro.c
Syzkaller repro syscall steps:
https://github.com/laifryiee/syzkaller_logs/tree/main/241009_103423_do_sock_setsockopt/repro.prog
Syzkaller report:
https://github.com/laifryiee/syzkaller_logs/tree/main/241009_103423_do_sock_setsockopt/repro.report
Kconfig(make olddefconfig):
https://github.com/laifryiee/syzkaller_logs/tree/main/241009_103423_do_sock_setsockopt/kconfig_origin
Bisect info:
https://github.com/laifryiee/syzkaller_logs/tree/main/241009_103423_do_sock_setsockopt/bisect_info.log
bzImage:
https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/241009_103423_do_sock_setsockopt/bzImage_8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b
Issue dmesg:
https://github.com/laifryiee/syzkaller_logs/blob/main/241009_103423_do_sock_setsockopt/8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b_dmesg.log

"
[   48.825073]  ? __lock_acquire+0x1b0f/0x5c90
[   48.825419]  ? __pfx___lock_acquire+0x10/0x10
[   48.825774]  sock_setsockopt+0x68/0x90
[   48.826117]  do_sock_setsockopt+0x3fb/0x480
[   48.826455]  ? __pfx_do_sock_setsockopt+0x10/0x10
[   48.826829]  ? lock_release+0x441/0x870
[   48.827140]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[   48.827558]  ? fdget+0x188/0x230
[   48.827846]  __sys_setsockopt+0x131/0x200
[   48.828184]  ? __pfx___sys_setsockopt+0x10/0x10
[   48.828551]  ? seqcount_lockdep_reader_access.constprop.0+0xc0/0xd0
[   48.829042]  ? __sanitizer_cov_trace_cmp4+0x1a/0x20
[   48.829425]  ? ktime_get_coarse_real_ts64+0xbf/0xf0
[   48.829817]  __x64_sys_setsockopt+0xc6/0x160
[   48.830160]  ? syscall_trace_enter+0x14a/0x230
[   48.830520]  x64_sys_call+0x6cf/0x20d0
[   48.830825]  do_syscall_64+0x6d/0x140
[   48.831124]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   48.831517] RIP: 0033:0x7f26cdc3ee5d
[   48.831804] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
[   48.833180] RSP: 002b:00007fff33f36278 EFLAGS: 00000213 ORIG_RAX: 0000000000000036
[   48.833756] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f26cdc3ee5d
[   48.834294] RDX: 0000000000000050 RSI: 0000000000000001 RDI: 0000000000000003
[   48.834830] RBP: 00007fff33f36290 R08: 0000000000000010 R09: 00007fff33f36290
[   48.835368] R10: 0000000020000080 R11: 0000000000000213 R12: 00007fff33f363e8
[   48.835906] R13: 000000000040178f R14: 0000000000403e08 R15: 00007f26cde51000
[   48.836466]  </TASK>
[   48.836648] Kernel panic - not syncing: softlockup: hung tasks
[   48.837096] CPU: 1 UID: 0 PID: 729 Comm: repro Tainted: G             L     6.12.0-rc2-8cf0b93919e1 #1
[   48.837796] Tainted: [L]=SOFTLOCKUP
[   48.838071] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[   48.838916] Call Trace:
[   48.839113]  <IRQ>
[   48.839282]  dump_stack_lvl+0x42/0x150
[   48.839584]  dump_stack+0x19/0x20
[   48.839846]  panic+0x703/0x790
[   48.840100]  ? __pfx_panic+0x10/0x10
[   48.840394]  ? watchdog_timer_fn+0x599/0x6b0
[   48.840727]  ? watchdog_timer_fn+0x58c/0x6b0
[   48.841065]  watchdog_timer_fn+0x5aa/0x6b0
[   48.841382]  ? __pfx_watchdog_timer_fn+0x10/0x10
[   48.841743]  __hrtimer_run_queues+0x5d6/0xc30
[   48.842091]  ? __pfx___hrtimer_run_queues+0x10/0x10
[   48.842473]  hrtimer_interrupt+0x324/0x7a0
[   48.842802]  __sysvec_apic_timer_interrupt+0x10b/0x410
[   48.843198]  ? debug_smp_processor_id+0x20/0x30
[   48.843551]  sysvec_apic_timer_interrupt+0xaf/0xd0
[   48.843922]  </IRQ>
[   48.844101]  <TASK>
[   48.844275]  asm_sysvec_apic_timer_interrupt+0x1f/0x30
[   48.844711] RIP: 0010:__sanitizer_cov_trace_pc+0x45/0x70
[   48.845130] Code: a9 00 01 ff 00 74 1d f6 c4 01 74 43 a9 00 00 0f 00 75 3c a9 00 00 f0 00 75 35 8b 82 04 1e 00 00 85 c0 74 2b 8b 82 e0 1d 00 00 <83> f8 02 75 20 48 8b 8a e8 1d 00 00 8b 92 e4 1d 00 00 48 8b 01 48
[   48.846480] RSP: 0018:ffff8880239cf790 EFLAGS: 00000246
[   48.846876] RAX: 0000000000000000 RBX: ffff8880239cf900 RCX: ffffffff8581c19f
[   48.847407] RDX: ffff88801a818000 RSI: ffffffff8581c1d5 RDI: 0000000000000007
[   48.847933] RBP: ffff8880239cf790 R08: 0000000000000001 R09: ffffed1004739f23
[   48.848472] R10: 0000000077cc006e R11: 0000000000000001 R12: 0000000000000000
[   48.849002] R13: 0000000077cc006e R14: ffff8880239cf918 R15: 0000000000000000
[   48.849536]  ? xas_start+0x11f/0x730
[   48.849818]  ? xas_start+0x155/0x730
[   48.850101]  xas_start+0x155/0x730
[   48.850372]  xas_load+0x2f/0x520
[   48.850629]  ? irqentry_exit+0x3e/0xa0
[   48.850922]  ? sysvec_apic_timer_interrupt+0x6a/0xd0
[   48.851304]  xas_store+0x1165/0x1ad0
[   48.851588]  ? __this_cpu_preempt_check+0x21/0x30
[   48.851950]  ? irqentry_exit+0x3e/0xa0
[   48.852254]  __xa_erase+0xc6/0x180
[   48.852524]  ? __pfx___xa_erase+0x10/0x10
[   48.852842]  ? __xa_erase+0xf1/0x180
[   48.853123]  ? sock_devmem_dontneed+0x42c/0x6d0
[   48.853480]  sock_devmem_dontneed+0x3a8/0x6d0
[   48.853829]  ? __pfx_sock_devmem_dontneed+0x10/0x10
[   48.854205]  ? trace_lock_acquire+0x139/0x1b0
[   48.854548]  ? lock_acquire+0x80/0xb0
[   48.854833]  ? __might_fault+0xf1/0x1b0
[   48.855133]  ? __might_fault+0xf1/0x1b0
[   48.855437]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
[   48.855849]  sk_setsockopt+0x480/0x3c60
[   48.856158]  ? __pfx_sk_setsockopt+0x10/0x10
[   48.856491]  ? __kasan_check_read+0x15/0x20
[   48.856814]  ? __lock_acquire+0x1b0f/0x5c90
[   48.857144]  ? __pfx___lock_acquire+0x10/0x10
[   48.857488]  sock_setsockopt+0x68/0x90
[   48.857785]  do_sock_setsockopt+0x3fb/0x480
[   48.858110]  ? __pfx_do_sock_setsockopt+0x10/0x10
[   48.858474]  ? lock_release+0x441/0x870
[   48.858776]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[   48.859184]  ? fdget+0x188/0x230
[   48.859448]  __sys_setsockopt+0x131/0x200
[   48.859764]  ? __pfx___sys_setsockopt+0x10/0x10
[   48.860123]  ? seqcount_lockdep_reader_access.constprop.0+0xc0/0xd0
[   48.860598]  ? __sanitizer_cov_trace_cmp4+0x1a/0x20
[   48.860982]  ? ktime_get_coarse_real_ts64+0xbf/0xf0
[   48.861370]  __x64_sys_setsockopt+0xc6/0x160
[   48.861710]  ? syscall_trace_enter+0x14a/0x230
[   48.862057]  x64_sys_call+0x6cf/0x20d0
[   48.862350]  do_syscall_64+0x6d/0x140
[   48.862639]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   48.863023] RIP: 0033:0x7f26cdc3ee5d
[   48.863301] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
[   48.864659] RSP: 002b:00007fff33f36278 EFLAGS: 00000213 ORIG_RAX: 0000000000000036
[   48.865223] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f26cdc3ee5d
"

I hope you find it useful.

Regards,
Yi Lai

---

If you don't need the following environment to reproduce the problem or if you
already have one reproduced environment, please ignore the following information.

How to reproduce:
git clone https://gitlab.com/xupengfe/repro_vm_env.git
cd repro_vm_env
tar -xvf repro_vm_env.tar.gz
cd repro_vm_env; ./start3.sh  // it needs qemu-system-x86_64 and I used v7.1.0
  // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
  // You could change the bzImage_xxx as you want
  // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
You could use below command to log in, there is no password for root.
ssh -p 10023 root@localhost

After login vm(virtual machine) successfully, you could transfer reproduced
binary to the vm by below way, and reproduce the problem in vm:
gcc -pthread -o repro repro.c
scp -P 10023 repro root@localhost:/root/

Get the bzImage for target kernel:
Please use target kconfig and copy it to kernel_src/.config
make olddefconfig
make -jx bzImage           //x should equal or less than cpu num your pc has

Fill the bzImage file into above start3.sh to load the target kernel in vm.


Tips:
If you already have qemu-system-x86_64, please ignore below info.
If you want to install qemu v7.1.0 version:
git clone https://github.com/qemu/qemu.git
cd qemu
git checkout -f v7.1.0
mkdir build
cd build
yum install -y ninja-build.x86_64
yum -y install libslirp-devel.x86_64
../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
make
make install 

On Mon, Sep 09, 2024 at 05:43:15AM +0000, Mina Almasry wrote:
> Add an interface for the user to notify the kernel that it is done
> reading the devmem dmabuf frags returned as cmsg. The kernel will
> drop the reference on the frags to make them available for reuse.
> 
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
> Signed-off-by: Mina Almasry <almasrymina@google.com>
> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
> Reviewed-by: Eric Dumazet <edumazet@google.com>
> 
> ---
> 
> v16:
> - Use sk_is_tcp().
> - Fix unnamed 128 DONTNEED limit (David).
> - Fix kernel allocating for 128 tokens even if the user didn't ask for
>   that much (Eric).
> - Fix number assignement (Arnd).
> 
> v10:
> - Fix leak of tokens (Nikolay).
> 
> v7:
> - Updated SO_DEVMEM_* uapi to use the next available entry (Arnd).
> 
> v6:
> - Squash in locking optimizations from edumazet@google.com. With his
>   changes we lock the xarray once per sock_devmem_dontneed operation
>   rather than once per frag.
> 
> Changes in v1:
> - devmemtoken -> dmabuf_token (David).
> - Use napi_pp_put_page() for refcounting (Yunsheng).
> - Fix build error with missing socket options on other asms.
> 
> ---
>  arch/alpha/include/uapi/asm/socket.h  |  1 +
>  arch/mips/include/uapi/asm/socket.h   |  1 +
>  arch/parisc/include/uapi/asm/socket.h |  1 +
>  arch/sparc/include/uapi/asm/socket.h  |  1 +
>  include/uapi/asm-generic/socket.h     |  1 +
>  include/uapi/linux/uio.h              |  4 ++
>  net/core/sock.c                       | 68 +++++++++++++++++++++++++++
>  7 files changed, 77 insertions(+)
> 
> diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h
> index ef4656a41058..251b73c5481e 100644
> --- a/arch/alpha/include/uapi/asm/socket.h
> +++ b/arch/alpha/include/uapi/asm/socket.h
> @@ -144,6 +144,7 @@
>  #define SCM_DEVMEM_LINEAR	SO_DEVMEM_LINEAR
>  #define SO_DEVMEM_DMABUF	79
>  #define SCM_DEVMEM_DMABUF	SO_DEVMEM_DMABUF
> +#define SO_DEVMEM_DONTNEED	80
>  
>  #if !defined(__KERNEL__)
>  
> diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h
> index 414807d55e33..8ab7582291ab 100644
> --- a/arch/mips/include/uapi/asm/socket.h
> +++ b/arch/mips/include/uapi/asm/socket.h
> @@ -155,6 +155,7 @@
>  #define SCM_DEVMEM_LINEAR	SO_DEVMEM_LINEAR
>  #define SO_DEVMEM_DMABUF	79
>  #define SCM_DEVMEM_DMABUF	SO_DEVMEM_DMABUF
> +#define SO_DEVMEM_DONTNEED	80
>  
>  #if !defined(__KERNEL__)
>  
> diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h
> index 2b817efd4544..38fc0b188e08 100644
> --- a/arch/parisc/include/uapi/asm/socket.h
> +++ b/arch/parisc/include/uapi/asm/socket.h
> @@ -136,6 +136,7 @@
>  #define SCM_DEVMEM_LINEAR	SO_DEVMEM_LINEAR
>  #define SO_DEVMEM_DMABUF	79
>  #define SCM_DEVMEM_DMABUF	SO_DEVMEM_DMABUF
> +#define SO_DEVMEM_DONTNEED	80
>  
>  #if !defined(__KERNEL__)
>  
> diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h
> index 00248fc68977..57084ed2f3c4 100644
> --- a/arch/sparc/include/uapi/asm/socket.h
> +++ b/arch/sparc/include/uapi/asm/socket.h
> @@ -137,6 +137,7 @@
>  #define SCM_DEVMEM_LINEAR        SO_DEVMEM_LINEAR
>  #define SO_DEVMEM_DMABUF         0x0058
>  #define SCM_DEVMEM_DMABUF        SO_DEVMEM_DMABUF
> +#define SO_DEVMEM_DONTNEED       0x0059
>  
>  #if !defined(__KERNEL__)
>  
> diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
> index e993edc9c0ee..3b4e3e815602 100644
> --- a/include/uapi/asm-generic/socket.h
> +++ b/include/uapi/asm-generic/socket.h
> @@ -139,6 +139,7 @@
>  #define SCM_DEVMEM_LINEAR	SO_DEVMEM_LINEAR
>  #define SO_DEVMEM_DMABUF	79
>  #define SCM_DEVMEM_DMABUF	SO_DEVMEM_DMABUF
> +#define SO_DEVMEM_DONTNEED	80
>  
>  #if !defined(__KERNEL__)
>  
> diff --git a/include/uapi/linux/uio.h b/include/uapi/linux/uio.h
> index 3a22ddae376a..d17f8fcd93ec 100644
> --- a/include/uapi/linux/uio.h
> +++ b/include/uapi/linux/uio.h
> @@ -33,6 +33,10 @@ struct dmabuf_cmsg {
>  				 */
>  };
>  
> +struct dmabuf_token {
> +	__u32 token_start;
> +	__u32 token_count;
> +};
>  /*
>   *	UIO_MAXIOV shall be at least 16 1003.1g (5.4.1.1)
>   */
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 468b1239606c..bbb57b5af0b1 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -124,6 +124,7 @@
>  #include <linux/netdevice.h>
>  #include <net/protocol.h>
>  #include <linux/skbuff.h>
> +#include <linux/skbuff_ref.h>
>  #include <net/net_namespace.h>
>  #include <net/request_sock.h>
>  #include <net/sock.h>
> @@ -1049,6 +1050,69 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
>  	return 0;
>  }
>  
> +#ifdef CONFIG_PAGE_POOL
> +
> +/* This is the number of tokens that the user can SO_DEVMEM_DONTNEED in
> + * 1 syscall. The limit exists to limit the amount of memory the kernel
> + * allocates to copy these tokens.
> + */
> +#define MAX_DONTNEED_TOKENS 128
> +
> +static noinline_for_stack int
> +sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
> +{
> +	unsigned int num_tokens, i, j, k, netmem_num = 0;
> +	struct dmabuf_token *tokens;
> +	netmem_ref netmems[16];
> +	int ret = 0;
> +
> +	if (!sk_is_tcp(sk))
> +		return -EBADF;
> +
> +	if (optlen % sizeof(struct dmabuf_token) ||
> +	    optlen > sizeof(*tokens) * MAX_DONTNEED_TOKENS)
> +		return -EINVAL;
> +
> +	tokens = kvmalloc_array(optlen, sizeof(*tokens), GFP_KERNEL);
> +	if (!tokens)
> +		return -ENOMEM;
> +
> +	num_tokens = optlen / sizeof(struct dmabuf_token);
> +	if (copy_from_sockptr(tokens, optval, optlen)) {
> +		kvfree(tokens);
> +		return -EFAULT;
> +	}
> +
> +	xa_lock_bh(&sk->sk_user_frags);
> +	for (i = 0; i < num_tokens; i++) {
> +		for (j = 0; j < tokens[i].token_count; j++) {
> +			netmem_ref netmem = (__force netmem_ref)__xa_erase(
> +				&sk->sk_user_frags, tokens[i].token_start + j);
> +
> +			if (netmem &&
> +			    !WARN_ON_ONCE(!netmem_is_net_iov(netmem))) {
> +				netmems[netmem_num++] = netmem;
> +				if (netmem_num == ARRAY_SIZE(netmems)) {
> +					xa_unlock_bh(&sk->sk_user_frags);
> +					for (k = 0; k < netmem_num; k++)
> +						WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
> +					netmem_num = 0;
> +					xa_lock_bh(&sk->sk_user_frags);
> +				}
> +				ret++;
> +			}
> +		}
> +	}
> +
> +	xa_unlock_bh(&sk->sk_user_frags);
> +	for (k = 0; k < netmem_num; k++)
> +		WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
> +
> +	kvfree(tokens);
> +	return ret;
> +}
> +#endif
> +
>  void sockopt_lock_sock(struct sock *sk)
>  {
>  	/* When current->bpf_ctx is set, the setsockopt is called from
> @@ -1211,6 +1275,10 @@ int sk_setsockopt(struct sock *sk, int level, int optname,
>  			ret = -EOPNOTSUPP;
>  		return ret;
>  		}
> +#ifdef CONFIG_PAGE_POOL
> +	case SO_DEVMEM_DONTNEED:
> +		return sock_devmem_dontneed(sk, optval, optlen);
> +#endif
>  	}
>  
>  	sockopt_lock_sock(sk);
> -- 
> 2.46.0.469.g59c65b2a67-goog
> 

^ permalink raw reply

* Re: [PATCH v5 7/8] execmem: add support for cache of large ROX pages
From: Mike Rapoport @ 2024-10-10  9:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andreas Larsson, Andy Lutomirski, Ard Biesheuvel, Arnd Bergmann,
	Borislav Petkov, Brian Cain, Catalin Marinas, Christoph Hellwig,
	Christophe Leroy, Dave Hansen, Dinh Nguyen, Geert Uytterhoeven,
	Guo Ren, Helge Deller, Huacai Chen, Ingo Molnar, Johannes Berg,
	John Paul Adrian Glaubitz, Kent Overstreet, Liam R. Howlett,
	Luis Chamberlain, Mark Rutland, Masami Hiramatsu, Matt Turner,
	Max Filippov, Michael Ellerman, Michal Simek, Oleg Nesterov,
	Palmer Dabbelt, Peter Zijlstra, Richard Weinberger, Russell King,
	Song Liu, Stafford Horne, Steven Rostedt, Thomas Bogendoerfer,
	Thomas Gleixner, Uladzislau Rezki, Vineet Gupta, Will Deacon, bpf,
	linux-alpha, linux-arch, linux-arm-kernel, linux-csky,
	linux-hexagon, linux-kernel, linux-m68k, linux-mips, linux-mm,
	linux-modules, linux-openrisc, linux-parisc, linux-riscv,
	linux-sh, linux-snps-arc, linux-trace-kernel, linux-um,
	linuxppc-dev, loongarch, sparclinux, x86
In-Reply-To: <20241009132427.5c94fb5942bae3832446bca5@linux-foundation.org>

On Wed, Oct 09, 2024 at 01:24:27PM -0700, Andrew Morton wrote:
> On Wed,  9 Oct 2024 21:08:15 +0300 Mike Rapoport <rppt@kernel.org> wrote:
> 
> > Using large pages to map text areas reduces iTLB pressure and improves
> > performance.
> 
> Are there any measurable performance improvements?

I don't have any numbers, I just followed the common sense of "less TLB
entries is better" and relied on Thomas comments from previous discussions.
 
> What are the effects of this series upon overall memory consumption?
 
There will be some execmem cache fragmentation and an increase in memory
consumption. It depends on the actual modules loaded and how large it the
fragmentation.

For a set of pretty randomly chosen modules where most come from
net/netfilter I see an increase from 19M to 25M.

> The lack of acks is a bit surprising for a v5 patch, but I'll add all
> this to mm.git for some testing, thanks.
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH] asm-generic: provide generic page_to_phys and phys_to_page implementations
From: Christoph Hellwig @ 2024-10-10  8:37 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Christoph Hellwig, linux-alpha, linux-kernel, linux-snps-arc,
	linux-arm-kernel, linux-csky@vger.kernel.org, linux-hexagon,
	loongarch, linux-m68k, linux-mips, linux-openrisc@vger.kernel.org,
	linux-parisc, linuxppc-dev, linux-riscv, linux-s390, linux-sh,
	sparclinux, linux-um, Linux-Arch, Christophe Leroy
In-Reply-To: <20241010070342.GB6674@lst.de>

On Thu, Oct 10, 2024 at 09:03:42AM +0200, Christoph Hellwig wrote:
> > I think we should try to have a little fewer nested macros
> > to evaluate here, right now this ends up expanding
> > __pfn_to_phys, PFN_PHYS, PAGE_SHIFT, CONFIG_PAGE_SHIFT,
> > page_to_pfn and __page_to_pfn. While the behavior is fine,
> > modern gcc versions list all of those in an warning message
> > if someone passes the wrong arguments.
> > 
> > Changing the two macros above into inline functions
> > would help as well, but may cause other problems.
> 
> Doing them as inlines seems useful to me, let me throw that at
> the buildbot and see if anything explodes.

The inline version instantly blows up, so I'll try just open coding
the phys to/from pfn translation instead.


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox