[PATCH v21 0/4] implement getrandom() in vDSO

linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v21 0/4] implement getrandom() in vDSO
@ 2024-07-07  0:26 Jason A. Donenfeld
  2024-07-07  0:26 ` [PATCH v21 1/4] mm: add VM_DROPPABLE for designating always lazily freeable mappings Jason A. Donenfeld
                   ` (3 more replies)
  0 siblings, 4 replies; 28+ messages in thread
From: Jason A. Donenfeld @ 2024-07-07  0:26 UTC (permalink / raw)
  To: linux-kernel, patches, tglx
  Cc: Jason A. Donenfeld, linux-crypto, linux-api, x86, Linus Torvalds,
	Greg Kroah-Hartman, Adhemerval Zanella Netto, Carlos O'Donell,
	Florian Weimer, Arnd Bergmann, Jann Horn, Christian Brauner,
	David Hildenbrand

The plan for this series is to take it through my random.git tree for 6.11.
It's cooking in linux-next now.

Changes v20->v21:

- After extensive conversation with Linus, we're nixing the entire
  vgetrandom_alloc() syscall, in favor of just exposing the functionality
  needed through mmap() and having the kernel communicate to the caller what
  arguments/sizes it should pass to mmap(). This simplifies the series
  considerably. It also means that the first commit adds some new MAP_*
  constants for mmap().

- Separate vDSO selftests out into separate commit.

--------------

Useful links:

- This series:
  - https://git.kernel.org/pub/scm/linux/kernel/git/crng/random.git/log/

- In case you're actually interested in the v≤14 design where faults were
  non-fatal and instructions were skipped (which I think is more coherent, even
  if the implementation is controversial), this lives in my branch here:
  - https://git.kernel.org/pub/scm/linux/kernel/git/crng/random.git/log/?h=jd/vdso-skip-insn
  Note that I'm *not* actually proposing this for upstream at this time. But it
  may be of conversational interest.

-------------

Two statements:

  1) Userspace wants faster cryptographically secure random numbers of
     arbitrary size, big or small.

  2) Userspace is currently unable to safely roll its own RNG with the
     same security profile as getrandom().

Statement (1) has been debated for years, with arguments ranging from
"we need faster cryptographically secure card shuffling!" to "the only
things that actually need good randomness are keys, which are few and
far between" to "actually, TLS CBC nonces are frequent" and so on. I
don't intend to wade into that debate substantially, except to note that
recently glibc added arc4random(), whose goal is to return a
cryptographically secure uint32_t, and there are real user reports of it
being too slow. So here we are.

Statement (2) is more interesting. The kernel is the nexus of all
entropic inputs that influence the RNG. It is in the best position, and
probably the only position, to decide anything at all about the current
state of the RNG and of its entropy. One of the things it uniquely knows
about is when reseeding is necessary.

For example, when a virtual machine is forked, restored, or duplicated,
it's imparative that the RNG doesn't generate the same outputs. For this
reason, there's a small protocol between hypervisors and the kernel that
indicates this has happened, alongside some ID, which the RNG uses to
immediately reseed, so as not to return the same numbers. Were userspace
to expand a getrandom() seed from time T1 for the next hour, and at some
point T2 < hour, the virtual machine forked, userspace would continue to
provide the same numbers to two (or more) different virtual machines,
resulting in potential cryptographic catastrophe. Something similar
happens on resuming from hibernation (or even suspend), with various
compromise scenarios there in mind.

There's a more general reason why userspace rolling its own RNG from a
getrandom() seed is fraught. There's a lot of attention paid to this
particular Linuxism we have of the RNG being initialized and thus
non-blocking or uninitialized and thus blocking until it is initialized.
These are our Two Big States that many hold to be the holy
differentiating factor between safe and not safe, between
cryptographically secure and garbage. The fact is, however, that the
distinction between these two states is a hand-wavy wishy-washy inexact
approximation. Outside of a few exceptional cases (e.g. a HW RNG is
available), we actually don't really ever know with any rigor at all
when the RNG is safe and ready (nor when it's compromised). We do the
best we can to "estimate" it, but entropy estimation is fundamentally
impossible in the general case. So really, we're just doing guess work,
and hoping it's good and conservative enough. Let's then assume that
there's always some potential error involved in this differentiator.

In fact, under the surface, the RNG is engineered around a different
principle, and that is trying to *use* new entropic inputs regularly and
at the right specific moments in time. For example, close to boot time,
the RNG reseeds itself more often than later. At certain events, like VM
fork, the RNG reseeds itself immediately. The various heuristics for
when the RNG will use new entropy and how often is really a core aspect
of what the RNG has some potential to do decently enough (and something
that will probably continue to improve in the future from random.c's
present set of algorithms). So in your mind, put away the metal
attachment to the Two Big States, which represent an approximation with
a potential margin of error. Instead keep in mind that the RNG's primary
operating heuristic is how often and exactly when it's going to reseed.

So, if userspace takes a seed from getrandom() at point T1, and uses it
for the next hour (or N megabytes or some other meaningless metric),
during that time, potential errors in the Two Big States approximation
are amplified. During that time potential reseeds are being lost,
forgotten, not reflected in the output stream. That's not good.

The simplest statement you could make is that userspace RNGs that expand
a getrandom() seed at some point T1 are nearly always *worse*, in some
way, than just calling getrandom() every time a random number is
desired.

For those reasons, after some discussion on libc-alpha, glibc's
arc4random() now just calls getrandom() on each invocation. That's
trivially safe, and gives us latitude to then make the safe thing faster
without becoming unsafe at our leasure. Card shuffling isn't
particularly fast, however.

How do we rectify this? By putting a safe implementation of getrandom()
in the vDSO, which has access to whatever information a
particular iteration of random.c is using to make its decisions. I use
that careful language of "particular iteration of random.c", because the
set of things that a vDSO getrandom() implementation might need for making
decisions as good as the kernel's will likely change over time. This
isn't just a matter of exporting certain *data* to userspace. We're not
going to commit to a "data API" where the various heuristics used are
exposed, locking in how the kernel works for decades to come, and then
leave it to various userspaces to roll something on top and shoot
themselves in the foot and have all sorts of complexity disasters.
Rather, vDSO getrandom() is supposed to be the *same exact algorithm*
that runs in the kernel, except it's been hoisted into userspace as
much as possible. And so vDSO getrandom() and kernel getrandom() will
always mirror each other hermetically.

API-wise, the vDSO gains this function:

  ssize_t vgetrandom(void *buffer, size_t len, unsigned int flags,
                     void *opaque_state, size_t opaque_len);

The return value and the first 3 arguments are the same as ordinary
getrandom(), while the penultimate argument is a pointer to some state
allocated with the right flags passed to mmap(2), explained below. Were all
five arguments passed to the getrandom syscall, nothing different would happen,
and the functions would have the exact same behavior.

If vgetrandom(NULL, 0, 0, &params, ~0UL) is called, then params gets populated
with information about what flags and prot fields to pass to mmap(2), as well
as how big each state should be, so that the caller can slice up returned
memory from mmap(2) into chunks for passing to vgetrandom().

Libc is expected to allocate a chunk of these on first use, and then
dole them out to threads as they're created, allocating more when
needed.

The interesting meat of the implementation is in lib/vdso/getrandom.c,
as generic C code, and it aims to mainly follow random.c's buffered fast
key erasure logic. Before the RNG is initialized, it falls back to the
syscall. Right now it uses a simple generation counter to make its decisions
on reseeding (though this could be made more extensive over time).

The actual place that has the most work to do is in all of the other
files. Most of the vDSO shared page infrastructure is centered around
gettimeofday, and so the main structs are all in arrays for different
timestamp types, and attached to time namespaces, and so forth. I've
done the best I could to add onto this in an unintrusive way.

In my test results, performance is pretty stellar (around 15x for uint32_t
generation), and it seems to be working. There's an extended example in the
last commit of this series, showing how the syscall and the vDSO function
are meant to be used together.

Cc: linux-crypto@vger.kernel.org
Cc: linux-api@vger.kernel.org
Cc: x86@kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
Cc: Carlos O'Donell <carlos@redhat.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jann Horn <jannh@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <dhildenb@redhat.com>

Jason A. Donenfeld (4):
  mm: add VM_DROPPABLE for designating always lazily freeable mappings
  random: introduce generic vDSO getrandom() implementation
  x86: vdso: Wire up getrandom() vDSO implementation
  selftests/vDSO: add tests for vgetrandom

 MAINTAINERS                                   |   4 +
 arch/alpha/include/uapi/asm/mman.h            |   3 +
 arch/mips/include/uapi/asm/mman.h             |   3 +
 arch/parisc/include/uapi/asm/mman.h           |   3 +
 arch/x86/Kconfig                              |   1 +
 arch/x86/entry/vdso/Makefile                  |   3 +-
 arch/x86/entry/vdso/vdso.lds.S                |   2 +
 arch/x86/entry/vdso/vgetrandom-chacha.S       | 178 +++++++++++
 arch/x86/entry/vdso/vgetrandom.c              |  17 ++
 arch/x86/include/asm/vdso/getrandom.h         |  55 ++++
 arch/x86/include/asm/vdso/vsyscall.h          |   2 +
 arch/x86/include/asm/vvar.h                   |  16 +
 arch/xtensa/include/uapi/asm/mman.h           |   3 +
 drivers/char/random.c                         |  18 +-
 fs/proc/task_mmu.c                            |   3 +
 include/linux/mm.h                            |   8 +
 include/trace/events/mmflags.h                |   7 +
 include/uapi/asm-generic/mman-common.h        |   4 +
 include/uapi/linux/random.h                   |  15 +
 include/vdso/datapage.h                       |  11 +
 include/vdso/getrandom.h                      |  46 +++
 lib/vdso/Kconfig                              |   6 +
 lib/vdso/getrandom.c                          | 252 +++++++++++++++
 mm/Kconfig                                    |   3 +
 mm/mmap.c                                     |  15 +
 mm/mprotect.c                                 |   2 +-
 mm/rmap.c                                     |  16 +-
 tools/include/asm/rwonce.h                    |   0
 tools/include/uapi/asm-generic/mman-common.h  |   4 +
 tools/testing/selftests/mm/.gitignore         |   1 +
 tools/testing/selftests/mm/Makefile           |   1 +
 tools/testing/selftests/mm/droppable.c        |  53 ++++
 tools/testing/selftests/vDSO/.gitignore       |   2 +
 tools/testing/selftests/vDSO/Makefile         |  15 +
 .../testing/selftests/vDSO/vdso_test_chacha.c |  43 +++
 .../selftests/vDSO/vdso_test_getrandom.c      | 288 ++++++++++++++++++
 36 files changed, 1097 insertions(+), 6 deletions(-)
 create mode 100644 arch/x86/entry/vdso/vgetrandom-chacha.S
 create mode 100644 arch/x86/entry/vdso/vgetrandom.c
 create mode 100644 arch/x86/include/asm/vdso/getrandom.h
 create mode 100644 include/vdso/getrandom.h
 create mode 100644 lib/vdso/getrandom.c
 create mode 100644 tools/include/asm/rwonce.h
 create mode 100644 tools/testing/selftests/mm/droppable.c
 create mode 100644 tools/testing/selftests/vDSO/vdso_test_chacha.c
 create mode 100644 tools/testing/selftests/vDSO/vdso_test_getrandom.c

base-commit: 22a40d14b572deb80c0648557f4bd502d7e83826
-- 
2.45.2

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v21 1/4] mm: add VM_DROPPABLE for designating always lazily freeable mappings
  2024-07-07  0:26 [PATCH v21 0/4] implement getrandom() in vDSO Jason A. Donenfeld
@ 2024-07-07  0:26 ` Jason A. Donenfeld
  2024-07-07  7:42   ` David Hildenbrand
  2024-07-07  0:26 ` [PATCH v21 2/4] random: introduce generic vDSO getrandom() implementation Jason A. Donenfeld
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 28+ messages in thread
From: Jason A. Donenfeld @ 2024-07-07  0:26 UTC (permalink / raw)
  To: linux-kernel, patches, tglx
  Cc: Jason A. Donenfeld, linux-crypto, linux-api, x86, Linus Torvalds,
	Greg Kroah-Hartman, Adhemerval Zanella Netto, Carlos O'Donell,
	Florian Weimer, Arnd Bergmann, Jann Horn, Christian Brauner,
	David Hildenbrand, linux-mm

The vDSO getrandom() implementation works with a buffer allocated with a
new system call that has certain requirements:

- It shouldn't be written to core dumps.
  * Easy: VM_DONTDUMP.
- It should be zeroed on fork.
  * Easy: VM_WIPEONFORK.

- It shouldn't be written to swap.
  * Uh-oh: mlock is rlimited.
  * Uh-oh: mlock isn't inherited by forks.

It turns out that the vDSO getrandom() function has three really nice
characteristics that we can exploit to solve this problem:

1) Due to being wiped during fork(), the vDSO code is already robust to
   having the contents of the pages it reads zeroed out midway through
   the function's execution.

2) In the absolute worst case of whatever contingency we're coding for,
   we have the option to fallback to the getrandom() syscall, and
   everything is fine.

3) The buffers the function uses are only ever useful for a maximum of
   60 seconds -- a sort of cache, rather than a long term allocation.

These characteristics mean that we can introduce VM_DROPPABLE, which
has the following semantics:

a) It never is written out to swap.
b) Under memory pressure, mm can just drop the pages (so that they're
   zero when read back again).
c) It is inherited by fork.
d) It doesn't count against the mlock budget, since nothing is locked.

This is fairly simple to implement, with the one snag that we have to
use 64-bit VM_* flags, but this shouldn't be a problem, since the only
consumers will probably be 64-bit anyway.

This way, allocations used by vDSO getrandom() can use:

    VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE

And there will be no problem with using memory when not in use, not
wiping on fork(), coredumps, or writing out to swap.

In order to let vDSO getrandom() use this, expose these via mmap(2) as
well, giving MAP_WIPEONFORK, MAP_DONTDUMP, and MAP_DROPPABLE.

Finally, the provided self test ensures that this is working as desired.

Cc: linux-mm@kvack.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 arch/alpha/include/uapi/asm/mman.h           |  3 ++
 arch/mips/include/uapi/asm/mman.h            |  3 ++
 arch/parisc/include/uapi/asm/mman.h          |  3 ++
 arch/xtensa/include/uapi/asm/mman.h          |  3 ++
 fs/proc/task_mmu.c                           |  3 ++
 include/linux/mm.h                           |  8 +++
 include/trace/events/mmflags.h               |  7 +++
 include/uapi/asm-generic/mman-common.h       |  4 ++
 mm/Kconfig                                   |  3 ++
 mm/mmap.c                                    | 15 ++++++
 mm/mprotect.c                                |  2 +-
 mm/rmap.c                                    | 16 ++++--
 tools/include/uapi/asm-generic/mman-common.h |  4 ++
 tools/testing/selftests/mm/.gitignore        |  1 +
 tools/testing/selftests/mm/Makefile          |  1 +
 tools/testing/selftests/mm/droppable.c       | 53 ++++++++++++++++++++
 16 files changed, 125 insertions(+), 4 deletions(-)
 create mode 100644 tools/testing/selftests/mm/droppable.c

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 763929e814e9..951c54a45676 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -31,6 +31,9 @@
 #define MAP_STACK	0x80000		/* give out an address that is best suited for process/thread stacks */
 #define MAP_HUGETLB	0x100000	/* create a huge page mapping */
 #define MAP_FIXED_NOREPLACE	0x200000/* MAP_FIXED which doesn't unmap underlying mapping */
+#define MAP_WIPEONFORK	0x08000000	/* Zero memory in child forks. */
+#define MAP_DONTDUMP	0x10000000	/* Do not write to coredumps. */
+#define MAP_DROPPABLE	0x20000000	/* Zero memory under memory pressure. */
 
 #define MS_ASYNC	1		/* sync memory asynchronously */
 #define MS_SYNC		2		/* synchronous memory sync */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 9c48d9a21aa0..7490a28ec960 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -49,6 +49,9 @@
 #define MAP_STACK	0x40000		/* give out an address that is best suited for process/thread stacks */
 #define MAP_HUGETLB	0x80000		/* create a huge page mapping */
 #define MAP_FIXED_NOREPLACE 0x100000	/* MAP_FIXED which doesn't unmap underlying mapping */
+#define MAP_WIPEONFORK	0x08000000	/* Zero memory in child forks. */
+#define MAP_DONTDUMP	0x10000000	/* Do not write to coredumps. */
+#define MAP_DROPPABLE	0x20000000	/* Zero memory under memory pressure. */
 
 /*
  * Flags for msync
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 68c44f99bc93..ed03f1d7d06c 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -26,6 +26,9 @@
 #define MAP_HUGETLB	0x80000		/* create a huge page mapping */
 #define MAP_FIXED_NOREPLACE 0x100000	/* MAP_FIXED which doesn't unmap underlying mapping */
 #define MAP_UNINITIALIZED 0		/* uninitialized anonymous mmap */
+#define MAP_WIPEONFORK	0x08000000	/* Zero memory in child forks. */
+#define MAP_DONTDUMP	0x10000000	/* Do not write to coredumps. */
+#define MAP_DROPPABLE	0x20000000	/* Zero memory under memory pressure. */
 
 #define MS_SYNC		1		/* synchronous memory sync */
 #define MS_ASYNC	2		/* sync memory asynchronously */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 1ff0c858544f..2e777670b7fa 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -58,6 +58,9 @@
 #define MAP_FIXED_NOREPLACE 0x100000	/* MAP_FIXED which doesn't unmap underlying mapping */
 #define MAP_UNINITIALIZED 0x4000000	/* For anonymous mmap, memory could be
 					 * uninitialized */
+#define MAP_WIPEONFORK	0x08000000	/* Zero memory in child forks. */
+#define MAP_DONTDUMP	0x10000000	/* Do not write to coredumps. */
+#define MAP_DROPPABLE	0x20000000	/* Zero memory under memory pressure. */
 
 /*
  * Flags for msync
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 71e5039d940d..b3bd8432f869 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -709,6 +709,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 #endif
 #ifdef CONFIG_64BIT
 		[ilog2(VM_SEALED)] = "sl",
+#endif
+#ifdef CONFIG_NEED_VM_DROPPABLE
+		[ilog2(VM_DROPPABLE)]	= "dp",
 #endif
 	};
 	size_t i;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index eb7c96d24ac0..92454a0272ce 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -321,12 +321,14 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_6	38	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
 #define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
+#define VM_HIGH_ARCH_6	BIT(VM_HIGH_ARCH_BIT_6)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
@@ -357,6 +359,12 @@ extern unsigned int kobjsize(const void *objp);
 # define VM_SHADOW_STACK	VM_NONE
 #endif
 
+#ifdef CONFIG_NEED_VM_DROPPABLE
+# define VM_DROPPABLE		VM_HIGH_ARCH_6
+#else
+# define VM_DROPPABLE		VM_NONE
+#endif
+
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
 #elif defined(CONFIG_PPC)
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index e46d6e82765e..fab7848df50a 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -165,6 +165,12 @@ IF_HAVE_PG_ARCH_X(arch_3)
 # define IF_HAVE_UFFD_MINOR(flag, name)
 #endif
 
+#ifdef CONFIG_NEED_VM_DROPPABLE
+# define IF_HAVE_VM_DROPPABLE(flag, name) {flag, name},
+#else
+# define IF_HAVE_VM_DROPPABLE(flag, name)
+#endif
+
 #define __def_vmaflag_names						\
 	{VM_READ,			"read"		},		\
 	{VM_WRITE,			"write"		},		\
@@ -197,6 +203,7 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY,	"softdirty"	)		\
 	{VM_MIXEDMAP,			"mixedmap"	},		\
 	{VM_HUGEPAGE,			"hugepage"	},		\
 	{VM_NOHUGEPAGE,			"nohugepage"	},		\
+IF_HAVE_VM_DROPPABLE(VM_DROPPABLE,	"droppable"	)		\
 	{VM_MERGEABLE,			"mergeable"	}		\
 
 #define show_vma_flags(flags)						\
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 6ce1f1ceb432..65a3069462a8 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -33,6 +33,10 @@
 #define MAP_UNINITIALIZED 0x4000000	/* For anonymous mmap, memory could be
 					 * uninitialized */
 
+#define MAP_WIPEONFORK		0x08000000	/* Zero memory in child forks. */
+#define MAP_DONTDUMP		0x10000000	/* Do not write to coredumps. */
+#define MAP_DROPPABLE		0x20000000	/* Zero memory under memory pressure. */
+
 /*
  * Flags for mlock
  */
diff --git a/mm/Kconfig b/mm/Kconfig
index b4cb45255a54..6cd65ea4b3ad 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1056,6 +1056,9 @@ config ARCH_USES_HIGH_VMA_FLAGS
 	bool
 config ARCH_HAS_PKEYS
 	bool
+config NEED_VM_DROPPABLE
+	select ARCH_USES_HIGH_VMA_FLAGS
+	bool
 
 config ARCH_USES_PG_ARCH_X
 	bool
diff --git a/mm/mmap.c b/mm/mmap.c
index 83b4682ec85c..e361f6750201 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1278,6 +1278,21 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
 			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
 
+	if (flags & MAP_WIPEONFORK) {
+		/* MAP_WIPEONFORK is only supported on anonymous memory. */
+		if (file || !(flags & MAP_PRIVATE))
+			return -EINVAL;
+		vm_flags |= VM_WIPEONFORK;
+	}
+	if (flags & MAP_DONTDUMP)
+		vm_flags |= VM_DONTDUMP;
+	if (flags & MAP_DROPPABLE) {
+		/* MAP_DROPPABLE is only supported on anonymous memory. */
+		if (file || !(flags & MAP_PRIVATE))
+			return -EINVAL;
+		vm_flags |= VM_DROPPABLE;
+	}
+
 	/* Obtain the address to map to. we verify (or select) it and ensure
 	 * that it represents a valid section of the address space.
 	 */
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8c6cd8825273..57b8dad9adcc 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -623,7 +623,7 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
 				may_expand_vm(mm, oldflags, nrpages))
 			return -ENOMEM;
 		if (!(oldflags & (VM_ACCOUNT|VM_WRITE|VM_HUGETLB|
-						VM_SHARED|VM_NORESERVE))) {
+				  VM_SHARED|VM_NORESERVE|VM_DROPPABLE))) {
 			charged = nrpages;
 			if (security_vm_enough_memory_mm(mm, charged))
 				return -ENOMEM;
diff --git a/mm/rmap.c b/mm/rmap.c
index e8fc5ecb59b2..56d7535d5cf6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1397,7 +1397,10 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 	VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
 	VM_BUG_ON_VMA(address < vma->vm_start ||
 			address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
-	__folio_set_swapbacked(folio);
+	/* VM_DROPPABLE mappings don't swap; instead they're just dropped when
+	 * under memory pressure. */
+	if (!(vma->vm_flags & VM_DROPPABLE))
+		__folio_set_swapbacked(folio);
 	__folio_set_anon(folio, vma, address, true);
 
 	if (likely(!folio_test_large(folio))) {
@@ -1841,7 +1844,11 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 * plus the rmap(s) (dropped by discard:).
 				 */
 				if (ref_count == 1 + map_count &&
-				    !folio_test_dirty(folio)) {
+				    (!folio_test_dirty(folio) ||
+				     /* Unlike MADV_FREE mappings, VM_DROPPABLE
+				      * ones can be dropped even if they've
+				      * been dirtied. */
+				     (vma->vm_flags & VM_DROPPABLE))) {
 					dec_mm_counter(mm, MM_ANONPAGES);
 					goto discard;
 				}
@@ -1851,7 +1858,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 * discarded. Remap the page to page table.
 				 */
 				set_pte_at(mm, address, pvmw.pte, pteval);
-				folio_set_swapbacked(folio);
+				/* Unlike MADV_FREE mappings, VM_DROPPABLE ones
+				 * never get swap backed on failure to drop. */
+				if (!(vma->vm_flags & VM_DROPPABLE))
+					folio_set_swapbacked(folio);
 				ret = false;
 				page_vma_mapped_walk_done(&pvmw);
 				break;
diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
index 6ce1f1ceb432..65a3069462a8 100644
--- a/tools/include/uapi/asm-generic/mman-common.h
+++ b/tools/include/uapi/asm-generic/mman-common.h
@@ -33,6 +33,10 @@
 #define MAP_UNINITIALIZED 0x4000000	/* For anonymous mmap, memory could be
 					 * uninitialized */
 
+#define MAP_WIPEONFORK		0x08000000	/* Zero memory in child forks. */
+#define MAP_DONTDUMP		0x10000000	/* Do not write to coredumps. */
+#define MAP_DROPPABLE		0x20000000	/* Zero memory under memory pressure. */
+
 /*
  * Flags for mlock
  */
diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore
index 0b9ab987601c..a8beeb43c2b5 100644
--- a/tools/testing/selftests/mm/.gitignore
+++ b/tools/testing/selftests/mm/.gitignore
@@ -49,3 +49,4 @@ hugetlb_fault_after_madv
 hugetlb_madv_vs_map
 mseal_test
 seal_elf
+droppable
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index 3b49bc3d0a3b..e3e5740e13e1 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -73,6 +73,7 @@ TEST_GEN_FILES += ksm_functional_tests
 TEST_GEN_FILES += mdwe_test
 TEST_GEN_FILES += hugetlb_fault_after_madv
 TEST_GEN_FILES += hugetlb_madv_vs_map
+TEST_GEN_FILES += droppable
 
 ifneq ($(ARCH),arm64)
 TEST_GEN_FILES += soft-dirty
diff --git a/tools/testing/selftests/mm/droppable.c b/tools/testing/selftests/mm/droppable.c
new file mode 100644
index 000000000000..846fb9aea4d1
--- /dev/null
+++ b/tools/testing/selftests/mm/droppable.c
@@ -0,0 +1,53 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2024 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#include <assert.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <signal.h>
+#include <sys/mman.h>
+#include <linux/mman.h>
+
+#include "../kselftest.h"
+
+int main(int argc, char *argv[])
+{
+	size_t alloc_size = 134217728;
+	size_t page_size = getpagesize();
+	void *alloc;
+	pid_t child;
+
+	ksft_print_header();
+	ksft_set_plan(1);
+
+	alloc = mmap(0, alloc_size, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE | MAP_DROPPABLE, -1, 0);
+	assert(alloc != MAP_FAILED);
+	memset(alloc, 'A', alloc_size);
+	for (size_t i = 0; i < alloc_size; i += page_size)
+		assert(*(uint8_t *)(alloc + i));
+
+	child = fork();
+	assert(child >= 0);
+	if (!child) {
+		for (;;)
+			memset(malloc(page_size), 'A', page_size);
+	}
+
+	for (bool done = false; !done;) {
+		for (size_t i = 0; i < alloc_size; i += page_size) {
+			if (!*(uint8_t *)(alloc + i)) {
+				done = true;
+				break;
+			}
+		}
+	}
+	kill(child, SIGTERM);
+
+	ksft_test_result_pass("VM_DROPPABLE: PASS\n");
+	exit(KSFT_PASS);
+}
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v21 2/4] random: introduce generic vDSO getrandom() implementation
  2024-07-07  0:26 [PATCH v21 0/4] implement getrandom() in vDSO Jason A. Donenfeld
  2024-07-07  0:26 ` [PATCH v21 1/4] mm: add VM_DROPPABLE for designating always lazily freeable mappings Jason A. Donenfeld
@ 2024-07-07  0:26 ` Jason A. Donenfeld
  2024-07-07  0:26 ` [PATCH v21 3/4] x86: vdso: Wire up getrandom() vDSO implementation Jason A. Donenfeld
  2024-07-07  0:26 ` [PATCH v21 4/4] selftests/vDSO: add tests for vgetrandom Jason A. Donenfeld
  3 siblings, 0 replies; 28+ messages in thread
From: Jason A. Donenfeld @ 2024-07-07  0:26 UTC (permalink / raw)
  To: linux-kernel, patches, tglx
  Cc: Jason A. Donenfeld, linux-crypto, linux-api, x86, Linus Torvalds,
	Greg Kroah-Hartman, Adhemerval Zanella Netto, Carlos O'Donell,
	Florian Weimer, Arnd Bergmann, Jann Horn, Christian Brauner,
	David Hildenbrand

Provide a generic C vDSO getrandom() implementation, which operates on
an opaque state returned by vgetrandom_alloc() and produces random bytes
the same way as getrandom(). This has the following API signature:

  ssize_t vgetrandom(void *buffer, size_t len, unsigned int flags,
                     void *opaque_state, size_t opaque_len);

The return value and the first three arguments are the same as ordinary
getrandom(), while the last two arguments are a pointer to the opaque
allocated state and its size. Were all five arguments passed to the
getrandom() syscall, nothing different would happen, and the functions
would have the exact same behavior.

The actual vDSO RNG algorithm implemented is the same one implemented by
drivers/char/random.c, using the same fast-erasure techniques as that.
Should the in-kernel implementation change, so too will the vDSO one.

It requires an implementation of ChaCha20 that does not use any stack,
in order to maintain forward secrecy if a multi-threaded program forks
(though this does not account for a similar issue with SA_SIGINFO
copying registers to the stack), so this is left as an
architecture-specific fill-in. Stack-less ChaCha20 is an easy algorithm
to implement on a variety of architectures, so this shouldn't be too
onerous.

Initially, the state is keyless, and so the first call makes a
getrandom() syscall to generate that key, and then uses it for
subsequent calls. By keeping track of a generation counter, it knows
when its key is invalidated and it should fetch a new one using the
syscall. Later, more than just a generation counter might be used.

Since MADV_WIPEONFORK is set on the opaque state, the key and related
state is wiped during a fork(), so secrets don't roll over into new
processes, and the same state doesn't accidentally generate the same
random stream. The generation counter, as well, is always >0, so that
the 0 counter is a useful indication of a fork() or otherwise
uninitialized state.

If the kernel RNG is not yet initialized, then the vDSO always calls the
syscall, because that behavior cannot be emulated in userspace, but
fortunately that state is short lived and only during early boot. If it
has been initialized, then there is no need to inspect the `flags`
argument, because the behavior does not change post-initialization
regardless of the `flags` value.

Since the opaque state passed to it is mutated, vDSO getrandom() is not
reentrant, when used with the same opaque state, which libc should be
mindful of.

The function works over an opaque per-thread state of a particular size,
which must be marked VM_WIPEONFORK, VM_DONTDUMP, VM_NORESERVE, and
VM_DROPPABLE for proper operation. Over time, the nuances of these
allocations may change or grow or even differ based on architectural
features.

The opaque passed to vDSO getrandom() must be allocated using the
mmap_flags and mmap_prot parameters provided by the vgetrandom_opaque_params
struct, which also contains the size of each state. Then, libc can call
mmap(2) and slice up the returned array into a state per each thread,
while ensuring that no single state straddles a page boundary. Libc is
expected to allocate a chunk of these on first use, and then dole them
out to threads as they're created, allocating more when needed.

vDSO getrandom() provides the ability for userspace to generate random
bytes quickly and safely, and is intended to be integrated into libc's
thread management. As an illustrative example, the introduced code in
the vdso_test_getrandom self test later in this series might be used to
do the same outside of libc. In a libc the various pthread-isms are
expected to be elided into libc internals.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 MAINTAINERS                 |   2 +
 drivers/char/random.c       |  18 ++-
 include/uapi/linux/random.h |  15 +++
 include/vdso/datapage.h     |  11 ++
 include/vdso/getrandom.h    |  46 +++++++
 lib/vdso/Kconfig            |   6 +
 lib/vdso/getrandom.c        | 252 ++++++++++++++++++++++++++++++++++++
 7 files changed, 349 insertions(+), 1 deletion(-)
 create mode 100644 include/vdso/getrandom.h
 create mode 100644 lib/vdso/getrandom.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 3c4fdf74a3f9..798158329ad8 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -18747,6 +18747,8 @@ T:	git https://git.kernel.org/pub/scm/linux/kernel/git/crng/random.git
 F:	Documentation/devicetree/bindings/rng/microsoft,vmgenid.yaml
 F:	drivers/char/random.c
 F:	drivers/virt/vmgenid.c
+F:	include/vdso/getrandom.h
+F:	lib/vdso/getrandom.c
 
 RAPIDIO SUBSYSTEM
 M:	Matt Porter <mporter@kernel.crashing.org>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 2597cb43f438..b02a12436750 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
 /*
- * Copyright (C) 2017-2022 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ * Copyright (C) 2017-2024 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
  * Copyright Matt Mackall <mpm@selenic.com>, 2003, 2004, 2005
  * Copyright Theodore Ts'o, 1994, 1995, 1996, 1997, 1998, 1999. All rights reserved.
  *
@@ -56,6 +56,10 @@
 #include <linux/sched/isolation.h>
 #include <crypto/chacha.h>
 #include <crypto/blake2s.h>
+#ifdef CONFIG_VDSO_GETRANDOM
+#include <vdso/getrandom.h>
+#include <vdso/datapage.h>
+#endif
 #include <asm/archrandom.h>
 #include <asm/processor.h>
 #include <asm/irq.h>
@@ -271,6 +275,15 @@ static void crng_reseed(struct work_struct *work)
 	if (next_gen == ULONG_MAX)
 		++next_gen;
 	WRITE_ONCE(base_crng.generation, next_gen);
+#ifdef CONFIG_VDSO_GETRANDOM
+	/* base_crng.generation's invalid value is ULONG_MAX, while
+	 * _vdso_rng_data.generation's invalid value is 0, so add one to the
+	 * former to arrive at the latter. Use smp_store_release so that this
+	 * is ordered with the write above to base_crng.generation. Pairs with
+	 * the smp_rmb() before the syscall in the vDSO code.
+	 */
+	smp_store_release(&_vdso_rng_data.generation, next_gen + 1);
+#endif
 	if (!static_branch_likely(&crng_is_ready))
 		crng_init = CRNG_READY;
 	spin_unlock_irqrestore(&base_crng.lock, flags);
@@ -721,6 +734,9 @@ static void __cold _credit_init_bits(size_t bits)
 		if (static_key_initialized && system_unbound_wq)
 			queue_work(system_unbound_wq, &set_ready);
 		atomic_notifier_call_chain(&random_ready_notifier, 0, NULL);
+#ifdef CONFIG_VDSO_GETRANDOM
+		WRITE_ONCE(_vdso_rng_data.is_ready, true);
+#endif
 		wake_up_interruptible(&crng_init_wait);
 		kill_fasync(&fasync, SIGIO, POLL_IN);
 		pr_notice("crng init done\n");
diff --git a/include/uapi/linux/random.h b/include/uapi/linux/random.h
index e744c23582eb..2a3fe4c2cdc9 100644
--- a/include/uapi/linux/random.h
+++ b/include/uapi/linux/random.h
@@ -55,4 +55,19 @@ struct rand_pool_info {
 #define GRND_RANDOM	0x0002
 #define GRND_INSECURE	0x0004
 
+/**
+ * struct vgetrandom_opaque_params - arguments for allocating memory for vgetrandom
+ *
+ * @size_per_opaque_state:	Size of each state that is to be passed to vgetrandom().
+ * @mmap_prot:			Value of the prot argument in mmap(2).
+ * @mmap_flags:			Value of the flags argument in mmap(2).
+ * @reserved:			Reserved for future use.
+ */
+struct vgetrandom_opaque_params {
+	__u32 size_of_opaque_state;
+	__u32 mmap_prot;
+	__u32 mmap_flags;
+	__u32 reserved[13];
+};
+
 #endif /* _UAPI_LINUX_RANDOM_H */
diff --git a/include/vdso/datapage.h b/include/vdso/datapage.h
index d04d394db064..05e5787beb73 100644
--- a/include/vdso/datapage.h
+++ b/include/vdso/datapage.h
@@ -113,6 +113,16 @@ struct vdso_data {
 	struct arch_vdso_data	arch_data;
 };
 
+/**
+ * struct vdso_rng_data - vdso RNG state information
+ * @generation:	counter representing the number of RNG reseeds
+ * @is_ready:	boolean signaling whether the RNG is initialized
+ */
+struct vdso_rng_data {
+	u64	generation;
+	u8	is_ready;
+};
+
 /*
  * We use the hidden visibility to prevent the compiler from generating a GOT
  * relocation. Not only is going through a GOT useless (the entry couldn't and
@@ -124,6 +134,7 @@ struct vdso_data {
  */
 extern struct vdso_data _vdso_data[CS_BASES] __attribute__((visibility("hidden")));
 extern struct vdso_data _timens_data[CS_BASES] __attribute__((visibility("hidden")));
+extern struct vdso_rng_data _vdso_rng_data __attribute__((visibility("hidden")));
 
 /**
  * union vdso_data_store - Generic vDSO data page
diff --git a/include/vdso/getrandom.h b/include/vdso/getrandom.h
new file mode 100644
index 000000000000..a8b7c14b0ae0
--- /dev/null
+++ b/include/vdso/getrandom.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2022-2024 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#ifndef _VDSO_GETRANDOM_H
+#define _VDSO_GETRANDOM_H
+
+#include <linux/types.h>
+
+#define CHACHA_KEY_SIZE         32
+#define CHACHA_BLOCK_SIZE       64
+
+/**
+ * struct vgetrandom_state - State used by vDSO getrandom().
+ *
+ * @batch:	One and a half ChaCha20 blocks of buffered RNG output.
+ *
+ * @key:	Key to be used for generating next batch.
+ *
+ * @batch_key:	Union of the prior two members, which is exactly two full
+ * 		ChaCha20 blocks in size, so that @batch and @key can be filled
+ * 		together.
+ *
+ * @generation:	Snapshot of @rng_info->generation in the vDSO data page at
+ *		the time @key was generated.
+ *
+ * @pos:	Offset into @batch of the next available random byte.
+ *
+ * @in_use:	Reentrancy guard for reusing a state within the same thread
+ *		due to signal handlers.
+ */
+struct vgetrandom_state {
+	union {
+		struct {
+			u8	batch[CHACHA_BLOCK_SIZE * 3 / 2];
+			u32	key[CHACHA_KEY_SIZE / sizeof(u32)];
+		};
+		u8		batch_key[CHACHA_BLOCK_SIZE * 2];
+	};
+	u64			generation;
+	u8			pos;
+	bool 			in_use;
+};
+
+#endif /* _VDSO_GETRANDOM_H */
diff --git a/lib/vdso/Kconfig b/lib/vdso/Kconfig
index c46c2300517c..99661b731834 100644
--- a/lib/vdso/Kconfig
+++ b/lib/vdso/Kconfig
@@ -38,3 +38,9 @@ config GENERIC_VDSO_OVERFLOW_PROTECT
 	  in the hotpath.
 
 endif
+
+config VDSO_GETRANDOM
+	bool
+	select NEED_VM_DROPPABLE
+	help
+	  Selected by architectures that support vDSO getrandom().
diff --git a/lib/vdso/getrandom.c b/lib/vdso/getrandom.c
new file mode 100644
index 000000000000..663610831969
--- /dev/null
+++ b/lib/vdso/getrandom.c
@@ -0,0 +1,252 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2022-2024 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#include <linux/cache.h>
+#include <linux/kernel.h>
+#include <linux/time64.h>
+#include <vdso/datapage.h>
+#include <vdso/getrandom.h>
+#include <asm/vdso/getrandom.h>
+#include <asm/vdso/vsyscall.h>
+#include <asm/unaligned.h>
+#include <uapi/linux/mman.h>
+
+#define MEMCPY_AND_ZERO_SRC(type, dst, src, len) do {				\
+	while (len >= sizeof(type)) {						\
+		__put_unaligned_t(type, __get_unaligned_t(type, src), dst);	\
+		__put_unaligned_t(type, 0, src);				\
+		dst += sizeof(type);						\
+		src += sizeof(type);						\
+		len -= sizeof(type);						\
+	}									\
+} while (0)
+
+static void memcpy_and_zero_src(void *dst, void *src, size_t len)
+{
+	if (IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
+		if (IS_ENABLED(CONFIG_64BIT))
+			MEMCPY_AND_ZERO_SRC(u64, dst, src, len);
+		MEMCPY_AND_ZERO_SRC(u32, dst, src, len);
+		MEMCPY_AND_ZERO_SRC(u16, dst, src, len);
+	}
+	MEMCPY_AND_ZERO_SRC(u8, dst, src, len);
+}
+
+/**
+ * __cvdso_getrandom_data - Generic vDSO implementation of getrandom() syscall.
+ * @rng_info:		Describes state of kernel RNG, memory shared with kernel.
+ * @buffer:		Destination buffer to fill with random bytes.
+ * @len:		Size of @buffer in bytes.
+ * @flags:		Zero or more GRND_* flags.
+ * @opaque_state:	Pointer to an opaque state area.
+ * @opaque_len:		Length of opaque state area.
+ *
+ * This implements a "fast key erasure" RNG using ChaCha20, in the same way that the kernel's
+ * getrandom() syscall does. It periodically reseeds its key from the kernel's RNG, at the same
+ * schedule that the kernel's RNG is reseeded. If the kernel's RNG is not ready, then this always
+ * calls into the syscall.
+ *
+ * If @buffer, @len, and @flags are 0, and @opaque_len is -1, then @opaque_state is populated
+ * with a struct vgetrandom_opaque_params and the function returns 0:
+ *
+ * @opaque_state *must* be allocated by calling mmap(2) using the mmap_prot and mmap_flags fields
+ * from the struct vgetrandom_opaque_params, and states must not straddle pages. Unless external
+ * locking is used, one state must be allocated per thread, as it is not safe to call this function
+ * concurrently with the same @opaque_state. However, it is safe to call this using the same
+ * @opaque_state that is shared between main code and signal handling code, within the same thread.
+ *
+ * Returns:	The number of random bytes written to @buffer, or a negative value indicating an error.
+ */
+static __always_inline ssize_t
+__cvdso_getrandom_data(const struct vdso_rng_data *rng_info, void *buffer, size_t len,
+		       unsigned int flags, void *opaque_state, size_t opaque_len)
+{
+	ssize_t ret = min_t(size_t, INT_MAX & PAGE_MASK /* = MAX_RW_COUNT */, len);
+	struct vgetrandom_state *state = opaque_state;
+	size_t batch_len, nblocks, orig_len = len;
+	bool in_use, have_retried = false;
+	unsigned long current_generation;
+	void *orig_buffer = buffer;
+	u32 counter[2] = { 0 };
+
+	if (unlikely(opaque_len == ~0UL && !buffer && !len && !flags)) {
+		*(struct vgetrandom_opaque_params *)opaque_state = (struct vgetrandom_opaque_params) {
+			.size_of_opaque_state = sizeof(*state),
+			.mmap_prot = PROT_READ | PROT_WRITE,
+			.mmap_flags = MAP_PRIVATE | MAP_ANONYMOUS |
+				      MAP_DROPPABLE | MAP_NORESERVE |
+				      MAP_WIPEONFORK | MAP_DONTDUMP
+		};
+		return 0;
+	}
+
+	/* The state must not straddle a page, since pages can be zeroed at any time. */
+	if (unlikely(((unsigned long)opaque_state & ~PAGE_MASK) + sizeof(*state) > PAGE_SIZE))
+		return -EFAULT;
+
+	/* If the caller passes the wrong size, which might happen due to CRIU, fallback. */
+	if (unlikely(opaque_len != sizeof(*state)))
+		goto fallback_syscall;
+
+	/*
+	 * If the kernel's RNG is not yet ready, then it's not possible to provide random bytes from
+	 * userspace, because A) the various @flags require this to block, or not, depending on
+	 * various factors unavailable to userspace, and B) the kernel's behavior before the RNG is
+	 * ready is to reseed from the entropy pool at every invocation.
+	 */
+	if (unlikely(!READ_ONCE(rng_info->is_ready)))
+		goto fallback_syscall;
+
+	/*
+	 * This condition is checked after @rng_info->is_ready, because before the kernel's RNG is
+	 * initialized, the @flags parameter may require this to block or return an error, even when
+	 * len is zero.
+	 */
+	if (unlikely(!len))
+		return 0;
+
+	/*
+	 * @state->in_use is basic reentrancy protection against this running in a signal handler
+	 * with the same @opaque_state, but obviously not atomic wrt multiple CPUs or more than one
+	 * level of reentrancy. If a signal interrupts this after reading @state->in_use, but before
+	 * writing @state->in_use, there is still no race, because the signal handler will run to
+	 * its completion before returning execution.
+	 */
+	in_use = READ_ONCE(state->in_use);
+	if (unlikely(in_use))
+		/* The syscall simply fills the buffer and does not touch @state, so fallback. */
+		goto fallback_syscall;
+	WRITE_ONCE(state->in_use, true);
+
+retry_generation:
+	/*
+	 * @rng_info->generation must always be read here, as it serializes @state->key with the
+	 * kernel's RNG reseeding schedule.
+	 */
+	current_generation = READ_ONCE(rng_info->generation);
+
+	/*
+	 * If @state->generation doesn't match the kernel RNG's generation, then it means the
+	 * kernel's RNG has reseeded, and so @state->key is reseeded as well.
+	 */
+	if (unlikely(state->generation != current_generation)) {
+		/*
+		 * Write the generation before filling the key, in case of fork. If there is a fork
+		 * just after this line, the parent and child will get different random bytes from
+		 * the syscall, which is good. However, were this line to occur after the getrandom
+		 * syscall, then both child and parent could have the same bytes and the same
+		 * generation counter, so the fork would not be detected. Therefore, write
+		 * @state->generation before the call to the getrandom syscall.
+		 */
+		WRITE_ONCE(state->generation, current_generation);
+
+		/*
+		 * Prevent the syscall from being reordered wrt current_generation. Pairs with the
+		 * smp_store_release(&_vdso_rng_data.generation) in random.c.
+		 */
+		smp_rmb();
+
+		/* Reseed @state->key using fresh bytes from the kernel. */
+		if (getrandom_syscall(state->key, sizeof(state->key), 0) != sizeof(state->key)) {
+			/*
+			 * If the syscall failed to refresh the key, then @state->key is now
+			 * invalid, so invalidate the generation so that it is not used again, and
+			 * fallback to using the syscall entirely.
+			 */
+			WRITE_ONCE(state->generation, 0);
+
+			/*
+			 * Set @state->in_use to false only after the last write to @state in the
+			 * line above.
+			 */
+			WRITE_ONCE(state->in_use, false);
+
+			goto fallback_syscall;
+		}
+
+		/*
+		 * Set @state->pos to beyond the end of the batch, so that the batch is refilled
+		 * using the new key.
+		 */
+		state->pos = sizeof(state->batch);
+	}
+
+	/* Set len to the total amount of bytes that this function is allowed to read, ret. */
+	len = ret;
+more_batch:
+	/*
+	 * First use bytes out of @state->batch, which may have been filled by the last call to this
+	 * function.
+	 */
+	batch_len = min_t(size_t, sizeof(state->batch) - state->pos, len);
+	if (batch_len) {
+		/* Zeroing at the same time as memcpying helps preserve forward secrecy. */
+		memcpy_and_zero_src(buffer, state->batch + state->pos, batch_len);
+		state->pos += batch_len;
+		buffer += batch_len;
+		len -= batch_len;
+	}
+
+	if (!len) {
+		/* Prevent the loop from being reordered wrt ->generation. */
+		barrier();
+
+		/*
+		 * Since @rng_info->generation will never be 0, re-read @state->generation, rather
+		 * than using the local current_generation variable, to learn whether a fork
+		 * occurred or if @state was zeroed due to memory pressure. Primarily, though, this
+		 * indicates whether the kernel's RNG has reseeded, in which case generate a new key
+		 * and start over.
+		 */
+		if (unlikely(READ_ONCE(state->generation) != READ_ONCE(rng_info->generation))) {
+			/*
+			 * Prevent this from looping forever in case of low memory or racing with a
+			 * user force-reseeding the kernel's RNG using the ioctl.
+			 */
+			if (have_retried) {
+				WRITE_ONCE(state->in_use, false);
+				goto fallback_syscall;
+			}
+
+			have_retried = true;
+			buffer = orig_buffer;
+			goto retry_generation;
+		}
+
+		/*
+		 * Set @state->in_use to false only when there will be no more reads or writes of
+		 * @state.
+		 */
+		WRITE_ONCE(state->in_use, false);
+		return ret;
+	}
+
+	/* Generate blocks of RNG output directly into @buffer while there's enough room left. */
+	nblocks = len / CHACHA_BLOCK_SIZE;
+	if (nblocks) {
+		__arch_chacha20_blocks_nostack(buffer, state->key, counter, nblocks);
+		buffer += nblocks * CHACHA_BLOCK_SIZE;
+		len -= nblocks * CHACHA_BLOCK_SIZE;
+	}
+
+	BUILD_BUG_ON(sizeof(state->batch_key) % CHACHA_BLOCK_SIZE != 0);
+
+	/* Refill the batch and overwrite the key, in order to preserve forward secrecy. */
+	__arch_chacha20_blocks_nostack(state->batch_key, state->key, counter,
+				       sizeof(state->batch_key) / CHACHA_BLOCK_SIZE);
+
+	/* Since the batch was just refilled, set the position back to 0 to indicate a full batch. */
+	state->pos = 0;
+	goto more_batch;
+
+fallback_syscall:
+	return getrandom_syscall(orig_buffer, orig_len, flags);
+}
+
+static __always_inline ssize_t
+__cvdso_getrandom(void *buffer, size_t len, unsigned int flags, void *opaque_state, size_t opaque_len)
+{
+	return __cvdso_getrandom_data(__arch_get_vdso_rng_data(), buffer, len, flags, opaque_state, opaque_len);
+}
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v21 3/4] x86: vdso: Wire up getrandom() vDSO implementation
  2024-07-07  0:26 [PATCH v21 0/4] implement getrandom() in vDSO Jason A. Donenfeld
  2024-07-07  0:26 ` [PATCH v21 1/4] mm: add VM_DROPPABLE for designating always lazily freeable mappings Jason A. Donenfeld
  2024-07-07  0:26 ` [PATCH v21 2/4] random: introduce generic vDSO getrandom() implementation Jason A. Donenfeld
@ 2024-07-07  0:26 ` Jason A. Donenfeld
  2024-07-07  0:26 ` [PATCH v21 4/4] selftests/vDSO: add tests for vgetrandom Jason A. Donenfeld
  3 siblings, 0 replies; 28+ messages in thread
From: Jason A. Donenfeld @ 2024-07-07  0:26 UTC (permalink / raw)
  To: linux-kernel, patches, tglx
  Cc: Jason A. Donenfeld, linux-crypto, linux-api, x86, Linus Torvalds,
	Greg Kroah-Hartman, Adhemerval Zanella Netto, Carlos O'Donell,
	Florian Weimer, Arnd Bergmann, Jann Horn, Christian Brauner,
	David Hildenbrand, Samuel Neves

Hook up the generic vDSO implementation to the x86 vDSO data page. Since
the existing vDSO infrastructure is heavily based on the timekeeping
functionality, which works over arrays of bases, a new macro is
introduced for vvars that are not arrays.

The vDSO function requires a ChaCha20 implementation that does not write
to the stack, yet can still do an entire ChaCha20 permutation, so
provide this using SSE2, since this is userland code that must work on
all x86-64 processors.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Samuel Neves <sneves@dei.uc.pt> # for vgetrandom-chacha.S
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 MAINTAINERS                             |   2 +
 arch/x86/Kconfig                        |   1 +
 arch/x86/entry/vdso/Makefile            |   3 +-
 arch/x86/entry/vdso/vdso.lds.S          |   2 +
 arch/x86/entry/vdso/vgetrandom-chacha.S | 178 ++++++++++++++++++++++++
 arch/x86/entry/vdso/vgetrandom.c        |  17 +++
 arch/x86/include/asm/vdso/getrandom.h   |  55 ++++++++
 arch/x86/include/asm/vdso/vsyscall.h    |   2 +
 arch/x86/include/asm/vvar.h             |  16 +++
 9 files changed, 275 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/entry/vdso/vgetrandom-chacha.S
 create mode 100644 arch/x86/entry/vdso/vgetrandom.c
 create mode 100644 arch/x86/include/asm/vdso/getrandom.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 798158329ad8..00cf0362482b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -18749,6 +18749,8 @@ F:	drivers/char/random.c
 F:	drivers/virt/vmgenid.c
 F:	include/vdso/getrandom.h
 F:	lib/vdso/getrandom.c
+F:	arch/x86/entry/vdso/vgetrandom*
+F:	arch/x86/include/asm/vdso/getrandom*
 
 RAPIDIO SUBSYSTEM
 M:	Matt Porter <mporter@kernel.crashing.org>
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1d7122a1883e..9c98b7a88cc2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -287,6 +287,7 @@ config X86
 	select HAVE_UNSTABLE_SCHED_CLOCK
 	select HAVE_USER_RETURN_NOTIFIER
 	select HAVE_GENERIC_VDSO
+	select VDSO_GETRANDOM			if X86_64
 	select HOTPLUG_PARALLEL			if SMP && X86_64
 	select HOTPLUG_SMT			if SMP
 	select HOTPLUG_SPLIT_STARTUP		if SMP && X86_32
diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index 215a1b202a91..c9216ac4fb1e 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -7,7 +7,7 @@
 include $(srctree)/lib/vdso/Makefile
 
 # Files to link into the vDSO:
-vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o
+vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o vgetrandom.o vgetrandom-chacha.o
 vobjs32-y := vdso32/note.o vdso32/system_call.o vdso32/sigreturn.o
 vobjs32-y += vdso32/vclock_gettime.o vdso32/vgetcpu.o
 vobjs-$(CONFIG_X86_SGX)	+= vsgx.o
@@ -73,6 +73,7 @@ CFLAGS_REMOVE_vdso32/vclock_gettime.o = -pg
 CFLAGS_REMOVE_vgetcpu.o = -pg
 CFLAGS_REMOVE_vdso32/vgetcpu.o = -pg
 CFLAGS_REMOVE_vsgx.o = -pg
+CFLAGS_REMOVE_vgetrandom.o = -pg
 
 #
 # X32 processes use x32 vDSO to access 64bit kernel data.
diff --git a/arch/x86/entry/vdso/vdso.lds.S b/arch/x86/entry/vdso/vdso.lds.S
index e8c60ae7a7c8..0bab5f4af6d1 100644
--- a/arch/x86/entry/vdso/vdso.lds.S
+++ b/arch/x86/entry/vdso/vdso.lds.S
@@ -30,6 +30,8 @@ VERSION {
 #ifdef CONFIG_X86_SGX
 		__vdso_sgx_enter_enclave;
 #endif
+		getrandom;
+		__vdso_getrandom;
 	local: *;
 	};
 }
diff --git a/arch/x86/entry/vdso/vgetrandom-chacha.S b/arch/x86/entry/vdso/vgetrandom-chacha.S
new file mode 100644
index 000000000000..bcba5639b8ee
--- /dev/null
+++ b/arch/x86/entry/vdso/vgetrandom-chacha.S
@@ -0,0 +1,178 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2022-2024 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#include <linux/linkage.h>
+#include <asm/frame.h>
+
+.section	.rodata, "a"
+.align 16
+CONSTANTS:	.octa 0x6b20657479622d323320646e61707865
+.text
+
+/*
+ * Very basic SSE2 implementation of ChaCha20. Produces a given positive number
+ * of blocks of output with a nonce of 0, taking an input key and 8-byte
+ * counter. Importantly does not spill to the stack. Its arguments are:
+ *
+ *	rdi: output bytes
+ *	rsi: 32-byte key input
+ *	rdx: 8-byte counter input/output
+ *	rcx: number of 64-byte blocks to write to output
+ */
+SYM_FUNC_START(__arch_chacha20_blocks_nostack)
+
+.set	output,		%rdi
+.set	key,		%rsi
+.set	counter,	%rdx
+.set	nblocks,	%rcx
+.set	i,		%al
+/* xmm registers are *not* callee-save. */
+.set	temp,		%xmm0
+.set	state0,		%xmm1
+.set	state1,		%xmm2
+.set	state2,		%xmm3
+.set	state3,		%xmm4
+.set	copy0,		%xmm5
+.set	copy1,		%xmm6
+.set	copy2,		%xmm7
+.set	copy3,		%xmm8
+.set	one,		%xmm9
+
+	/* copy0 = "expand 32-byte k" */
+	movaps		CONSTANTS(%rip),copy0
+	/* copy1,copy2 = key */
+	movups		0x00(key),copy1
+	movups		0x10(key),copy2
+	/* copy3 = counter || zero nonce */
+	movq		0x00(counter),copy3
+	/* one = 1 || 0 */
+	movq		$1,%rax
+	movq		%rax,one
+
+.Lblock:
+	/* state0,state1,state2,state3 = copy0,copy1,copy2,copy3 */
+	movdqa		copy0,state0
+	movdqa		copy1,state1
+	movdqa		copy2,state2
+	movdqa		copy3,state3
+
+	movb		$10,i
+.Lpermute:
+	/* state0 += state1, state3 = rotl32(state3 ^ state0, 16) */
+	paddd		state1,state0
+	pxor		state0,state3
+	movdqa		state3,temp
+	pslld		$16,temp
+	psrld		$16,state3
+	por		temp,state3
+
+	/* state2 += state3, state1 = rotl32(state1 ^ state2, 12) */
+	paddd		state3,state2
+	pxor		state2,state1
+	movdqa		state1,temp
+	pslld		$12,temp
+	psrld		$20,state1
+	por		temp,state1
+
+	/* state0 += state1, state3 = rotl32(state3 ^ state0, 8) */
+	paddd		state1,state0
+	pxor		state0,state3
+	movdqa		state3,temp
+	pslld		$8,temp
+	psrld		$24,state3
+	por		temp,state3
+
+	/* state2 += state3, state1 = rotl32(state1 ^ state2, 7) */
+	paddd		state3,state2
+	pxor		state2,state1
+	movdqa		state1,temp
+	pslld		$7,temp
+	psrld		$25,state1
+	por		temp,state1
+
+	/* state1[0,1,2,3] = state1[1,2,3,0] */
+	pshufd		$0x39,state1,state1
+	/* state2[0,1,2,3] = state2[2,3,0,1] */
+	pshufd		$0x4e,state2,state2
+	/* state3[0,1,2,3] = state3[3,0,1,2] */
+	pshufd		$0x93,state3,state3
+
+	/* state0 += state1, state3 = rotl32(state3 ^ state0, 16) */
+	paddd		state1,state0
+	pxor		state0,state3
+	movdqa		state3,temp
+	pslld		$16,temp
+	psrld		$16,state3
+	por		temp,state3
+
+	/* state2 += state3, state1 = rotl32(state1 ^ state2, 12) */
+	paddd		state3,state2
+	pxor		state2,state1
+	movdqa		state1,temp
+	pslld		$12,temp
+	psrld		$20,state1
+	por		temp,state1
+
+	/* state0 += state1, state3 = rotl32(state3 ^ state0, 8) */
+	paddd		state1,state0
+	pxor		state0,state3
+	movdqa		state3,temp
+	pslld		$8,temp
+	psrld		$24,state3
+	por		temp,state3
+
+	/* state2 += state3, state1 = rotl32(state1 ^ state2, 7) */
+	paddd		state3,state2
+	pxor		state2,state1
+	movdqa		state1,temp
+	pslld		$7,temp
+	psrld		$25,state1
+	por		temp,state1
+
+	/* state1[0,1,2,3] = state1[3,0,1,2] */
+	pshufd		$0x93,state1,state1
+	/* state2[0,1,2,3] = state2[2,3,0,1] */
+	pshufd		$0x4e,state2,state2
+	/* state3[0,1,2,3] = state3[1,2,3,0] */
+	pshufd		$0x39,state3,state3
+
+	decb		i
+	jnz		.Lpermute
+
+	/* output0 = state0 + copy0 */
+	paddd		copy0,state0
+	movups		state0,0x00(output)
+	/* output1 = state1 + copy1 */
+	paddd		copy1,state1
+	movups		state1,0x10(output)
+	/* output2 = state2 + copy2 */
+	paddd		copy2,state2
+	movups		state2,0x20(output)
+	/* output3 = state3 + copy3 */
+	paddd		copy3,state3
+	movups		state3,0x30(output)
+
+	/* ++copy3.counter */
+	paddq		one,copy3
+
+	/* output += 64, --nblocks */
+	addq		$64,output
+	decq		nblocks
+	jnz		.Lblock
+
+	/* counter = copy3.counter */
+	movq		copy3,0x00(counter)
+
+	/* Zero out the potentially sensitive regs, in case nothing uses these again. */
+	pxor		state0,state0
+	pxor		state1,state1
+	pxor		state2,state2
+	pxor		state3,state3
+	pxor		copy1,copy1
+	pxor		copy2,copy2
+	pxor		temp,temp
+
+	ret
+SYM_FUNC_END(__arch_chacha20_blocks_nostack)
diff --git a/arch/x86/entry/vdso/vgetrandom.c b/arch/x86/entry/vdso/vgetrandom.c
new file mode 100644
index 000000000000..52d3c7faae2e
--- /dev/null
+++ b/arch/x86/entry/vdso/vgetrandom.c
@@ -0,0 +1,17 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2022-2024 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+#include <linux/types.h>
+
+#include "../../../../lib/vdso/getrandom.c"
+
+ssize_t __vdso_getrandom(void *buffer, size_t len, unsigned int flags, void *opaque_state, size_t opaque_len);
+
+ssize_t __vdso_getrandom(void *buffer, size_t len, unsigned int flags, void *opaque_state, size_t opaque_len)
+{
+	return __cvdso_getrandom(buffer, len, flags, opaque_state, opaque_len);
+}
+
+ssize_t getrandom(void *, size_t, unsigned int, void *, size_t)
+	__attribute__((weak, alias("__vdso_getrandom")));
diff --git a/arch/x86/include/asm/vdso/getrandom.h b/arch/x86/include/asm/vdso/getrandom.h
new file mode 100644
index 000000000000..b96e674cafde
--- /dev/null
+++ b/arch/x86/include/asm/vdso/getrandom.h
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2022-2024 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+#ifndef __ASM_VDSO_GETRANDOM_H
+#define __ASM_VDSO_GETRANDOM_H
+
+#ifndef __ASSEMBLY__
+
+#include <asm/unistd.h>
+#include <asm/vvar.h>
+
+/**
+ * getrandom_syscall - Invoke the getrandom() syscall.
+ * @buffer:	Destination buffer to fill with random bytes.
+ * @len:	Size of @buffer in bytes.
+ * @flags:	Zero or more GRND_* flags.
+ * Returns:	The number of random bytes written to @buffer, or a negative value indicating an error.
+ */
+static __always_inline ssize_t getrandom_syscall(void *buffer, size_t len, unsigned int flags)
+{
+	long ret;
+
+	asm ("syscall" : "=a" (ret) :
+	     "0" (__NR_getrandom), "D" (buffer), "S" (len), "d" (flags) :
+	     "rcx", "r11", "memory");
+
+	return ret;
+}
+
+#define __vdso_rng_data (VVAR(_vdso_rng_data))
+
+static __always_inline const struct vdso_rng_data *__arch_get_vdso_rng_data(void)
+{
+	if (IS_ENABLED(CONFIG_TIME_NS) && __vdso_data->clock_mode == VDSO_CLOCKMODE_TIMENS)
+		return (void *)&__vdso_rng_data + ((void *)&__timens_vdso_data - (void *)&__vdso_data);
+	return &__vdso_rng_data;
+}
+
+/**
+ * __arch_chacha20_blocks_nostack - Generate ChaCha20 stream without using the stack.
+ * @dst_bytes:	Destination buffer to hold @nblocks * 64 bytes of output.
+ * @key:	32-byte input key.
+ * @counter:	8-byte counter, read on input and updated on return.
+ * @nblocks:	Number of blocks to generate.
+ *
+ * Generates a given positive number of blocks of ChaCha20 output with nonce=0, and does not write
+ * to any stack or memory outside of the parameters passed to it, in order to mitigate stack data
+ * leaking into forked child processes.
+ */
+extern void __arch_chacha20_blocks_nostack(u8 *dst_bytes, const u32 *key, u32 *counter, size_t nblocks);
+
+#endif /* !__ASSEMBLY__ */
+
+#endif /* __ASM_VDSO_GETRANDOM_H */
diff --git a/arch/x86/include/asm/vdso/vsyscall.h b/arch/x86/include/asm/vdso/vsyscall.h
index be199a9b2676..71c56586a22f 100644
--- a/arch/x86/include/asm/vdso/vsyscall.h
+++ b/arch/x86/include/asm/vdso/vsyscall.h
@@ -11,6 +11,8 @@
 #include <asm/vvar.h>
 
 DEFINE_VVAR(struct vdso_data, _vdso_data);
+DEFINE_VVAR_SINGLE(struct vdso_rng_data, _vdso_rng_data);
+
 /*
  * Update the vDSO data page to keep in sync with kernel timekeeping.
  */
diff --git a/arch/x86/include/asm/vvar.h b/arch/x86/include/asm/vvar.h
index 183e98e49ab9..9d9af37f7cab 100644
--- a/arch/x86/include/asm/vvar.h
+++ b/arch/x86/include/asm/vvar.h
@@ -26,6 +26,8 @@
  */
 #define DECLARE_VVAR(offset, type, name) \
 	EMIT_VVAR(name, offset)
+#define DECLARE_VVAR_SINGLE(offset, type, name) \
+	EMIT_VVAR(name, offset)
 
 #else
 
@@ -37,6 +39,10 @@ extern char __vvar_page;
 	extern type timens_ ## name[CS_BASES]				\
 	__attribute__((visibility("hidden")));				\
 
+#define DECLARE_VVAR_SINGLE(offset, type, name)				\
+	extern type vvar_ ## name					\
+	__attribute__((visibility("hidden")));				\
+
 #define VVAR(name) (vvar_ ## name)
 #define TIMENS(name) (timens_ ## name)
 
@@ -44,12 +50,22 @@ extern char __vvar_page;
 	type name[CS_BASES]						\
 	__attribute__((section(".vvar_" #name), aligned(16))) __visible
 
+#define DEFINE_VVAR_SINGLE(type, name)					\
+	type name							\
+	__attribute__((section(".vvar_" #name), aligned(16))) __visible
+
 #endif
 
 /* DECLARE_VVAR(offset, type, name) */
 
 DECLARE_VVAR(128, struct vdso_data, _vdso_data)
 
+#if !defined(_SINGLE_DATA)
+#define _SINGLE_DATA
+DECLARE_VVAR_SINGLE(640, struct vdso_rng_data, _vdso_rng_data)
+#endif
+
 #undef DECLARE_VVAR
+#undef DECLARE_VVAR_SINGLE
 
 #endif
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v21 4/4] selftests/vDSO: add tests for vgetrandom
  2024-07-07  0:26 [PATCH v21 0/4] implement getrandom() in vDSO Jason A. Donenfeld
                   ` (2 preceding siblings ...)
  2024-07-07  0:26 ` [PATCH v21 3/4] x86: vdso: Wire up getrandom() vDSO implementation Jason A. Donenfeld
@ 2024-07-07  0:26 ` Jason A. Donenfeld
  3 siblings, 0 replies; 28+ messages in thread
From: Jason A. Donenfeld @ 2024-07-07  0:26 UTC (permalink / raw)
  To: linux-kernel, patches, tglx
  Cc: Jason A. Donenfeld, linux-crypto, linux-api, x86, Linus Torvalds,
	Greg Kroah-Hartman, Adhemerval Zanella Netto, Carlos O'Donell,
	Florian Weimer, Arnd Bergmann, Jann Horn, Christian Brauner,
	David Hildenbrand, linux-kselftest

This adds two tests for vgetrandom. The first one, vdso_test_chacha,
simply checks that the assembly implementation of chacha20 matches that
of libsodium, a basic sanity check that should catch most errors. The
second, vdso_test_getrandom, is a full "libc-like" implementation of the
userspace side of vgetrandom() support. It's meant to be used also as
example code for libcs that might be integrating this.

Cc: linux-kselftest@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 tools/include/asm/rwonce.h                    |   0
 tools/testing/selftests/vDSO/.gitignore       |   2 +
 tools/testing/selftests/vDSO/Makefile         |  15 +
 .../testing/selftests/vDSO/vdso_test_chacha.c |  43 +++
 .../selftests/vDSO/vdso_test_getrandom.c      | 288 ++++++++++++++++++
 5 files changed, 348 insertions(+)
 create mode 100644 tools/include/asm/rwonce.h
 create mode 100644 tools/testing/selftests/vDSO/vdso_test_chacha.c
 create mode 100644 tools/testing/selftests/vDSO/vdso_test_getrandom.c

diff --git a/tools/include/asm/rwonce.h b/tools/include/asm/rwonce.h
new file mode 100644
index 000000000000..e69de29bb2d1
diff --git a/tools/testing/selftests/vDSO/.gitignore b/tools/testing/selftests/vDSO/.gitignore
index a8dc51af5a9c..30d5c8f0e5c7 100644
--- a/tools/testing/selftests/vDSO/.gitignore
+++ b/tools/testing/selftests/vDSO/.gitignore
@@ -6,3 +6,5 @@ vdso_test_correctness
 vdso_test_gettimeofday
 vdso_test_getcpu
 vdso_standalone_test_x86
+vdso_test_getrandom
+vdso_test_chacha
diff --git a/tools/testing/selftests/vDSO/Makefile b/tools/testing/selftests/vDSO/Makefile
index d53a4d8008f9..12fdae3b3201 100644
--- a/tools/testing/selftests/vDSO/Makefile
+++ b/tools/testing/selftests/vDSO/Makefile
@@ -3,6 +3,7 @@ include ../lib.mk
 
 uname_M := $(shell uname -m 2>/dev/null || echo not)
 ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/x86/ -e s/x86_64/x86/)
+SODIUM := $(shell pkg-config --libs libsodium 2>/dev/null)
 
 TEST_GEN_PROGS := $(OUTPUT)/vdso_test_gettimeofday $(OUTPUT)/vdso_test_getcpu
 TEST_GEN_PROGS += $(OUTPUT)/vdso_test_abi
@@ -11,9 +12,19 @@ ifeq ($(ARCH),$(filter $(ARCH),x86 x86_64))
 TEST_GEN_PROGS += $(OUTPUT)/vdso_standalone_test_x86
 endif
 TEST_GEN_PROGS += $(OUTPUT)/vdso_test_correctness
+ifeq ($(uname_M),x86_64)
+TEST_GEN_PROGS += $(OUTPUT)/vdso_test_getrandom
+ifneq ($(SODIUM),)
+TEST_GEN_PROGS += $(OUTPUT)/vdso_test_chacha
+endif
+endif
 
 CFLAGS := -std=gnu99
 CFLAGS_vdso_standalone_test_x86 := -nostdlib -fno-asynchronous-unwind-tables -fno-stack-protector
+CFLAGS_vdso_test_getrandom := -isystem $(top_srcdir)/tools/include -isystem $(top_srcdir)/include/uapi
+CFLAGS_vdso_test_chacha := $(SODIUM) -idirafter $(top_srcdir)/tools/include -isystem $(top_srcdir)/include \
+			   -isystem $(top_srcdir)/arch/$(ARCH)/include \
+			   -D__ASSEMBLY__ -DBULID_VDSO -DCONFIG_FUNCTION_ALIGNMENT=0 -Wa,--noexecstack
 LDFLAGS_vdso_test_correctness := -ldl
 ifeq ($(CONFIG_X86_32),y)
 LDLIBS += -lgcc_s
@@ -33,3 +44,7 @@ $(OUTPUT)/vdso_test_correctness: vdso_test_correctness.c
 		vdso_test_correctness.c \
 		-o $@ \
 		$(LDFLAGS_vdso_test_correctness)
+$(OUTPUT)/vdso_test_getrandom: CFLAGS += $(CFLAGS_vdso_test_getrandom)
+$(OUTPUT)/vdso_test_getrandom: parse_vdso.c
+$(OUTPUT)/vdso_test_chacha: CFLAGS += $(CFLAGS_vdso_test_chacha)
+$(OUTPUT)/vdso_test_chacha: $(top_srcdir)/arch/$(ARCH)/entry/vdso/vgetrandom-chacha.S
diff --git a/tools/testing/selftests/vDSO/vdso_test_chacha.c b/tools/testing/selftests/vDSO/vdso_test_chacha.c
new file mode 100644
index 000000000000..e38f44e5f803
--- /dev/null
+++ b/tools/testing/selftests/vDSO/vdso_test_chacha.c
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2022-2024 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#include <sodium/crypto_stream_chacha20.h>
+#include <sys/random.h>
+#include <string.h>
+#include <stdint.h>
+#include "../kselftest.h"
+
+extern void __arch_chacha20_blocks_nostack(uint8_t *dst_bytes, const uint8_t *key, uint32_t *counter, size_t nblocks);
+
+int main(int argc, char *argv[])
+{
+	enum { TRIALS = 1000, BLOCKS = 128, BLOCK_SIZE = 64 };
+	static const uint8_t nonce[8] = { 0 };
+	uint32_t counter[2];
+	uint8_t key[32];
+	uint8_t output1[BLOCK_SIZE * BLOCKS], output2[BLOCK_SIZE * BLOCKS];
+
+	ksft_print_header();
+	ksft_set_plan(1);
+
+	for (unsigned int trial = 0; trial < TRIALS; ++trial) {
+		if (getrandom(key, sizeof(key), 0) != sizeof(key)) {
+			printf("getrandom() failed!\n");
+			return KSFT_SKIP;
+		}
+		crypto_stream_chacha20(output1, sizeof(output1), nonce, key);
+		for (unsigned int split = 0; split < BLOCKS; ++split) {
+			memset(output2, 'X', sizeof(output2));
+			memset(counter, 0, sizeof(counter));
+			if (split)
+				__arch_chacha20_blocks_nostack(output2, key, counter, split);
+			__arch_chacha20_blocks_nostack(output2 + split * BLOCK_SIZE, key, counter, BLOCKS - split);
+			if (memcmp(output1, output2, sizeof(output1)))
+				return KSFT_FAIL;
+		}
+	}
+	ksft_test_result_pass("chacha: PASS\n");
+	return KSFT_PASS;
+}
diff --git a/tools/testing/selftests/vDSO/vdso_test_getrandom.c b/tools/testing/selftests/vDSO/vdso_test_getrandom.c
new file mode 100644
index 000000000000..69f5833590d2
--- /dev/null
+++ b/tools/testing/selftests/vDSO/vdso_test_getrandom.c
@@ -0,0 +1,288 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2022-2024 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#include <assert.h>
+#include <pthread.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <time.h>
+#include <unistd.h>
+#include <signal.h>
+#include <sys/auxv.h>
+#include <sys/mman.h>
+#include <sys/random.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <linux/random.h>
+
+#include "../kselftest.h"
+#include "parse_vdso.h"
+
+#ifndef timespecsub
+#define	timespecsub(tsp, usp, vsp)					\
+	do {								\
+		(vsp)->tv_sec = (tsp)->tv_sec - (usp)->tv_sec;		\
+		(vsp)->tv_nsec = (tsp)->tv_nsec - (usp)->tv_nsec;	\
+		if ((vsp)->tv_nsec < 0) {				\
+			(vsp)->tv_sec--;				\
+			(vsp)->tv_nsec += 1000000000L;			\
+		}							\
+	} while (0)
+#endif
+
+static struct {
+	pthread_mutex_t lock;
+	void **states;
+	size_t len, cap;
+} grnd_allocator = {
+	.lock = PTHREAD_MUTEX_INITIALIZER
+};
+
+static struct {
+	ssize_t(*fn)(void *, size_t, unsigned long, void *, size_t);
+	pthread_key_t key;
+	pthread_once_t initialized;
+	struct vgetrandom_opaque_params params;
+} grnd_ctx = {
+	.initialized = PTHREAD_ONCE_INIT
+};
+
+static void *vgetrandom_get_state(void)
+{
+	void *state = NULL;
+
+	pthread_mutex_lock(&grnd_allocator.lock);
+	if (!grnd_allocator.len) {
+		size_t page_size = getpagesize();
+		size_t new_cap;
+		size_t alloc_size, num = sysconf(_SC_NPROCESSORS_ONLN); /* Just a decent heuristic. */
+		void *new_block, *new_states;
+
+		alloc_size = (num * grnd_ctx.params.size_of_opaque_state + page_size - 1) & (~(page_size - 1));
+		num = (page_size / grnd_ctx.params.size_of_opaque_state) * (alloc_size / page_size);
+		new_block = mmap(0, alloc_size, grnd_ctx.params.mmap_prot, grnd_ctx.params.mmap_flags, -1, 0);
+		if (new_block == MAP_FAILED)
+			return NULL;
+
+		new_cap = grnd_allocator.cap + num;
+		new_states = reallocarray(grnd_allocator.states, new_cap, sizeof(*grnd_allocator.states));
+		if (!new_states)
+			goto unmap;
+		grnd_allocator.cap = new_cap;
+		grnd_allocator.states = new_states;
+
+		for (size_t i = 0; i < num; ++i) {
+			if (((uintptr_t)new_block & (page_size - 1)) + grnd_ctx.params.size_of_opaque_state > page_size)
+				new_block = (void *)(((uintptr_t)new_block + page_size - 1) & (~(page_size - 1)));
+			grnd_allocator.states[i] = new_block;
+			new_block += grnd_ctx.params.size_of_opaque_state;
+		}
+		grnd_allocator.len = num;
+		goto success;
+
+	unmap:
+		munmap(new_block, alloc_size);
+		goto out;
+	}
+success:
+	state = grnd_allocator.states[--grnd_allocator.len];
+
+out:
+	pthread_mutex_unlock(&grnd_allocator.lock);
+	return state;
+}
+
+static void vgetrandom_put_state(void *state)
+{
+	if (!state)
+		return;
+	pthread_mutex_lock(&grnd_allocator.lock);
+	grnd_allocator.states[grnd_allocator.len++] = state;
+	pthread_mutex_unlock(&grnd_allocator.lock);
+}
+
+static void vgetrandom_init(void)
+{
+	if (pthread_key_create(&grnd_ctx.key, vgetrandom_put_state) != 0)
+		return;
+	unsigned long sysinfo_ehdr = getauxval(AT_SYSINFO_EHDR);
+	if (!sysinfo_ehdr) {
+		printf("AT_SYSINFO_EHDR is not present!\n");
+		exit(KSFT_SKIP);
+	}
+	vdso_init_from_sysinfo_ehdr(sysinfo_ehdr);
+	grnd_ctx.fn = (__typeof__(grnd_ctx.fn))vdso_sym("LINUX_2.6", "__vdso_getrandom");
+	if (!grnd_ctx.fn) {
+		printf("__vdso_getrandom is missing!\n");
+		exit(KSFT_FAIL);
+	}
+	if (grnd_ctx.fn(NULL, 0, 0, &grnd_ctx.params, ~0UL) != 0) {
+		printf("failed to fetch vgetrandom params!\n");
+		exit(KSFT_FAIL);
+	}
+}
+
+static ssize_t vgetrandom(void *buf, size_t len, unsigned long flags)
+{
+	void *state;
+
+	pthread_once(&grnd_ctx.initialized, vgetrandom_init);
+	state = pthread_getspecific(grnd_ctx.key);
+	if (!state) {
+		state = vgetrandom_get_state();
+		if (pthread_setspecific(grnd_ctx.key, state) != 0) {
+			vgetrandom_put_state(state);
+			state = NULL;
+		}
+		if (!state) {
+			printf("vgetrandom_get_state failed!\n");
+			exit(KSFT_FAIL);
+		}
+	}
+	return grnd_ctx.fn(buf, len, flags, state, grnd_ctx.params.size_of_opaque_state);
+}
+
+enum { TRIALS = 25000000, THREADS = 256 };
+
+static void *test_vdso_getrandom(void *)
+{
+	for (size_t i = 0; i < TRIALS; ++i) {
+		unsigned int val;
+		ssize_t ret = vgetrandom(&val, sizeof(val), 0);
+		assert(ret == sizeof(val));
+	}
+	return NULL;
+}
+
+static void *test_libc_getrandom(void *)
+{
+	for (size_t i = 0; i < TRIALS; ++i) {
+		unsigned int val;
+		ssize_t ret = getrandom(&val, sizeof(val), 0);
+		assert(ret == sizeof(val));
+	}
+	return NULL;
+}
+
+static void *test_syscall_getrandom(void *)
+{
+	for (size_t i = 0; i < TRIALS; ++i) {
+		unsigned int val;
+		ssize_t ret = syscall(__NR_getrandom, &val, sizeof(val), 0);
+		assert(ret == sizeof(val));
+	}
+	return NULL;
+}
+
+static void bench_single(void)
+{
+	struct timespec start, end, diff;
+
+	clock_gettime(CLOCK_MONOTONIC, &start);
+	test_vdso_getrandom(NULL);
+	clock_gettime(CLOCK_MONOTONIC, &end);
+	timespecsub(&end, &start, &diff);
+	printf("   vdso: %u times in %lu.%09lu seconds\n", TRIALS, diff.tv_sec, diff.tv_nsec);
+
+	clock_gettime(CLOCK_MONOTONIC, &start);
+	test_libc_getrandom(NULL);
+	clock_gettime(CLOCK_MONOTONIC, &end);
+	timespecsub(&end, &start, &diff);
+	printf("   libc: %u times in %lu.%09lu seconds\n", TRIALS, diff.tv_sec, diff.tv_nsec);
+
+	clock_gettime(CLOCK_MONOTONIC, &start);
+	test_syscall_getrandom(NULL);
+	clock_gettime(CLOCK_MONOTONIC, &end);
+	timespecsub(&end, &start, &diff);
+	printf("syscall: %u times in %lu.%09lu seconds\n", TRIALS, diff.tv_sec, diff.tv_nsec);
+}
+
+static void bench_multi(void)
+{
+	struct timespec start, end, diff;
+	pthread_t threads[THREADS];
+
+	clock_gettime(CLOCK_MONOTONIC, &start);
+	for (size_t i = 0; i < THREADS; ++i)
+		assert(pthread_create(&threads[i], NULL, test_vdso_getrandom, NULL) == 0);
+	for (size_t i = 0; i < THREADS; ++i)
+		pthread_join(threads[i], NULL);
+	clock_gettime(CLOCK_MONOTONIC, &end);
+	timespecsub(&end, &start, &diff);
+	printf("   vdso: %u x %u times in %lu.%09lu seconds\n", TRIALS, THREADS, diff.tv_sec, diff.tv_nsec);
+
+	clock_gettime(CLOCK_MONOTONIC, &start);
+	for (size_t i = 0; i < THREADS; ++i)
+		assert(pthread_create(&threads[i], NULL, test_libc_getrandom, NULL) == 0);
+	for (size_t i = 0; i < THREADS; ++i)
+		pthread_join(threads[i], NULL);
+	clock_gettime(CLOCK_MONOTONIC, &end);
+	timespecsub(&end, &start, &diff);
+	printf("   libc: %u x %u times in %lu.%09lu seconds\n", TRIALS, THREADS, diff.tv_sec, diff.tv_nsec);
+
+	clock_gettime(CLOCK_MONOTONIC, &start);
+	for (size_t i = 0; i < THREADS; ++i)
+		assert(pthread_create(&threads[i], NULL, test_syscall_getrandom, NULL) == 0);
+	for (size_t i = 0; i < THREADS; ++i)
+		pthread_join(threads[i], NULL);
+	clock_gettime(CLOCK_MONOTONIC, &end);
+	timespecsub(&end, &start, &diff);
+	printf("   syscall: %u x %u times in %lu.%09lu seconds\n", TRIALS, THREADS, diff.tv_sec, diff.tv_nsec);
+}
+
+static void fill(void)
+{
+	uint8_t weird_size[323929];
+	for (;;)
+		vgetrandom(weird_size, sizeof(weird_size), 0);
+}
+
+static void kselftest(void)
+{
+	uint8_t weird_size[1263];
+
+	ksft_print_header();
+	ksft_set_plan(1);
+
+	for (size_t i = 0; i < 1000; ++i) {
+		ssize_t ret = vgetrandom(weird_size, sizeof(weird_size), 0);
+		if (ret != sizeof(weird_size))
+			exit(KSFT_FAIL);
+	}
+
+	ksft_test_result_pass("getrandom: PASS\n");
+	exit(KSFT_PASS);
+}
+
+static void usage(const char *argv0)
+{
+	fprintf(stderr, "Usage: %s [bench-single|bench-multi|fill]\n", argv0);
+}
+
+int main(int argc, char *argv[])
+{
+	if (argc == 1) {
+		kselftest();
+		return 0;
+	}
+
+	if (argc != 2) {
+		usage(argv[0]);
+		return 1;
+	}
+	if (!strcmp(argv[1], "bench-single"))
+		bench_single();
+	else if (!strcmp(argv[1], "bench-multi"))
+		bench_multi();
+	else if (!strcmp(argv[1], "fill"))
+		fill();
+	else {
+		usage(argv[0]);
+		return 1;
+	}
+	return 0;
+}
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v21 1/4] mm: add VM_DROPPABLE for designating always lazily freeable mappings
  2024-07-07  0:26 ` [PATCH v21 1/4] mm: add VM_DROPPABLE for designating always lazily freeable mappings Jason A. Donenfeld
@ 2024-07-07  7:42   ` David Hildenbrand
  2024-07-07 18:19     ` Linus Torvalds
  2024-07-08  1:46     ` Jason A. Donenfeld
  0 siblings, 2 replies; 28+ messages in thread
From: David Hildenbrand @ 2024-07-07  7:42 UTC (permalink / raw)
  To: Jason A. Donenfeld, linux-kernel, patches, tglx
  Cc: linux-crypto, linux-api, x86, Linus Torvalds, Greg Kroah-Hartman,
	Adhemerval Zanella Netto, Carlos O'Donell, Florian Weimer,
	Arnd Bergmann, Jann Horn, Christian Brauner, David Hildenbrand,
	linux-mm

On 07.07.24 02:26, Jason A. Donenfeld wrote:

Hi,

having more generic support for VM_DROPPABLE sounds great, I was myself 
at some point looking for something like that.

> The vDSO getrandom() implementation works with a buffer allocated with a
> new system call that has certain requirements:
> 
> - It shouldn't be written to core dumps.
>    * Easy: VM_DONTDUMP.
> - It should be zeroed on fork.
>    * Easy: VM_WIPEONFORK.
> 
> - It shouldn't be written to swap.
>    * Uh-oh: mlock is rlimited.
>    * Uh-oh: mlock isn't inherited by forks.
> 
> It turns out that the vDSO getrandom() function has three really nice
> characteristics that we can exploit to solve this problem:
> 
> 1) Due to being wiped during fork(), the vDSO code is already robust to
>     having the contents of the pages it reads zeroed out midway through
>     the function's execution.
> 
> 2) In the absolute worst case of whatever contingency we're coding for,
>     we have the option to fallback to the getrandom() syscall, and
>     everything is fine.
> 
> 3) The buffers the function uses are only ever useful for a maximum of
>     60 seconds -- a sort of cache, rather than a long term allocation.
> 
> These characteristics mean that we can introduce VM_DROPPABLE, which
> has the following semantics:
> 
> a) It never is written out to swap.
> b) Under memory pressure, mm can just drop the pages (so that they're
>     zero when read back again).
> c) It is inherited by fork.
> d) It doesn't count against the mlock budget, since nothing is locked.
> 
> This is fairly simple to implement, with the one snag that we have to
> use 64-bit VM_* flags, but this shouldn't be a problem, since the only
> consumers will probably be 64-bit anyway.
> 
> This way, allocations used by vDSO getrandom() can use:
> 
>      VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE
> 
> And there will be no problem with using memory when not in use, not
> wiping on fork(), coredumps, or writing out to swap.
> 
> In order to let vDSO getrandom() use this, expose these via mmap(2) as
> well, giving MAP_WIPEONFORK, MAP_DONTDUMP, and MAP_DROPPABLE.


Patch subject would be better to talk about MAP_DROPPABLE now.

But I don't immediately see why MAP_WIPEONFORK and MAP_DONTDUMP have to 
be mmap() flags. Using mmap(MAP_NORESERVE|MAP_DROPPABLE) with madvise() 
to configure these (for users that require that) should be good enough, 
just like they are for existing users.

Thinking out loud, also MAP_DROPPABLE only sets a VMA flag (and does 
snot affect memory commitiing like MAP_NORESERVE), right? So 
MAP_DROPPABLE could easily become a madvise() option as well?

(as you know, we only have limited mmap bits but plenty of madvise 
numbers available)


Interestingly, when looking into something comparable in the past I 
stumbled over "vrange" [1], which would have had a slightly different 
semantic (signal on reaccess). And that did turn out to be more sutibale 
for madvise() flags [2], whereby vrange evolved into 
MADV_VOLATILE/MADV_NONVOLATILE

A sticky MADV_VOLATILE vs. MADV_NONVOLATILE would actually sound pretty 
handy. (again, with your semantics, not the signal-on-reaccess kind of 
thing)

([2] is in general a good read; hey, it's been 10 years since that was 
brought up the last time!)


There needs to be better reasoning why we have to consume three mmap 
bits for something that can likely be achieved without any.

Maybe that was discussed with Linus and there is a pretty good reason 
for that.

I'll also mention that I am unsure how MAP_DROPPABLE is supposed to 
interact with mlock. Maybe just like MADV_FREE currently does (no idea 
if that will work as intended ;) ).


[1] https://lwn.net/Articles/590991/
[2] https://lwn.net/Articles/602650/

> 
> Finally, the provided self test ensures that this is working as desired.
> 
> Cc: linux-mm@kvack.org
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> ---


[...]

> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 8c6cd8825273..57b8dad9adcc 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -623,7 +623,7 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
>   				may_expand_vm(mm, oldflags, nrpages))
>   			return -ENOMEM;
>   		if (!(oldflags & (VM_ACCOUNT|VM_WRITE|VM_HUGETLB|
> -						VM_SHARED|VM_NORESERVE))) {
> +				  VM_SHARED|VM_NORESERVE|VM_DROPPABLE))) {
>   			charged = nrpages;
>   			if (security_vm_enough_memory_mm(mm, charged))
>   				return -ENOMEM;

I don't quite understand this change here. If MAP_DROPPABLE does not 
affect memory accounting during mmap(), it should not affect the same 
during mprotect(). VM_NORESERVE / MAP_NORESERVE is responsible for that.

Did I missing something where MAP_DROPPABLE changes the memory 
accounting during mmap()?

> diff --git a/mm/rmap.c b/mm/rmap.c
> index e8fc5ecb59b2..56d7535d5cf6 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1397,7 +1397,10 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>   	VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
>   	VM_BUG_ON_VMA(address < vma->vm_start ||
>   			address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
> -	__folio_set_swapbacked(folio);
> +	/* VM_DROPPABLE mappings don't swap; instead they're just dropped when
> +	 * under memory pressure. */
> +	if (!(vma->vm_flags & VM_DROPPABLE))
> +		__folio_set_swapbacked(folio);
>   	__folio_set_anon(folio, vma, address, true);
>   
>   	if (likely(!folio_test_large(folio))) {
> @@ -1841,7 +1844,11 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>   				 * plus the rmap(s) (dropped by discard:).
>   				 */
>   				if (ref_count == 1 + map_count &&
> -				    !folio_test_dirty(folio)) {
> +				    (!folio_test_dirty(folio) ||
> +				     /* Unlike MADV_FREE mappings, VM_DROPPABLE
> +				      * ones can be dropped even if they've
> +				      * been dirtied. */

We use

/*
  * Comment start
  * Comment end
  */

styled comments in MM.

> +				     (vma->vm_flags & VM_DROPPABLE))) {
>   					dec_mm_counter(mm, MM_ANONPAGES);
>   					goto discard;
>   				}
> @@ -1851,7 +1858,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>   				 * discarded. Remap the page to page table.
>   				 */
>   				set_pte_at(mm, address, pvmw.pte, pteval);
> -				folio_set_swapbacked(folio);
> +				/* Unlike MADV_FREE mappings, VM_DROPPABLE ones
> +				 * never get swap backed on failure to drop. */
> +				if (!(vma->vm_flags & VM_DROPPABLE))
> +					folio_set_swapbacked(folio);
>   				ret = false;
>   				page_vma_mapped_walk_done(&pvmw);
>   				break;

A note that in mm/mm-stable, "madvise_free_huge_pmd" exists to optimize 
MADV_FREE on PMDs. I suspect we'd want to extend that one as well for 
dropping support, but likely it would also only be a performance 
improvmeent and not affect functonality if not handled.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v21 1/4] mm: add VM_DROPPABLE for designating always lazily freeable mappings
  2024-07-07  7:42   ` David Hildenbrand
@ 2024-07-07 18:19     ` Linus Torvalds
  2024-07-07 18:52       ` David Hildenbrand
  2024-07-08  1:59       ` Jason A. Donenfeld
  2024-07-08  1:46     ` Jason A. Donenfeld
  1 sibling, 2 replies; 28+ messages in thread
From: Linus Torvalds @ 2024-07-07 18:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jason A. Donenfeld, linux-kernel, patches, tglx, linux-crypto,
	linux-api, x86, Greg Kroah-Hartman, Adhemerval Zanella Netto,
	Carlos O'Donell, Florian Weimer, Arnd Bergmann, Jann Horn,
	Christian Brauner, David Hildenbrand, linux-mm

On Sun, 7 Jul 2024 at 00:42, David Hildenbrand <david@redhat.com> wrote:
>
> But I don't immediately see why MAP_WIPEONFORK and MAP_DONTDUMP have to
> be mmap() flags.

I don't think they have to be mmap() flags, but that said, I think
it's technically the better alternative than saying "you have to
madvise things later".

I very much understand the "we don't have a lot of MAP_xyz flags and
we don't want to waste them" argument, but at the same time

 (a) we _do_ have those flags

 (b) picking a worse interface seems bad

 (c) we could actually use the PROT_xyz bits, which we have a ton of

And yes, (c) is ugly, but is it uglier than "use two system calls to
do one thing"? I mean, "flags" and "prot" are just two sides of the
same coin in the end, the split is kind of arbitrary, and "prot" only
has four bits right now, and one of them is historical and useless,
and actually happens to be *exactly* this kind of MAP_xyz bit.

(In case it's not clear, I'm talking about PROT_SEM, which is very
much a behavioral bit for broken architectures that we've actually
never implemented).

We also have PROT_GROSDOWN and PROT_GROWSUP , which is basically a
"match MAP_GROWSxyz and change the mprotect() limits appropriately"

So I actually think we could use the PROT_xyz bits, and anybody who
says "those are for PROT_READ and PROT_WRITE is already very very
wrong.

Again - not pretty, but mappens to match reality.

> Interestingly, when looking into something comparable in the past I
> stumbled over "vrange" [1], which would have had a slightly different
> semantic (signal on reaccess).

We literally talked about exactly this with Jason, except unlike you I
couldn't find the historical archive (I tried in vain to find
something from lore).

  https://lore.kernel.org/lkml/CAHk-=whRpLyY+U9mkKo8O=2_BXNk=7sjYeObzFr3fGi0KLjLJw@mail.gmail.com/

I do think that a "explicit populate and get a signal on access" is a
very valid model, but I think the "zero on access" is a more
immediately real model.

And we actually have had the "get signal on access" before: that's
what VM_DONTCOPY is.

And it was the *less* useful model, which is why we added
VM_WIPEONCOPY, because that's the semantics people actually wanted.

So I think the "signal on thrown out data access" is interesting, but
not necessarily the *more* interesting case.

And I think if we do want that case, I think having MAP_DROPPABLE have
those semantics for MAP_SHARED would be the way to go. IOW, starting
off with the "zero on next access after drop" case doesn't make it any
harder to then later add a "fault on next access after drop" version.

> There needs to be better reasoning why we have to consume three mmap
> bits for something that can likely be achieved without any.

I think it goes the other way: why are MAP_xyz bits so precious to
make this harder to actually use?

Together with that whole "maybe use PROT_xyz bits instead" discussion?

               Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v21 1/4] mm: add VM_DROPPABLE for designating always lazily freeable mappings
  2024-07-07 18:19     ` Linus Torvalds
@ 2024-07-07 18:52       ` David Hildenbrand
  2024-07-07 19:22         ` Linus Torvalds
  2024-07-08  1:59       ` Jason A. Donenfeld
  1 sibling, 1 reply; 28+ messages in thread
From: David Hildenbrand @ 2024-07-07 18:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jason A. Donenfeld, linux-kernel, patches, tglx, linux-crypto,
	linux-api, x86, Greg Kroah-Hartman, Adhemerval Zanella Netto,
	Carlos O'Donell, Florian Weimer, Arnd Bergmann, Jann Horn,
	Christian Brauner, David Hildenbrand, linux-mm

On 07.07.24 20:19, Linus Torvalds wrote:
> On Sun, 7 Jul 2024 at 00:42, David Hildenbrand <david@redhat.com> wrote:
>>
>> But I don't immediately see why MAP_WIPEONFORK and MAP_DONTDUMP have to
>> be mmap() flags.

Hi Linus,

> 
> I don't think they have to be mmap() flags, but that said, I think
> it's technically the better alternative than saying "you have to
> madvise things later".

Having the option to madvise usually means that you can toggle it on/off 
(e.g., MADV_DONTFORK vs. MADV_DOFORK). Not sure if having that option 
could be valuable here (droppable) as well; maybe not.

> 
> I very much understand the "we don't have a lot of MAP_xyz flags and
> we don't want to waste them" argument, but at the same time
> 
>   (a) we _do_ have those flags
> 
>   (b) picking a worse interface seems bad
> 
>   (c) we could actually use the PROT_xyz bits, which we have a ton of
> 

I recall that introducing things like MAP_SHARED_VALIDATE received a lot 
of pushback in the past. But that was before my MM days, and I only had 
people tell me stories about it.

(and at LSF/MM it's been a recurring theme that if you want to propose 
new MMAP flag, you're going to have a hard time)

> And yes, (c) is ugly, but is it uglier than "use two system calls to
> do one thing"? I mean, "flags" and "prot" are just two sides of the
> same coin in the end, the split is kind of arbitrary, and "prot" only
> has four bits right now, and one of them is historical and useless,
> and actually happens to be *exactly* this kind of MAP_xyz bit.

Yeah, I always had the same feeling about prot vs. flags.

My understanding so far was that we should have madvise() ways to toggle 
stuff and add mmap bits if not avoidable; at least that's what I learned 
from the community.

Good to hear that this is changing. (or it's just been an urban myth)


I'll use your mail as reference in the future when that topic pops up ;)


Maybe, historically we used madvise options so it's easier to sense 
which options the current kernel actually supports. (e.g., let mmap() 
succeed but let a separate madvise(MADV_HUGEPAGE) etc. fail if not 
supported by the kernel; no need to fail the whole operation).

> 
> (In case it's not clear, I'm talking about PROT_SEM, which is very
> much a behavioral bit for broken architectures that we've actually
> never implemented).

Yeah.

> 
> We also have PROT_GROSDOWN and PROT_GROWSUP , which is basically a
> "match MAP_GROWSxyz and change the mprotect() limits appropriately"

It's the first time I hear about these two mprotect() options, thanks 
for mentioning that :)

> 
> So I actually think we could use the PROT_xyz bits, and anybody who
> says "those are for PROT_READ and PROT_WRITE is already very very
> wrong.
> 
> Again - not pretty, but mappens to match reality.
> 
>> Interestingly, when looking into something comparable in the past I
>> stumbled over "vrange" [1], which would have had a slightly different
>> semantic (signal on reaccess).
> 
> We literally talked about exactly this with Jason, except unlike you I
> couldn't find the historical archive (I tried in vain to find
> something from lore).

Good that you discussed it, I primarily scanned this patch set here only.

I took notes back when I was looking for something like VM_DROPPABLE 
(also, being more interested in the non-signal version for a VM cache 
use case).

> 
>    https://lore.kernel.org/lkml/CAHk-=whRpLyY+U9mkKo8O=2_BXNk=7sjYeObzFr3fGi0KLjLJw@mail.gmail.com/
> 
> I do think that a "explicit populate and get a signal on access" is a
> very valid model, but I think the "zero on access" is a more
> immediately real model.
> 
> And we actually have had the "get signal on access" before: that's
> what VM_DONTCOPY is.
> 
> And it was the *less* useful model, which is why we added
> VM_WIPEONCOPY, because that's the semantics people actually wanted.
> 
> So I think the "signal on thrown out data access" is interesting, but
> not necessarily the *more* interesting case.

Absolutely agreed.

> 
> And I think if we do want that case, I think having MAP_DROPPABLE have
> those semantics for MAP_SHARED would be the way to go. IOW, starting
> off with the "zero on next access after drop" case doesn't make it any
> harder to then later add a "fault on next access after drop" version.
> 
>> There needs to be better reasoning why we have to consume three mmap
>> bits for something that can likely be achieved without any.
> 
> I think it goes the other way: why are MAP_xyz bits so precious to
> make this harder to actually use?

If things changed and we can have as many as we want, good!

Things like MADV_HUGEPAGE/MADV_MERGEABLE might benefit from a mmap flag 
as well.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v21 1/4] mm: add VM_DROPPABLE for designating always lazily freeable mappings
  2024-07-07 18:52       ` David Hildenbrand
@ 2024-07-07 19:22         ` Linus Torvalds
  2024-07-07 21:01           ` David Hildenbrand
  0 siblings, 1 reply; 28+ messages in thread
From: Linus Torvalds @ 2024-07-07 19:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jason A. Donenfeld, linux-kernel, patches, tglx, linux-crypto,
	linux-api, x86, Greg Kroah-Hartman, Adhemerval Zanella Netto,
	Carlos O'Donell, Florian Weimer, Arnd Bergmann, Jann Horn,
	Christian Brauner, David Hildenbrand, linux-mm

On Sun, 7 Jul 2024 at 11:52, David Hildenbrand <david@redhat.com> wrote:
>
> I recall that introducing things like MAP_SHARED_VALIDATE received a lot
> of pushback in the past. But that was before my MM days, and I only had
> people tell me stories about it.

I think MAP_SHARED_VALIDATE was mostly about worrying about the API impact.

And I think it worked out so well that this is probably the first time
it has been brought up ever since ;)

That said, the *reason* for MAP_SHARED_VALIDATE is actually very
valid: we have historically just ignored any random flags in the
mmap() interfaces, and with shared mappings, that can be dangerous.

IOW, the real issue wasn't MAP_SHARED_VALIDATE itself, but introducing
*other* flags that affected maps that old kernels would ignore, and
then the worry was "now old kernels and new kernels work very
differently for this binary".

That's technically obviously true of any MAP_DROPPABLE thing too - old
kernels would happily just ignore it. I suspect that's more of a
feature than a mis-feature, but..

> My understanding so far was that we should have madvise() ways to toggle
> stuff and add mmap bits if not avoidable; at least that's what I learned
> from the community.

It doesn't sound like a bad model in general. I'm not entirely sure it
makes sense for something like "droppable", since that is a fairly
fundamental behavioral thing. Does it make sense to make something
undroppable when it can drop pages concurrently with that operation?

I mean, you can't switch MAP_SHARED around either.

The other bits already _do_ have madvise() things, and Jason added a
way to just do it all in one go.

> Good to hear that this is changing. (or it's just been an urban myth)

I don't know if that's an urban myth.  Some people are a *lot* more
risk-averse than I personally am. I want things to make sense, but I
also consider "this is fixable if it causes issues" to be a valid
argument.

So for example, who knows *what* garbage people pass off to mmap() as
an argument. That worry was why MAP_SHARED_VALIDATE happened.

But at the same time, does it make sense to complicate things because
of some theoretical worry? Giving random bits to mmap() sounds
unlikely to be a real issue to me, but maybe I'm being naive.

I do generally think that user mode programs can pretty much be
expected to do random things, but how do you even *create* a mmap
MAP_xyz flags field that has random high bits set?

> > We also have PROT_GROSDOWN and PROT_GROWSUP , which is basically a
> > "match MAP_GROWSxyz and change the mprotect() limits appropriately"
>
> It's the first time I hear about these two mprotect() options, thanks
> for mentioning that :)

Don't thank me.

They actually do make sense in a "what if I want to mprotect() the
stack, but I don't know what the stack range is since it's dynamic"
kind of sense, so I certainly don't hate them.

So they are not bad bits, but at the same time they are examples of
how there is a fuzzy line between MAP_xyz and PROT_xyz.

And sometimes the line is literally just "mprotect() only gets one of
them, but we want to pass in the other one, so we duplicate them as a
very very special case".

                     Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v21 1/4] mm: add VM_DROPPABLE for designating always lazily freeable mappings
  2024-07-07 19:22         ` Linus Torvalds
@ 2024-07-07 21:01           ` David Hildenbrand
  2024-07-08  0:08             ` Linus Torvalds
  0 siblings, 1 reply; 28+ messages in thread
From: David Hildenbrand @ 2024-07-07 21:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jason A. Donenfeld, linux-kernel, patches, tglx, linux-crypto,
	linux-api, x86, Greg Kroah-Hartman, Adhemerval Zanella Netto,
	Carlos O'Donell, Florian Weimer, Arnd Bergmann, Jann Horn,
	Christian Brauner, David Hildenbrand, linux-mm

On 07.07.24 21:22, Linus Torvalds wrote:
> On Sun, 7 Jul 2024 at 11:52, David Hildenbrand <david@redhat.com> wrote:
>>
>> I recall that introducing things like MAP_SHARED_VALIDATE received a lot
>> of pushback in the past. But that was before my MM days, and I only had
>> people tell me stories about it.
> 
> I think MAP_SHARED_VALIDATE was mostly about worrying about the API impact.
> 
> And I think it worked out so well that this is probably the first time
> it has been brought up ever since ;)
> 
> That said, the *reason* for MAP_SHARED_VALIDATE is actually very
> valid: we have historically just ignored any random flags in the
> mmap() interfaces, and with shared mappings, that can be dangerous.
> 
> IOW, the real issue wasn't MAP_SHARED_VALIDATE itself, but introducing
> *other* flags that affected maps that old kernels would ignore, and
> then the worry was "now old kernels and new kernels work very
> differently for this binary".
> 
> That's technically obviously true of any MAP_DROPPABLE thing too - old
> kernels would happily just ignore it. I suspect that's more of a
> feature than a mis-feature, but..
> 
>> My understanding so far was that we should have madvise() ways to toggle
>> stuff and add mmap bits if not avoidable; at least that's what I learned
>> from the community.
> 
> It doesn't sound like a bad model in general. I'm not entirely sure it
> makes sense for something like "droppable", since that is a fairly
> fundamental behavioral thing. Does it make sense to make something
> undroppable when it can drop pages concurrently with that operation?
> 
> I mean, you can't switch MAP_SHARED around either.
> 
> The other bits already _do_ have madvise() things, and Jason added a
> way to just do it all in one go.

I just recalled that with MAP_HUGETLB, bits [26:31] encode a hugetlb
size (see include/uapi/asm-generic/hugetlb_encode.h). hugetlb, the gift
that keeps on giving.

We're using:

+#define MAP_WIPEONFORK		0x08000000	/* Zero memory in child forks. */
+#define MAP_DONTDUMP		0x10000000	/* Do not write to coredumps. */
+#define MAP_DROPPABLE		0x20000000	/* Zero memory under memory pressure. */

Which should be bit 27-29.

So using these flags with MAP_HUGETLB will result in surprises.

At least MAP_DROPPABLE doesn't quite make sense with hugetlb, but at least
the other ones do have semantics with hugetlb?

It's late Sunday here in Germany, so I might just have messed something up.

Just raising that there might be a "bit" conflict.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v21 1/4] mm: add VM_DROPPABLE for designating always lazily freeable mappings
  2024-07-07 21:01           ` David Hildenbrand
@ 2024-07-08  0:08             ` Linus Torvalds
  2024-07-08  8:11               ` David Hildenbrand
  2024-07-08 13:50               ` Jason A. Donenfeld
  0 siblings, 2 replies; 28+ messages in thread
From: Linus Torvalds @ 2024-07-08  0:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jason A. Donenfeld, linux-kernel, patches, tglx, linux-crypto,
	linux-api, x86, Greg Kroah-Hartman, Adhemerval Zanella Netto,
	Carlos O'Donell, Florian Weimer, Arnd Bergmann, Jann Horn,
	Christian Brauner, David Hildenbrand, linux-mm

On Sun, 7 Jul 2024 at 14:01, David Hildenbrand <david@redhat.com> wrote:
>
> At least MAP_DROPPABLE doesn't quite make sense with hugetlb, but at least
> the other ones do have semantics with hugetlb?

Hmm.

How about we just say that VM_DROPPABLE really is something separate
from MAP_PRIVATE or MAP_SHARED..

And then we make the rule be that VM_DROPPABLE is never dumped and
always dropped on fork, just to make things simpler.

It not only avoids a flag, but it actually makes sense: the pages
aren't stable for dumping anyway, and not copying them on fork() not
only avoids some overhead, but makes it much more reliable and
testable.

IOW, how about taking this approach:

   --- a/include/uapi/linux/mman.h
   +++ b/include/uapi/linux/mman.h
   @@ -17,5 +17,6 @@
    #define MAP_SHARED  0x01            /* Share changes */
    #define MAP_PRIVATE 0x02            /* Changes are private */
    #define MAP_SHARED_VALIDATE 0x03    /* share + validate extension flags */
   +#define MAP_DROPPABLE       0x08    /* 4 is not in MAP_TYPE on parisc? */

    /*

with do_mmap() doing:

   --- a/mm/mmap.c
   +++ b/mm/mmap.c
   @@ -1369,6 +1369,23 @@ unsigned long do_mmap(struct file *file,
                        pgoff = 0;
                        vm_flags |= VM_SHARED | VM_MAYSHARE;
                        break;
   +            case MAP_DROPPABLE:
   +                    /*
   +                     * A locked or stack area makes no sense to
   +                     * be droppable.
   +                     *
   +                     * Also, since droppable pages can just go
   +                     * away at any time, it makes no sense to
   +                     * copy them on fork or dump them.
   +                     */
   +                    if (flags & MAP_LOCKED)
   +                            return -EINVAL;
   +                    if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
   +                            return -EINVAL;
   +
   +                    vm_flags |= VM_DROPPABLE;
   +                    vm_flags |= VM_WIPEONFORK | VM_DONTDUMP;
   +                    fallthrough;
                case MAP_PRIVATE:
                        /*
                         * Set pgoff according to addr for anon_vma.

which looks rather simple.

The only oddity is that parisc thing - every other archiecture has the
MAP_TYPE bits being 0xf, but parisc uses 0x2b (also four bits, but
instead of the low four bits it's 00101011 - strange).

So using 8 as a MAP_TYPE bit for MAP_DROPPABLE works everywhere, and
if we eventually want to do a "signaling" MAP_DROPPABLE we could use
9.

This has the added advantage that if somebody does this on an old
kernel,. they *will* get an error. Because unlike the 'flag' bits in
general, the MAP_TYPE bit space has always been tested.

Hmm?

              Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v21 1/4] mm: add VM_DROPPABLE for designating always lazily freeable mappings
  2024-07-07  7:42   ` David Hildenbrand
  2024-07-07 18:19     ` Linus Torvalds
@ 2024-07-08  1:46     ` Jason A. Donenfeld
  2024-07-08 20:24       ` David Hildenbrand
  1 sibling, 1 reply; 28+ messages in thread
From: Jason A. Donenfeld @ 2024-07-08  1:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, patches, tglx, linux-crypto, linux-api, x86,
	Linus Torvalds, Greg Kroah-Hartman, Adhemerval Zanella Netto,
	Carlos O'Donell, Florian Weimer, Arnd Bergmann, Jann Horn,
	Christian Brauner, David Hildenbrand, linux-mm

Hi David,

Thanks a lot for the review of the code. Am very glad to have somebody
who knows this code take a careful look at it.

On Sun, Jul 07, 2024 at 09:42:38AM +0200, David Hildenbrand wrote:
> Patch subject would be better to talk about MAP_DROPPABLE now.

Will do. Or, well, in light of the conversation downthread, MAP_DROPPABLE.

> But I don't immediately see why MAP_WIPEONFORK and MAP_DONTDUMP have to 
> be mmap() flags. Using mmap(MAP_NORESERVE|MAP_DROPPABLE) with madvise() 
> to configure these (for users that require that) should be good enough, 
> just like they are for existing users.

I looked into that too, and coming up with some clunky mechanism for
automating several calls to madvise() for each thing. I could make it
work need be, but it's really not nice. And it sort of then leads in the
direction, "this interface isn't great; why don't you just make a
dedicated syscall that does everything you need in one fell swoop,"
which is explicitly what Linus doesn't want. Making it accessible to
mmap() instead makes it more of a direct thing that isn't a whole new
syscall.

Anyway, it indeed looks like there are more PROT_ bits available, and
also that PROT_ has been used this way before. In addition to PROT_SEM,
there are a few arch-specific PROT_ bits that seem similar enough. The
distinction is pretty blurry between MAP_ and PROT_.

So I'll just move this to PROT_ for v+1.

> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index 8c6cd8825273..57b8dad9adcc 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -623,7 +623,7 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
> >   				may_expand_vm(mm, oldflags, nrpages))
> >   			return -ENOMEM;
> >   		if (!(oldflags & (VM_ACCOUNT|VM_WRITE|VM_HUGETLB|
> > -						VM_SHARED|VM_NORESERVE))) {
> > +				  VM_SHARED|VM_NORESERVE|VM_DROPPABLE))) {
> >   			charged = nrpages;
> >   			if (security_vm_enough_memory_mm(mm, charged))
> >   				return -ENOMEM;
> 
> I don't quite understand this change here. If MAP_DROPPABLE does not 
> affect memory accounting during mmap(), it should not affect the same 
> during mprotect(). VM_NORESERVE / MAP_NORESERVE is responsible for that.
> 
> Did I missing something where MAP_DROPPABLE changes the memory 
> accounting during mmap()?

Actually, I think I errored by not adding it to mmap() (via the check in
accountable_mapping(), I believe), and I should add it there.  That also
might be another reason why this is better as a MAP_ (or, rather PROT_)
bit, rather than an madvise call.

Tell me if you disagree, as I might be way off here. But I was thinking
that because the system can just "drop" this memory, it's not sensible
to account for it, because it can be taken right back.

> > diff --git a/mm/rmap.c b/mm/rmap.c
> We use
> 
> /*
>   * Comment start
>   * Comment end
>   */
> 
> styled comments in MM.

Fixed.

> 
> > +				     (vma->vm_flags & VM_DROPPABLE))) {
> >   					dec_mm_counter(mm, MM_ANONPAGES);
> >   					goto discard;
> >   				}
> > @@ -1851,7 +1858,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >   				 * discarded. Remap the page to page table.
> >   				 */
> >   				set_pte_at(mm, address, pvmw.pte, pteval);
> > -				folio_set_swapbacked(folio);
> > +				/* Unlike MADV_FREE mappings, VM_DROPPABLE ones
> > +				 * never get swap backed on failure to drop. */
> > +				if (!(vma->vm_flags & VM_DROPPABLE))
> > +					folio_set_swapbacked(folio);
> >   				ret = false;
> >   				page_vma_mapped_walk_done(&pvmw);
> >   				break;
> 
> A note that in mm/mm-stable, "madvise_free_huge_pmd" exists to optimize 
> MADV_FREE on PMDs. I suspect we'd want to extend that one as well for 
> dropping support, but likely it would also only be a performance 
> improvmeent and not affect functonality if not handled.

That's for doing the freeing of PTEs after the fact, right? If the
mapping was created, got filled with some data, and then sometime later
it got MADV_FREE'd, which is the pattern people follow typically with
MADV_FREE. If we do this as PROT_/MAP_, then that's not a case we need
to worry about, if I understand this code correctly.

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v21 1/4] mm: add VM_DROPPABLE for designating always lazily freeable mappings
  2024-07-07 18:19     ` Linus Torvalds
  2024-07-07 18:52       ` David Hildenbrand
@ 2024-07-08  1:59       ` Jason A. Donenfeld
  1 sibling, 0 replies; 28+ messages in thread
From: Jason A. Donenfeld @ 2024-07-08  1:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Hildenbrand, linux-kernel, patches, tglx, linux-crypto,
	linux-api, x86, Greg Kroah-Hartman, Adhemerval Zanella Netto,
	Carlos O'Donell, Florian Weimer, Arnd Bergmann, Jann Horn,
	Christian Brauner, David Hildenbrand, linux-mm

Hi Linus,

On Sun, Jul 07, 2024 at 11:19:46AM -0700, Linus Torvalds wrote:
>  (c) we could actually use the PROT_xyz bits, which we have a ton of

As I just wrote to David, I'll move this to PROT_xyz.

By the way, in addition to the PROT_SEM historical artifact, there's
also architecture specific ones like PROT_ADI on SPARC, PROT_SAO on
PowerPC, and PROT_BTI and PROT_MTE on arm64. So the MAP_ vs PROT_
distinction seems kind of blurred anyway.

> And we actually have had the "get signal on access" before: that's
> what VM_DONTCOPY is.
> 
> And it was the *less* useful model, which is why we added
> VM_WIPEONCOPY, because that's the semantics people actually wanted.
> 
> So I think the "signal on thrown out data access" is interesting, but
> not necessarily the *more* interesting case.

FYI, I looked into using VM_DONTCOPY/MADV_DONTFORK for my purposes,
because it could possibly make another problem easier, but I couldn't
figure out how to make it smoothly work.

Specifically, a program has a bunch of threads, and some of them have a
vgetrandom state in use, carved out of the same page. One of the threads
forks. In the VM_WIPEONFORK case, the fork child has to reclaim the
states that were in use by other threads at the time of the fork and
return them to the pool of available slices. In the VM_DONTCOPY case,
that's not necessary, which is kind of nice. But if the program forked
in the signal handler and then returned to an in progress vgetrandom
operation, now there's a signal that needs to be handled internally,
identified as belonging to the internal state areas, and not bubbled up
to other code. This seems difficult and fraught. It's far easier to just
have the memory be zeroed and have the code unconditionally check for
that at the same time it's doing other consistency checks.

So yea it just seems a lot more desirable to have the behavior be
zeroing rather than an asynchronous signal, because code can
straightforwardly deal with that inline.

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v21 1/4] mm: add VM_DROPPABLE for designating always lazily freeable mappings
  2024-07-08  0:08             ` Linus Torvalds
@ 2024-07-08  8:11               ` David Hildenbrand
  2024-07-08  8:23                 ` David Hildenbrand
  2024-07-08 13:55                 ` Jason A. Donenfeld
  2024-07-08 13:50               ` Jason A. Donenfeld
  1 sibling, 2 replies; 28+ messages in thread
From: David Hildenbrand @ 2024-07-08  8:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jason A. Donenfeld, linux-kernel, patches, tglx, linux-crypto,
	linux-api, x86, Greg Kroah-Hartman, Adhemerval Zanella Netto,
	Carlos O'Donell, Florian Weimer, Arnd Bergmann, Jann Horn,
	Christian Brauner, David Hildenbrand, linux-mm

On 08.07.24 02:08, Linus Torvalds wrote:
> On Sun, 7 Jul 2024 at 14:01, David Hildenbrand <david@redhat.com> wrote:
>>
>> At least MAP_DROPPABLE doesn't quite make sense with hugetlb, but at least
>> the other ones do have semantics with hugetlb?
> 
> Hmm.
> 
> How about we just say that VM_DROPPABLE really is something separate
> from MAP_PRIVATE or MAP_SHARED..

So it would essentially currently imply MAP_ANON|MAP_PRIVATE, without 
COW (not shared with a child process).

Then, we should ignore any fd+offset that is passed (or bail out); I 
assume that's what your proposal below does automatically without diving 
into the code.

> 
> And then we make the rule be that VM_DROPPABLE is never dumped and
> always dropped on fork, just to make things simpler.

The semantics are much more intuitive. No need for separate mmap flags.

> 
> It not only avoids a flag, but it actually makes sense: the pages
> aren't stable for dumping anyway, and not copying them on fork() not
> only avoids some overhead, but makes it much more reliable and
> testable.
> 
> IOW, how about taking this approach:
> 
>     --- a/include/uapi/linux/mman.h
>     +++ b/include/uapi/linux/mman.h
>     @@ -17,5 +17,6 @@
>      #define MAP_SHARED  0x01            /* Share changes */
>      #define MAP_PRIVATE 0x02            /* Changes are private */
>      #define MAP_SHARED_VALIDATE 0x03    /* share + validate extension flags */
>     +#define MAP_DROPPABLE       0x08    /* 4 is not in MAP_TYPE on parisc? */
> 
>      /*
> 
> with do_mmap() doing:
> 
>     --- a/mm/mmap.c
>     +++ b/mm/mmap.c
>     @@ -1369,6 +1369,23 @@ unsigned long do_mmap(struct file *file,
>                          pgoff = 0;
>                          vm_flags |= VM_SHARED | VM_MAYSHARE;
>                          break;
>     +            case MAP_DROPPABLE:
>     +                    /*
>     +                     * A locked or stack area makes no sense to
>     +                     * be droppable.
>     +                     *
>     +                     * Also, since droppable pages can just go
>     +                     * away at any time, it makes no sense to
>     +                     * copy them on fork or dump them.
>     +                     */
>     +                    if (flags & MAP_LOCKED)
>     +                            return -EINVAL;

Likely we'll have to adjust mlock() as well. Also, I think we should 
just bail out with hugetlb as well.

>     +                    if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
>     +                            return -EINVAL;
>     +
>     +                    vm_flags |= VM_DROPPABLE;
>     +                    vm_flags |= VM_WIPEONFORK | VM_DONTDUMP;

Further, maybe we want to disallow madvise() clearing these flags here, 
just to be consistent.

>     +                    fallthrough;
>                  case MAP_PRIVATE:
>                          /*
>                           * Set pgoff according to addr for anon_vma.
> 
> which looks rather simple.
> 
> The only oddity is that parisc thing - every other archiecture has the
> MAP_TYPE bits being 0xf, but parisc uses 0x2b (also four bits, but
> instead of the low four bits it's 00101011 - strange).

I assume, changing that would have the risk of breaking stupid user 
space, right? (that sets a bit without any semantics)

> 
> So using 8 as a MAP_TYPE bit for MAP_DROPPABLE works everywhere, and
> if we eventually want to do a "signaling" MAP_DROPPABLE we could use
> 9.

Sounds good enough.

> 
> This has the added advantage that if somebody does this on an old
> kernel,. they *will* get an error. Because unlike the 'flag' bits in
> general, the MAP_TYPE bit space has always been tested.
> 
> Hmm?

As a side note, I'll raise that I am not a particular fan of the 
"droppable" terminology, at least with the "read 0s" approach.

 From a user perspective, the memory might suddenly lose its state and 
read as 0s just like volatile memory when it loses power. "dropping 
pages" sounds more like an implementation detail.

Something like MAP_VOLATILE might be more intuitive (similar to the 
proposed MADV_VOLATILE).

But naming is hard, just mentioning to share my thought :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v21 1/4] mm: add VM_DROPPABLE for designating always lazily freeable mappings
  2024-07-08  8:11               ` David Hildenbrand
@ 2024-07-08  8:23                 ` David Hildenbrand
  2024-07-08 13:57                   ` Jason A. Donenfeld
  2024-07-08 13:55                 ` Jason A. Donenfeld
  1 sibling, 1 reply; 28+ messages in thread
From: David Hildenbrand @ 2024-07-08  8:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jason A. Donenfeld, linux-kernel, patches, tglx, linux-crypto,
	linux-api, x86, Greg Kroah-Hartman, Adhemerval Zanella Netto,
	Carlos O'Donell, Florian Weimer, Arnd Bergmann, Jann Horn,
	Christian Brauner, David Hildenbrand, linux-mm

> As a side note, I'll raise that I am not a particular fan of the
> "droppable" terminology, at least with the "read 0s" approach.
> 
>   From a user perspective, the memory might suddenly lose its state and
> read as 0s just like volatile memory when it loses power. "dropping
> pages" sounds more like an implementation detail.

Just to raise why I consider "dropping" an implementation detail: in 
combination with a previous idea I had of exposing "nonvolatile" memory 
to VMs, the following might be interesting:

A hypervisor could expose special "nonvolatile memory" as separate guest 
physical memory region to a VM.

We could use that special memory to back these MAP_XXX regions in our 
guest, in addition to trying to make use of them in the guest kernel, 
for example for something similar to cleancache.

Long story short: it's the hypervisor that could be effectively 
dropping/zeroing out that memory, not the guest VM. "NONVOLATILE" might 
be clearer than "DROPPABLE".

But again, naming is hard ... :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v21 1/4] mm: add VM_DROPPABLE for designating always lazily freeable mappings
  2024-07-08  0:08             ` Linus Torvalds
  2024-07-08  8:11               ` David Hildenbrand
@ 2024-07-08 13:50               ` Jason A. Donenfeld
  1 sibling, 0 replies; 28+ messages in thread
From: Jason A. Donenfeld @ 2024-07-08 13:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Hildenbrand, linux-kernel, patches, tglx, linux-crypto,
	linux-api, x86, Greg Kroah-Hartman, Adhemerval Zanella Netto,
	Carlos O'Donell, Florian Weimer, Arnd Bergmann, Jann Horn,
	Christian Brauner, David Hildenbrand, linux-mm

Hi Linus,

On Sun, Jul 07, 2024 at 05:08:29PM -0700, Linus Torvalds wrote:
>    +                    vm_flags |= VM_DROPPABLE;
>    +                    vm_flags |= VM_WIPEONFORK | VM_DONTDUMP;
> which looks rather simple.

That is nice, though I would add that if we're implying things that are
sensible to imply, it really also needs to add VM_NORESERVE too.
DROPPABLE doesn't make sense semantically without it.

Anyway, rather than adding PROT_xyz for v+1, I'll try adding this
MAP_DROPPABLE (or a different name for David) with the implications as
you've suggested.

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v21 1/4] mm: add VM_DROPPABLE for designating always lazily freeable mappings
  2024-07-08  8:11               ` David Hildenbrand
  2024-07-08  8:23                 ` David Hildenbrand
@ 2024-07-08 13:55                 ` Jason A. Donenfeld
  2024-07-08 14:40                   ` Jason A. Donenfeld
  2024-07-08 20:06                   ` David Hildenbrand
  1 sibling, 2 replies; 28+ messages in thread
From: Jason A. Donenfeld @ 2024-07-08 13:55 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linus Torvalds, linux-kernel, patches, tglx, linux-crypto,
	linux-api, x86, Greg Kroah-Hartman, Adhemerval Zanella Netto,
	Carlos O'Donell, Florian Weimer, Arnd Bergmann, Jann Horn,
	Christian Brauner, David Hildenbrand, linux-mm

Hi David,

On Mon, Jul 08, 2024 at 10:11:24AM +0200, David Hildenbrand wrote:
> The semantics are much more intuitive. No need for separate mmap flags.

Agreed.
 
> Likely we'll have to adjust mlock() as well. Also, I think we should 
> just bail out with hugetlb as well.

Ack.

> Further, maybe we want to disallow madvise() clearing these flags here, 
> just to be consistent.

Good thinking.

> As a side note, I'll raise that I am not a particular fan of the 
> "droppable" terminology, at least with the "read 0s" approach.
> 
>  From a user perspective, the memory might suddenly lose its state and 
> read as 0s just like volatile memory when it loses power. "dropping 
> pages" sounds more like an implementation detail.
> 
> Something like MAP_VOLATILE might be more intuitive (similar to the 
> proposed MADV_VOLATILE).
> 
> But naming is hard, just mentioning to share my thought :)

Naming is hard, but *renaming* is annoying. I like droppable simply
because that's what I've been calling it in my head. MAP_VOLATILE is
fine with me though, and seems reasonable enough. So I'll name it that,
and then please don't change your mind about it later so I won't have to
rename everything again. :)

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v21 1/4] mm: add VM_DROPPABLE for designating always lazily freeable mappings
  2024-07-08  8:23                 ` David Hildenbrand
@ 2024-07-08 13:57                   ` Jason A. Donenfeld
  2024-07-08 20:05                     ` David Hildenbrand
  0 siblings, 1 reply; 28+ messages in thread
From: Jason A. Donenfeld @ 2024-07-08 13:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linus Torvalds, linux-kernel, patches, tglx, linux-crypto,
	linux-api, x86, Greg Kroah-Hartman, Adhemerval Zanella Netto,
	Carlos O'Donell, Florian Weimer, Arnd Bergmann, Jann Horn,
	Christian Brauner, David Hildenbrand, linux-mm

On Mon, Jul 08, 2024 at 10:23:10AM +0200, David Hildenbrand wrote:
> > As a side note, I'll raise that I am not a particular fan of the
> > "droppable" terminology, at least with the "read 0s" approach.
> > 
> >   From a user perspective, the memory might suddenly lose its state and
> > read as 0s just like volatile memory when it loses power. "dropping
> > pages" sounds more like an implementation detail.
>
> Long story short: it's the hypervisor that could be effectively 
> dropping/zeroing out that memory, not the guest VM. "NONVOLATILE" might 
> be clearer than "DROPPABLE".

Surely you mean "VOLATILE", not "NONVOLATILE", right?

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v21 1/4] mm: add VM_DROPPABLE for designating always lazily freeable mappings
  2024-07-08 13:55                 ` Jason A. Donenfeld
@ 2024-07-08 14:40                   ` Jason A. Donenfeld
  2024-07-08 20:21                     ` David Hildenbrand
  2024-07-08 20:06                   ` David Hildenbrand
  1 sibling, 1 reply; 28+ messages in thread
From: Jason A. Donenfeld @ 2024-07-08 14:40 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linus Torvalds, linux-kernel, patches, tglx, linux-crypto,
	linux-api, x86, Greg Kroah-Hartman, Adhemerval Zanella Netto,
	Carlos O'Donell, Florian Weimer, Arnd Bergmann, Jann Horn,
	Christian Brauner, David Hildenbrand, linux-mm

Hi David, Linus,

Below is what I understand the suggestions about the UX to be. The full
commit is in https://git.zx2c4.com/linux-rng/log/ but here's the part
we've been discussing. I've held off on David's suggestion changing
"DROPPABLE" to "VOLATILE" to give Linus some time to wake up on the west
coast and voice his preference for "DROPPABLE". But the rest is in
place.

Jason

diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
index a246e11988d5..e89d00528f2f 100644
--- a/include/uapi/linux/mman.h
+++ b/include/uapi/linux/mman.h
@@ -17,6 +17,7 @@
 #define MAP_SHARED	0x01		/* Share changes */
 #define MAP_PRIVATE	0x02		/* Changes are private */
 #define MAP_SHARED_VALIDATE 0x03	/* share + validate extension flags */
+#define MAP_DROPPABLE	0x08		/* Zero memory under memory pressure. */
 
 /*
  * Huge page size encoding when MAP_HUGETLB is specified, and a huge page
diff --git a/mm/madvise.c b/mm/madvise.c
index a77893462b92..cba5bc652fc4 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1068,13 +1068,16 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
 		new_flags |= VM_WIPEONFORK;
 		break;
 	case MADV_KEEPONFORK:
+		if (vma->vm_flags & VM_DROPPABLE)
+			return -EINVAL;
 		new_flags &= ~VM_WIPEONFORK;
 		break;
 	case MADV_DONTDUMP:
 		new_flags |= VM_DONTDUMP;
 		break;
 	case MADV_DODUMP:
-		if (!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL)
+		if ((!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL) ||
+		    (vma->vm_flags & VM_DROPPABLE))
 			return -EINVAL;
 		new_flags &= ~VM_DONTDUMP;
 		break;
diff --git a/mm/mlock.c b/mm/mlock.c
index 30b51cdea89d..b87b3d8cc9cc 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -485,7 +485,7 @@ static int mlock_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
 
 	if (newflags == oldflags || (oldflags & VM_SPECIAL) ||
 	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) ||
-	    vma_is_dax(vma) || vma_is_secretmem(vma))
+	    vma_is_dax(vma) || vma_is_secretmem(vma) || (oldflags & VM_DROPPABLE))
 		/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
 		goto out;
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 83b4682ec85c..b3d38179dd42 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1369,6 +1369,34 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			pgoff = 0;
 			vm_flags |= VM_SHARED | VM_MAYSHARE;
 			break;
+		case MAP_DROPPABLE:
+			/*
+			 * A locked or stack area makes no sense to be droppable.
+			 *
+			 * Also, since droppable pages can just go away at any time
+			 * it makes no sense to copy them on fork or dump them.
+			 *
+			 * And don't attempt to combine with hugetlb for now.
+			 */
+			if (flags & (MAP_LOCKED | MAP_HUGETLB))
+			        return -EINVAL;
+			if (vm_flags & (VM_GROWSDOWN | VM_GROWSUP))
+			        return -EINVAL;
+
+			vm_flags |= VM_DROPPABLE;
+
+			/*
+			 * If the pages can be dropped, then it doesn't make
+			 * sense to reserve them.
+			 */
+			vm_flags |= VM_NORESERVE;
+
+			/*
+			 * Likewise, they're volatile enough that they
+			 * shouldn't survive forks or coredumps.
+			 */
+			vm_flags |= VM_WIPEONFORK | VM_DONTDUMP;
+			fallthrough;
 		case MAP_PRIVATE:
 			/*
 			 * Set pgoff according to addr for anon_vma.


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v21 1/4] mm: add VM_DROPPABLE for designating always lazily freeable mappings
  2024-07-08 13:57                   ` Jason A. Donenfeld
@ 2024-07-08 20:05                     ` David Hildenbrand
  0 siblings, 0 replies; 28+ messages in thread
From: David Hildenbrand @ 2024-07-08 20:05 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Linus Torvalds, linux-kernel, patches, tglx, linux-crypto,
	linux-api, x86, Greg Kroah-Hartman, Adhemerval Zanella Netto,
	Carlos O'Donell, Florian Weimer, Arnd Bergmann, Jann Horn,
	Christian Brauner, David Hildenbrand, linux-mm

On 08.07.24 15:57, Jason A. Donenfeld wrote:
> On Mon, Jul 08, 2024 at 10:23:10AM +0200, David Hildenbrand wrote:
>>> As a side note, I'll raise that I am not a particular fan of the
>>> "droppable" terminology, at least with the "read 0s" approach.
>>>
>>>    From a user perspective, the memory might suddenly lose its state and
>>> read as 0s just like volatile memory when it loses power. "dropping
>>> pages" sounds more like an implementation detail.
>>
>> Long story short: it's the hypervisor that could be effectively
>> dropping/zeroing out that memory, not the guest VM. "NONVOLATILE" might
>> be clearer than "DROPPABLE".
> 
> Surely you mean "VOLATILE", not "NONVOLATILE", right?

Yes, typo :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v21 1/4] mm: add VM_DROPPABLE for designating always lazily freeable mappings
  2024-07-08 13:55                 ` Jason A. Donenfeld
  2024-07-08 14:40                   ` Jason A. Donenfeld
@ 2024-07-08 20:06                   ` David Hildenbrand
  1 sibling, 0 replies; 28+ messages in thread
From: David Hildenbrand @ 2024-07-08 20:06 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Linus Torvalds, linux-kernel, patches, tglx, linux-crypto,
	linux-api, x86, Greg Kroah-Hartman, Adhemerval Zanella Netto,
	Carlos O'Donell, Florian Weimer, Arnd Bergmann, Jann Horn,
	Christian Brauner, David Hildenbrand, linux-mm

On 08.07.24 15:55, Jason A. Donenfeld wrote:
> Hi David,
> 
> On Mon, Jul 08, 2024 at 10:11:24AM +0200, David Hildenbrand wrote:
>> The semantics are much more intuitive. No need for separate mmap flags.
> 
> Agreed.
>   
>> Likely we'll have to adjust mlock() as well. Also, I think we should
>> just bail out with hugetlb as well.
> 
> Ack.
> 
>> Further, maybe we want to disallow madvise() clearing these flags here,
>> just to be consistent.
> 
> Good thinking.
> 
>> As a side note, I'll raise that I am not a particular fan of the
>> "droppable" terminology, at least with the "read 0s" approach.
>>
>>   From a user perspective, the memory might suddenly lose its state and
>> read as 0s just like volatile memory when it loses power. "dropping
>> pages" sounds more like an implementation detail.
>>
>> Something like MAP_VOLATILE might be more intuitive (similar to the
>> proposed MADV_VOLATILE).
>>
>> But naming is hard, just mentioning to share my thought :)
> 
> Naming is hard, but *renaming* is annoying. I like droppable simply
> because that's what I've been calling it in my head. MAP_VOLATILE is
> fine with me though, and seems reasonable enough. So I'll name it that,
> and then please don't change your mind about it later so I won't have to
> rename everything again. :)

:) Nah. But wait with any remaining until more than one person thinks 
it's a good idea.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v21 1/4] mm: add VM_DROPPABLE for designating always lazily freeable mappings
  2024-07-08 14:40                   ` Jason A. Donenfeld
@ 2024-07-08 20:21                     ` David Hildenbrand
  2024-07-08 20:26                       ` David Hildenbrand
  2024-07-09  2:17                       ` Jason A. Donenfeld
  0 siblings, 2 replies; 28+ messages in thread
From: David Hildenbrand @ 2024-07-08 20:21 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Linus Torvalds, linux-kernel, patches, tglx, linux-crypto,
	linux-api, x86, Greg Kroah-Hartman, Adhemerval Zanella Netto,
	Carlos O'Donell, Florian Weimer, Arnd Bergmann, Jann Horn,
	Christian Brauner, David Hildenbrand, linux-mm

On 08.07.24 16:40, Jason A. Donenfeld wrote:
> Hi David, Linus,
> 
> Below is what I understand the suggestions about the UX to be. The full
> commit is in https://git.zx2c4.com/linux-rng/log/ but here's the part
> we've been discussing. I've held off on David's suggestion changing
> "DROPPABLE" to "VOLATILE" to give Linus some time to wake up on the west
> coast and voice his preference for "DROPPABLE". But the rest is in
> place.
> 
> Jason
> 
> diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
> index a246e11988d5..e89d00528f2f 100644
> --- a/include/uapi/linux/mman.h
> +++ b/include/uapi/linux/mman.h
> @@ -17,6 +17,7 @@
>   #define MAP_SHARED	0x01		/* Share changes */
>   #define MAP_PRIVATE	0x02		/* Changes are private */
>   #define MAP_SHARED_VALIDATE 0x03	/* share + validate extension flags */
> +#define MAP_DROPPABLE	0x08		/* Zero memory under memory pressure. */
>   
>   /*
>    * Huge page size encoding when MAP_HUGETLB is specified, and a huge page
> diff --git a/mm/madvise.c b/mm/madvise.c
> index a77893462b92..cba5bc652fc4 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1068,13 +1068,16 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
>   		new_flags |= VM_WIPEONFORK;
>   		break;
>   	case MADV_KEEPONFORK:
> +		if (vma->vm_flags & VM_DROPPABLE)
> +			return -EINVAL;
>   		new_flags &= ~VM_WIPEONFORK;
>   		break;
>   	case MADV_DONTDUMP:
>   		new_flags |= VM_DONTDUMP;
>   		break;
>   	case MADV_DODUMP:
> -		if (!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL)
> +		if ((!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL) ||
> +		    (vma->vm_flags & VM_DROPPABLE))
>   			return -EINVAL;
>   		new_flags &= ~VM_DONTDUMP;
>   		break;
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 30b51cdea89d..b87b3d8cc9cc 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -485,7 +485,7 @@ static int mlock_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
>   
>   	if (newflags == oldflags || (oldflags & VM_SPECIAL) ||
>   	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) ||
> -	    vma_is_dax(vma) || vma_is_secretmem(vma))
> +	    vma_is_dax(vma) || vma_is_secretmem(vma) || (oldflags & VM_DROPPABLE))
>   		/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
>   		goto out;
>   
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 83b4682ec85c..b3d38179dd42 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1369,6 +1369,34 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
>   			pgoff = 0;
>   			vm_flags |= VM_SHARED | VM_MAYSHARE;
>   			break;
> +		case MAP_DROPPABLE:
> +			/*
> +			 * A locked or stack area makes no sense to be droppable.
> +			 *
> +			 * Also, since droppable pages can just go away at any time
> +			 * it makes no sense to copy them on fork or dump them.
> +			 *
> +			 * And don't attempt to combine with hugetlb for now.
> +			 */
> +			if (flags & (MAP_LOCKED | MAP_HUGETLB))
> +			        return -EINVAL;
> +			if (vm_flags & (VM_GROWSDOWN | VM_GROWSUP))
> +			        return -EINVAL;
> +
> +			vm_flags |= VM_DROPPABLE;
> +
> +			/*
> +			 * If the pages can be dropped, then it doesn't make
> +			 * sense to reserve them.
> +			 */
> +			vm_flags |= VM_NORESERVE;

That is certainly interesting. Nothing that we might not be able to 
reclaim these pages reliably in all cases: for example when long-term 
pinning them.

In some environments (OVERCOMMIT_NEVER) MAP_NORESERE would never be 
effective. I wonder if we want to stick to the same behavior here ... 
but in theory I agree that we can set this here unconditionally, it's 
just the corner case of "there are ways to prohibit reclaim" that makes 
me wonder.

BTW, I was just trying to understand how MADV_FREE + MAP_DROPPABLE would 
behave without any swap space around.

Did you experiment with that?

I'm reading can_reclaim_anon_pages(), and I'm wondering how 
good/reliable that works when there is no swap configured.

Also, the comment in get_scan_count(): "If we have no swap space, do not 
bother scanning anon folios." makes me wonder if some work in that area 
is needed.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v21 1/4] mm: add VM_DROPPABLE for designating always lazily freeable mappings
  2024-07-08  1:46     ` Jason A. Donenfeld
@ 2024-07-08 20:24       ` David Hildenbrand
  0 siblings, 0 replies; 28+ messages in thread
From: David Hildenbrand @ 2024-07-08 20:24 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: linux-kernel, patches, tglx, linux-crypto, linux-api, x86,
	Linus Torvalds, Greg Kroah-Hartman, Adhemerval Zanella Netto,
	Carlos O'Donell, Florian Weimer, Arnd Bergmann, Jann Horn,
	Christian Brauner, David Hildenbrand, linux-mm

On 08.07.24 03:46, Jason A. Donenfeld wrote:
> Hi David,

Hi Jason,

just catching up on mails here. Most of the stuff is now clear from the 
other subthread.

[...]

>>> @@ -1851,7 +1858,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>    				 * discarded. Remap the page to page table.
>>>    				 */
>>>    				set_pte_at(mm, address, pvmw.pte, pteval);
>>> -				folio_set_swapbacked(folio);
>>> +				/* Unlike MADV_FREE mappings, VM_DROPPABLE ones
>>> +				 * never get swap backed on failure to drop. */
>>> +				if (!(vma->vm_flags & VM_DROPPABLE))
>>> +					folio_set_swapbacked(folio);
>>>    				ret = false;
>>>    				page_vma_mapped_walk_done(&pvmw);
>>>    				break;
>>
>> A note that in mm/mm-stable, "madvise_free_huge_pmd" exists to optimize
>> MADV_FREE on PMDs. I suspect we'd want to extend that one as well for
>> dropping support, but likely it would also only be a performance
>> improvmeent and not affect functonality if not handled.
> 
> That's for doing the freeing of PTEs after the fact, right? If the
> mapping was created, got filled with some data, and then sometime later
> it got MADV_FREE'd, which is the pattern people follow typically with
> MADV_FREE. If we do this as PROT_/MAP_, then that's not a case we need
> to worry about, if I understand this code correctly.

We essentially now have code to handle PMD-mapped THP: instead of first 
remapping them using PTEs to then unmap+discard via 512 PTEs (due to 
MADV_FREE being set on the folio), we can now simply unmap+discard a 
single PMD. So performance wise, this might be interesting for this 
mechanism as well (when used in combination with THP).

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v21 1/4] mm: add VM_DROPPABLE for designating always lazily freeable mappings
  2024-07-08 20:21                     ` David Hildenbrand
@ 2024-07-08 20:26                       ` David Hildenbrand
  2024-07-09  2:17                       ` Jason A. Donenfeld
  1 sibling, 0 replies; 28+ messages in thread
From: David Hildenbrand @ 2024-07-08 20:26 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Linus Torvalds, linux-kernel, patches, tglx, linux-crypto,
	linux-api, x86, Greg Kroah-Hartman, Adhemerval Zanella Netto,
	Carlos O'Donell, Florian Weimer, Arnd Bergmann, Jann Horn,
	Christian Brauner, David Hildenbrand, linux-mm

On 08.07.24 22:21, David Hildenbrand wrote:
> On 08.07.24 16:40, Jason A. Donenfeld wrote:
>> Hi David, Linus,
>>
>> Below is what I understand the suggestions about the UX to be. The full
>> commit is in https://git.zx2c4.com/linux-rng/log/ but here's the part
>> we've been discussing. I've held off on David's suggestion changing
>> "DROPPABLE" to "VOLATILE" to give Linus some time to wake up on the west
>> coast and voice his preference for "DROPPABLE". But the rest is in
>> place.
>>
>> Jason
>>
>> diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
>> index a246e11988d5..e89d00528f2f 100644
>> --- a/include/uapi/linux/mman.h
>> +++ b/include/uapi/linux/mman.h
>> @@ -17,6 +17,7 @@
>>    #define MAP_SHARED	0x01		/* Share changes */
>>    #define MAP_PRIVATE	0x02		/* Changes are private */
>>    #define MAP_SHARED_VALIDATE 0x03	/* share + validate extension flags */
>> +#define MAP_DROPPABLE	0x08		/* Zero memory under memory pressure. */
>>    
>>    /*
>>     * Huge page size encoding when MAP_HUGETLB is specified, and a huge page
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index a77893462b92..cba5bc652fc4 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -1068,13 +1068,16 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
>>    		new_flags |= VM_WIPEONFORK;
>>    		break;
>>    	case MADV_KEEPONFORK:
>> +		if (vma->vm_flags & VM_DROPPABLE)
>> +			return -EINVAL;
>>    		new_flags &= ~VM_WIPEONFORK;
>>    		break;
>>    	case MADV_DONTDUMP:
>>    		new_flags |= VM_DONTDUMP;
>>    		break;
>>    	case MADV_DODUMP:
>> -		if (!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL)
>> +		if ((!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL) ||
>> +		    (vma->vm_flags & VM_DROPPABLE))
>>    			return -EINVAL;
>>    		new_flags &= ~VM_DONTDUMP;
>>    		break;
>> diff --git a/mm/mlock.c b/mm/mlock.c
>> index 30b51cdea89d..b87b3d8cc9cc 100644
>> --- a/mm/mlock.c
>> +++ b/mm/mlock.c
>> @@ -485,7 +485,7 @@ static int mlock_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
>>    
>>    	if (newflags == oldflags || (oldflags & VM_SPECIAL) ||
>>    	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) ||
>> -	    vma_is_dax(vma) || vma_is_secretmem(vma))
>> +	    vma_is_dax(vma) || vma_is_secretmem(vma) || (oldflags & VM_DROPPABLE))
>>    		/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
>>    		goto out;
>>    
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index 83b4682ec85c..b3d38179dd42 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -1369,6 +1369,34 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
>>    			pgoff = 0;
>>    			vm_flags |= VM_SHARED | VM_MAYSHARE;
>>    			break;
>> +		case MAP_DROPPABLE:
>> +			/*
>> +			 * A locked or stack area makes no sense to be droppable.
>> +			 *
>> +			 * Also, since droppable pages can just go away at any time
>> +			 * it makes no sense to copy them on fork or dump them.
>> +			 *
>> +			 * And don't attempt to combine with hugetlb for now.
>> +			 */
>> +			if (flags & (MAP_LOCKED | MAP_HUGETLB))
>> +			        return -EINVAL;
>> +			if (vm_flags & (VM_GROWSDOWN | VM_GROWSUP))
>> +			        return -EINVAL;
>> +
>> +			vm_flags |= VM_DROPPABLE;
>> +
>> +			/*
>> +			 * If the pages can be dropped, then it doesn't make
>> +			 * sense to reserve them.
>> +			 */
>> +			vm_flags |= VM_NORESERVE;
> 
> That is certainly interesting. Nothing that we might not be able to

"Nothing" -> "I'll note that" :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v21 1/4] mm: add VM_DROPPABLE for designating always lazily freeable mappings
  2024-07-08 20:21                     ` David Hildenbrand
  2024-07-08 20:26                       ` David Hildenbrand
@ 2024-07-09  2:17                       ` Jason A. Donenfeld
  2024-07-10  3:05                         ` David Hildenbrand
  1 sibling, 1 reply; 28+ messages in thread
From: Jason A. Donenfeld @ 2024-07-09  2:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linus Torvalds, linux-kernel, patches, tglx, linux-crypto,
	linux-api, x86, Greg Kroah-Hartman, Adhemerval Zanella Netto,
	Carlos O'Donell, Florian Weimer, Arnd Bergmann, Jann Horn,
	Christian Brauner, David Hildenbrand, linux-mm

Hi David,

On Mon, Jul 08, 2024 at 10:21:09PM +0200, David Hildenbrand wrote:
> BTW, I was just trying to understand how MADV_FREE + MAP_DROPPABLE would 
> behave without any swap space around.
> 
> Did you experiment with that?

You mean on a system without any swap configured? That's actually my
primary test environment for this. It behaves as expected: when ram
fills up and the scanner is trying to reclaim what it can,
folio_test_swapbacked(folio) is false, and the memory gets freed. After,
reads fault in a zero page. So it's working as expected.

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v21 1/4] mm: add VM_DROPPABLE for designating always lazily freeable mappings
  2024-07-09  2:17                       ` Jason A. Donenfeld
@ 2024-07-10  3:05                         ` David Hildenbrand
  2024-07-10  3:34                           ` Jason A. Donenfeld
  0 siblings, 1 reply; 28+ messages in thread
From: David Hildenbrand @ 2024-07-10  3:05 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Linus Torvalds, linux-kernel, patches, tglx, linux-crypto,
	linux-api, x86, Greg Kroah-Hartman, Adhemerval Zanella Netto,
	Carlos O'Donell, Florian Weimer, Arnd Bergmann, Jann Horn,
	Christian Brauner, David Hildenbrand, linux-mm

On 09.07.24 04:17, Jason A. Donenfeld wrote:
> Hi David,
> 
> On Mon, Jul 08, 2024 at 10:21:09PM +0200, David Hildenbrand wrote:
>> BTW, I was just trying to understand how MADV_FREE + MAP_DROPPABLE would
>> behave without any swap space around.
>>
>> Did you experiment with that?
> 
> You mean on a system without any swap configured? That's actually my
> primary test environment for this. It behaves as expected: when ram
> fills up and the scanner is trying to reclaim what it can,
> folio_test_swapbacked(folio) is false, and the memory gets freed. After,
> reads fault in a zero page. So it's working as expected.

Okay, just to be clear: no swap/zram/zswap. The reclaim code regarding 
not scanning anonymous memory without swap was a bit confusing.

thanks!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v21 1/4] mm: add VM_DROPPABLE for designating always lazily freeable mappings
  2024-07-10  3:05                         ` David Hildenbrand
@ 2024-07-10  3:34                           ` Jason A. Donenfeld
  2024-07-10  3:53                             ` David Hildenbrand
  0 siblings, 1 reply; 28+ messages in thread
From: Jason A. Donenfeld @ 2024-07-10  3:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linus Torvalds, linux-kernel, patches, tglx, linux-crypto,
	linux-api, x86, Greg Kroah-Hartman, Adhemerval Zanella Netto,
	Carlos O'Donell, Florian Weimer, Arnd Bergmann, Jann Horn,
	Christian Brauner, David Hildenbrand, linux-mm

On Wed, Jul 10, 2024 at 05:05:06AM +0200, David Hildenbrand wrote:
> On 09.07.24 04:17, Jason A. Donenfeld wrote:
> > Hi David,
> > 
> > On Mon, Jul 08, 2024 at 10:21:09PM +0200, David Hildenbrand wrote:
> >> BTW, I was just trying to understand how MADV_FREE + MAP_DROPPABLE would
> >> behave without any swap space around.
> >>
> >> Did you experiment with that?
> > 
> > You mean on a system without any swap configured? That's actually my
> > primary test environment for this. It behaves as expected: when ram
> > fills up and the scanner is trying to reclaim what it can,
> > folio_test_swapbacked(folio) is false, and the memory gets freed. After,
> > reads fault in a zero page. So it's working as expected.
> 
> Okay, just to be clear: no swap/zram/zswap. The reclaim code regarding 
> not scanning anonymous memory without swap was a bit confusing.

Right, no swap, as boring a system as can be. I've experimented with
that behavior on my swap-less 64GB thinkpad, as well as on little
special purpose VMs, where I hacked the VM_DROPPABLE test code into the
wireguard test suite.

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v21 1/4] mm: add VM_DROPPABLE for designating always lazily freeable mappings
  2024-07-10  3:34                           ` Jason A. Donenfeld
@ 2024-07-10  3:53                             ` David Hildenbrand
  0 siblings, 0 replies; 28+ messages in thread
From: David Hildenbrand @ 2024-07-10  3:53 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Linus Torvalds, linux-kernel, patches, tglx, linux-crypto,
	linux-api, x86, Greg Kroah-Hartman, Adhemerval Zanella Netto,
	Carlos O'Donell, Florian Weimer, Arnd Bergmann, Jann Horn,
	Christian Brauner, David Hildenbrand, linux-mm

On 10.07.24 05:34, Jason A. Donenfeld wrote:
> On Wed, Jul 10, 2024 at 05:05:06AM +0200, David Hildenbrand wrote:
>> On 09.07.24 04:17, Jason A. Donenfeld wrote:
>>> Hi David,
>>>
>>> On Mon, Jul 08, 2024 at 10:21:09PM +0200, David Hildenbrand wrote:
>>>> BTW, I was just trying to understand how MADV_FREE + MAP_DROPPABLE would
>>>> behave without any swap space around.
>>>>
>>>> Did you experiment with that?
>>>
>>> You mean on a system without any swap configured? That's actually my
>>> primary test environment for this. It behaves as expected: when ram
>>> fills up and the scanner is trying to reclaim what it can,
>>> folio_test_swapbacked(folio) is false, and the memory gets freed. After,
>>> reads fault in a zero page. So it's working as expected.
>>
>> Okay, just to be clear: no swap/zram/zswap. The reclaim code regarding
>> not scanning anonymous memory without swap was a bit confusing.
> 
> Right, no swap, as boring a system as can be. I've experimented with
> that behavior on my swap-less 64GB thinkpad, as well as on little
> special purpose VMs, where I hacked the VM_DROPPABLE test code into the
> wireguard test suite.

Great, thanks!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2024-07-10  3:53 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-07  0:26 [PATCH v21 0/4] implement getrandom() in vDSO Jason A. Donenfeld
2024-07-07  0:26 ` [PATCH v21 1/4] mm: add VM_DROPPABLE for designating always lazily freeable mappings Jason A. Donenfeld
2024-07-07  7:42   ` David Hildenbrand
2024-07-07 18:19     ` Linus Torvalds
2024-07-07 18:52       ` David Hildenbrand
2024-07-07 19:22         ` Linus Torvalds
2024-07-07 21:01           ` David Hildenbrand
2024-07-08  0:08             ` Linus Torvalds
2024-07-08  8:11               ` David Hildenbrand
2024-07-08  8:23                 ` David Hildenbrand
2024-07-08 13:57                   ` Jason A. Donenfeld
2024-07-08 20:05                     ` David Hildenbrand
2024-07-08 13:55                 ` Jason A. Donenfeld
2024-07-08 14:40                   ` Jason A. Donenfeld
2024-07-08 20:21                     ` David Hildenbrand
2024-07-08 20:26                       ` David Hildenbrand
2024-07-09  2:17                       ` Jason A. Donenfeld
2024-07-10  3:05                         ` David Hildenbrand
2024-07-10  3:34                           ` Jason A. Donenfeld
2024-07-10  3:53                             ` David Hildenbrand
2024-07-08 20:06                   ` David Hildenbrand
2024-07-08 13:50               ` Jason A. Donenfeld
2024-07-08  1:59       ` Jason A. Donenfeld
2024-07-08  1:46     ` Jason A. Donenfeld
2024-07-08 20:24       ` David Hildenbrand
2024-07-07  0:26 ` [PATCH v21 2/4] random: introduce generic vDSO getrandom() implementation Jason A. Donenfeld
2024-07-07  0:26 ` [PATCH v21 3/4] x86: vdso: Wire up getrandom() vDSO implementation Jason A. Donenfeld
2024-07-07  0:26 ` [PATCH v21 4/4] selftests/vDSO: add tests for vgetrandom Jason A. Donenfeld

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).