* [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
@ 2025-02-13 18:16 Lorenzo Stoakes
2025-02-13 18:17 ` [PATCH 1/4] mm: allow guard regions in file-backed and read-only mappings Lorenzo Stoakes
` (5 more replies)
0 siblings, 6 replies; 64+ messages in thread
From: Lorenzo Stoakes @ 2025-02-13 18:16 UTC (permalink / raw)
To: Andrew Morton
Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
Vlastimil Babka, Paul E . McKenney, Jann Horn, David Hildenbrand,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
The guard regions feature was initially implemented to support anonymous
mappings only, excluding shmem.
This was done such as to introduce the feature carefully and incrementally
and to be conservative when considering the various caveats and corner
cases that are applicable to file-backed mappings but not to anonymous
ones.
Now this feature has landed in 6.13, it is time to revisit this and to
extend this functionality to file-backed and shmem mappings.
In order to make this maximally useful, and since one may map file-backed
mappings read-only (for instance ELF images), we also remove the
restriction on read-only mappings and permit the establishment of guard
regions in any non-hugetlb, non-mlock()'d mapping.
It is permissible to permit the establishment of guard regions in read-only
mappings because the guard regions only reduce access to the mapping, and
when removed simply reinstate the existing attributes of the underlying
VMA, meaning no access violations can occur.
While the change in kernel code introduced in this series is small, the
majority of the effort here is spent in extending the testing to assert
that the feature works correctly across numerous file-backed mapping
scenarios.
Every single guard region self-test performed against anonymous memory
(which is relevant and not anon-only) has now been updated to also be
performed against shmem and a mapping of a file in the working directory.
This confirms that all cases also function correctly for file-backed guard
regions.
In addition a number of other tests are added for specific file-backed
mapping scenarios.
There are a number of other concerns that one might have with regard to
guard regions, addressed below:
Readahead
~~~~~~~~~
Readahead is a process through which the page cache is populated on the
assumption that sequential reads will occur, thus amortising I/O and,
through a clever use of the PG_readahead folio flag establishing during
major fault and checked upon minor fault, provides for asynchronous I/O to
occur as dat is processed, reducing I/O stalls as data is faulted in.
Guard regions do not alter this mechanism which operations at the folio and
fault level, but do of course prevent the faulting of folios that would
otherwise be mapped.
In the instance of a major fault prior to a guard region, synchronous
readahead will occur including populating folios in the page cache which
the guard regions will, in the case of the mapping in question, prevent
access to.
In addition, if PG_readahead is placed in a folio that is now inaccessible,
this will prevent asynchronous readahead from occurring as it would
otherwise do.
However, there are mechanisms for heuristically resetting this within
readahead regardless, which will 'recover' correct readahead behaviour.
Readahead presumes sequential data access, the presence of a guard region
clearly indicates that, at least in the guard region, no such sequential
access will occur, as it cannot occur there.
So this should have very little impact on any real workload. The far more
important point is as to whether readahead causes incorrect or
inappropriate mapping of ranges disallowed by the presence of guard
regions - this is not the case, as readahead does not 'pre-fault' memory in
this fashion.
At any rate, any mechanism which would attempt to do so would hit the usual
page fault paths, which correctly handle PTE markers as with anonymous
mappings.
Fault-Around
~~~~~~~~~~~~
The fault-around logic, in a similar vein to readahead, attempts to improve
efficiency with regard to file-backed memory mappings, however it differs
in that it does not try to fetch folios into the page cache that are about
to be accessed, but rather pre-maps a range of folios around the faulting
address.
Guard regions making use of PTE markers makes this relatively trivial, as
this case is already handled - see filemap_map_folio_range() and
filemap_map_order0_folio() - in both instances, the solution is to simply
keep the established page table mappings and let the fault handler take
care of PTE markers, as per the comment:
/*
* NOTE: If there're PTE markers, we'll leave them to be
* handled in the specific fault path, and it'll prohibit
* the fault-around logic.
*/
This works, as establishing guard regions results in page table mappings
with PTE markers, and clearing them removes them.
Truncation
~~~~~~~~~~
File truncation will not eliminate existing guard regions, as the
truncation operation will ultimately zap the range via
unmap_mapping_range(), which specifically excludes PTE markers.
Zapping
~~~~~~~
Zapping is, as with anonymous mappings, handled by zap_nonpresent_ptes(),
which specifically deals with guard entries, leaving them intact except in
instances such as process teardown or munmap() where they need to be
removed.
Reclaim
~~~~~~~
When reclaim is performed on file-backed folios, it ultimately invokes
try_to_unmap_one() via the rmap. If the folio is non-large, then map_pte()
will ultimately abort the operation for the guard region mapping. If large,
then check_pte() will determine that this is a non-device private
entry/device-exclusive entry 'swap' PTE and thus abort the operation in
that instance.
Therefore, no odd things happen in the instance of reclaim being attempted
upon a file-backed guard region.
Hole Punching
~~~~~~~~~~~~~
This updates the page cache and ultimately invokes unmap_mapping_range(),
which explicitly leaves PTE markers in place.
Because the establishment of guard regions zapped any existing mappings to
file-backed folios, once the guard regions are removed then the
hole-punched region will be faulted in as usual and everything will behave
as expected.
Lorenzo Stoakes (4):
mm: allow guard regions in file-backed and read-only mappings
selftests/mm: rename guard-pages to guard-regions
tools/selftests: expand all guard region tests to file-backed
tools/selftests: add file/shmem-backed mapping guard region tests
mm/madvise.c | 8 +-
tools/testing/selftests/mm/.gitignore | 2 +-
tools/testing/selftests/mm/Makefile | 2 +-
.../mm/{guard-pages.c => guard-regions.c} | 921 ++++++++++++++++--
4 files changed, 821 insertions(+), 112 deletions(-)
rename tools/testing/selftests/mm/{guard-pages.c => guard-regions.c} (58%)
--
2.48.1
^ permalink raw reply [flat|nested] 64+ messages in thread
* [PATCH 1/4] mm: allow guard regions in file-backed and read-only mappings
2025-02-13 18:16 [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings Lorenzo Stoakes
@ 2025-02-13 18:17 ` Lorenzo Stoakes
2025-02-18 14:15 ` Vlastimil Babka
` (2 more replies)
2025-02-13 18:17 ` [PATCH 2/4] selftests/mm: rename guard-pages to guard-regions Lorenzo Stoakes
` (4 subsequent siblings)
5 siblings, 3 replies; 64+ messages in thread
From: Lorenzo Stoakes @ 2025-02-13 18:17 UTC (permalink / raw)
To: Andrew Morton
Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
Vlastimil Babka, Paul E . McKenney, Jann Horn, David Hildenbrand,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
There is no reason to disallow guard regions in file-backed mappings -
readahead and fault-around both function correctly in the presence of PTE
markers, equally other operations relating to memory-mapped files function
correctly.
Additionally, read-only mappings if introducing guard-regions, only
restrict the mapping further, which means there is no violation of any
access rights by permitting this to be so.
Removing this restriction allows for read-only mapped files (such as
executable files) correctly which would otherwise not be permitted.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/madvise.c | 8 +-------
1 file changed, 1 insertion(+), 7 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 6ecead476a80..e01e93e179a8 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1051,13 +1051,7 @@ static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked)
if (!allow_locked)
disallowed |= VM_LOCKED;
- if (!vma_is_anonymous(vma))
- return false;
-
- if ((vma->vm_flags & (VM_MAYWRITE | disallowed)) != VM_MAYWRITE)
- return false;
-
- return true;
+ return !(vma->vm_flags & disallowed);
}
static bool is_guard_pte_marker(pte_t ptent)
--
2.48.1
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 2/4] selftests/mm: rename guard-pages to guard-regions
2025-02-13 18:16 [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings Lorenzo Stoakes
2025-02-13 18:17 ` [PATCH 1/4] mm: allow guard regions in file-backed and read-only mappings Lorenzo Stoakes
@ 2025-02-13 18:17 ` Lorenzo Stoakes
2025-02-18 14:15 ` Vlastimil Babka
2025-03-02 8:35 ` Lorenzo Stoakes
2025-02-13 18:17 ` [PATCH 3/4] tools/selftests: expand all guard region tests to file-backed Lorenzo Stoakes
` (3 subsequent siblings)
5 siblings, 2 replies; 64+ messages in thread
From: Lorenzo Stoakes @ 2025-02-13 18:17 UTC (permalink / raw)
To: Andrew Morton
Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
Vlastimil Babka, Paul E . McKenney, Jann Horn, David Hildenbrand,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
The feature formerly referred to as guard pages is more correctly referred
to as 'guard regions', as in fact no pages are ever allocated in the
process of installing the regions.
To avoid confusion, rename the tests accordingly.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
tools/testing/selftests/mm/.gitignore | 2 +-
tools/testing/selftests/mm/Makefile | 2 +-
.../mm/{guard-pages.c => guard-regions.c} | 42 +++++++++----------
3 files changed, 23 insertions(+), 23 deletions(-)
rename tools/testing/selftests/mm/{guard-pages.c => guard-regions.c} (98%)
diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore
index 121000c28c10..c5241b193db8 100644
--- a/tools/testing/selftests/mm/.gitignore
+++ b/tools/testing/selftests/mm/.gitignore
@@ -57,4 +57,4 @@ droppable
hugetlb_dio
pkey_sighandler_tests_32
pkey_sighandler_tests_64
-guard-pages
+guard-regions
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index 63ce39d024bb..8270895039d1 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -97,7 +97,7 @@ TEST_GEN_FILES += hugetlb_fault_after_madv
TEST_GEN_FILES += hugetlb_madv_vs_map
TEST_GEN_FILES += hugetlb_dio
TEST_GEN_FILES += droppable
-TEST_GEN_FILES += guard-pages
+TEST_GEN_FILES += guard-regions
ifneq ($(ARCH),arm64)
TEST_GEN_FILES += soft-dirty
diff --git a/tools/testing/selftests/mm/guard-pages.c b/tools/testing/selftests/mm/guard-regions.c
similarity index 98%
rename from tools/testing/selftests/mm/guard-pages.c
rename to tools/testing/selftests/mm/guard-regions.c
index ece37212a8a2..7a41cf9ffbdf 100644
--- a/tools/testing/selftests/mm/guard-pages.c
+++ b/tools/testing/selftests/mm/guard-regions.c
@@ -107,12 +107,12 @@ static bool try_read_write_buf(char *ptr)
return try_read_buf(ptr) && try_write_buf(ptr);
}
-FIXTURE(guard_pages)
+FIXTURE(guard_regions)
{
unsigned long page_size;
};
-FIXTURE_SETUP(guard_pages)
+FIXTURE_SETUP(guard_regions)
{
struct sigaction act = {
.sa_handler = &handle_fatal,
@@ -126,7 +126,7 @@ FIXTURE_SETUP(guard_pages)
self->page_size = (unsigned long)sysconf(_SC_PAGESIZE);
};
-FIXTURE_TEARDOWN(guard_pages)
+FIXTURE_TEARDOWN(guard_regions)
{
struct sigaction act = {
.sa_handler = SIG_DFL,
@@ -137,7 +137,7 @@ FIXTURE_TEARDOWN(guard_pages)
sigaction(SIGSEGV, &act, NULL);
}
-TEST_F(guard_pages, basic)
+TEST_F(guard_regions, basic)
{
const unsigned long NUM_PAGES = 10;
const unsigned long page_size = self->page_size;
@@ -231,7 +231,7 @@ TEST_F(guard_pages, basic)
}
/* Assert that operations applied across multiple VMAs work as expected. */
-TEST_F(guard_pages, multi_vma)
+TEST_F(guard_regions, multi_vma)
{
const unsigned long page_size = self->page_size;
char *ptr_region, *ptr, *ptr1, *ptr2, *ptr3;
@@ -367,7 +367,7 @@ TEST_F(guard_pages, multi_vma)
* Assert that batched operations performed using process_madvise() work as
* expected.
*/
-TEST_F(guard_pages, process_madvise)
+TEST_F(guard_regions, process_madvise)
{
const unsigned long page_size = self->page_size;
pid_t pid = getpid();
@@ -467,7 +467,7 @@ TEST_F(guard_pages, process_madvise)
}
/* Assert that unmapping ranges does not leave guard markers behind. */
-TEST_F(guard_pages, munmap)
+TEST_F(guard_regions, munmap)
{
const unsigned long page_size = self->page_size;
char *ptr, *ptr_new1, *ptr_new2;
@@ -505,7 +505,7 @@ TEST_F(guard_pages, munmap)
}
/* Assert that mprotect() operations have no bearing on guard markers. */
-TEST_F(guard_pages, mprotect)
+TEST_F(guard_regions, mprotect)
{
const unsigned long page_size = self->page_size;
char *ptr;
@@ -553,7 +553,7 @@ TEST_F(guard_pages, mprotect)
}
/* Split and merge VMAs and make sure guard pages still behave. */
-TEST_F(guard_pages, split_merge)
+TEST_F(guard_regions, split_merge)
{
const unsigned long page_size = self->page_size;
char *ptr, *ptr_new;
@@ -684,7 +684,7 @@ TEST_F(guard_pages, split_merge)
}
/* Assert that MADV_DONTNEED does not remove guard markers. */
-TEST_F(guard_pages, dontneed)
+TEST_F(guard_regions, dontneed)
{
const unsigned long page_size = self->page_size;
char *ptr;
@@ -737,7 +737,7 @@ TEST_F(guard_pages, dontneed)
}
/* Assert that mlock()'ed pages work correctly with guard markers. */
-TEST_F(guard_pages, mlock)
+TEST_F(guard_regions, mlock)
{
const unsigned long page_size = self->page_size;
char *ptr;
@@ -810,7 +810,7 @@ TEST_F(guard_pages, mlock)
*
* - Moving a mapping alone should retain markers as they are.
*/
-TEST_F(guard_pages, mremap_move)
+TEST_F(guard_regions, mremap_move)
{
const unsigned long page_size = self->page_size;
char *ptr, *ptr_new;
@@ -857,7 +857,7 @@ TEST_F(guard_pages, mremap_move)
* will have to remove guard pages manually to fix up (they'd have to do the
* same if it were a PROT_NONE mapping).
*/
-TEST_F(guard_pages, mremap_expand)
+TEST_F(guard_regions, mremap_expand)
{
const unsigned long page_size = self->page_size;
char *ptr, *ptr_new;
@@ -920,7 +920,7 @@ TEST_F(guard_pages, mremap_expand)
* if the user were using a PROT_NONE mapping they'd have to manually fix this
* up also so this is OK.
*/
-TEST_F(guard_pages, mremap_shrink)
+TEST_F(guard_regions, mremap_shrink)
{
const unsigned long page_size = self->page_size;
char *ptr;
@@ -984,7 +984,7 @@ TEST_F(guard_pages, mremap_shrink)
* Assert that forking a process with VMAs that do not have VM_WIPEONFORK set
* retain guard pages.
*/
-TEST_F(guard_pages, fork)
+TEST_F(guard_regions, fork)
{
const unsigned long page_size = self->page_size;
char *ptr;
@@ -1039,7 +1039,7 @@ TEST_F(guard_pages, fork)
* Assert expected behaviour after we fork populated ranges of anonymous memory
* and then guard and unguard the range.
*/
-TEST_F(guard_pages, fork_cow)
+TEST_F(guard_regions, fork_cow)
{
const unsigned long page_size = self->page_size;
char *ptr;
@@ -1110,7 +1110,7 @@ TEST_F(guard_pages, fork_cow)
* Assert that forking a process with VMAs that do have VM_WIPEONFORK set
* behave as expected.
*/
-TEST_F(guard_pages, fork_wipeonfork)
+TEST_F(guard_regions, fork_wipeonfork)
{
const unsigned long page_size = self->page_size;
char *ptr;
@@ -1160,7 +1160,7 @@ TEST_F(guard_pages, fork_wipeonfork)
}
/* Ensure that MADV_FREE retains guard entries as expected. */
-TEST_F(guard_pages, lazyfree)
+TEST_F(guard_regions, lazyfree)
{
const unsigned long page_size = self->page_size;
char *ptr;
@@ -1196,7 +1196,7 @@ TEST_F(guard_pages, lazyfree)
}
/* Ensure that MADV_POPULATE_READ, MADV_POPULATE_WRITE behave as expected. */
-TEST_F(guard_pages, populate)
+TEST_F(guard_regions, populate)
{
const unsigned long page_size = self->page_size;
char *ptr;
@@ -1222,7 +1222,7 @@ TEST_F(guard_pages, populate)
}
/* Ensure that MADV_COLD, MADV_PAGEOUT do not remove guard markers. */
-TEST_F(guard_pages, cold_pageout)
+TEST_F(guard_regions, cold_pageout)
{
const unsigned long page_size = self->page_size;
char *ptr;
@@ -1268,7 +1268,7 @@ TEST_F(guard_pages, cold_pageout)
}
/* Ensure that guard pages do not break userfaultd. */
-TEST_F(guard_pages, uffd)
+TEST_F(guard_regions, uffd)
{
const unsigned long page_size = self->page_size;
int uffd;
--
2.48.1
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 3/4] tools/selftests: expand all guard region tests to file-backed
2025-02-13 18:16 [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings Lorenzo Stoakes
2025-02-13 18:17 ` [PATCH 1/4] mm: allow guard regions in file-backed and read-only mappings Lorenzo Stoakes
2025-02-13 18:17 ` [PATCH 2/4] selftests/mm: rename guard-pages to guard-regions Lorenzo Stoakes
@ 2025-02-13 18:17 ` Lorenzo Stoakes
2025-02-18 14:17 ` Vlastimil Babka
2025-04-22 10:37 ` Ryan Roberts
2025-02-13 18:17 ` [PATCH 4/4] tools/selftests: add file/shmem-backed mapping guard region tests Lorenzo Stoakes
` (2 subsequent siblings)
5 siblings, 2 replies; 64+ messages in thread
From: Lorenzo Stoakes @ 2025-02-13 18:17 UTC (permalink / raw)
To: Andrew Morton
Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
Vlastimil Babka, Paul E . McKenney, Jann Horn, David Hildenbrand,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
Extend the guard region tests to allow for test fixture variants for anon,
shmem, and local file files.
This allows us to assert that each of the expected behaviours of anonymous
memory also applies correctly to file-backed (both shmem and an a file
created locally in the current working directory) and thus asserts the same
correctness guarantees as all the remaining tests do.
The fixture teardown is now performed in the parent process rather than
child forked ones, meaning cleanup is always performed, including unlinking
any generated temporary files.
Additionally the variant fixture data type now contains an enum value
indicating the type of backing store and the mmap() invocation is
abstracted to allow for the mapping of whichever backing store the variant
is testing.
We adjust tests as necessary to account for the fact they may now reference
files rather than anonymous memory.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
tools/testing/selftests/mm/guard-regions.c | 290 +++++++++++++++------
1 file changed, 205 insertions(+), 85 deletions(-)
diff --git a/tools/testing/selftests/mm/guard-regions.c b/tools/testing/selftests/mm/guard-regions.c
index 7a41cf9ffbdf..0469c783f4fa 100644
--- a/tools/testing/selftests/mm/guard-regions.c
+++ b/tools/testing/selftests/mm/guard-regions.c
@@ -6,6 +6,7 @@
#include <assert.h>
#include <errno.h>
#include <fcntl.h>
+#include <linux/limits.h>
#include <linux/userfaultfd.h>
#include <setjmp.h>
#include <signal.h>
@@ -37,6 +38,79 @@ static sigjmp_buf signal_jmp_buf;
*/
#define FORCE_READ(x) (*(volatile typeof(x) *)x)
+/*
+ * How is the test backing the mapping being tested?
+ */
+enum backing_type {
+ ANON_BACKED,
+ SHMEM_BACKED,
+ LOCAL_FILE_BACKED,
+};
+
+FIXTURE(guard_regions)
+{
+ unsigned long page_size;
+ char path[PATH_MAX];
+ int fd;
+};
+
+FIXTURE_VARIANT(guard_regions)
+{
+ enum backing_type backing;
+};
+
+FIXTURE_VARIANT_ADD(guard_regions, anon)
+{
+ .backing = ANON_BACKED,
+};
+
+FIXTURE_VARIANT_ADD(guard_regions, shmem)
+{
+ .backing = SHMEM_BACKED,
+};
+
+FIXTURE_VARIANT_ADD(guard_regions, file)
+{
+ .backing = LOCAL_FILE_BACKED,
+};
+
+static bool is_anon_backed(const FIXTURE_VARIANT(guard_regions) * variant)
+{
+ switch (variant->backing) {
+ case ANON_BACKED:
+ case SHMEM_BACKED:
+ return true;
+ default:
+ return false;
+ }
+}
+
+static void *mmap_(FIXTURE_DATA(guard_regions) * self,
+ const FIXTURE_VARIANT(guard_regions) * variant,
+ void *addr, size_t length, int prot, int extra_flags,
+ off_t offset)
+{
+ int fd;
+ int flags = extra_flags;
+
+ switch (variant->backing) {
+ case ANON_BACKED:
+ flags |= MAP_PRIVATE | MAP_ANON;
+ fd = -1;
+ break;
+ case SHMEM_BACKED:
+ case LOCAL_FILE_BACKED:
+ flags |= MAP_SHARED;
+ fd = self->fd;
+ break;
+ default:
+ ksft_exit_fail();
+ break;
+ }
+
+ return mmap(addr, length, prot, flags, fd, offset);
+}
+
static int userfaultfd(int flags)
{
return syscall(SYS_userfaultfd, flags);
@@ -107,12 +181,7 @@ static bool try_read_write_buf(char *ptr)
return try_read_buf(ptr) && try_write_buf(ptr);
}
-FIXTURE(guard_regions)
-{
- unsigned long page_size;
-};
-
-FIXTURE_SETUP(guard_regions)
+static void setup_sighandler(void)
{
struct sigaction act = {
.sa_handler = &handle_fatal,
@@ -122,11 +191,9 @@ FIXTURE_SETUP(guard_regions)
sigemptyset(&act.sa_mask);
if (sigaction(SIGSEGV, &act, NULL))
ksft_exit_fail_perror("sigaction");
+}
- self->page_size = (unsigned long)sysconf(_SC_PAGESIZE);
-};
-
-FIXTURE_TEARDOWN(guard_regions)
+static void teardown_sighandler(void)
{
struct sigaction act = {
.sa_handler = SIG_DFL,
@@ -137,6 +204,48 @@ FIXTURE_TEARDOWN(guard_regions)
sigaction(SIGSEGV, &act, NULL);
}
+static int open_file(const char *prefix, char *path)
+{
+ int fd;
+
+ snprintf(path, PATH_MAX, "%sguard_regions_test_file_XXXXXX", prefix);
+ fd = mkstemp(path);
+ if (fd < 0)
+ ksft_exit_fail_perror("mkstemp");
+
+ return fd;
+}
+
+FIXTURE_SETUP(guard_regions)
+{
+ self->page_size = (unsigned long)sysconf(_SC_PAGESIZE);
+ setup_sighandler();
+
+ if (variant->backing == ANON_BACKED)
+ return;
+
+ self->fd = open_file(
+ variant->backing == SHMEM_BACKED ? "/tmp/" : "",
+ self->path);
+
+ /* We truncate file to at least 100 pages, tests can modify as needed. */
+ ASSERT_EQ(ftruncate(self->fd, 100 * self->page_size), 0);
+};
+
+FIXTURE_TEARDOWN_PARENT(guard_regions)
+{
+ teardown_sighandler();
+
+ if (variant->backing == ANON_BACKED)
+ return;
+
+ if (self->fd >= 0)
+ close(self->fd);
+
+ if (self->path[0] != '\0')
+ unlink(self->path);
+}
+
TEST_F(guard_regions, basic)
{
const unsigned long NUM_PAGES = 10;
@@ -144,8 +253,8 @@ TEST_F(guard_regions, basic)
char *ptr;
int i;
- ptr = mmap(NULL, NUM_PAGES * page_size, PROT_READ | PROT_WRITE,
- MAP_PRIVATE | MAP_ANON, -1, 0);
+ ptr = mmap_(self, variant, NULL, NUM_PAGES * page_size,
+ PROT_READ | PROT_WRITE, 0, 0);
ASSERT_NE(ptr, MAP_FAILED);
/* Trivially assert we can touch the first page. */
@@ -238,25 +347,23 @@ TEST_F(guard_regions, multi_vma)
int i;
/* Reserve a 100 page region over which we can install VMAs. */
- ptr_region = mmap(NULL, 100 * page_size, PROT_NONE,
- MAP_ANON | MAP_PRIVATE, -1, 0);
+ ptr_region = mmap_(self, variant, NULL, 100 * page_size,
+ PROT_NONE, 0, 0);
ASSERT_NE(ptr_region, MAP_FAILED);
/* Place a VMA of 10 pages size at the start of the region. */
- ptr1 = mmap(ptr_region, 10 * page_size, PROT_READ | PROT_WRITE,
- MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
+ ptr1 = mmap_(self, variant, ptr_region, 10 * page_size,
+ PROT_READ | PROT_WRITE, MAP_FIXED, 0);
ASSERT_NE(ptr1, MAP_FAILED);
/* Place a VMA of 5 pages size 50 pages into the region. */
- ptr2 = mmap(&ptr_region[50 * page_size], 5 * page_size,
- PROT_READ | PROT_WRITE,
- MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
+ ptr2 = mmap_(self, variant, &ptr_region[50 * page_size], 5 * page_size,
+ PROT_READ | PROT_WRITE, MAP_FIXED, 0);
ASSERT_NE(ptr2, MAP_FAILED);
/* Place a VMA of 20 pages size at the end of the region. */
- ptr3 = mmap(&ptr_region[80 * page_size], 20 * page_size,
- PROT_READ | PROT_WRITE,
- MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
+ ptr3 = mmap_(self, variant, &ptr_region[80 * page_size], 20 * page_size,
+ PROT_READ | PROT_WRITE, MAP_FIXED, 0);
ASSERT_NE(ptr3, MAP_FAILED);
/* Unmap gaps. */
@@ -326,13 +433,11 @@ TEST_F(guard_regions, multi_vma)
}
/* Now map incompatible VMAs in the gaps. */
- ptr = mmap(&ptr_region[10 * page_size], 40 * page_size,
- PROT_READ | PROT_WRITE | PROT_EXEC,
- MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
+ ptr = mmap_(self, variant, &ptr_region[10 * page_size], 40 * page_size,
+ PROT_READ | PROT_WRITE | PROT_EXEC, MAP_FIXED, 0);
ASSERT_NE(ptr, MAP_FAILED);
- ptr = mmap(&ptr_region[55 * page_size], 25 * page_size,
- PROT_READ | PROT_WRITE | PROT_EXEC,
- MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
+ ptr = mmap_(self, variant, &ptr_region[55 * page_size], 25 * page_size,
+ PROT_READ | PROT_WRITE | PROT_EXEC, MAP_FIXED, 0);
ASSERT_NE(ptr, MAP_FAILED);
/*
@@ -379,8 +484,8 @@ TEST_F(guard_regions, process_madvise)
ASSERT_NE(pidfd, -1);
/* Reserve region to map over. */
- ptr_region = mmap(NULL, 100 * page_size, PROT_NONE,
- MAP_ANON | MAP_PRIVATE, -1, 0);
+ ptr_region = mmap_(self, variant, NULL, 100 * page_size,
+ PROT_NONE, 0, 0);
ASSERT_NE(ptr_region, MAP_FAILED);
/*
@@ -388,9 +493,8 @@ TEST_F(guard_regions, process_madvise)
* overwrite existing entries and test this code path against
* overwriting existing entries.
*/
- ptr1 = mmap(&ptr_region[page_size], 10 * page_size,
- PROT_READ | PROT_WRITE,
- MAP_FIXED | MAP_ANON | MAP_PRIVATE | MAP_POPULATE, -1, 0);
+ ptr1 = mmap_(self, variant, &ptr_region[page_size], 10 * page_size,
+ PROT_READ | PROT_WRITE, MAP_FIXED | MAP_POPULATE, 0);
ASSERT_NE(ptr1, MAP_FAILED);
/* We want guard markers at start/end of each VMA. */
vec[0].iov_base = ptr1;
@@ -399,9 +503,8 @@ TEST_F(guard_regions, process_madvise)
vec[1].iov_len = page_size;
/* 5 pages offset 50 pages into reserve region. */
- ptr2 = mmap(&ptr_region[50 * page_size], 5 * page_size,
- PROT_READ | PROT_WRITE,
- MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
+ ptr2 = mmap_(self, variant, &ptr_region[50 * page_size], 5 * page_size,
+ PROT_READ | PROT_WRITE, MAP_FIXED, 0);
ASSERT_NE(ptr2, MAP_FAILED);
vec[2].iov_base = ptr2;
vec[2].iov_len = page_size;
@@ -409,9 +512,8 @@ TEST_F(guard_regions, process_madvise)
vec[3].iov_len = page_size;
/* 20 pages offset 79 pages into reserve region. */
- ptr3 = mmap(&ptr_region[79 * page_size], 20 * page_size,
- PROT_READ | PROT_WRITE,
- MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
+ ptr3 = mmap_(self, variant, &ptr_region[79 * page_size], 20 * page_size,
+ PROT_READ | PROT_WRITE, MAP_FIXED, 0);
ASSERT_NE(ptr3, MAP_FAILED);
vec[4].iov_base = ptr3;
vec[4].iov_len = page_size;
@@ -472,8 +574,8 @@ TEST_F(guard_regions, munmap)
const unsigned long page_size = self->page_size;
char *ptr, *ptr_new1, *ptr_new2;
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
- MAP_ANON | MAP_PRIVATE, -1, 0);
+ ptr = mmap_(self, variant, NULL, 10 * page_size,
+ PROT_READ | PROT_WRITE, 0, 0);
ASSERT_NE(ptr, MAP_FAILED);
/* Guard first and last pages. */
@@ -489,11 +591,11 @@ TEST_F(guard_regions, munmap)
ASSERT_EQ(munmap(&ptr[9 * page_size], page_size), 0);
/* Map over them.*/
- ptr_new1 = mmap(ptr, page_size, PROT_READ | PROT_WRITE,
- MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
+ ptr_new1 = mmap_(self, variant, ptr, page_size, PROT_READ | PROT_WRITE,
+ MAP_FIXED, 0);
ASSERT_NE(ptr_new1, MAP_FAILED);
- ptr_new2 = mmap(&ptr[9 * page_size], page_size, PROT_READ | PROT_WRITE,
- MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
+ ptr_new2 = mmap_(self, variant, &ptr[9 * page_size], page_size,
+ PROT_READ | PROT_WRITE, MAP_FIXED, 0);
ASSERT_NE(ptr_new2, MAP_FAILED);
/* Assert that they are now not guarded. */
@@ -511,8 +613,8 @@ TEST_F(guard_regions, mprotect)
char *ptr;
int i;
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
- MAP_ANON | MAP_PRIVATE, -1, 0);
+ ptr = mmap_(self, variant, NULL, 10 * page_size,
+ PROT_READ | PROT_WRITE, 0, 0);
ASSERT_NE(ptr, MAP_FAILED);
/* Guard the middle of the range. */
@@ -559,8 +661,8 @@ TEST_F(guard_regions, split_merge)
char *ptr, *ptr_new;
int i;
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
- MAP_ANON | MAP_PRIVATE, -1, 0);
+ ptr = mmap_(self, variant, NULL, 10 * page_size,
+ PROT_READ | PROT_WRITE, 0, 0);
ASSERT_NE(ptr, MAP_FAILED);
/* Guard the whole range. */
@@ -601,14 +703,14 @@ TEST_F(guard_regions, split_merge)
}
/* Now map them again - the unmap will have cleared the guards. */
- ptr_new = mmap(&ptr[2 * page_size], page_size, PROT_READ | PROT_WRITE,
- MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
+ ptr_new = mmap_(self, variant, &ptr[2 * page_size], page_size,
+ PROT_READ | PROT_WRITE, MAP_FIXED, 0);
ASSERT_NE(ptr_new, MAP_FAILED);
- ptr_new = mmap(&ptr[5 * page_size], page_size, PROT_READ | PROT_WRITE,
- MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
+ ptr_new = mmap_(self, variant, &ptr[5 * page_size], page_size,
+ PROT_READ | PROT_WRITE, MAP_FIXED, 0);
ASSERT_NE(ptr_new, MAP_FAILED);
- ptr_new = mmap(&ptr[8 * page_size], page_size, PROT_READ | PROT_WRITE,
- MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
+ ptr_new = mmap_(self, variant, &ptr[8 * page_size], page_size,
+ PROT_READ | PROT_WRITE, MAP_FIXED, 0);
ASSERT_NE(ptr_new, MAP_FAILED);
/* Now make sure guard pages are established. */
@@ -690,8 +792,8 @@ TEST_F(guard_regions, dontneed)
char *ptr;
int i;
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
- MAP_ANON | MAP_PRIVATE, -1, 0);
+ ptr = mmap_(self, variant, NULL, 10 * page_size,
+ PROT_READ | PROT_WRITE, 0, 0);
ASSERT_NE(ptr, MAP_FAILED);
/* Back the whole range. */
@@ -721,8 +823,16 @@ TEST_F(guard_regions, dontneed)
ASSERT_FALSE(result);
} else {
ASSERT_TRUE(result);
- /* Make sure we really did get reset to zero page. */
- ASSERT_EQ(*curr, '\0');
+ switch (variant->backing) {
+ case ANON_BACKED:
+ /* If anon, then we get a zero page. */
+ ASSERT_EQ(*curr, '\0');
+ break;
+ default:
+ /* Otherwise, we get the file data. */
+ ASSERT_EQ(*curr, 'y');
+ break;
+ }
}
/* Now write... */
@@ -743,8 +853,8 @@ TEST_F(guard_regions, mlock)
char *ptr;
int i;
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
- MAP_ANON | MAP_PRIVATE, -1, 0);
+ ptr = mmap_(self, variant, NULL, 10 * page_size,
+ PROT_READ | PROT_WRITE, 0, 0);
ASSERT_NE(ptr, MAP_FAILED);
/* Populate. */
@@ -816,8 +926,8 @@ TEST_F(guard_regions, mremap_move)
char *ptr, *ptr_new;
/* Map 5 pages. */
- ptr = mmap(NULL, 5 * page_size, PROT_READ | PROT_WRITE,
- MAP_ANON | MAP_PRIVATE, -1, 0);
+ ptr = mmap_(self, variant, NULL, 5 * page_size,
+ PROT_READ | PROT_WRITE, 0, 0);
ASSERT_NE(ptr, MAP_FAILED);
/* Place guard markers at both ends of the 5 page span. */
@@ -831,8 +941,7 @@ TEST_F(guard_regions, mremap_move)
/* Map a new region we will move this range into. Doing this ensures
* that we have reserved a range to map into.
*/
- ptr_new = mmap(NULL, 5 * page_size, PROT_NONE, MAP_ANON | MAP_PRIVATE,
- -1, 0);
+ ptr_new = mmap_(self, variant, NULL, 5 * page_size, PROT_NONE, 0, 0);
ASSERT_NE(ptr_new, MAP_FAILED);
ASSERT_EQ(mremap(ptr, 5 * page_size, 5 * page_size,
@@ -863,8 +972,8 @@ TEST_F(guard_regions, mremap_expand)
char *ptr, *ptr_new;
/* Map 10 pages... */
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
- MAP_ANON | MAP_PRIVATE, -1, 0);
+ ptr = mmap_(self, variant, NULL, 10 * page_size,
+ PROT_READ | PROT_WRITE, 0, 0);
ASSERT_NE(ptr, MAP_FAILED);
/* ...But unmap the last 5 so we can ensure we can expand into them. */
ASSERT_EQ(munmap(&ptr[5 * page_size], 5 * page_size), 0);
@@ -888,8 +997,7 @@ TEST_F(guard_regions, mremap_expand)
ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
/* Reserve a region which we can move to and expand into. */
- ptr_new = mmap(NULL, 20 * page_size, PROT_NONE,
- MAP_ANON | MAP_PRIVATE, -1, 0);
+ ptr_new = mmap_(self, variant, NULL, 20 * page_size, PROT_NONE, 0, 0);
ASSERT_NE(ptr_new, MAP_FAILED);
/* Now move and expand into it. */
@@ -927,8 +1035,8 @@ TEST_F(guard_regions, mremap_shrink)
int i;
/* Map 5 pages. */
- ptr = mmap(NULL, 5 * page_size, PROT_READ | PROT_WRITE,
- MAP_ANON | MAP_PRIVATE, -1, 0);
+ ptr = mmap_(self, variant, NULL, 5 * page_size,
+ PROT_READ | PROT_WRITE, 0, 0);
ASSERT_NE(ptr, MAP_FAILED);
/* Place guard markers at both ends of the 5 page span. */
@@ -992,8 +1100,8 @@ TEST_F(guard_regions, fork)
int i;
/* Map 10 pages. */
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
- MAP_ANON | MAP_PRIVATE, -1, 0);
+ ptr = mmap_(self, variant, NULL, 10 * page_size,
+ PROT_READ | PROT_WRITE, 0, 0);
ASSERT_NE(ptr, MAP_FAILED);
/* Establish guard pages in the first 5 pages. */
@@ -1046,9 +1154,12 @@ TEST_F(guard_regions, fork_cow)
pid_t pid;
int i;
+ if (variant->backing != ANON_BACKED)
+ SKIP(return, "CoW only supported on anon mappings");
+
/* Map 10 pages. */
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
- MAP_ANON | MAP_PRIVATE, -1, 0);
+ ptr = mmap_(self, variant, NULL, 10 * page_size,
+ PROT_READ | PROT_WRITE, 0, 0);
ASSERT_NE(ptr, MAP_FAILED);
/* Populate range. */
@@ -1117,9 +1228,12 @@ TEST_F(guard_regions, fork_wipeonfork)
pid_t pid;
int i;
+ if (variant->backing != ANON_BACKED)
+ SKIP(return, "Wipe on fork only supported on anon mappings");
+
/* Map 10 pages. */
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
- MAP_ANON | MAP_PRIVATE, -1, 0);
+ ptr = mmap_(self, variant, NULL, 10 * page_size,
+ PROT_READ | PROT_WRITE, 0, 0);
ASSERT_NE(ptr, MAP_FAILED);
/* Mark wipe on fork. */
@@ -1166,9 +1280,12 @@ TEST_F(guard_regions, lazyfree)
char *ptr;
int i;
+ if (variant->backing != ANON_BACKED)
+ SKIP(return, "MADV_FREE only supported on anon mappings");
+
/* Map 10 pages. */
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
- MAP_ANON | MAP_PRIVATE, -1, 0);
+ ptr = mmap_(self, variant, NULL, 10 * page_size,
+ PROT_READ | PROT_WRITE, 0, 0);
ASSERT_NE(ptr, MAP_FAILED);
/* Guard range. */
@@ -1202,8 +1319,8 @@ TEST_F(guard_regions, populate)
char *ptr;
/* Map 10 pages. */
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
- MAP_ANON | MAP_PRIVATE, -1, 0);
+ ptr = mmap_(self, variant, NULL, 10 * page_size,
+ PROT_READ | PROT_WRITE, 0, 0);
ASSERT_NE(ptr, MAP_FAILED);
/* Guard range. */
@@ -1229,8 +1346,8 @@ TEST_F(guard_regions, cold_pageout)
int i;
/* Map 10 pages. */
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
- MAP_ANON | MAP_PRIVATE, -1, 0);
+ ptr = mmap_(self, variant, NULL, 10 * page_size,
+ PROT_READ | PROT_WRITE, 0, 0);
ASSERT_NE(ptr, MAP_FAILED);
/* Guard range. */
@@ -1281,6 +1398,9 @@ TEST_F(guard_regions, uffd)
struct uffdio_register reg;
struct uffdio_range range;
+ if (!is_anon_backed(variant))
+ SKIP(return, "uffd only works on anon backing");
+
/* Set up uffd. */
uffd = userfaultfd(0);
if (uffd == -1 && errno == EPERM)
@@ -1290,8 +1410,8 @@ TEST_F(guard_regions, uffd)
ASSERT_EQ(ioctl(uffd, UFFDIO_API, &api), 0);
/* Map 10 pages. */
- ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
- MAP_ANON | MAP_PRIVATE, -1, 0);
+ ptr = mmap_(self, variant, NULL, 10 * page_size,
+ PROT_READ | PROT_WRITE, 0, 0);
ASSERT_NE(ptr, MAP_FAILED);
/* Register the range with uffd. */
--
2.48.1
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 4/4] tools/selftests: add file/shmem-backed mapping guard region tests
2025-02-13 18:16 [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings Lorenzo Stoakes
` (2 preceding siblings ...)
2025-02-13 18:17 ` [PATCH 3/4] tools/selftests: expand all guard region tests to file-backed Lorenzo Stoakes
@ 2025-02-13 18:17 ` Lorenzo Stoakes
2025-02-18 14:18 ` Vlastimil Babka
2025-02-18 12:12 ` [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings Vlastimil Babka
2025-02-19 8:25 ` Kalesh Singh
5 siblings, 1 reply; 64+ messages in thread
From: Lorenzo Stoakes @ 2025-02-13 18:17 UTC (permalink / raw)
To: Andrew Morton
Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
Vlastimil Babka, Paul E . McKenney, Jann Horn, David Hildenbrand,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
Extend the guard region self tests to explicitly assert that guard regions
work correctly for functionality specific to file-backed and shmem
mappings.
In addition to testing all of the existing guard region functionality that
is currently tested against anonymous mappings against file-backed and
shmem mappings (except those which are exclusive to anonymous mapping), we
now also:
* Test that MADV_SEQUENTIAL does not cause unexpected readahead behaviour.
* Test that MAP_PRIVATE behaves as expected with guard regions installed in
both a shared and private mapping of an fd.
* Test that a read-only file can correctly establish guard regions.
* Test a probable fault-around case does not interfere with guard regions
(or vice-versa).
* Test that truncation does not eliminate guard regions.
* Test that hole punching functions as expected in the presence of guard
regions.
* Test that a read-only mapping of a memfd write sealed mapping can have
guard regions established within it and function correctly without
violation of the seal.
* Test that guard regions installed into a mapping of the anonymous zero
page function correctly.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
tools/testing/selftests/mm/guard-regions.c | 595 +++++++++++++++++++++
1 file changed, 595 insertions(+)
diff --git a/tools/testing/selftests/mm/guard-regions.c b/tools/testing/selftests/mm/guard-regions.c
index 0469c783f4fa..ea9b5815e828 100644
--- a/tools/testing/selftests/mm/guard-regions.c
+++ b/tools/testing/selftests/mm/guard-regions.c
@@ -216,6 +216,58 @@ static int open_file(const char *prefix, char *path)
return fd;
}
+/* Establish a varying pattern in a buffer. */
+static void set_pattern(char *ptr, size_t num_pages, size_t page_size)
+{
+ size_t i;
+
+ for (i = 0; i < num_pages; i++) {
+ char *ptr2 = &ptr[i * page_size];
+
+ memset(ptr2, 'a' + (i % 26), page_size);
+ }
+}
+
+/*
+ * Check that a buffer contains the pattern set by set_pattern(), starting at a
+ * page offset of pgoff within the buffer.
+ */
+static bool check_pattern_offset(char *ptr, size_t num_pages, size_t page_size,
+ size_t pgoff)
+{
+ size_t i;
+
+ for (i = 0; i < num_pages * page_size; i++) {
+ size_t offset = pgoff * page_size + i;
+ char actual = ptr[offset];
+ char expected = 'a' + ((offset / page_size) % 26);
+
+ if (actual != expected)
+ return false;
+ }
+
+ return true;
+}
+
+/* Check that a buffer contains the pattern set by set_pattern(). */
+static bool check_pattern(char *ptr, size_t num_pages, size_t page_size)
+{
+ return check_pattern_offset(ptr, num_pages, page_size, 0);
+}
+
+/* Determine if a buffer contains only repetitions of a specified char. */
+static bool is_buf_eq(char *buf, size_t size, char chr)
+{
+ size_t i;
+
+ for (i = 0; i < size; i++) {
+ if (buf[i] != chr)
+ return false;
+ }
+
+ return true;
+}
+
FIXTURE_SETUP(guard_regions)
{
self->page_size = (unsigned long)sysconf(_SC_PAGESIZE);
@@ -1437,4 +1489,547 @@ TEST_F(guard_regions, uffd)
ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
}
+/*
+ * Mark a region within a file-backed mapping using MADV_SEQUENTIAL so we
+ * aggressively read-ahead, then install guard regions and assert that it
+ * behaves correctly.
+ *
+ * We page out using MADV_PAGEOUT before checking guard regions so we drop page
+ * cache folios, meaning we maximise the possibility of some broken readahead.
+ */
+TEST_F(guard_regions, madvise_sequential)
+{
+ char *ptr;
+ int i;
+ const unsigned long page_size = self->page_size;
+
+ if (variant->backing == ANON_BACKED)
+ SKIP(return, "MADV_SEQUENTIAL meaningful only for file-backed");
+
+ ptr = mmap_(self, variant, NULL, 10 * page_size,
+ PROT_READ | PROT_WRITE, 0, 0);
+ ASSERT_NE(ptr, MAP_FAILED);
+
+ /* Establish a pattern of data in the file. */
+ set_pattern(ptr, 10, page_size);
+ ASSERT_TRUE(check_pattern(ptr, 10, page_size));
+
+ /* Mark it as being accessed sequentially. */
+ ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_SEQUENTIAL), 0);
+
+ /* Mark every other page a guard page. */
+ for (i = 0; i < 10; i += 2) {
+ char *ptr2 = &ptr[i * page_size];
+
+ ASSERT_EQ(madvise(ptr2, page_size, MADV_GUARD_INSTALL), 0);
+ }
+
+ /* Now page it out. */
+ ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_PAGEOUT), 0);
+
+ /* Now make sure pages are as expected. */
+ for (i = 0; i < 10; i++) {
+ char *chrp = &ptr[i * page_size];
+
+ if (i % 2 == 0) {
+ bool result = try_read_write_buf(chrp);
+
+ ASSERT_FALSE(result);
+ } else {
+ ASSERT_EQ(*chrp, 'a' + i);
+ }
+ }
+
+ /* Now remove guard pages. */
+ ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_REMOVE), 0);
+
+ /* Now make sure all data is as expected. */
+ if (!check_pattern(ptr, 10, page_size))
+ ASSERT_TRUE(false);
+
+ ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+
+/*
+ * Check that file-backed mappings implement guard regions with MAP_PRIVATE
+ * correctly.
+ */
+TEST_F(guard_regions, map_private)
+{
+ const unsigned long page_size = self->page_size;
+ char *ptr_shared, *ptr_private;
+ int i;
+
+ if (variant->backing == ANON_BACKED)
+ SKIP(return, "MAP_PRIVATE test specific to file-backed");
+
+ ptr_shared = mmap_(self, variant, NULL, 10 * page_size, PROT_READ | PROT_WRITE, 0, 0);
+ ASSERT_NE(ptr_shared, MAP_FAILED);
+
+ /* Manually mmap(), do not use mmap_() wrapper so we can force MAP_PRIVATE. */
+ ptr_private = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE, self->fd, 0);
+ ASSERT_NE(ptr_private, MAP_FAILED);
+
+ /* Set pattern in shared mapping. */
+ set_pattern(ptr_shared, 10, page_size);
+
+ /* Install guard regions in every other page in the shared mapping. */
+ for (i = 0; i < 10; i += 2) {
+ char *ptr = &ptr_shared[i * page_size];
+
+ ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_INSTALL), 0);
+ }
+
+ for (i = 0; i < 10; i++) {
+ /* Every even shared page should be guarded. */
+ ASSERT_EQ(try_read_buf(&ptr_shared[i * page_size]), i % 2 != 0);
+ /* Private mappings should always be readable. */
+ ASSERT_TRUE(try_read_buf(&ptr_private[i * page_size]));
+ }
+
+ /* Install guard regions in every other page in the private mapping. */
+ for (i = 0; i < 10; i += 2) {
+ char *ptr = &ptr_private[i * page_size];
+
+ ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_INSTALL), 0);
+ }
+
+ for (i = 0; i < 10; i++) {
+ /* Every even shared page should be guarded. */
+ ASSERT_EQ(try_read_buf(&ptr_shared[i * page_size]), i % 2 != 0);
+ /* Every odd private page should be guarded. */
+ ASSERT_EQ(try_read_buf(&ptr_private[i * page_size]), i % 2 != 0);
+ }
+
+ /* Remove guard regions from shared mapping. */
+ ASSERT_EQ(madvise(ptr_shared, 10 * page_size, MADV_GUARD_REMOVE), 0);
+
+ for (i = 0; i < 10; i++) {
+ /* Shared mappings should always be readable. */
+ ASSERT_TRUE(try_read_buf(&ptr_shared[i * page_size]));
+ /* Every even private page should be guarded. */
+ ASSERT_EQ(try_read_buf(&ptr_private[i * page_size]), i % 2 != 0);
+ }
+
+ /* Remove guard regions from private mapping. */
+ ASSERT_EQ(madvise(ptr_private, 10 * page_size, MADV_GUARD_REMOVE), 0);
+
+ for (i = 0; i < 10; i++) {
+ /* Shared mappings should always be readable. */
+ ASSERT_TRUE(try_read_buf(&ptr_shared[i * page_size]));
+ /* Private mappings should always be readable. */
+ ASSERT_TRUE(try_read_buf(&ptr_private[i * page_size]));
+ }
+
+ /* Ensure patterns are intact. */
+ ASSERT_TRUE(check_pattern(ptr_shared, 10, page_size));
+ ASSERT_TRUE(check_pattern(ptr_private, 10, page_size));
+
+ /* Now write out every other page to MAP_PRIVATE. */
+ for (i = 0; i < 10; i += 2) {
+ char *ptr = &ptr_private[i * page_size];
+
+ memset(ptr, 'a' + i, page_size);
+ }
+
+ /*
+ * At this point the mapping is:
+ *
+ * 0123456789
+ * SPSPSPSPSP
+ *
+ * Where S = shared, P = private mappings.
+ */
+
+ /* Now mark the beginning of the mapping guarded. */
+ ASSERT_EQ(madvise(ptr_private, 5 * page_size, MADV_GUARD_INSTALL), 0);
+
+ /*
+ * This renders the mapping:
+ *
+ * 0123456789
+ * xxxxxPSPSP
+ */
+
+ for (i = 0; i < 10; i++) {
+ char *ptr = &ptr_private[i * page_size];
+
+ /* Ensure guard regions as expected. */
+ ASSERT_EQ(try_read_buf(ptr), i >= 5);
+ /* The shared mapping should always succeed. */
+ ASSERT_TRUE(try_read_buf(&ptr_shared[i * page_size]));
+ }
+
+ /* Remove the guard regions altogether. */
+ ASSERT_EQ(madvise(ptr_private, 10 * page_size, MADV_GUARD_REMOVE), 0);
+
+ /*
+ *
+ * We now expect the mapping to be:
+ *
+ * 0123456789
+ * SSSSSPSPSP
+ *
+ * As we removed guard regions, the private pages from the first 5 will
+ * have been zapped, so on fault will reestablish the shared mapping.
+ */
+
+ for (i = 0; i < 10; i++) {
+ char *ptr = &ptr_private[i * page_size];
+
+ /*
+ * Assert that shared mappings in the MAP_PRIVATE mapping match
+ * the shared mapping.
+ */
+ if (i < 5 || i % 2 == 0) {
+ char *ptr_s = &ptr_shared[i * page_size];
+
+ ASSERT_EQ(memcmp(ptr, ptr_s, page_size), 0);
+ continue;
+ }
+
+ /* Everything else is a private mapping. */
+ ASSERT_TRUE(is_buf_eq(ptr, page_size, 'a' + i));
+ }
+
+ ASSERT_EQ(munmap(ptr_shared, 10 * page_size), 0);
+ ASSERT_EQ(munmap(ptr_private, 10 * page_size), 0);
+}
+
+/* Test that guard regions established over a read-only mapping function correctly. */
+TEST_F(guard_regions, readonly_file)
+{
+ const unsigned long page_size = self->page_size;
+ char *ptr;
+ int i;
+
+ if (variant->backing == ANON_BACKED)
+ SKIP(return, "Read-only test specific to file-backed");
+
+ /* Map shared so we can populate with pattern, populate it, unmap. */
+ ptr = mmap_(self, variant, NULL, 10 * page_size,
+ PROT_READ | PROT_WRITE, 0, 0);
+ ASSERT_NE(ptr, MAP_FAILED);
+ set_pattern(ptr, 10, page_size);
+ ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+ /* Close the fd so we can re-open read-only. */
+ ASSERT_EQ(close(self->fd), 0);
+
+ /* Re-open read-only. */
+ self->fd = open(self->path, O_RDONLY);
+ ASSERT_NE(self->fd, -1);
+ /* Re-map read-only. */
+ ptr = mmap_(self, variant, NULL, 10 * page_size, PROT_READ, 0, 0);
+ ASSERT_NE(ptr, MAP_FAILED);
+
+ /* Mark every other page guarded. */
+ for (i = 0; i < 10; i += 2) {
+ char *ptr_pg = &ptr[i * page_size];
+
+ ASSERT_EQ(madvise(ptr_pg, page_size, MADV_GUARD_INSTALL), 0);
+ }
+
+ /* Assert that the guard regions are in place.*/
+ for (i = 0; i < 10; i++) {
+ char *ptr_pg = &ptr[i * page_size];
+
+ ASSERT_EQ(try_read_buf(ptr_pg), i % 2 != 0);
+ }
+
+ /* Remove guard regions. */
+ ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_REMOVE), 0);
+
+ /* Ensure the data is as expected. */
+ ASSERT_TRUE(check_pattern(ptr, 10, page_size));
+
+ ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+
+TEST_F(guard_regions, fault_around)
+{
+ const unsigned long page_size = self->page_size;
+ char *ptr;
+ int i;
+
+ if (variant->backing == ANON_BACKED)
+ SKIP(return, "Fault-around test specific to file-backed");
+
+ ptr = mmap_(self, variant, NULL, 10 * page_size,
+ PROT_READ | PROT_WRITE, 0, 0);
+ ASSERT_NE(ptr, MAP_FAILED);
+
+ /* Establish a pattern in the backing file. */
+ set_pattern(ptr, 10, page_size);
+
+ /*
+ * Now drop it from the page cache so we get major faults when next we
+ * map it.
+ */
+ ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_PAGEOUT), 0);
+
+ /* Unmap and remap 'to be sure'. */
+ ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+ ptr = mmap_(self, variant, NULL, 10 * page_size,
+ PROT_READ | PROT_WRITE, 0, 0);
+ ASSERT_NE(ptr, MAP_FAILED);
+
+ /* Now make every even page guarded. */
+ for (i = 0; i < 10; i += 2) {
+ char *ptr_p = &ptr[i * page_size];
+
+ ASSERT_EQ(madvise(ptr_p, page_size, MADV_GUARD_INSTALL), 0);
+ }
+
+ /* Now fault in every odd page. This should trigger fault-around. */
+ for (i = 1; i < 10; i += 2) {
+ char *ptr_p = &ptr[i * page_size];
+
+ ASSERT_TRUE(try_read_buf(ptr_p));
+ }
+
+ /* Finally, ensure that guard regions are intact as expected. */
+ for (i = 0; i < 10; i++) {
+ char *ptr_p = &ptr[i * page_size];
+
+ ASSERT_EQ(try_read_buf(ptr_p), i % 2 != 0);
+ }
+
+ ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+
+TEST_F(guard_regions, truncation)
+{
+ const unsigned long page_size = self->page_size;
+ char *ptr;
+ int i;
+
+ if (variant->backing == ANON_BACKED)
+ SKIP(return, "Truncation test specific to file-backed");
+
+ ptr = mmap_(self, variant, NULL, 10 * page_size,
+ PROT_READ | PROT_WRITE, 0, 0);
+ ASSERT_NE(ptr, MAP_FAILED);
+
+ /*
+ * Establish a pattern in the backing file, just so there is data
+ * there.
+ */
+ set_pattern(ptr, 10, page_size);
+
+ /* Now make every even page guarded. */
+ for (i = 0; i < 10; i += 2) {
+ char *ptr_p = &ptr[i * page_size];
+
+ ASSERT_EQ(madvise(ptr_p, page_size, MADV_GUARD_INSTALL), 0);
+ }
+
+ /* Now assert things are as expected. */
+ for (i = 0; i < 10; i++) {
+ char *ptr_p = &ptr[i * page_size];
+
+ ASSERT_EQ(try_read_write_buf(ptr_p), i % 2 != 0);
+ }
+
+ /* Now truncate to actually used size (initialised to 100). */
+ ASSERT_EQ(ftruncate(self->fd, 10 * page_size), 0);
+
+ /* Here the guard regions will remain intact. */
+ for (i = 0; i < 10; i++) {
+ char *ptr_p = &ptr[i * page_size];
+
+ ASSERT_EQ(try_read_write_buf(ptr_p), i % 2 != 0);
+ }
+
+ /* Now truncate to half the size, then truncate again to the full size. */
+ ASSERT_EQ(ftruncate(self->fd, 5 * page_size), 0);
+ ASSERT_EQ(ftruncate(self->fd, 10 * page_size), 0);
+
+ /* Again, guard pages will remain intact. */
+ for (i = 0; i < 10; i++) {
+ char *ptr_p = &ptr[i * page_size];
+
+ ASSERT_EQ(try_read_write_buf(ptr_p), i % 2 != 0);
+ }
+
+ ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+
+TEST_F(guard_regions, hole_punch)
+{
+ const unsigned long page_size = self->page_size;
+ char *ptr;
+ int i;
+
+ if (variant->backing == ANON_BACKED)
+ SKIP(return, "Truncation test specific to file-backed");
+
+ /* Establish pattern in mapping. */
+ ptr = mmap_(self, variant, NULL, 10 * page_size,
+ PROT_READ | PROT_WRITE, 0, 0);
+ ASSERT_NE(ptr, MAP_FAILED);
+ set_pattern(ptr, 10, page_size);
+
+ /* Install a guard region in the middle of the mapping. */
+ ASSERT_EQ(madvise(&ptr[3 * page_size], 4 * page_size,
+ MADV_GUARD_INSTALL), 0);
+
+ /*
+ * The buffer will now be:
+ *
+ * 0123456789
+ * ***xxxx***
+ *
+ * Where * is data and x is the guard region.
+ */
+
+ /* Ensure established. */
+ for (i = 0; i < 10; i++) {
+ char *ptr_p = &ptr[i * page_size];
+
+ ASSERT_EQ(try_read_buf(ptr_p), i < 3 || i >= 7);
+ }
+
+ /* Now hole punch the guarded region. */
+ ASSERT_EQ(madvise(&ptr[3 * page_size], 4 * page_size,
+ MADV_REMOVE), 0);
+
+ /* Ensure guard regions remain. */
+ for (i = 0; i < 10; i++) {
+ char *ptr_p = &ptr[i * page_size];
+
+ ASSERT_EQ(try_read_buf(ptr_p), i < 3 || i >= 7);
+ }
+
+ /* Now remove guard region throughout. */
+ ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_REMOVE), 0);
+
+ /* Check that the pattern exists in non-hole punched region. */
+ ASSERT_TRUE(check_pattern(ptr, 3, page_size));
+ /* Check that hole punched region is zeroed. */
+ ASSERT_TRUE(is_buf_eq(&ptr[3 * page_size], 4 * page_size, '\0'));
+ /* Check that the pattern exists in the remainder of the file. */
+ ASSERT_TRUE(check_pattern_offset(ptr, 3, page_size, 7));
+
+ ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+
+/*
+ * Ensure that a memfd works correctly with guard regions, that we can write
+ * seal it then open the mapping read-only and still establish guard regions
+ * within, remove those guard regions and have everything work correctly.
+ */
+TEST_F(guard_regions, memfd_write_seal)
+{
+ const unsigned long page_size = self->page_size;
+ char *ptr;
+ int i;
+
+ if (variant->backing != SHMEM_BACKED)
+ SKIP(return, "memfd write seal test specific to shmem");
+
+ /* OK, we need a memfd, so close existing one. */
+ ASSERT_EQ(close(self->fd), 0);
+
+ /* Create and truncate memfd. */
+ self->fd = memfd_create("guard_regions_memfd_seals_test",
+ MFD_ALLOW_SEALING);
+ ASSERT_NE(self->fd, -1);
+ ASSERT_EQ(ftruncate(self->fd, 10 * page_size), 0);
+
+ /* Map, set pattern, unmap. */
+ ptr = mmap_(self, variant, NULL, 10 * page_size, PROT_READ | PROT_WRITE, 0, 0);
+ ASSERT_NE(ptr, MAP_FAILED);
+ set_pattern(ptr, 10, page_size);
+ ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+
+ /* Write-seal the memfd. */
+ ASSERT_EQ(fcntl(self->fd, F_ADD_SEALS, F_SEAL_WRITE), 0);
+
+ /* Now map the memfd readonly. */
+ ptr = mmap_(self, variant, NULL, 10 * page_size, PROT_READ, 0, 0);
+ ASSERT_NE(ptr, MAP_FAILED);
+
+ /* Ensure pattern is as expected. */
+ ASSERT_TRUE(check_pattern(ptr, 10, page_size));
+
+ /* Now make every even page guarded. */
+ for (i = 0; i < 10; i += 2) {
+ char *ptr_p = &ptr[i * page_size];
+
+ ASSERT_EQ(madvise(ptr_p, page_size, MADV_GUARD_INSTALL), 0);
+ }
+
+ /* Now assert things are as expected. */
+ for (i = 0; i < 10; i++) {
+ char *ptr_p = &ptr[i * page_size];
+
+ ASSERT_EQ(try_read_buf(ptr_p), i % 2 != 0);
+ }
+
+ /* Now remove guard regions. */
+ ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_REMOVE), 0);
+
+ /* Ensure pattern is as expected. */
+ ASSERT_TRUE(check_pattern(ptr, 10, page_size));
+
+ /* Ensure write seal intact. */
+ for (i = 0; i < 10; i++) {
+ char *ptr_p = &ptr[i * page_size];
+
+ ASSERT_FALSE(try_write_buf(ptr_p));
+ }
+
+ ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+
+
+/*
+ * Since we are now permitted to establish guard regions in read-only anonymous
+ * mappings, for the sake of thoroughness, though it probably has no practical
+ * use, test that guard regions function with a mapping to the anonymous zero
+ * page.
+ */
+TEST_F(guard_regions, anon_zeropage)
+{
+ const unsigned long page_size = self->page_size;
+ char *ptr;
+ int i;
+
+ if (!is_anon_backed(variant))
+ SKIP(return, "anon zero page test specific to anon/shmem");
+
+ /* Obtain a read-only i.e. anon zero page mapping. */
+ ptr = mmap_(self, variant, NULL, 10 * page_size, PROT_READ, 0, 0);
+ ASSERT_NE(ptr, MAP_FAILED);
+
+ /* Now make every even page guarded. */
+ for (i = 0; i < 10; i += 2) {
+ char *ptr_p = &ptr[i * page_size];
+
+ ASSERT_EQ(madvise(ptr_p, page_size, MADV_GUARD_INSTALL), 0);
+ }
+
+ /* Now assert things are as expected. */
+ for (i = 0; i < 10; i++) {
+ char *ptr_p = &ptr[i * page_size];
+
+ ASSERT_EQ(try_read_buf(ptr_p), i % 2 != 0);
+ }
+
+ /* Now remove all guard regions. */
+ ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_REMOVE), 0);
+
+ /* Now assert things are as expected. */
+ for (i = 0; i < 10; i++) {
+ char *ptr_p = &ptr[i * page_size];
+
+ ASSERT_TRUE(try_read_buf(ptr_p));
+ }
+
+ /* Ensure zero page...*/
+ ASSERT_TRUE(is_buf_eq(ptr, 10 * page_size, '\0'));
+
+ ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+
TEST_HARNESS_MAIN
--
2.48.1
^ permalink raw reply related [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-13 18:16 [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings Lorenzo Stoakes
` (3 preceding siblings ...)
2025-02-13 18:17 ` [PATCH 4/4] tools/selftests: add file/shmem-backed mapping guard region tests Lorenzo Stoakes
@ 2025-02-18 12:12 ` Vlastimil Babka
2025-02-18 13:05 ` Lorenzo Stoakes
2025-02-19 8:25 ` Kalesh Singh
5 siblings, 1 reply; 64+ messages in thread
From: Vlastimil Babka @ 2025-02-18 12:12 UTC (permalink / raw)
To: Lorenzo Stoakes, Andrew Morton
Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
Paul E . McKenney, Jann Horn, David Hildenbrand, linux-mm,
linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
On 2/13/25 19:16, Lorenzo Stoakes wrote:
> The guard regions feature was initially implemented to support anonymous
> mappings only, excluding shmem.
>
> This was done such as to introduce the feature carefully and incrementally
> and to be conservative when considering the various caveats and corner
> cases that are applicable to file-backed mappings but not to anonymous
> ones.
>
> Now this feature has landed in 6.13, it is time to revisit this and to
> extend this functionality to file-backed and shmem mappings.
>
> In order to make this maximally useful, and since one may map file-backed
> mappings read-only (for instance ELF images), we also remove the
> restriction on read-only mappings and permit the establishment of guard
> regions in any non-hugetlb, non-mlock()'d mapping.
Do you plan to address mlock later too? I guess somebody might find use for
those. Is there some fundamental issue or just that we need to define some
good semantics for corner cases? (i.e. if pages are already populated in the
mlocked area, discarding them by replacing with guard pte's goes against
that, so do we allow it or not?).
Otherwise nice discussion of all the potential issues here!
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-18 12:12 ` [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings Vlastimil Babka
@ 2025-02-18 13:05 ` Lorenzo Stoakes
2025-02-18 14:35 ` David Hildenbrand
0 siblings, 1 reply; 64+ messages in thread
From: Lorenzo Stoakes @ 2025-02-18 13:05 UTC (permalink / raw)
To: Vlastimil Babka
Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
Matthew Wilcox, Paul E . McKenney, Jann Horn, David Hildenbrand,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
On Tue, Feb 18, 2025 at 01:12:05PM +0100, Vlastimil Babka wrote:
> On 2/13/25 19:16, Lorenzo Stoakes wrote:
> > The guard regions feature was initially implemented to support anonymous
> > mappings only, excluding shmem.
> >
> > This was done such as to introduce the feature carefully and incrementally
> > and to be conservative when considering the various caveats and corner
> > cases that are applicable to file-backed mappings but not to anonymous
> > ones.
> >
> > Now this feature has landed in 6.13, it is time to revisit this and to
> > extend this functionality to file-backed and shmem mappings.
> >
> > In order to make this maximally useful, and since one may map file-backed
> > mappings read-only (for instance ELF images), we also remove the
> > restriction on read-only mappings and permit the establishment of guard
> > regions in any non-hugetlb, non-mlock()'d mapping.
>
> Do you plan to address mlock later too? I guess somebody might find use for
> those. Is there some fundamental issue or just that we need to define some
> good semantics for corner cases? (i.e. if pages are already populated in the
> mlocked area, discarding them by replacing with guard pte's goes against
> that, so do we allow it or not?).
Yeah that's the fundamental issue with mlock, it does not interact with the
zapping part of MADV_GUARD_INSTALL, and that is why we disallow it (but not so
for MADV_GUARD_REMOVE, as if a VMA that contains guard regions is locked
_afterwards_ there will be no zapping).
We could potentially expose an equivalent, as there are for other flags, a
_LOCKED variant of the madvise() flag, like MADV_GUARD_INSTALL_LOCKED to make
this explicit.
That is probably the most sensible option, if there is a need for this!
>
> Otherwise nice discussion of all the potential issues here!
>
Thanks! :)
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/4] mm: allow guard regions in file-backed and read-only mappings
2025-02-13 18:17 ` [PATCH 1/4] mm: allow guard regions in file-backed and read-only mappings Lorenzo Stoakes
@ 2025-02-18 14:15 ` Vlastimil Babka
2025-02-18 16:01 ` David Hildenbrand
2025-02-24 14:02 ` Lorenzo Stoakes
2 siblings, 0 replies; 64+ messages in thread
From: Vlastimil Babka @ 2025-02-18 14:15 UTC (permalink / raw)
To: Lorenzo Stoakes, Andrew Morton
Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
Paul E . McKenney, Jann Horn, David Hildenbrand, linux-mm,
linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
On 2/13/25 19:17, Lorenzo Stoakes wrote:
> There is no reason to disallow guard regions in file-backed mappings -
> readahead and fault-around both function correctly in the presence of PTE
> markers, equally other operations relating to memory-mapped files function
> correctly.
>
> Additionally, read-only mappings if introducing guard-regions, only
> restrict the mapping further, which means there is no violation of any
> access rights by permitting this to be so.
>
> Removing this restriction allows for read-only mapped files (such as
> executable files) correctly which would otherwise not be permitted.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> ---
> mm/madvise.c | 8 +-------
> 1 file changed, 1 insertion(+), 7 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 6ecead476a80..e01e93e179a8 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1051,13 +1051,7 @@ static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked)
> if (!allow_locked)
> disallowed |= VM_LOCKED;
>
> - if (!vma_is_anonymous(vma))
> - return false;
> -
> - if ((vma->vm_flags & (VM_MAYWRITE | disallowed)) != VM_MAYWRITE)
> - return false;
> -
> - return true;
> + return !(vma->vm_flags & disallowed);
> }
>
> static bool is_guard_pte_marker(pte_t ptent)
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 2/4] selftests/mm: rename guard-pages to guard-regions
2025-02-13 18:17 ` [PATCH 2/4] selftests/mm: rename guard-pages to guard-regions Lorenzo Stoakes
@ 2025-02-18 14:15 ` Vlastimil Babka
2025-03-02 8:35 ` Lorenzo Stoakes
1 sibling, 0 replies; 64+ messages in thread
From: Vlastimil Babka @ 2025-02-18 14:15 UTC (permalink / raw)
To: Lorenzo Stoakes, Andrew Morton
Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
Paul E . McKenney, Jann Horn, David Hildenbrand, linux-mm,
linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
On 2/13/25 19:17, Lorenzo Stoakes wrote:
> The feature formerly referred to as guard pages is more correctly referred
> to as 'guard regions', as in fact no pages are ever allocated in the
> process of installing the regions.
>
> To avoid confusion, rename the tests accordingly.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
> ---
> tools/testing/selftests/mm/.gitignore | 2 +-
> tools/testing/selftests/mm/Makefile | 2 +-
> .../mm/{guard-pages.c => guard-regions.c} | 42 +++++++++----------
> 3 files changed, 23 insertions(+), 23 deletions(-)
> rename tools/testing/selftests/mm/{guard-pages.c => guard-regions.c} (98%)
>
> diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore
> index 121000c28c10..c5241b193db8 100644
> --- a/tools/testing/selftests/mm/.gitignore
> +++ b/tools/testing/selftests/mm/.gitignore
> @@ -57,4 +57,4 @@ droppable
> hugetlb_dio
> pkey_sighandler_tests_32
> pkey_sighandler_tests_64
> -guard-pages
> +guard-regions
> diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
> index 63ce39d024bb..8270895039d1 100644
> --- a/tools/testing/selftests/mm/Makefile
> +++ b/tools/testing/selftests/mm/Makefile
> @@ -97,7 +97,7 @@ TEST_GEN_FILES += hugetlb_fault_after_madv
> TEST_GEN_FILES += hugetlb_madv_vs_map
> TEST_GEN_FILES += hugetlb_dio
> TEST_GEN_FILES += droppable
> -TEST_GEN_FILES += guard-pages
> +TEST_GEN_FILES += guard-regions
>
> ifneq ($(ARCH),arm64)
> TEST_GEN_FILES += soft-dirty
> diff --git a/tools/testing/selftests/mm/guard-pages.c b/tools/testing/selftests/mm/guard-regions.c
> similarity index 98%
> rename from tools/testing/selftests/mm/guard-pages.c
> rename to tools/testing/selftests/mm/guard-regions.c
> index ece37212a8a2..7a41cf9ffbdf 100644
> --- a/tools/testing/selftests/mm/guard-pages.c
> +++ b/tools/testing/selftests/mm/guard-regions.c
> @@ -107,12 +107,12 @@ static bool try_read_write_buf(char *ptr)
> return try_read_buf(ptr) && try_write_buf(ptr);
> }
>
> -FIXTURE(guard_pages)
> +FIXTURE(guard_regions)
> {
> unsigned long page_size;
> };
>
> -FIXTURE_SETUP(guard_pages)
> +FIXTURE_SETUP(guard_regions)
> {
> struct sigaction act = {
> .sa_handler = &handle_fatal,
> @@ -126,7 +126,7 @@ FIXTURE_SETUP(guard_pages)
> self->page_size = (unsigned long)sysconf(_SC_PAGESIZE);
> };
>
> -FIXTURE_TEARDOWN(guard_pages)
> +FIXTURE_TEARDOWN(guard_regions)
> {
> struct sigaction act = {
> .sa_handler = SIG_DFL,
> @@ -137,7 +137,7 @@ FIXTURE_TEARDOWN(guard_pages)
> sigaction(SIGSEGV, &act, NULL);
> }
>
> -TEST_F(guard_pages, basic)
> +TEST_F(guard_regions, basic)
> {
> const unsigned long NUM_PAGES = 10;
> const unsigned long page_size = self->page_size;
> @@ -231,7 +231,7 @@ TEST_F(guard_pages, basic)
> }
>
> /* Assert that operations applied across multiple VMAs work as expected. */
> -TEST_F(guard_pages, multi_vma)
> +TEST_F(guard_regions, multi_vma)
> {
> const unsigned long page_size = self->page_size;
> char *ptr_region, *ptr, *ptr1, *ptr2, *ptr3;
> @@ -367,7 +367,7 @@ TEST_F(guard_pages, multi_vma)
> * Assert that batched operations performed using process_madvise() work as
> * expected.
> */
> -TEST_F(guard_pages, process_madvise)
> +TEST_F(guard_regions, process_madvise)
> {
> const unsigned long page_size = self->page_size;
> pid_t pid = getpid();
> @@ -467,7 +467,7 @@ TEST_F(guard_pages, process_madvise)
> }
>
> /* Assert that unmapping ranges does not leave guard markers behind. */
> -TEST_F(guard_pages, munmap)
> +TEST_F(guard_regions, munmap)
> {
> const unsigned long page_size = self->page_size;
> char *ptr, *ptr_new1, *ptr_new2;
> @@ -505,7 +505,7 @@ TEST_F(guard_pages, munmap)
> }
>
> /* Assert that mprotect() operations have no bearing on guard markers. */
> -TEST_F(guard_pages, mprotect)
> +TEST_F(guard_regions, mprotect)
> {
> const unsigned long page_size = self->page_size;
> char *ptr;
> @@ -553,7 +553,7 @@ TEST_F(guard_pages, mprotect)
> }
>
> /* Split and merge VMAs and make sure guard pages still behave. */
> -TEST_F(guard_pages, split_merge)
> +TEST_F(guard_regions, split_merge)
> {
> const unsigned long page_size = self->page_size;
> char *ptr, *ptr_new;
> @@ -684,7 +684,7 @@ TEST_F(guard_pages, split_merge)
> }
>
> /* Assert that MADV_DONTNEED does not remove guard markers. */
> -TEST_F(guard_pages, dontneed)
> +TEST_F(guard_regions, dontneed)
> {
> const unsigned long page_size = self->page_size;
> char *ptr;
> @@ -737,7 +737,7 @@ TEST_F(guard_pages, dontneed)
> }
>
> /* Assert that mlock()'ed pages work correctly with guard markers. */
> -TEST_F(guard_pages, mlock)
> +TEST_F(guard_regions, mlock)
> {
> const unsigned long page_size = self->page_size;
> char *ptr;
> @@ -810,7 +810,7 @@ TEST_F(guard_pages, mlock)
> *
> * - Moving a mapping alone should retain markers as they are.
> */
> -TEST_F(guard_pages, mremap_move)
> +TEST_F(guard_regions, mremap_move)
> {
> const unsigned long page_size = self->page_size;
> char *ptr, *ptr_new;
> @@ -857,7 +857,7 @@ TEST_F(guard_pages, mremap_move)
> * will have to remove guard pages manually to fix up (they'd have to do the
> * same if it were a PROT_NONE mapping).
> */
> -TEST_F(guard_pages, mremap_expand)
> +TEST_F(guard_regions, mremap_expand)
> {
> const unsigned long page_size = self->page_size;
> char *ptr, *ptr_new;
> @@ -920,7 +920,7 @@ TEST_F(guard_pages, mremap_expand)
> * if the user were using a PROT_NONE mapping they'd have to manually fix this
> * up also so this is OK.
> */
> -TEST_F(guard_pages, mremap_shrink)
> +TEST_F(guard_regions, mremap_shrink)
> {
> const unsigned long page_size = self->page_size;
> char *ptr;
> @@ -984,7 +984,7 @@ TEST_F(guard_pages, mremap_shrink)
> * Assert that forking a process with VMAs that do not have VM_WIPEONFORK set
> * retain guard pages.
> */
> -TEST_F(guard_pages, fork)
> +TEST_F(guard_regions, fork)
> {
> const unsigned long page_size = self->page_size;
> char *ptr;
> @@ -1039,7 +1039,7 @@ TEST_F(guard_pages, fork)
> * Assert expected behaviour after we fork populated ranges of anonymous memory
> * and then guard and unguard the range.
> */
> -TEST_F(guard_pages, fork_cow)
> +TEST_F(guard_regions, fork_cow)
> {
> const unsigned long page_size = self->page_size;
> char *ptr;
> @@ -1110,7 +1110,7 @@ TEST_F(guard_pages, fork_cow)
> * Assert that forking a process with VMAs that do have VM_WIPEONFORK set
> * behave as expected.
> */
> -TEST_F(guard_pages, fork_wipeonfork)
> +TEST_F(guard_regions, fork_wipeonfork)
> {
> const unsigned long page_size = self->page_size;
> char *ptr;
> @@ -1160,7 +1160,7 @@ TEST_F(guard_pages, fork_wipeonfork)
> }
>
> /* Ensure that MADV_FREE retains guard entries as expected. */
> -TEST_F(guard_pages, lazyfree)
> +TEST_F(guard_regions, lazyfree)
> {
> const unsigned long page_size = self->page_size;
> char *ptr;
> @@ -1196,7 +1196,7 @@ TEST_F(guard_pages, lazyfree)
> }
>
> /* Ensure that MADV_POPULATE_READ, MADV_POPULATE_WRITE behave as expected. */
> -TEST_F(guard_pages, populate)
> +TEST_F(guard_regions, populate)
> {
> const unsigned long page_size = self->page_size;
> char *ptr;
> @@ -1222,7 +1222,7 @@ TEST_F(guard_pages, populate)
> }
>
> /* Ensure that MADV_COLD, MADV_PAGEOUT do not remove guard markers. */
> -TEST_F(guard_pages, cold_pageout)
> +TEST_F(guard_regions, cold_pageout)
> {
> const unsigned long page_size = self->page_size;
> char *ptr;
> @@ -1268,7 +1268,7 @@ TEST_F(guard_pages, cold_pageout)
> }
>
> /* Ensure that guard pages do not break userfaultd. */
> -TEST_F(guard_pages, uffd)
> +TEST_F(guard_regions, uffd)
> {
> const unsigned long page_size = self->page_size;
> int uffd;
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 3/4] tools/selftests: expand all guard region tests to file-backed
2025-02-13 18:17 ` [PATCH 3/4] tools/selftests: expand all guard region tests to file-backed Lorenzo Stoakes
@ 2025-02-18 14:17 ` Vlastimil Babka
2025-04-22 10:37 ` Ryan Roberts
1 sibling, 0 replies; 64+ messages in thread
From: Vlastimil Babka @ 2025-02-18 14:17 UTC (permalink / raw)
To: Lorenzo Stoakes, Andrew Morton
Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
Paul E . McKenney, Jann Horn, David Hildenbrand, linux-mm,
linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
On 2/13/25 19:17, Lorenzo Stoakes wrote:
> Extend the guard region tests to allow for test fixture variants for anon,
> shmem, and local file files.
>
> This allows us to assert that each of the expected behaviours of anonymous
> memory also applies correctly to file-backed (both shmem and an a file
> created locally in the current working directory) and thus asserts the same
> correctness guarantees as all the remaining tests do.
>
> The fixture teardown is now performed in the parent process rather than
> child forked ones, meaning cleanup is always performed, including unlinking
> any generated temporary files.
>
> Additionally the variant fixture data type now contains an enum value
> indicating the type of backing store and the mmap() invocation is
> abstracted to allow for the mapping of whichever backing store the variant
> is testing.
>
> We adjust tests as necessary to account for the fact they may now reference
> files rather than anonymous memory.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 4/4] tools/selftests: add file/shmem-backed mapping guard region tests
2025-02-13 18:17 ` [PATCH 4/4] tools/selftests: add file/shmem-backed mapping guard region tests Lorenzo Stoakes
@ 2025-02-18 14:18 ` Vlastimil Babka
0 siblings, 0 replies; 64+ messages in thread
From: Vlastimil Babka @ 2025-02-18 14:18 UTC (permalink / raw)
To: Lorenzo Stoakes, Andrew Morton
Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
Paul E . McKenney, Jann Horn, David Hildenbrand, linux-mm,
linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
On 2/13/25 19:17, Lorenzo Stoakes wrote:
> Extend the guard region self tests to explicitly assert that guard regions
> work correctly for functionality specific to file-backed and shmem
> mappings.
>
> In addition to testing all of the existing guard region functionality that
> is currently tested against anonymous mappings against file-backed and
> shmem mappings (except those which are exclusive to anonymous mapping), we
> now also:
>
> * Test that MADV_SEQUENTIAL does not cause unexpected readahead behaviour.
> * Test that MAP_PRIVATE behaves as expected with guard regions installed in
> both a shared and private mapping of an fd.
> * Test that a read-only file can correctly establish guard regions.
> * Test a probable fault-around case does not interfere with guard regions
> (or vice-versa).
> * Test that truncation does not eliminate guard regions.
> * Test that hole punching functions as expected in the presence of guard
> regions.
> * Test that a read-only mapping of a memfd write sealed mapping can have
> guard regions established within it and function correctly without
> violation of the seal.
> * Test that guard regions installed into a mapping of the anonymous zero
> page function correctly.
Impressive.
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-18 13:05 ` Lorenzo Stoakes
@ 2025-02-18 14:35 ` David Hildenbrand
2025-02-18 14:53 ` Lorenzo Stoakes
0 siblings, 1 reply; 64+ messages in thread
From: David Hildenbrand @ 2025-02-18 14:35 UTC (permalink / raw)
To: Lorenzo Stoakes, Vlastimil Babka
Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
Matthew Wilcox, Paul E . McKenney, Jann Horn, linux-mm,
linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
On 18.02.25 14:05, Lorenzo Stoakes wrote:
> On Tue, Feb 18, 2025 at 01:12:05PM +0100, Vlastimil Babka wrote:
>> On 2/13/25 19:16, Lorenzo Stoakes wrote:
>>> The guard regions feature was initially implemented to support anonymous
>>> mappings only, excluding shmem.
>>>
>>> This was done such as to introduce the feature carefully and incrementally
>>> and to be conservative when considering the various caveats and corner
>>> cases that are applicable to file-backed mappings but not to anonymous
>>> ones.
>>>
>>> Now this feature has landed in 6.13, it is time to revisit this and to
>>> extend this functionality to file-backed and shmem mappings.
>>>
>>> In order to make this maximally useful, and since one may map file-backed
>>> mappings read-only (for instance ELF images), we also remove the
>>> restriction on read-only mappings and permit the establishment of guard
>>> regions in any non-hugetlb, non-mlock()'d mapping.
>>
>> Do you plan to address mlock later too? I guess somebody might find use for
>> those. Is there some fundamental issue or just that we need to define some
>> good semantics for corner cases? (i.e. if pages are already populated in the
>> mlocked area, discarding them by replacing with guard pte's goes against
>> that, so do we allow it or not?).
>
> Yeah that's the fundamental issue with mlock, it does not interact with the
> zapping part of MADV_GUARD_INSTALL, and that is why we disallow it (but not so
> for MADV_GUARD_REMOVE, as if a VMA that contains guard regions is locked
> _afterwards_ there will be no zapping).
>
> We could potentially expose an equivalent, as there are for other flags, a
> _LOCKED variant of the madvise() flag, like MADV_GUARD_INSTALL_LOCKED to make
> this explicit.
>
> That is probably the most sensible option, if there is a need for this!
mlock is weird, because it assumes that memory will be faulted in in the whole VMA.
You'd likely have to populate + mlock the page when removing the guard.
Also not sure what happens if one does an mlock()/mlockall() after
already installing PTE markers.
__mm_populate() would skip whole VMAs in case populate_vma_page_range()
fails. And I would assume populate_vma_page_range() fails on the first
guard when it triggers a page fault.
OTOH, supporting the mlock-on-fault thingy should be easy. That's precisely where
MADV_DONTNEED_LOCKED originates from:
commit 9457056ac426e5ed0671356509c8dcce69f8dee0
Author: Johannes Weiner <hannes@cmpxchg.org>
Date: Thu Mar 24 18:14:12 2022 -0700
mm: madvise: MADV_DONTNEED_LOCKED
MADV_DONTNEED historically rejects mlocked ranges, but with MLOCK_ONFAULT
and MCL_ONFAULT allowing to mlock without populating, there are valid use
cases for depopulating locked ranges as well.
Adding support for that would be indeed nice.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-18 14:35 ` David Hildenbrand
@ 2025-02-18 14:53 ` Lorenzo Stoakes
2025-02-18 15:20 ` David Hildenbrand
0 siblings, 1 reply; 64+ messages in thread
From: Lorenzo Stoakes @ 2025-02-18 14:53 UTC (permalink / raw)
To: David Hildenbrand
Cc: Vlastimil Babka, Andrew Morton, Suren Baghdasaryan,
Liam R . Howlett, Matthew Wilcox, Paul E . McKenney, Jann Horn,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
On Tue, Feb 18, 2025 at 03:35:19PM +0100, David Hildenbrand wrote:
> On 18.02.25 14:05, Lorenzo Stoakes wrote:
> > On Tue, Feb 18, 2025 at 01:12:05PM +0100, Vlastimil Babka wrote:
> > > On 2/13/25 19:16, Lorenzo Stoakes wrote:
> > > > The guard regions feature was initially implemented to support anonymous
> > > > mappings only, excluding shmem.
> > > >
> > > > This was done such as to introduce the feature carefully and incrementally
> > > > and to be conservative when considering the various caveats and corner
> > > > cases that are applicable to file-backed mappings but not to anonymous
> > > > ones.
> > > >
> > > > Now this feature has landed in 6.13, it is time to revisit this and to
> > > > extend this functionality to file-backed and shmem mappings.
> > > >
> > > > In order to make this maximally useful, and since one may map file-backed
> > > > mappings read-only (for instance ELF images), we also remove the
> > > > restriction on read-only mappings and permit the establishment of guard
> > > > regions in any non-hugetlb, non-mlock()'d mapping.
> > >
> > > Do you plan to address mlock later too? I guess somebody might find use for
> > > those. Is there some fundamental issue or just that we need to define some
> > > good semantics for corner cases? (i.e. if pages are already populated in the
> > > mlocked area, discarding them by replacing with guard pte's goes against
> > > that, so do we allow it or not?).
> >
> > Yeah that's the fundamental issue with mlock, it does not interact with the
> > zapping part of MADV_GUARD_INSTALL, and that is why we disallow it (but not so
> > for MADV_GUARD_REMOVE, as if a VMA that contains guard regions is locked
> > _afterwards_ there will be no zapping).
> >
> > We could potentially expose an equivalent, as there are for other flags, a
> > _LOCKED variant of the madvise() flag, like MADV_GUARD_INSTALL_LOCKED to make
> > this explicit.
> >
> > That is probably the most sensible option, if there is a need for this!
>
> mlock is weird, because it assumes that memory will be faulted in in the whole VMA.
>
> You'd likely have to populate + mlock the page when removing the guard.
Right yeah that'd be super weird. And I don't want to add that logic.
> Also not sure what happens if one does an mlock()/mlockall() after
> already installing PTE markers.
The existing logic already handles non-present cases by skipping them, in
mlock_pte_range():
for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
ptent = ptep_get(pte);
if (!pte_present(ptent))
continue;
...
}
Which covers off guard regions. Removing the guard regions after this will
leave you in a weird situation where these entries will be zapped... maybe
we need a patch to make MADV_GUARD_REMOVE check VM_LOCKED and in this case
also populate?
Actually I think the simpler option is to just disallow MADV_GUARD_REMOVE
if you since locked the range? The code currently allows this on the
proviso that 'you aren't zapping locked mappings' but leaves the VMA in a
state such that some entries would not be locked.
It'd be pretty weird to lock guard regions like this.
Having said all that, given what you say below, maybe it's not an issue
after all?...
>
> __mm_populate() would skip whole VMAs in case populate_vma_page_range()
> fails. And I would assume populate_vma_page_range() fails on the first
> guard when it triggers a page fault.
>
> OTOH, supporting the mlock-on-fault thingy should be easy. That's precisely where
> MADV_DONTNEED_LOCKED originates from:
>
> commit 9457056ac426e5ed0671356509c8dcce69f8dee0
> Author: Johannes Weiner <hannes@cmpxchg.org>
> Date: Thu Mar 24 18:14:12 2022 -0700
>
> mm: madvise: MADV_DONTNEED_LOCKED
> MADV_DONTNEED historically rejects mlocked ranges, but with MLOCK_ONFAULT
> and MCL_ONFAULT allowing to mlock without populating, there are valid use
> cases for depopulating locked ranges as well.
...Hm this seems to imply the current guard remove stuff isn't quite so
bad, so maybe the assumption that VM_LOCKED implies 'everything is
populated' isn't quite as stringent then.
The restriction is as simple as:
if (behavior != MADV_DONTNEED_LOCKED)
forbidden |= VM_LOCKED;
>
>
> Adding support for that would be indeed nice.
I mean it's sort of maybe understandable why you'd want to MADV_DONTNEED
locked ranges, but I really don't understand why you'd want to add guard
regions to mlock()'ed regions?
Then again we're currently asymmetric as you can add them _before_
mlock()'ing...
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-18 14:53 ` Lorenzo Stoakes
@ 2025-02-18 15:20 ` David Hildenbrand
2025-02-18 16:43 ` Lorenzo Stoakes
0 siblings, 1 reply; 64+ messages in thread
From: David Hildenbrand @ 2025-02-18 15:20 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Vlastimil Babka, Andrew Morton, Suren Baghdasaryan,
Liam R . Howlett, Matthew Wilcox, Paul E . McKenney, Jann Horn,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
> Right yeah that'd be super weird. And I don't want to add that logic.
>
>> Also not sure what happens if one does an mlock()/mlockall() after
>> already installing PTE markers.
>
> The existing logic already handles non-present cases by skipping them, in
> mlock_pte_range():
>
> for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
> ptent = ptep_get(pte);
> if (!pte_present(ptent))
> continue;
>
> ...
> }
I *think* that code only updates already-mapped folios, to properly call
mlock_folio()/munlock_folio().
It is not the code that populates pages on mlock()/mlockall(). I think
all that goes via mm_populate()/__mm_populate(), where "ordinary GUP"
should apply.
See populate_vma_page_range(), especially also the VM_LOCKONFAULT handling.
>
> Which covers off guard regions. Removing the guard regions after this will
> leave you in a weird situation where these entries will be zapped... maybe
> we need a patch to make MADV_GUARD_REMOVE check VM_LOCKED and in this case
> also populate?
Maybe? Or we say that it behaves like MADV_DONTNEED_LOCKED.
>
> Actually I think the simpler option is to just disallow MADV_GUARD_REMOVE
> if you since locked the range? The code currently allows this on the
> proviso that 'you aren't zapping locked mappings' but leaves the VMA in a
> state such that some entries would not be locked.
>
> It'd be pretty weird to lock guard regions like this.
>
> Having said all that, given what you say below, maybe it's not an issue
> after all?...
>
>>
>> __mm_populate() would skip whole VMAs in case populate_vma_page_range()
>> fails. And I would assume populate_vma_page_range() fails on the first
>> guard when it triggers a page fault.
>>
>> OTOH, supporting the mlock-on-fault thingy should be easy. That's precisely where
>> MADV_DONTNEED_LOCKED originates from:
>>
>> commit 9457056ac426e5ed0671356509c8dcce69f8dee0
>> Author: Johannes Weiner <hannes@cmpxchg.org>
>> Date: Thu Mar 24 18:14:12 2022 -0700
>>
>> mm: madvise: MADV_DONTNEED_LOCKED
>> MADV_DONTNEED historically rejects mlocked ranges, but with MLOCK_ONFAULT
>> and MCL_ONFAULT allowing to mlock without populating, there are valid use
>> cases for depopulating locked ranges as well.
>
> ...Hm this seems to imply the current guard remove stuff isn't quite so
> bad, so maybe the assumption that VM_LOCKED implies 'everything is
> populated' isn't quite as stringent then.
Right, with MCL_ONFAULT at least. Without MCL_ONFAULT, the assumption is
that everything is populated (unless, apparently one uses
MADV_DONTNEED_LOCKED or population failed, maybe).
VM_LOCKONFAULT seems to be the sane case. I wonder why
MADV_DONTNEED_LOCKED didn't explicitly check for that one ... maybe
there is a history to that.
>
> The restriction is as simple as:
>
> if (behavior != MADV_DONTNEED_LOCKED)
> forbidden |= VM_LOCKED;
>
>>
>>
>> Adding support for that would be indeed nice.
>
> I mean it's sort of maybe understandable why you'd want to MADV_DONTNEED
> locked ranges, but I really don't understand why you'd want to add guard
> regions to mlock()'ed regions?
Somme apps use mlockall(), and it might be nice to just be able to use
guard pages as if "Nothing happened".
E.g., QEMU has the option to use mlockall().
>
> Then again we're currently asymmetric as you can add them _before_
> mlock()'ing...
Right.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/4] mm: allow guard regions in file-backed and read-only mappings
2025-02-13 18:17 ` [PATCH 1/4] mm: allow guard regions in file-backed and read-only mappings Lorenzo Stoakes
2025-02-18 14:15 ` Vlastimil Babka
@ 2025-02-18 16:01 ` David Hildenbrand
2025-02-18 16:12 ` Lorenzo Stoakes
2025-02-24 14:02 ` Lorenzo Stoakes
2 siblings, 1 reply; 64+ messages in thread
From: David Hildenbrand @ 2025-02-18 16:01 UTC (permalink / raw)
To: Lorenzo Stoakes, Andrew Morton
Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
Vlastimil Babka, Paul E . McKenney, Jann Horn, linux-mm,
linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
On 13.02.25 19:17, Lorenzo Stoakes wrote:
> There is no reason to disallow guard regions in file-backed mappings -
> readahead and fault-around both function correctly in the presence of PTE
> markers, equally other operations relating to memory-mapped files function
> correctly.
>
> Additionally, read-only mappings if introducing guard-regions, only
> restrict the mapping further, which means there is no violation of any
> access rights by permitting this to be so.
>
> Removing this restriction allows for read-only mapped files (such as
> executable files) correctly which would otherwise not be permitted.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/madvise.c | 8 +-------
> 1 file changed, 1 insertion(+), 7 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 6ecead476a80..e01e93e179a8 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1051,13 +1051,7 @@ static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked)
> if (!allow_locked)
> disallowed |= VM_LOCKED;
>
> - if (!vma_is_anonymous(vma))
> - return false;
> -
> - if ((vma->vm_flags & (VM_MAYWRITE | disallowed)) != VM_MAYWRITE)
> - return false;
> -
> - return true;
> + return !(vma->vm_flags & disallowed);
> }
>
> static bool is_guard_pte_marker(pte_t ptent)
Acked-by: David Hildenbrand <david@redhat.com>
I assume these markers cannot completely prevent us from allocating
pages/folios for these underlying file/pageache ranges of these markers
in case of shmem during page faults, right?
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/4] mm: allow guard regions in file-backed and read-only mappings
2025-02-18 16:01 ` David Hildenbrand
@ 2025-02-18 16:12 ` Lorenzo Stoakes
2025-02-18 16:17 ` David Hildenbrand
0 siblings, 1 reply; 64+ messages in thread
From: Lorenzo Stoakes @ 2025-02-18 16:12 UTC (permalink / raw)
To: David Hildenbrand
Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
Matthew Wilcox, Vlastimil Babka, Paul E . McKenney, Jann Horn,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
On Tue, Feb 18, 2025 at 05:01:16PM +0100, David Hildenbrand wrote:
> On 13.02.25 19:17, Lorenzo Stoakes wrote:
> > There is no reason to disallow guard regions in file-backed mappings -
> > readahead and fault-around both function correctly in the presence of PTE
> > markers, equally other operations relating to memory-mapped files function
> > correctly.
> >
> > Additionally, read-only mappings if introducing guard-regions, only
> > restrict the mapping further, which means there is no violation of any
> > access rights by permitting this to be so.
> >
> > Removing this restriction allows for read-only mapped files (such as
> > executable files) correctly which would otherwise not be permitted.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> > mm/madvise.c | 8 +-------
> > 1 file changed, 1 insertion(+), 7 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 6ecead476a80..e01e93e179a8 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1051,13 +1051,7 @@ static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked)
> > if (!allow_locked)
> > disallowed |= VM_LOCKED;
> > - if (!vma_is_anonymous(vma))
> > - return false;
> > -
> > - if ((vma->vm_flags & (VM_MAYWRITE | disallowed)) != VM_MAYWRITE)
> > - return false;
> > -
> > - return true;
> > + return !(vma->vm_flags & disallowed);
> > }
> > static bool is_guard_pte_marker(pte_t ptent)
>
> Acked-by: David Hildenbrand <david@redhat.com>
Thanks!
>
> I assume these markers cannot completely prevent us from allocating
> pages/folios for these underlying file/pageache ranges of these markers in
> case of shmem during page faults, right?
If the markers are in place, then page faulting will result in a
segfault. If we faulted in a shmem page then installed markers (which would
zap the range), then the page cache will be populated, but obviously
subject to standard reclaim.
If we perform synchronous readahead prior to a guard region that includes
(partially or fully) a guard region we might major fault entries into the
page cache that are then not accessable _from that mapping_, this is rather
unavoidable as this doesn't account for page table mappings and should be
largely trivial overhead (also these folios are reclaimable).
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/4] mm: allow guard regions in file-backed and read-only mappings
2025-02-18 16:12 ` Lorenzo Stoakes
@ 2025-02-18 16:17 ` David Hildenbrand
2025-02-18 16:21 ` Lorenzo Stoakes
0 siblings, 1 reply; 64+ messages in thread
From: David Hildenbrand @ 2025-02-18 16:17 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
Matthew Wilcox, Vlastimil Babka, Paul E . McKenney, Jann Horn,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
On 18.02.25 17:12, Lorenzo Stoakes wrote:
> On Tue, Feb 18, 2025 at 05:01:16PM +0100, David Hildenbrand wrote:
>> On 13.02.25 19:17, Lorenzo Stoakes wrote:
>>> There is no reason to disallow guard regions in file-backed mappings -
>>> readahead and fault-around both function correctly in the presence of PTE
>>> markers, equally other operations relating to memory-mapped files function
>>> correctly.
>>>
>>> Additionally, read-only mappings if introducing guard-regions, only
>>> restrict the mapping further, which means there is no violation of any
>>> access rights by permitting this to be so.
>>>
>>> Removing this restriction allows for read-only mapped files (such as
>>> executable files) correctly which would otherwise not be permitted.
>>>
>>> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> ---
>>> mm/madvise.c | 8 +-------
>>> 1 file changed, 1 insertion(+), 7 deletions(-)
>>>
>>> diff --git a/mm/madvise.c b/mm/madvise.c
>>> index 6ecead476a80..e01e93e179a8 100644
>>> --- a/mm/madvise.c
>>> +++ b/mm/madvise.c
>>> @@ -1051,13 +1051,7 @@ static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked)
>>> if (!allow_locked)
>>> disallowed |= VM_LOCKED;
>>> - if (!vma_is_anonymous(vma))
>>> - return false;
>>> -
>>> - if ((vma->vm_flags & (VM_MAYWRITE | disallowed)) != VM_MAYWRITE)
>>> - return false;
>>> -
>>> - return true;
>>> + return !(vma->vm_flags & disallowed);
>>> }
>>> static bool is_guard_pte_marker(pte_t ptent)
>>
>> Acked-by: David Hildenbrand <david@redhat.com>
>
> Thanks!
>
>>
>> I assume these markers cannot completely prevent us from allocating
>> pages/folios for these underlying file/pageache ranges of these markers in
>> case of shmem during page faults, right?
>
> If the markers are in place, then page faulting will result in a
> segfault. If we faulted in a shmem page then installed markers (which would
> zap the range), then the page cache will be populated, but obviously
> subject to standard reclaim.
Well, yes, (a) if there is swap and (b), if the noswap option was not
specified for tmpfs.
Okay, so installing a guard entry might require punshing a hole to get
rid of any already-existing memory. But readahead (below) might mess it up.
>
> If we perform synchronous readahead prior to a guard region that includes
> (partially or fully) a guard region we might major fault entries into the
> page cache that are then not accessable _from that mapping_, this is rather
> unavoidable as this doesn't account for page table mappings and should be
> largely trivial overhead (also these folios are reclaimable).
Right, that's what I had in mind: assume I have a single marker in a
PMD, shmem might allocate a PMD THP to back that region, ignoring the
marker hint. (so I think)
Thanks!
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/4] mm: allow guard regions in file-backed and read-only mappings
2025-02-18 16:17 ` David Hildenbrand
@ 2025-02-18 16:21 ` Lorenzo Stoakes
2025-02-18 16:27 ` David Hildenbrand
0 siblings, 1 reply; 64+ messages in thread
From: Lorenzo Stoakes @ 2025-02-18 16:21 UTC (permalink / raw)
To: David Hildenbrand
Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
Matthew Wilcox, Vlastimil Babka, Paul E . McKenney, Jann Horn,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
On Tue, Feb 18, 2025 at 05:17:20PM +0100, David Hildenbrand wrote:
> On 18.02.25 17:12, Lorenzo Stoakes wrote:
> > On Tue, Feb 18, 2025 at 05:01:16PM +0100, David Hildenbrand wrote:
> > > On 13.02.25 19:17, Lorenzo Stoakes wrote:
> > > > There is no reason to disallow guard regions in file-backed mappings -
> > > > readahead and fault-around both function correctly in the presence of PTE
> > > > markers, equally other operations relating to memory-mapped files function
> > > > correctly.
> > > >
> > > > Additionally, read-only mappings if introducing guard-regions, only
> > > > restrict the mapping further, which means there is no violation of any
> > > > access rights by permitting this to be so.
> > > >
> > > > Removing this restriction allows for read-only mapped files (such as
> > > > executable files) correctly which would otherwise not be permitted.
> > > >
> > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > ---
> > > > mm/madvise.c | 8 +-------
> > > > 1 file changed, 1 insertion(+), 7 deletions(-)
> > > >
> > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > index 6ecead476a80..e01e93e179a8 100644
> > > > --- a/mm/madvise.c
> > > > +++ b/mm/madvise.c
> > > > @@ -1051,13 +1051,7 @@ static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked)
> > > > if (!allow_locked)
> > > > disallowed |= VM_LOCKED;
> > > > - if (!vma_is_anonymous(vma))
> > > > - return false;
> > > > -
> > > > - if ((vma->vm_flags & (VM_MAYWRITE | disallowed)) != VM_MAYWRITE)
> > > > - return false;
> > > > -
> > > > - return true;
> > > > + return !(vma->vm_flags & disallowed);
> > > > }
> > > > static bool is_guard_pte_marker(pte_t ptent)
> > >
> > > Acked-by: David Hildenbrand <david@redhat.com>
> >
> > Thanks!
> >
> > >
> > > I assume these markers cannot completely prevent us from allocating
> > > pages/folios for these underlying file/pageache ranges of these markers in
> > > case of shmem during page faults, right?
> >
> > If the markers are in place, then page faulting will result in a
> > segfault. If we faulted in a shmem page then installed markers (which would
> > zap the range), then the page cache will be populated, but obviously
> > subject to standard reclaim.
>
> Well, yes, (a) if there is swap and (b), if the noswap option was not
> specified for tmpfs.
>
Right, yeah if you don't have it set up such that dropping a reference to the
folio doesn't drop the page altogether.
I think this matches expectation though in that you'd get the same results from
an MADV_DONTNEED followed by faulting the page again.
> Okay, so installing a guard entry might require punshing a hole to get rid
> of any already-existing memory. But readahead (below) might mess it up.
Only if you are so concerned about avoiding the page cache being populated there
that you want to do this :)
Readahead I think will not readahead into a holepunched region as the hole
punching extends to the fs layer _I believe_ I have not checked the code for
this, but I believe it actually changes the underlying file too right to say
'this part of the file is empty'?
(I did explicitly test hole punching with guard regions btw, by the by :)
>
> >
> > If we perform synchronous readahead prior to a guard region that includes
> > (partially or fully) a guard region we might major fault entries into the
> > page cache that are then not accessable _from that mapping_, this is rather
> > unavoidable as this doesn't account for page table mappings and should be
> > largely trivial overhead (also these folios are reclaimable).
>
> Right, that's what I had in mind: assume I have a single marker in a PMD,
> shmem might allocate a PMD THP to back that region, ignoring the marker
> hint. (so I think)
>
> Thanks!
>
> --
> Cheers,
>
> David / dhildenb
>
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/4] mm: allow guard regions in file-backed and read-only mappings
2025-02-18 16:21 ` Lorenzo Stoakes
@ 2025-02-18 16:27 ` David Hildenbrand
2025-02-18 16:49 ` Lorenzo Stoakes
0 siblings, 1 reply; 64+ messages in thread
From: David Hildenbrand @ 2025-02-18 16:27 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
Matthew Wilcox, Vlastimil Babka, Paul E . McKenney, Jann Horn,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
On 18.02.25 17:21, Lorenzo Stoakes wrote:
> On Tue, Feb 18, 2025 at 05:17:20PM +0100, David Hildenbrand wrote:
>> On 18.02.25 17:12, Lorenzo Stoakes wrote:
>>> On Tue, Feb 18, 2025 at 05:01:16PM +0100, David Hildenbrand wrote:
>>>> On 13.02.25 19:17, Lorenzo Stoakes wrote:
>>>>> There is no reason to disallow guard regions in file-backed mappings -
>>>>> readahead and fault-around both function correctly in the presence of PTE
>>>>> markers, equally other operations relating to memory-mapped files function
>>>>> correctly.
>>>>>
>>>>> Additionally, read-only mappings if introducing guard-regions, only
>>>>> restrict the mapping further, which means there is no violation of any
>>>>> access rights by permitting this to be so.
>>>>>
>>>>> Removing this restriction allows for read-only mapped files (such as
>>>>> executable files) correctly which would otherwise not be permitted.
>>>>>
>>>>> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>>>> ---
>>>>> mm/madvise.c | 8 +-------
>>>>> 1 file changed, 1 insertion(+), 7 deletions(-)
>>>>>
>>>>> diff --git a/mm/madvise.c b/mm/madvise.c
>>>>> index 6ecead476a80..e01e93e179a8 100644
>>>>> --- a/mm/madvise.c
>>>>> +++ b/mm/madvise.c
>>>>> @@ -1051,13 +1051,7 @@ static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked)
>>>>> if (!allow_locked)
>>>>> disallowed |= VM_LOCKED;
>>>>> - if (!vma_is_anonymous(vma))
>>>>> - return false;
>>>>> -
>>>>> - if ((vma->vm_flags & (VM_MAYWRITE | disallowed)) != VM_MAYWRITE)
>>>>> - return false;
>>>>> -
>>>>> - return true;
>>>>> + return !(vma->vm_flags & disallowed);
>>>>> }
>>>>> static bool is_guard_pte_marker(pte_t ptent)
>>>>
>>>> Acked-by: David Hildenbrand <david@redhat.com>
>>>
>>> Thanks!
>>>
>>>>
>>>> I assume these markers cannot completely prevent us from allocating
>>>> pages/folios for these underlying file/pageache ranges of these markers in
>>>> case of shmem during page faults, right?
>>>
>>> If the markers are in place, then page faulting will result in a
>>> segfault. If we faulted in a shmem page then installed markers (which would
>>> zap the range), then the page cache will be populated, but obviously
>>> subject to standard reclaim.
>>
>> Well, yes, (a) if there is swap and (b), if the noswap option was not
>> specified for tmpfs.
>>
>
> Right, yeah if you don't have it set up such that dropping a reference to the
> folio doesn't drop the page altogether.
>
> I think this matches expectation though in that you'd get the same results from
> an MADV_DONTNEED followed by faulting the page again.
It might make sense to document that: installing a guard behaves just
like MADV_DONTNEED; in case of a file, that means that the pagecache is
left untouched.
>
>> Okay, so installing a guard entry might require punshing a hole to get rid
>> of any already-existing memory. But readahead (below) might mess it up.
>
> Only if you are so concerned about avoiding the page cache being populated there
> that you want to do this :)
>
> Readahead I think will not readahead into a holepunched region as the hole
> punching extends to the fs layer _I believe_ I have not checked the code for
> this, but I believe it actually changes the underlying file too right to say
> 'this part of the file is empty'?
Well, we are talking about shmem here ... not your ordinary fs backed by
an actual file :)
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-18 15:20 ` David Hildenbrand
@ 2025-02-18 16:43 ` Lorenzo Stoakes
2025-02-18 17:14 ` David Hildenbrand
0 siblings, 1 reply; 64+ messages in thread
From: Lorenzo Stoakes @ 2025-02-18 16:43 UTC (permalink / raw)
To: David Hildenbrand
Cc: Vlastimil Babka, Andrew Morton, Suren Baghdasaryan,
Liam R . Howlett, Matthew Wilcox, Paul E . McKenney, Jann Horn,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
On Tue, Feb 18, 2025 at 04:20:18PM +0100, David Hildenbrand wrote:
> > Right yeah that'd be super weird. And I don't want to add that logic.
> >
> > > Also not sure what happens if one does an mlock()/mlockall() after
> > > already installing PTE markers.
> >
> > The existing logic already handles non-present cases by skipping them, in
> > mlock_pte_range():
> >
> > for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
> > ptent = ptep_get(pte);
> > if (!pte_present(ptent))
> > continue;
> >
> > ...
> > }
>
> I *think* that code only updates already-mapped folios, to properly call
> mlock_folio()/munlock_folio().
Guard regions _are_ 'already mapped' :) so it leaves them in place.
do_mlock() -> apply_vma_lock_flags() -> mlock_fixup() -> mlock_vma_pages_range()
implies this will be invoked.
>
> It is not the code that populates pages on mlock()/mlockall(). I think all
> that goes via mm_populate()/__mm_populate(), where "ordinary GUP" should
> apply.
OK I want to correct what I said earlier.
Installing a guard region then attempting mlock() will result in an error. The
populate will -EFAULT and stop at the guard region, which causes mlock() to
error out.
This is a partial failure, so the VMA is split and has VM_LOCKED applied, but
the populate halts at the guard region.
This is ok as per previous discussion on aggregate operation failure, there can
be no expectation of 'unwinding' of partially successful operations that form
part of a requested aggregate one.
However, given there's stuff to clean up, and on error a user _may_ wish to then
remove guard regions and try again, I guess there's no harm in keeping the code
as it is where we allow MADV_GUARD_REMOVE even if VM_LOCKED is in place.
>
> See populate_vma_page_range(), especially also the VM_LOCKONFAULT handling.
Yeah that code is horrible, you just reminded me of it... 'rightly or wrongly'
yeah wrongly, very wrongly...
>
> >
> > Which covers off guard regions. Removing the guard regions after this will
> > leave you in a weird situation where these entries will be zapped... maybe
> > we need a patch to make MADV_GUARD_REMOVE check VM_LOCKED and in this case
> > also populate?
>
> Maybe? Or we say that it behaves like MADV_DONTNEED_LOCKED.
See above, no we should not :P this is only good for cleanup after mlock()
failure, although no sane program should really be trying to do this, a sane
program would give up here (and it's a _programmatic error_ to try to mlock() a
range with guard regions).
>
> >
> > Actually I think the simpler option is to just disallow MADV_GUARD_REMOVE
> > if you since locked the range? The code currently allows this on the
> > proviso that 'you aren't zapping locked mappings' but leaves the VMA in a
> > state such that some entries would not be locked.
> >
> > It'd be pretty weird to lock guard regions like this.
> >
> > Having said all that, given what you say below, maybe it's not an issue
> > after all?...
> >
> > >
> > > __mm_populate() would skip whole VMAs in case populate_vma_page_range()
> > > fails. And I would assume populate_vma_page_range() fails on the first
> > > guard when it triggers a page fault.
> > >
> > > OTOH, supporting the mlock-on-fault thingy should be easy. That's precisely where
> > > MADV_DONTNEED_LOCKED originates from:
> > >
> > > commit 9457056ac426e5ed0671356509c8dcce69f8dee0
> > > Author: Johannes Weiner <hannes@cmpxchg.org>
> > > Date: Thu Mar 24 18:14:12 2022 -0700
> > >
> > > mm: madvise: MADV_DONTNEED_LOCKED
> > > MADV_DONTNEED historically rejects mlocked ranges, but with MLOCK_ONFAULT
> > > and MCL_ONFAULT allowing to mlock without populating, there are valid use
> > > cases for depopulating locked ranges as well.
> >
> > ...Hm this seems to imply the current guard remove stuff isn't quite so
> > bad, so maybe the assumption that VM_LOCKED implies 'everything is
> > populated' isn't quite as stringent then.
>
> Right, with MCL_ONFAULT at least. Without MCL_ONFAULT, the assumption is
> that everything is populated (unless, apparently one uses
> MADV_DONTNEED_LOCKED or population failed, maybe).
>
> VM_LOCKONFAULT seems to be the sane case. I wonder why MADV_DONTNEED_LOCKED
> didn't explicitly check for that one ... maybe there is a history to that.
Yeah weird.
>
> >
> > The restriction is as simple as:
> >
> > if (behavior != MADV_DONTNEED_LOCKED)
> > forbidden |= VM_LOCKED;
> >
> > >
> > >
> > > Adding support for that would be indeed nice.
> >
> > I mean it's sort of maybe understandable why you'd want to MADV_DONTNEED
> > locked ranges, but I really don't understand why you'd want to add guard
> > regions to mlock()'ed regions?
>
> Somme apps use mlockall(), and it might be nice to just be able to use guard
> pages as if "Nothing happened".
Sadly I think not given above :P
>
> E.g., QEMU has the option to use mlockall().
>
> >
> > Then again we're currently asymmetric as you can add them _before_
> > mlock()'ing...
>
> Right.
>
> --
> Cheers,
>
> David / dhildenb
>
I think the _LOCKED idea is therefore kaput, because it just won't work
properly because populating guard regions fails.
It fails because it tries to 'touch' the memory, but 'touching' guard
region memory causes a segfault. This kind of breaks the idea of
mlock()'ing guard regions.
I think adding workarounds to make this possible in any way is not really
worth it (and would probably be pretty gross).
We already document that 'mlock()ing lightweight guard regions will fail'
as per man page so this is all in line with that.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/4] mm: allow guard regions in file-backed and read-only mappings
2025-02-18 16:27 ` David Hildenbrand
@ 2025-02-18 16:49 ` Lorenzo Stoakes
2025-02-18 17:00 ` David Hildenbrand
0 siblings, 1 reply; 64+ messages in thread
From: Lorenzo Stoakes @ 2025-02-18 16:49 UTC (permalink / raw)
To: David Hildenbrand
Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
Matthew Wilcox, Vlastimil Babka, Paul E . McKenney, Jann Horn,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
On Tue, Feb 18, 2025 at 05:27:24PM +0100, David Hildenbrand wrote:
> On 18.02.25 17:21, Lorenzo Stoakes wrote:
> > On Tue, Feb 18, 2025 at 05:17:20PM +0100, David Hildenbrand wrote:
> > > On 18.02.25 17:12, Lorenzo Stoakes wrote:
> > > > On Tue, Feb 18, 2025 at 05:01:16PM +0100, David Hildenbrand wrote:
> > > > > On 13.02.25 19:17, Lorenzo Stoakes wrote:
> > > > > > There is no reason to disallow guard regions in file-backed mappings -
> > > > > > readahead and fault-around both function correctly in the presence of PTE
> > > > > > markers, equally other operations relating to memory-mapped files function
> > > > > > correctly.
> > > > > >
> > > > > > Additionally, read-only mappings if introducing guard-regions, only
> > > > > > restrict the mapping further, which means there is no violation of any
> > > > > > access rights by permitting this to be so.
> > > > > >
> > > > > > Removing this restriction allows for read-only mapped files (such as
> > > > > > executable files) correctly which would otherwise not be permitted.
> > > > > >
> > > > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > > > ---
> > > > > > mm/madvise.c | 8 +-------
> > > > > > 1 file changed, 1 insertion(+), 7 deletions(-)
> > > > > >
> > > > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > > > index 6ecead476a80..e01e93e179a8 100644
> > > > > > --- a/mm/madvise.c
> > > > > > +++ b/mm/madvise.c
> > > > > > @@ -1051,13 +1051,7 @@ static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked)
> > > > > > if (!allow_locked)
> > > > > > disallowed |= VM_LOCKED;
> > > > > > - if (!vma_is_anonymous(vma))
> > > > > > - return false;
> > > > > > -
> > > > > > - if ((vma->vm_flags & (VM_MAYWRITE | disallowed)) != VM_MAYWRITE)
> > > > > > - return false;
> > > > > > -
> > > > > > - return true;
> > > > > > + return !(vma->vm_flags & disallowed);
> > > > > > }
> > > > > > static bool is_guard_pte_marker(pte_t ptent)
> > > > >
> > > > > Acked-by: David Hildenbrand <david@redhat.com>
> > > >
> > > > Thanks!
> > > >
> > > > >
> > > > > I assume these markers cannot completely prevent us from allocating
> > > > > pages/folios for these underlying file/pageache ranges of these markers in
> > > > > case of shmem during page faults, right?
> > > >
> > > > If the markers are in place, then page faulting will result in a
> > > > segfault. If we faulted in a shmem page then installed markers (which would
> > > > zap the range), then the page cache will be populated, but obviously
> > > > subject to standard reclaim.
> > >
> > > Well, yes, (a) if there is swap and (b), if the noswap option was not
> > > specified for tmpfs.
> > >
> >
> > Right, yeah if you don't have it set up such that dropping a reference to the
> > folio doesn't drop the page altogether.
> >
> > I think this matches expectation though in that you'd get the same results from
> > an MADV_DONTNEED followed by faulting the page again.
>
> It might make sense to document that: installing a guard behaves just like
> MADV_DONTNEED; in case of a file, that means that the pagecache is left
> untouched.
More docs noooo! :P I will update the man pages when this is more obviously
heading for landing in 6.15 accordingly.
Current man page documentation on this is:
'If the region maps memory pages those mappings will be replaced as part of
the operation'
I think something like:
'If the region maps pages those mappings will be replaced as part of the
operation. When guard regions are removed via MADV_GUARD_REMOVE, faulting
in the page will behave as if that region had MADV_DONTNEED applied to it,
that is anonymous ranges will be backed by newly allocated zeroed pages and
file-backed ranges will be backed by the underlying file pages.'
Probably something less wordy than this...
>
> >
> > > Okay, so installing a guard entry might require punshing a hole to get rid
> > > of any already-existing memory. But readahead (below) might mess it up.
> >
> > Only if you are so concerned about avoiding the page cache being populated there
> > that you want to do this :)
> >
> > Readahead I think will not readahead into a holepunched region as the hole
> > punching extends to the fs layer _I believe_ I have not checked the code for
> > this, but I believe it actually changes the underlying file too right to say
> > 'this part of the file is empty'?
>
> Well, we are talking about shmem here ... not your ordinary fs backed by an
> actual file :)
I am talking about both, I multitask ;)
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/4] mm: allow guard regions in file-backed and read-only mappings
2025-02-18 16:49 ` Lorenzo Stoakes
@ 2025-02-18 17:00 ` David Hildenbrand
2025-02-18 17:04 ` Lorenzo Stoakes
0 siblings, 1 reply; 64+ messages in thread
From: David Hildenbrand @ 2025-02-18 17:00 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
Matthew Wilcox, Vlastimil Babka, Paul E . McKenney, Jann Horn,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
On 18.02.25 17:49, Lorenzo Stoakes wrote:
> On Tue, Feb 18, 2025 at 05:27:24PM +0100, David Hildenbrand wrote:
>> On 18.02.25 17:21, Lorenzo Stoakes wrote:
>>> On Tue, Feb 18, 2025 at 05:17:20PM +0100, David Hildenbrand wrote:
>>>> On 18.02.25 17:12, Lorenzo Stoakes wrote:
>>>>> On Tue, Feb 18, 2025 at 05:01:16PM +0100, David Hildenbrand wrote:
>>>>>> On 13.02.25 19:17, Lorenzo Stoakes wrote:
>>>>>>> There is no reason to disallow guard regions in file-backed mappings -
>>>>>>> readahead and fault-around both function correctly in the presence of PTE
>>>>>>> markers, equally other operations relating to memory-mapped files function
>>>>>>> correctly.
>>>>>>>
>>>>>>> Additionally, read-only mappings if introducing guard-regions, only
>>>>>>> restrict the mapping further, which means there is no violation of any
>>>>>>> access rights by permitting this to be so.
>>>>>>>
>>>>>>> Removing this restriction allows for read-only mapped files (such as
>>>>>>> executable files) correctly which would otherwise not be permitted.
>>>>>>>
>>>>>>> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>>>>>> ---
>>>>>>> mm/madvise.c | 8 +-------
>>>>>>> 1 file changed, 1 insertion(+), 7 deletions(-)
>>>>>>>
>>>>>>> diff --git a/mm/madvise.c b/mm/madvise.c
>>>>>>> index 6ecead476a80..e01e93e179a8 100644
>>>>>>> --- a/mm/madvise.c
>>>>>>> +++ b/mm/madvise.c
>>>>>>> @@ -1051,13 +1051,7 @@ static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked)
>>>>>>> if (!allow_locked)
>>>>>>> disallowed |= VM_LOCKED;
>>>>>>> - if (!vma_is_anonymous(vma))
>>>>>>> - return false;
>>>>>>> -
>>>>>>> - if ((vma->vm_flags & (VM_MAYWRITE | disallowed)) != VM_MAYWRITE)
>>>>>>> - return false;
>>>>>>> -
>>>>>>> - return true;
>>>>>>> + return !(vma->vm_flags & disallowed);
>>>>>>> }
>>>>>>> static bool is_guard_pte_marker(pte_t ptent)
>>>>>>
>>>>>> Acked-by: David Hildenbrand <david@redhat.com>
>>>>>
>>>>> Thanks!
>>>>>
>>>>>>
>>>>>> I assume these markers cannot completely prevent us from allocating
>>>>>> pages/folios for these underlying file/pageache ranges of these markers in
>>>>>> case of shmem during page faults, right?
>>>>>
>>>>> If the markers are in place, then page faulting will result in a
>>>>> segfault. If we faulted in a shmem page then installed markers (which would
>>>>> zap the range), then the page cache will be populated, but obviously
>>>>> subject to standard reclaim.
>>>>
>>>> Well, yes, (a) if there is swap and (b), if the noswap option was not
>>>> specified for tmpfs.
>>>>
>>>
>>> Right, yeah if you don't have it set up such that dropping a reference to the
>>> folio doesn't drop the page altogether.
>>>
>>> I think this matches expectation though in that you'd get the same results from
>>> an MADV_DONTNEED followed by faulting the page again.
>>
>> It might make sense to document that: installing a guard behaves just like
>> MADV_DONTNEED; in case of a file, that means that the pagecache is left
>> untouched.
>
> More docs noooo! :P I will update the man pages when this is more obviously
> heading for landing in 6.15 accordingly.
>
> Current man page documentation on this is:
>
> 'If the region maps memory pages those mappings will be replaced as part of
> the operation'
>
> I think something like:
>
> 'If the region maps pages those mappings will be replaced as part of the
> operation. When guard regions are removed via MADV_GUARD_REMOVE, faulting
> in the page will behave as if that region had MADV_DONTNEED applied to it,
> that is anonymous ranges will be backed by newly allocated zeroed pages and
> file-backed ranges will be backed by the underlying file pages.'
>
> Probably something less wordy than this...
Yeah, but sounds good to me.
>
>>
>>>
>>>> Okay, so installing a guard entry might require punshing a hole to get rid
>>>> of any already-existing memory. But readahead (below) might mess it up.
>>>
>>> Only if you are so concerned about avoiding the page cache being populated there
>>> that you want to do this :)
>>>
>>> Readahead I think will not readahead into a holepunched region as the hole
>>> punching extends to the fs layer _I believe_ I have not checked the code for
>>> this, but I believe it actually changes the underlying file too right to say
>>> 'this part of the file is empty'?
>>
>> Well, we are talking about shmem here ... not your ordinary fs backed by an
>> actual file :)
>
> I am talking about both, I multitask ;)
For !shmem, we should indeed not be messing with a sparse file structure.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/4] mm: allow guard regions in file-backed and read-only mappings
2025-02-18 17:00 ` David Hildenbrand
@ 2025-02-18 17:04 ` Lorenzo Stoakes
0 siblings, 0 replies; 64+ messages in thread
From: Lorenzo Stoakes @ 2025-02-18 17:04 UTC (permalink / raw)
To: David Hildenbrand
Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
Matthew Wilcox, Vlastimil Babka, Paul E . McKenney, Jann Horn,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
On Tue, Feb 18, 2025 at 06:00:36PM +0100, David Hildenbrand wrote:
> On 18.02.25 17:49, Lorenzo Stoakes wrote:
> > On Tue, Feb 18, 2025 at 05:27:24PM +0100, David Hildenbrand wrote:
> > > On 18.02.25 17:21, Lorenzo Stoakes wrote:
> > > > On Tue, Feb 18, 2025 at 05:17:20PM +0100, David Hildenbrand wrote:
> > > > > On 18.02.25 17:12, Lorenzo Stoakes wrote:
> > > > > > On Tue, Feb 18, 2025 at 05:01:16PM +0100, David Hildenbrand wrote:
> > > > > > > On 13.02.25 19:17, Lorenzo Stoakes wrote:
> > > > > > > > There is no reason to disallow guard regions in file-backed mappings -
> > > > > > > > readahead and fault-around both function correctly in the presence of PTE
> > > > > > > > markers, equally other operations relating to memory-mapped files function
> > > > > > > > correctly.
> > > > > > > >
> > > > > > > > Additionally, read-only mappings if introducing guard-regions, only
> > > > > > > > restrict the mapping further, which means there is no violation of any
> > > > > > > > access rights by permitting this to be so.
> > > > > > > >
> > > > > > > > Removing this restriction allows for read-only mapped files (such as
> > > > > > > > executable files) correctly which would otherwise not be permitted.
> > > > > > > >
> > > > > > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > > > > > ---
> > > > > > > > mm/madvise.c | 8 +-------
> > > > > > > > 1 file changed, 1 insertion(+), 7 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > > > > > index 6ecead476a80..e01e93e179a8 100644
> > > > > > > > --- a/mm/madvise.c
> > > > > > > > +++ b/mm/madvise.c
> > > > > > > > @@ -1051,13 +1051,7 @@ static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked)
> > > > > > > > if (!allow_locked)
> > > > > > > > disallowed |= VM_LOCKED;
> > > > > > > > - if (!vma_is_anonymous(vma))
> > > > > > > > - return false;
> > > > > > > > -
> > > > > > > > - if ((vma->vm_flags & (VM_MAYWRITE | disallowed)) != VM_MAYWRITE)
> > > > > > > > - return false;
> > > > > > > > -
> > > > > > > > - return true;
> > > > > > > > + return !(vma->vm_flags & disallowed);
> > > > > > > > }
> > > > > > > > static bool is_guard_pte_marker(pte_t ptent)
> > > > > > >
> > > > > > > Acked-by: David Hildenbrand <david@redhat.com>
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > > >
> > > > > > > I assume these markers cannot completely prevent us from allocating
> > > > > > > pages/folios for these underlying file/pageache ranges of these markers in
> > > > > > > case of shmem during page faults, right?
> > > > > >
> > > > > > If the markers are in place, then page faulting will result in a
> > > > > > segfault. If we faulted in a shmem page then installed markers (which would
> > > > > > zap the range), then the page cache will be populated, but obviously
> > > > > > subject to standard reclaim.
> > > > >
> > > > > Well, yes, (a) if there is swap and (b), if the noswap option was not
> > > > > specified for tmpfs.
> > > > >
> > > >
> > > > Right, yeah if you don't have it set up such that dropping a reference to the
> > > > folio doesn't drop the page altogether.
> > > >
> > > > I think this matches expectation though in that you'd get the same results from
> > > > an MADV_DONTNEED followed by faulting the page again.
> > >
> > > It might make sense to document that: installing a guard behaves just like
> > > MADV_DONTNEED; in case of a file, that means that the pagecache is left
> > > untouched.
> >
> > More docs noooo! :P I will update the man pages when this is more obviously
> > heading for landing in 6.15 accordingly.
> >
> > Current man page documentation on this is:
> >
> > 'If the region maps memory pages those mappings will be replaced as part of
> > the operation'
> >
> > I think something like:
> >
> > 'If the region maps pages those mappings will be replaced as part of the
> > operation. When guard regions are removed via MADV_GUARD_REMOVE, faulting
> > in the page will behave as if that region had MADV_DONTNEED applied to it,
> > that is anonymous ranges will be backed by newly allocated zeroed pages and
> > file-backed ranges will be backed by the underlying file pages.'
> >
> > Probably something less wordy than this...
>
> Yeah, but sounds good to me.
>
> >
> > >
> > > >
> > > > > Okay, so installing a guard entry might require punshing a hole to get rid
> > > > > of any already-existing memory. But readahead (below) might mess it up.
> > > >
> > > > Only if you are so concerned about avoiding the page cache being populated there
> > > > that you want to do this :)
> > > >
> > > > Readahead I think will not readahead into a holepunched region as the hole
> > > > punching extends to the fs layer _I believe_ I have not checked the code for
> > > > this, but I believe it actually changes the underlying file too right to say
> > > > 'this part of the file is empty'?
> > >
> > > Well, we are talking about shmem here ... not your ordinary fs backed by an
> > > actual file :)
> >
> > I am talking about both, I multitask ;)
>
> For !shmem, we should indeed not be messing with a sparse file structure.
Ah right I get you, sorry missed your point here. Yeah MADV_REMOVE for shmem
will just explicitly drop the pages which doesn't have any underlying file to
impact as with actually file-backed memory.
>
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-18 16:43 ` Lorenzo Stoakes
@ 2025-02-18 17:14 ` David Hildenbrand
2025-02-18 17:20 ` Lorenzo Stoakes
0 siblings, 1 reply; 64+ messages in thread
From: David Hildenbrand @ 2025-02-18 17:14 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Vlastimil Babka, Andrew Morton, Suren Baghdasaryan,
Liam R . Howlett, Matthew Wilcox, Paul E . McKenney, Jann Horn,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
On 18.02.25 17:43, Lorenzo Stoakes wrote:
> On Tue, Feb 18, 2025 at 04:20:18PM +0100, David Hildenbrand wrote:
>>> Right yeah that'd be super weird. And I don't want to add that logic.
>>>
>>>> Also not sure what happens if one does an mlock()/mlockall() after
>>>> already installing PTE markers.
>>>
>>> The existing logic already handles non-present cases by skipping them, in
>>> mlock_pte_range():
>>>
>>> for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
>>> ptent = ptep_get(pte);
>>> if (!pte_present(ptent))
>>> continue;
>>>
>>> ...
>>> }
>>
>> I *think* that code only updates already-mapped folios, to properly call
>> mlock_folio()/munlock_folio().
>
> Guard regions _are_ 'already mapped' :) so it leaves them in place.
"mapped folios" -- there is no folio mapped. Yes, the VMA is in place.
>
> do_mlock() -> apply_vma_lock_flags() -> mlock_fixup() -> mlock_vma_pages_range()
> implies this will be invoked.
Yes, to process any already mapped folios, to then continue population
later.
>
>>
>> It is not the code that populates pages on mlock()/mlockall(). I think all
>> that goes via mm_populate()/__mm_populate(), where "ordinary GUP" should
>> apply.
>
> OK I want to correct what I said earlier.
>
> Installing a guard region then attempting mlock() will result in an error. The
> populate will -EFAULT and stop at the guard region, which causes mlock() to
> error out.
Right, that's my expectation.
>
> This is a partial failure, so the VMA is split and has VM_LOCKED applied, but
> the populate halts at the guard region.
>
> This is ok as per previous discussion on aggregate operation failure, there can
> be no expectation of 'unwinding' of partially successful operations that form
> part of a requested aggregate one.
>
> However, given there's stuff to clean up, and on error a user _may_ wish to then
> remove guard regions and try again, I guess there's no harm in keeping the code
> as it is where we allow MADV_GUARD_REMOVE even if VM_LOCKED is in place.
Likely yes; it's all weird code.
>
>>
>> See populate_vma_page_range(), especially also the VM_LOCKONFAULT handling.
>
> Yeah that code is horrible, you just reminded me of it... 'rightly or wrongly'
> yeah wrongly, very wrongly...
>
>>
>>>
>>> Which covers off guard regions. Removing the guard regions after this will
>>> leave you in a weird situation where these entries will be zapped... maybe
>>> we need a patch to make MADV_GUARD_REMOVE check VM_LOCKED and in this case
>>> also populate?
>>
>> Maybe? Or we say that it behaves like MADV_DONTNEED_LOCKED.
>
> See above, no we should not :P this is only good for cleanup after mlock()
> failure, although no sane program should really be trying to do this, a sane
> program would give up here (and it's a _programmatic error_ to try to mlock() a
> range with guard regions).
>>>> Somme apps use mlockall(), and it might be nice to just be able to
use guard
>> pages as if "Nothing happened".
>
> Sadly I think not given above :P
QEMU, for example, will issue an mlockall(MCL_CURRENT | MCL_FUTURE);
when requested to then exit(); if it fails.
Assume glibc or any lib uses it, QEMU would have no real way of figuring
that out or instructing offending libraries to disabled that, at least
for now ...
... turning RT VMs less usable if any library uses guard regions. :(
There is upcoming support for MCL_ONFAULT in QEMU [1] (see below).
[1] https://lkml.kernel.org/r/20250212173823.214429-3-peterx@redhat.com
>
>>
>> E.g., QEMU has the option to use mlockall().
>>
>>>
>>> Then again we're currently asymmetric as you can add them _before_
>>> mlock()'ing...
>>
>> Right.
>>
>> --
>> Cheers,
>>
>> David / dhildenb
>>
>
> I think the _LOCKED idea is therefore kaput, because it just won't work
> properly because populating guard regions fails.
Right, I think basic VM_LOCKED is out of the picture. VM_LOCKONFAULT
might be interesting, because we are skipping the population stage.
>
> It fails because it tries to 'touch' the memory, but 'touching' guard
> region memory causes a segfault. This kind of breaks the idea of
> mlock()'ing guard regions.
>
> I think adding workarounds to make this possible in any way is not really
> worth it (and would probably be pretty gross).
>
> We already document that 'mlock()ing lightweight guard regions will fail'
> as per man page so this is all in line with that.
Right, and I claim that supporting VM_LOCKONFAULT might likely be as
easy as allowing install/remove of guard regions when that flag is set.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-18 17:14 ` David Hildenbrand
@ 2025-02-18 17:20 ` Lorenzo Stoakes
2025-02-18 17:25 ` David Hildenbrand
0 siblings, 1 reply; 64+ messages in thread
From: Lorenzo Stoakes @ 2025-02-18 17:20 UTC (permalink / raw)
To: David Hildenbrand
Cc: Vlastimil Babka, Andrew Morton, Suren Baghdasaryan,
Liam R . Howlett, Matthew Wilcox, Paul E . McKenney, Jann Horn,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
On Tue, Feb 18, 2025 at 06:14:00PM +0100, David Hildenbrand wrote:
> On 18.02.25 17:43, Lorenzo Stoakes wrote:
> > On Tue, Feb 18, 2025 at 04:20:18PM +0100, David Hildenbrand wrote:
> > > > Right yeah that'd be super weird. And I don't want to add that logic.
> > > >
> > > > > Also not sure what happens if one does an mlock()/mlockall() after
> > > > > already installing PTE markers.
> > > >
> > > > The existing logic already handles non-present cases by skipping them, in
> > > > mlock_pte_range():
> > > >
> > > > for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
> > > > ptent = ptep_get(pte);
> > > > if (!pte_present(ptent))
> > > > continue;
> > > >
> > > > ...
> > > > }
> > >
> > > I *think* that code only updates already-mapped folios, to properly call
> > > mlock_folio()/munlock_folio().
> >
> > Guard regions _are_ 'already mapped' :) so it leaves them in place.
>
> "mapped folios" -- there is no folio mapped. Yes, the VMA is in place.
We're engaging in a moot discussion on this I think but I mean it appears
to operate by walking page tables if they are populated, which they will be
for guard regions, but when it finds it's non-present it will skip.
This amounts to the same thing as not doing anything, obviously.
>
> >
> > do_mlock() -> apply_vma_lock_flags() -> mlock_fixup() -> mlock_vma_pages_range()
> > implies this will be invoked.
>
> Yes, to process any already mapped folios, to then continue population
> later.
>
> >
> > >
> > > It is not the code that populates pages on mlock()/mlockall(). I think all
> > > that goes via mm_populate()/__mm_populate(), where "ordinary GUP" should
> > > apply.
> >
> > OK I want to correct what I said earlier.
> >
> > Installing a guard region then attempting mlock() will result in an error. The
> > populate will -EFAULT and stop at the guard region, which causes mlock() to
> > error out.
>
> Right, that's my expectation.
OK good!
>
> >
> > This is a partial failure, so the VMA is split and has VM_LOCKED applied, but
> > the populate halts at the guard region.
> >
> > This is ok as per previous discussion on aggregate operation failure, there can
> > be no expectation of 'unwinding' of partially successful operations that form
> > part of a requested aggregate one.
> >
> > However, given there's stuff to clean up, and on error a user _may_ wish to then
> > remove guard regions and try again, I guess there's no harm in keeping the code
> > as it is where we allow MADV_GUARD_REMOVE even if VM_LOCKED is in place.
>
> Likely yes; it's all weird code.
>
> >
> > >
> > > See populate_vma_page_range(), especially also the VM_LOCKONFAULT handling.
> >
> > Yeah that code is horrible, you just reminded me of it... 'rightly or wrongly'
> > yeah wrongly, very wrongly...
> >
> > >
> > > >
> > > > Which covers off guard regions. Removing the guard regions after this will
> > > > leave you in a weird situation where these entries will be zapped... maybe
> > > > we need a patch to make MADV_GUARD_REMOVE check VM_LOCKED and in this case
> > > > also populate?
> > >
> > > Maybe? Or we say that it behaves like MADV_DONTNEED_LOCKED.
> >
> > See above, no we should not :P this is only good for cleanup after mlock()
> > failure, although no sane program should really be trying to do this, a sane
> > program would give up here (and it's a _programmatic error_ to try to mlock() a
> > range with guard regions).
> >>>> Somme apps use mlockall(), and it might be nice to just be able to use
> guard
> > > pages as if "Nothing happened".
> >
> > Sadly I think not given above :P
>
> QEMU, for example, will issue an mlockall(MCL_CURRENT | MCL_FUTURE); when
> requested to then exit(); if it fails.
Hm under what circumstances? I use qemu extensively to test this stuff with
no issues. Unless you mean it's using it in the 'host' code somehow.
>
> Assume glibc or any lib uses it, QEMU would have no real way of figuring
> that out or instructing offending libraries to disabled that, at least for
> now ...
>
> ... turning RT VMs less usable if any library uses guard regions. :(
>
This seems really stupid, to be honest. Unfortunately there's no way around
this, if software does stupid things then they get stupid prizes. There are
other ways mlock() and faulting in can fail too.
> There is upcoming support for MCL_ONFAULT in QEMU [1] (see below).
Good.
>
> [1] https://lkml.kernel.org/r/20250212173823.214429-3-peterx@redhat.com
>
> >
> > >
> > > E.g., QEMU has the option to use mlockall().
> > >
> > > >
> > > > Then again we're currently asymmetric as you can add them _before_
> > > > mlock()'ing...
> > >
> > > Right.
> > >
> > > --
> > > Cheers,
> > >
> > > David / dhildenb
> > >
> >
> > I think the _LOCKED idea is therefore kaput, because it just won't work
> > properly because populating guard regions fails.
>
> Right, I think basic VM_LOCKED is out of the picture. VM_LOCKONFAULT might
> be interesting, because we are skipping the population stage.
>
> >
> > It fails because it tries to 'touch' the memory, but 'touching' guard
> > region memory causes a segfault. This kind of breaks the idea of
> > mlock()'ing guard regions.
> >
> > I think adding workarounds to make this possible in any way is not really
> > worth it (and would probably be pretty gross).
> >
> > We already document that 'mlock()ing lightweight guard regions will fail'
> > as per man page so this is all in line with that.
>
> Right, and I claim that supporting VM_LOCKONFAULT might likely be as easy as
> allowing install/remove of guard regions when that flag is set.
We already allow this flag! VM_LOCKED and VM_HUGETLB are the only flags we
disallow.
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-18 17:20 ` Lorenzo Stoakes
@ 2025-02-18 17:25 ` David Hildenbrand
2025-02-18 17:28 ` Lorenzo Stoakes
0 siblings, 1 reply; 64+ messages in thread
From: David Hildenbrand @ 2025-02-18 17:25 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Vlastimil Babka, Andrew Morton, Suren Baghdasaryan,
Liam R . Howlett, Matthew Wilcox, Paul E . McKenney, Jann Horn,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
>>
>> QEMU, for example, will issue an mlockall(MCL_CURRENT | MCL_FUTURE); when
>> requested to then exit(); if it fails.
>
> Hm under what circumstances? I use qemu extensively to test this stuff with
> no issues. Unless you mean it's using it in the 'host' code somehow.
-overcommit mem-lock=on
or (legacy)
-realtime mlock=on
I think.
[...]
>>>
>>> It fails because it tries to 'touch' the memory, but 'touching' guard
>>> region memory causes a segfault. This kind of breaks the idea of
>>> mlock()'ing guard regions.
>>>
>>> I think adding workarounds to make this possible in any way is not really
>>> worth it (and would probably be pretty gross).
>>>
>>> We already document that 'mlock()ing lightweight guard regions will fail'
>>> as per man page so this is all in line with that.
>>
>> Right, and I claim that supporting VM_LOCKONFAULT might likely be as easy as
>> allowing install/remove of guard regions when that flag is set.
>
> We already allow this flag! VM_LOCKED and VM_HUGETLB are the only flags we
> disallow.
See mlock2();
SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags)
{
vm_flags_t vm_flags = VM_LOCKED;
if (flags & ~MLOCK_ONFAULT)
return -EINVAL;
if (flags & MLOCK_ONFAULT)
vm_flags |= VM_LOCKONFAULT;
return do_mlock(start, len, vm_flags);
}
VM_LOCKONFAULT always as VM_LOCKED set as well.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-18 17:25 ` David Hildenbrand
@ 2025-02-18 17:28 ` Lorenzo Stoakes
2025-02-18 17:31 ` David Hildenbrand
2025-02-25 15:54 ` Vlastimil Babka
0 siblings, 2 replies; 64+ messages in thread
From: Lorenzo Stoakes @ 2025-02-18 17:28 UTC (permalink / raw)
To: David Hildenbrand
Cc: Vlastimil Babka, Andrew Morton, Suren Baghdasaryan,
Liam R . Howlett, Matthew Wilcox, Paul E . McKenney, Jann Horn,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
On Tue, Feb 18, 2025 at 06:25:35PM +0100, David Hildenbrand wrote:
> > >
> > > QEMU, for example, will issue an mlockall(MCL_CURRENT | MCL_FUTURE); when
> > > requested to then exit(); if it fails.
> >
> > Hm under what circumstances? I use qemu extensively to test this stuff with
> > no issues. Unless you mean it's using it in the 'host' code somehow.
>
>
> -overcommit mem-lock=on
>
> or (legacy)
>
> -realtime mlock=on
>
> I think.
>
> [...]
Thanks
>
> > > >
> > > > It fails because it tries to 'touch' the memory, but 'touching' guard
> > > > region memory causes a segfault. This kind of breaks the idea of
> > > > mlock()'ing guard regions.
> > > >
> > > > I think adding workarounds to make this possible in any way is not really
> > > > worth it (and would probably be pretty gross).
> > > >
> > > > We already document that 'mlock()ing lightweight guard regions will fail'
> > > > as per man page so this is all in line with that.
> > >
> > > Right, and I claim that supporting VM_LOCKONFAULT might likely be as easy as
> > > allowing install/remove of guard regions when that flag is set.
> >
> > We already allow this flag! VM_LOCKED and VM_HUGETLB are the only flags we
> > disallow.
>
>
> See mlock2();
>
> SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags)
> {
> vm_flags_t vm_flags = VM_LOCKED;
>
> if (flags & ~MLOCK_ONFAULT)
> return -EINVAL;
>
> if (flags & MLOCK_ONFAULT)
> vm_flags |= VM_LOCKONFAULT;
>
> return do_mlock(start, len, vm_flags);
> }
>
>
> VM_LOCKONFAULT always as VM_LOCKED set as well.
OK cool, that makes sense.
As with much kernel stuff, I knew this in the past. Then I forgot. Then I knew
again, then... :P if only somebody would write it down in a book...
Yeah then that makes sense to check explicitly for (VM_LOCKED | VM_LOCKONFAULT)
in any MADV_GUARD_INSTALL_LOCKED variant as obviously this would be passively
excluded right now.
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-18 17:28 ` Lorenzo Stoakes
@ 2025-02-18 17:31 ` David Hildenbrand
2025-02-25 15:54 ` Vlastimil Babka
1 sibling, 0 replies; 64+ messages in thread
From: David Hildenbrand @ 2025-02-18 17:31 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Vlastimil Babka, Andrew Morton, Suren Baghdasaryan,
Liam R . Howlett, Matthew Wilcox, Paul E . McKenney, Jann Horn,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
>> See mlock2();
>>
>> SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags)
>> {
>> vm_flags_t vm_flags = VM_LOCKED;
>>
>> if (flags & ~MLOCK_ONFAULT)
>> return -EINVAL;
>>
>> if (flags & MLOCK_ONFAULT)
>> vm_flags |= VM_LOCKONFAULT;
>>
>> return do_mlock(start, len, vm_flags);
>> }
>>
>>
>> VM_LOCKONFAULT always as VM_LOCKED set as well.
>
> OK cool, that makes sense.
>
> As with much kernel stuff, I knew this in the past. Then I forgot. Then I knew
> again, then... :P if only somebody would write it down in a book...
And if there would be such a book in the makings, I would consider
hurrying up and preordering it ;)
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-13 18:16 [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings Lorenzo Stoakes
` (4 preceding siblings ...)
2025-02-18 12:12 ` [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings Vlastimil Babka
@ 2025-02-19 8:25 ` Kalesh Singh
2025-02-19 8:35 ` Kalesh Singh
2025-02-19 9:03 ` Lorenzo Stoakes
5 siblings, 2 replies; 64+ messages in thread
From: Kalesh Singh @ 2025-02-19 8:25 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
Matthew Wilcox, Vlastimil Babka, Paul E . McKenney, Jann Horn,
David Hildenbrand, linux-mm, linux-kernel, Shuah Khan,
linux-kselftest, linux-api, John Hubbard, Juan Yescas
On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> The guard regions feature was initially implemented to support anonymous
> mappings only, excluding shmem.
>
> This was done such as to introduce the feature carefully and incrementally
> and to be conservative when considering the various caveats and corner
> cases that are applicable to file-backed mappings but not to anonymous
> ones.
>
> Now this feature has landed in 6.13, it is time to revisit this and to
> extend this functionality to file-backed and shmem mappings.
>
> In order to make this maximally useful, and since one may map file-backed
> mappings read-only (for instance ELF images), we also remove the
> restriction on read-only mappings and permit the establishment of guard
> regions in any non-hugetlb, non-mlock()'d mapping.
Hi Lorenzo,
Thank you for your work on this.
Have we thought about how guard regions are represented in /proc/*/[s]maps?
In the field, I've found that many applications read the ranges from
/proc/self/[s]maps to determine what they can access (usually related
to obfuscation techniques). If they don't know of the guard regions it
would cause them to crash; I think that we'll need similar entries to
PROT_NONE (---p) for these, and generally to maintain consistency
between the behavior and what is being said from /proc/*/[s]maps.
-- Kalesh
>
> It is permissible to permit the establishment of guard regions in read-only
> mappings because the guard regions only reduce access to the mapping, and
> when removed simply reinstate the existing attributes of the underlying
> VMA, meaning no access violations can occur.
>
> While the change in kernel code introduced in this series is small, the
> majority of the effort here is spent in extending the testing to assert
> that the feature works correctly across numerous file-backed mapping
> scenarios.
>
> Every single guard region self-test performed against anonymous memory
> (which is relevant and not anon-only) has now been updated to also be
> performed against shmem and a mapping of a file in the working directory.
>
> This confirms that all cases also function correctly for file-backed guard
> regions.
>
> In addition a number of other tests are added for specific file-backed
> mapping scenarios.
>
> There are a number of other concerns that one might have with regard to
> guard regions, addressed below:
>
> Readahead
> ~~~~~~~~~
>
> Readahead is a process through which the page cache is populated on the
> assumption that sequential reads will occur, thus amortising I/O and,
> through a clever use of the PG_readahead folio flag establishing during
> major fault and checked upon minor fault, provides for asynchronous I/O to
> occur as dat is processed, reducing I/O stalls as data is faulted in.
>
> Guard regions do not alter this mechanism which operations at the folio and
> fault level, but do of course prevent the faulting of folios that would
> otherwise be mapped.
>
> In the instance of a major fault prior to a guard region, synchronous
> readahead will occur including populating folios in the page cache which
> the guard regions will, in the case of the mapping in question, prevent
> access to.
>
> In addition, if PG_readahead is placed in a folio that is now inaccessible,
> this will prevent asynchronous readahead from occurring as it would
> otherwise do.
>
> However, there are mechanisms for heuristically resetting this within
> readahead regardless, which will 'recover' correct readahead behaviour.
>
> Readahead presumes sequential data access, the presence of a guard region
> clearly indicates that, at least in the guard region, no such sequential
> access will occur, as it cannot occur there.
>
> So this should have very little impact on any real workload. The far more
> important point is as to whether readahead causes incorrect or
> inappropriate mapping of ranges disallowed by the presence of guard
> regions - this is not the case, as readahead does not 'pre-fault' memory in
> this fashion.
>
> At any rate, any mechanism which would attempt to do so would hit the usual
> page fault paths, which correctly handle PTE markers as with anonymous
> mappings.
>
> Fault-Around
> ~~~~~~~~~~~~
>
> The fault-around logic, in a similar vein to readahead, attempts to improve
> efficiency with regard to file-backed memory mappings, however it differs
> in that it does not try to fetch folios into the page cache that are about
> to be accessed, but rather pre-maps a range of folios around the faulting
> address.
>
> Guard regions making use of PTE markers makes this relatively trivial, as
> this case is already handled - see filemap_map_folio_range() and
> filemap_map_order0_folio() - in both instances, the solution is to simply
> keep the established page table mappings and let the fault handler take
> care of PTE markers, as per the comment:
>
> /*
> * NOTE: If there're PTE markers, we'll leave them to be
> * handled in the specific fault path, and it'll prohibit
> * the fault-around logic.
> */
>
> This works, as establishing guard regions results in page table mappings
> with PTE markers, and clearing them removes them.
>
> Truncation
> ~~~~~~~~~~
>
> File truncation will not eliminate existing guard regions, as the
> truncation operation will ultimately zap the range via
> unmap_mapping_range(), which specifically excludes PTE markers.
>
> Zapping
> ~~~~~~~
>
> Zapping is, as with anonymous mappings, handled by zap_nonpresent_ptes(),
> which specifically deals with guard entries, leaving them intact except in
> instances such as process teardown or munmap() where they need to be
> removed.
>
> Reclaim
> ~~~~~~~
>
> When reclaim is performed on file-backed folios, it ultimately invokes
> try_to_unmap_one() via the rmap. If the folio is non-large, then map_pte()
> will ultimately abort the operation for the guard region mapping. If large,
> then check_pte() will determine that this is a non-device private
> entry/device-exclusive entry 'swap' PTE and thus abort the operation in
> that instance.
>
> Therefore, no odd things happen in the instance of reclaim being attempted
> upon a file-backed guard region.
>
> Hole Punching
> ~~~~~~~~~~~~~
>
> This updates the page cache and ultimately invokes unmap_mapping_range(),
> which explicitly leaves PTE markers in place.
>
> Because the establishment of guard regions zapped any existing mappings to
> file-backed folios, once the guard regions are removed then the
> hole-punched region will be faulted in as usual and everything will behave
> as expected.
>
> Lorenzo Stoakes (4):
> mm: allow guard regions in file-backed and read-only mappings
> selftests/mm: rename guard-pages to guard-regions
> tools/selftests: expand all guard region tests to file-backed
> tools/selftests: add file/shmem-backed mapping guard region tests
>
> mm/madvise.c | 8 +-
> tools/testing/selftests/mm/.gitignore | 2 +-
> tools/testing/selftests/mm/Makefile | 2 +-
> .../mm/{guard-pages.c => guard-regions.c} | 921 ++++++++++++++++--
> 4 files changed, 821 insertions(+), 112 deletions(-)
> rename tools/testing/selftests/mm/{guard-pages.c => guard-regions.c} (58%)
>
> --
> 2.48.1
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-19 8:25 ` Kalesh Singh
@ 2025-02-19 8:35 ` Kalesh Singh
2025-02-19 9:15 ` Lorenzo Stoakes
2025-02-19 17:32 ` Liam R. Howlett
2025-02-19 9:03 ` Lorenzo Stoakes
1 sibling, 2 replies; 64+ messages in thread
From: Kalesh Singh @ 2025-02-19 8:35 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
Matthew Wilcox, Vlastimil Babka, Paul E . McKenney, Jann Horn,
David Hildenbrand, linux-mm, linux-kernel, Shuah Khan,
linux-kselftest, linux-api, John Hubbard, Juan Yescas
On Wed, Feb 19, 2025 at 12:25 AM Kalesh Singh <kaleshsingh@google.com> wrote:
>
> On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > The guard regions feature was initially implemented to support anonymous
> > mappings only, excluding shmem.
> >
> > This was done such as to introduce the feature carefully and incrementally
> > and to be conservative when considering the various caveats and corner
> > cases that are applicable to file-backed mappings but not to anonymous
> > ones.
> >
> > Now this feature has landed in 6.13, it is time to revisit this and to
> > extend this functionality to file-backed and shmem mappings.
> >
> > In order to make this maximally useful, and since one may map file-backed
> > mappings read-only (for instance ELF images), we also remove the
> > restriction on read-only mappings and permit the establishment of guard
> > regions in any non-hugetlb, non-mlock()'d mapping.
>
> Hi Lorenzo,
>
> Thank you for your work on this.
>
> Have we thought about how guard regions are represented in /proc/*/[s]maps?
>
> In the field, I've found that many applications read the ranges from
> /proc/self/[s]maps to determine what they can access (usually related
> to obfuscation techniques). If they don't know of the guard regions it
> would cause them to crash; I think that we'll need similar entries to
> PROT_NONE (---p) for these, and generally to maintain consistency
> between the behavior and what is being said from /proc/*/[s]maps.
To clarify why the applications may not be aware of their guard
regions -- in the case of the ELF mappings these PROT_NONE (guard
regions) would be installed by the dynamic loader; or may be inherited
from the parent (zygote in Android's case).
>
> -- Kalesh
>
> >
> > It is permissible to permit the establishment of guard regions in read-only
> > mappings because the guard regions only reduce access to the mapping, and
> > when removed simply reinstate the existing attributes of the underlying
> > VMA, meaning no access violations can occur.
> >
> > While the change in kernel code introduced in this series is small, the
> > majority of the effort here is spent in extending the testing to assert
> > that the feature works correctly across numerous file-backed mapping
> > scenarios.
> >
> > Every single guard region self-test performed against anonymous memory
> > (which is relevant and not anon-only) has now been updated to also be
> > performed against shmem and a mapping of a file in the working directory.
> >
> > This confirms that all cases also function correctly for file-backed guard
> > regions.
> >
> > In addition a number of other tests are added for specific file-backed
> > mapping scenarios.
> >
> > There are a number of other concerns that one might have with regard to
> > guard regions, addressed below:
> >
> > Readahead
> > ~~~~~~~~~
> >
> > Readahead is a process through which the page cache is populated on the
> > assumption that sequential reads will occur, thus amortising I/O and,
> > through a clever use of the PG_readahead folio flag establishing during
> > major fault and checked upon minor fault, provides for asynchronous I/O to
> > occur as dat is processed, reducing I/O stalls as data is faulted in.
> >
> > Guard regions do not alter this mechanism which operations at the folio and
> > fault level, but do of course prevent the faulting of folios that would
> > otherwise be mapped.
> >
> > In the instance of a major fault prior to a guard region, synchronous
> > readahead will occur including populating folios in the page cache which
> > the guard regions will, in the case of the mapping in question, prevent
> > access to.
> >
> > In addition, if PG_readahead is placed in a folio that is now inaccessible,
> > this will prevent asynchronous readahead from occurring as it would
> > otherwise do.
> >
> > However, there are mechanisms for heuristically resetting this within
> > readahead regardless, which will 'recover' correct readahead behaviour.
> >
> > Readahead presumes sequential data access, the presence of a guard region
> > clearly indicates that, at least in the guard region, no such sequential
> > access will occur, as it cannot occur there.
> >
> > So this should have very little impact on any real workload. The far more
> > important point is as to whether readahead causes incorrect or
> > inappropriate mapping of ranges disallowed by the presence of guard
> > regions - this is not the case, as readahead does not 'pre-fault' memory in
> > this fashion.
> >
> > At any rate, any mechanism which would attempt to do so would hit the usual
> > page fault paths, which correctly handle PTE markers as with anonymous
> > mappings.
> >
> > Fault-Around
> > ~~~~~~~~~~~~
> >
> > The fault-around logic, in a similar vein to readahead, attempts to improve
> > efficiency with regard to file-backed memory mappings, however it differs
> > in that it does not try to fetch folios into the page cache that are about
> > to be accessed, but rather pre-maps a range of folios around the faulting
> > address.
> >
> > Guard regions making use of PTE markers makes this relatively trivial, as
> > this case is already handled - see filemap_map_folio_range() and
> > filemap_map_order0_folio() - in both instances, the solution is to simply
> > keep the established page table mappings and let the fault handler take
> > care of PTE markers, as per the comment:
> >
> > /*
> > * NOTE: If there're PTE markers, we'll leave them to be
> > * handled in the specific fault path, and it'll prohibit
> > * the fault-around logic.
> > */
> >
> > This works, as establishing guard regions results in page table mappings
> > with PTE markers, and clearing them removes them.
> >
> > Truncation
> > ~~~~~~~~~~
> >
> > File truncation will not eliminate existing guard regions, as the
> > truncation operation will ultimately zap the range via
> > unmap_mapping_range(), which specifically excludes PTE markers.
> >
> > Zapping
> > ~~~~~~~
> >
> > Zapping is, as with anonymous mappings, handled by zap_nonpresent_ptes(),
> > which specifically deals with guard entries, leaving them intact except in
> > instances such as process teardown or munmap() where they need to be
> > removed.
> >
> > Reclaim
> > ~~~~~~~
> >
> > When reclaim is performed on file-backed folios, it ultimately invokes
> > try_to_unmap_one() via the rmap. If the folio is non-large, then map_pte()
> > will ultimately abort the operation for the guard region mapping. If large,
> > then check_pte() will determine that this is a non-device private
> > entry/device-exclusive entry 'swap' PTE and thus abort the operation in
> > that instance.
> >
> > Therefore, no odd things happen in the instance of reclaim being attempted
> > upon a file-backed guard region.
> >
> > Hole Punching
> > ~~~~~~~~~~~~~
> >
> > This updates the page cache and ultimately invokes unmap_mapping_range(),
> > which explicitly leaves PTE markers in place.
> >
> > Because the establishment of guard regions zapped any existing mappings to
> > file-backed folios, once the guard regions are removed then the
> > hole-punched region will be faulted in as usual and everything will behave
> > as expected.
> >
> > Lorenzo Stoakes (4):
> > mm: allow guard regions in file-backed and read-only mappings
> > selftests/mm: rename guard-pages to guard-regions
> > tools/selftests: expand all guard region tests to file-backed
> > tools/selftests: add file/shmem-backed mapping guard region tests
> >
> > mm/madvise.c | 8 +-
> > tools/testing/selftests/mm/.gitignore | 2 +-
> > tools/testing/selftests/mm/Makefile | 2 +-
> > .../mm/{guard-pages.c => guard-regions.c} | 921 ++++++++++++++++--
> > 4 files changed, 821 insertions(+), 112 deletions(-)
> > rename tools/testing/selftests/mm/{guard-pages.c => guard-regions.c} (58%)
> >
> > --
> > 2.48.1
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-19 8:25 ` Kalesh Singh
2025-02-19 8:35 ` Kalesh Singh
@ 2025-02-19 9:03 ` Lorenzo Stoakes
2025-02-19 9:15 ` David Hildenbrand
1 sibling, 1 reply; 64+ messages in thread
From: Lorenzo Stoakes @ 2025-02-19 9:03 UTC (permalink / raw)
To: Kalesh Singh
Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
Matthew Wilcox, Vlastimil Babka, Paul E . McKenney, Jann Horn,
David Hildenbrand, linux-mm, linux-kernel, Shuah Khan,
linux-kselftest, linux-api, John Hubbard, Juan Yescas
On Wed, Feb 19, 2025 at 12:25:51AM -0800, Kalesh Singh wrote:
> On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > The guard regions feature was initially implemented to support anonymous
> > mappings only, excluding shmem.
> >
> > This was done such as to introduce the feature carefully and incrementally
> > and to be conservative when considering the various caveats and corner
> > cases that are applicable to file-backed mappings but not to anonymous
> > ones.
> >
> > Now this feature has landed in 6.13, it is time to revisit this and to
> > extend this functionality to file-backed and shmem mappings.
> >
> > In order to make this maximally useful, and since one may map file-backed
> > mappings read-only (for instance ELF images), we also remove the
> > restriction on read-only mappings and permit the establishment of guard
> > regions in any non-hugetlb, non-mlock()'d mapping.
>
> Hi Lorenzo,
>
> Thank you for your work on this.
You're welcome.
>
> Have we thought about how guard regions are represented in /proc/*/[s]maps?
This is off-topic here but... Yes, extensively. No they do not appear
there.
I thought you had attended LPC and my talk where I mentioned this
purposefully as a drawback?
I went out of my way to advertise this limitation at the LPC talk, in the
original series, etc. so it's a little disappointing that this is being
brought up so late, but nobody else has raised objections to this issue so
I think in general it's not a limitation that matters in practice.
>
> In the field, I've found that many applications read the ranges from
> /proc/self/[s]maps to determine what they can access (usually related
> to obfuscation techniques). If they don't know of the guard regions it
> would cause them to crash; I think that we'll need similar entries to
> PROT_NONE (---p) for these, and generally to maintain consistency
> between the behavior and what is being said from /proc/*/[s]maps.
No, we cannot have these, sorry.
Firstly /proc/$pid/[s]maps describes VMAs. The entire purpose of this
feature is to avoid having to accumulate VMAs for regions which are not
intended to be accessible.
Secondly, there is no practical means for this to be accomplished in
/proc/$pid/maps in _any_ way - as no metadata relating to a VMA indicates
they have guard regions.
This is intentional, because setting such metadata is simply not practical
- why? Because when you try to split the VMA, how do you know which bit
gets the metadata and which doesn't? You can't without _reading page
tables_.
/proc/$pid/smaps _does_ read page tables, but we can't start pretending
VMAs exist when they don't, this would be completely inaccurate, would
break assumptions for things like mremap (which require a single VMA) and
would be unworkable.
The best that _could_ be achieved is to have a marker in /proc/$pid/smaps
saying 'hey this region has guard regions somewhere'.
But I haven't seen any demand for this and presumably this wouldn't help
your imagined program?
I don't really understand your use case though, what programs would read
/proc/maps, then... try to use /proc/$pid/mem or whatnot to arbitrarily
read regions? Such applications would be in danger of SIGBUS in any case if
they were to read invalid portions of file-backed mappings, and have no way
of knowing this, so they seem fundamentally broken as it is?
>
> -- Kalesh
>
> >
> > It is permissible to permit the establishment of guard regions in read-only
> > mappings because the guard regions only reduce access to the mapping, and
> > when removed simply reinstate the existing attributes of the underlying
> > VMA, meaning no access violations can occur.
> >
> > While the change in kernel code introduced in this series is small, the
> > majority of the effort here is spent in extending the testing to assert
> > that the feature works correctly across numerous file-backed mapping
> > scenarios.
> >
> > Every single guard region self-test performed against anonymous memory
> > (which is relevant and not anon-only) has now been updated to also be
> > performed against shmem and a mapping of a file in the working directory.
> >
> > This confirms that all cases also function correctly for file-backed guard
> > regions.
> >
> > In addition a number of other tests are added for specific file-backed
> > mapping scenarios.
> >
> > There are a number of other concerns that one might have with regard to
> > guard regions, addressed below:
> >
> > Readahead
> > ~~~~~~~~~
> >
> > Readahead is a process through which the page cache is populated on the
> > assumption that sequential reads will occur, thus amortising I/O and,
> > through a clever use of the PG_readahead folio flag establishing during
> > major fault and checked upon minor fault, provides for asynchronous I/O to
> > occur as dat is processed, reducing I/O stalls as data is faulted in.
> >
> > Guard regions do not alter this mechanism which operations at the folio and
> > fault level, but do of course prevent the faulting of folios that would
> > otherwise be mapped.
> >
> > In the instance of a major fault prior to a guard region, synchronous
> > readahead will occur including populating folios in the page cache which
> > the guard regions will, in the case of the mapping in question, prevent
> > access to.
> >
> > In addition, if PG_readahead is placed in a folio that is now inaccessible,
> > this will prevent asynchronous readahead from occurring as it would
> > otherwise do.
> >
> > However, there are mechanisms for heuristically resetting this within
> > readahead regardless, which will 'recover' correct readahead behaviour.
> >
> > Readahead presumes sequential data access, the presence of a guard region
> > clearly indicates that, at least in the guard region, no such sequential
> > access will occur, as it cannot occur there.
> >
> > So this should have very little impact on any real workload. The far more
> > important point is as to whether readahead causes incorrect or
> > inappropriate mapping of ranges disallowed by the presence of guard
> > regions - this is not the case, as readahead does not 'pre-fault' memory in
> > this fashion.
> >
> > At any rate, any mechanism which would attempt to do so would hit the usual
> > page fault paths, which correctly handle PTE markers as with anonymous
> > mappings.
> >
> > Fault-Around
> > ~~~~~~~~~~~~
> >
> > The fault-around logic, in a similar vein to readahead, attempts to improve
> > efficiency with regard to file-backed memory mappings, however it differs
> > in that it does not try to fetch folios into the page cache that are about
> > to be accessed, but rather pre-maps a range of folios around the faulting
> > address.
> >
> > Guard regions making use of PTE markers makes this relatively trivial, as
> > this case is already handled - see filemap_map_folio_range() and
> > filemap_map_order0_folio() - in both instances, the solution is to simply
> > keep the established page table mappings and let the fault handler take
> > care of PTE markers, as per the comment:
> >
> > /*
> > * NOTE: If there're PTE markers, we'll leave them to be
> > * handled in the specific fault path, and it'll prohibit
> > * the fault-around logic.
> > */
> >
> > This works, as establishing guard regions results in page table mappings
> > with PTE markers, and clearing them removes them.
> >
> > Truncation
> > ~~~~~~~~~~
> >
> > File truncation will not eliminate existing guard regions, as the
> > truncation operation will ultimately zap the range via
> > unmap_mapping_range(), which specifically excludes PTE markers.
> >
> > Zapping
> > ~~~~~~~
> >
> > Zapping is, as with anonymous mappings, handled by zap_nonpresent_ptes(),
> > which specifically deals with guard entries, leaving them intact except in
> > instances such as process teardown or munmap() where they need to be
> > removed.
> >
> > Reclaim
> > ~~~~~~~
> >
> > When reclaim is performed on file-backed folios, it ultimately invokes
> > try_to_unmap_one() via the rmap. If the folio is non-large, then map_pte()
> > will ultimately abort the operation for the guard region mapping. If large,
> > then check_pte() will determine that this is a non-device private
> > entry/device-exclusive entry 'swap' PTE and thus abort the operation in
> > that instance.
> >
> > Therefore, no odd things happen in the instance of reclaim being attempted
> > upon a file-backed guard region.
> >
> > Hole Punching
> > ~~~~~~~~~~~~~
> >
> > This updates the page cache and ultimately invokes unmap_mapping_range(),
> > which explicitly leaves PTE markers in place.
> >
> > Because the establishment of guard regions zapped any existing mappings to
> > file-backed folios, once the guard regions are removed then the
> > hole-punched region will be faulted in as usual and everything will behave
> > as expected.
> >
> > Lorenzo Stoakes (4):
> > mm: allow guard regions in file-backed and read-only mappings
> > selftests/mm: rename guard-pages to guard-regions
> > tools/selftests: expand all guard region tests to file-backed
> > tools/selftests: add file/shmem-backed mapping guard region tests
> >
> > mm/madvise.c | 8 +-
> > tools/testing/selftests/mm/.gitignore | 2 +-
> > tools/testing/selftests/mm/Makefile | 2 +-
> > .../mm/{guard-pages.c => guard-regions.c} | 921 ++++++++++++++++--
> > 4 files changed, 821 insertions(+), 112 deletions(-)
> > rename tools/testing/selftests/mm/{guard-pages.c => guard-regions.c} (58%)
> >
> > --
> > 2.48.1
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-19 8:35 ` Kalesh Singh
@ 2025-02-19 9:15 ` Lorenzo Stoakes
2025-02-19 17:32 ` Liam R. Howlett
1 sibling, 0 replies; 64+ messages in thread
From: Lorenzo Stoakes @ 2025-02-19 9:15 UTC (permalink / raw)
To: Kalesh Singh
Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
Matthew Wilcox, Vlastimil Babka, Paul E . McKenney, Jann Horn,
David Hildenbrand, linux-mm, linux-kernel, Shuah Khan,
linux-kselftest, linux-api, John Hubbard, Juan Yescas
On Wed, Feb 19, 2025 at 12:35:12AM -0800, Kalesh Singh wrote:
> On Wed, Feb 19, 2025 at 12:25 AM Kalesh Singh <kaleshsingh@google.com> wrote:
> >
> > On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > The guard regions feature was initially implemented to support anonymous
> > > mappings only, excluding shmem.
> > >
> > > This was done such as to introduce the feature carefully and incrementally
> > > and to be conservative when considering the various caveats and corner
> > > cases that are applicable to file-backed mappings but not to anonymous
> > > ones.
> > >
> > > Now this feature has landed in 6.13, it is time to revisit this and to
> > > extend this functionality to file-backed and shmem mappings.
> > >
> > > In order to make this maximally useful, and since one may map file-backed
> > > mappings read-only (for instance ELF images), we also remove the
> > > restriction on read-only mappings and permit the establishment of guard
> > > regions in any non-hugetlb, non-mlock()'d mapping.
> >
> > Hi Lorenzo,
> >
> > Thank you for your work on this.
> >
> > Have we thought about how guard regions are represented in /proc/*/[s]maps?
> >
> > In the field, I've found that many applications read the ranges from
> > /proc/self/[s]maps to determine what they can access (usually related
> > to obfuscation techniques). If they don't know of the guard regions it
> > would cause them to crash; I think that we'll need similar entries to
> > PROT_NONE (---p) for these, and generally to maintain consistency
> > between the behavior and what is being said from /proc/*/[s]maps.
>
> To clarify why the applications may not be aware of their guard
> regions -- in the case of the ELF mappings these PROT_NONE (guard
> regions) would be installed by the dynamic loader; or may be inherited
> from the parent (zygote in Android's case).
I'm still really confused as to how you propose to use guard regions in
this scenario. If it's trying to 'clone' some set of mappings somehow then
guard regions are just not compatible with your use case at all.
I'm not sure what would be honestly... you maybe could find a way to modify
the loader to interpret PROT_NONE mappings as a cue to use guard regions
instead if that worked?
You can't both not have a VMA (something that describes mapping with common
attributes) and have page table state along with somehow having something
that describes a mapping with said page table state :)
It's a sort of non-such-beast.
Everything is a trade-off, guard regions provide the option to retain state
regarding faulting behaviour in page tables rather the VMAs, entirely
opt-in and at the behest of the user.
Now extended to file-backed and shmem!
While I hoped it would help your scenario based on your talk at LPC, this
is very clearly an important change for guard regions (for shmem most
notably) and removes an important limitation and exposes the possibility of
using it for things such as ELF mappings for those who don't also require
specific [s]maps state (which explicitly NEEDS VMA metadata state).
I suspect you can find a way of making this work in your scenario
though.
An alternative if you're willing to wait for page table state to be
acquired is to expose a new /proc/$pid/... endpoint that explicitly outputs
guard regions.
Would this solve your problem...? I had always considered implementing this
if there was a demand.
Maybe /proc/$pid/gmaps? :P 'g' could mean guard regions or... ;)
>
> >
> > -- Kalesh
> >
> > >
> > > It is permissible to permit the establishment of guard regions in read-only
> > > mappings because the guard regions only reduce access to the mapping, and
> > > when removed simply reinstate the existing attributes of the underlying
> > > VMA, meaning no access violations can occur.
> > >
> > > While the change in kernel code introduced in this series is small, the
> > > majority of the effort here is spent in extending the testing to assert
> > > that the feature works correctly across numerous file-backed mapping
> > > scenarios.
> > >
> > > Every single guard region self-test performed against anonymous memory
> > > (which is relevant and not anon-only) has now been updated to also be
> > > performed against shmem and a mapping of a file in the working directory.
> > >
> > > This confirms that all cases also function correctly for file-backed guard
> > > regions.
> > >
> > > In addition a number of other tests are added for specific file-backed
> > > mapping scenarios.
> > >
> > > There are a number of other concerns that one might have with regard to
> > > guard regions, addressed below:
> > >
> > > Readahead
> > > ~~~~~~~~~
> > >
> > > Readahead is a process through which the page cache is populated on the
> > > assumption that sequential reads will occur, thus amortising I/O and,
> > > through a clever use of the PG_readahead folio flag establishing during
> > > major fault and checked upon minor fault, provides for asynchronous I/O to
> > > occur as dat is processed, reducing I/O stalls as data is faulted in.
> > >
> > > Guard regions do not alter this mechanism which operations at the folio and
> > > fault level, but do of course prevent the faulting of folios that would
> > > otherwise be mapped.
> > >
> > > In the instance of a major fault prior to a guard region, synchronous
> > > readahead will occur including populating folios in the page cache which
> > > the guard regions will, in the case of the mapping in question, prevent
> > > access to.
> > >
> > > In addition, if PG_readahead is placed in a folio that is now inaccessible,
> > > this will prevent asynchronous readahead from occurring as it would
> > > otherwise do.
> > >
> > > However, there are mechanisms for heuristically resetting this within
> > > readahead regardless, which will 'recover' correct readahead behaviour.
> > >
> > > Readahead presumes sequential data access, the presence of a guard region
> > > clearly indicates that, at least in the guard region, no such sequential
> > > access will occur, as it cannot occur there.
> > >
> > > So this should have very little impact on any real workload. The far more
> > > important point is as to whether readahead causes incorrect or
> > > inappropriate mapping of ranges disallowed by the presence of guard
> > > regions - this is not the case, as readahead does not 'pre-fault' memory in
> > > this fashion.
> > >
> > > At any rate, any mechanism which would attempt to do so would hit the usual
> > > page fault paths, which correctly handle PTE markers as with anonymous
> > > mappings.
> > >
> > > Fault-Around
> > > ~~~~~~~~~~~~
> > >
> > > The fault-around logic, in a similar vein to readahead, attempts to improve
> > > efficiency with regard to file-backed memory mappings, however it differs
> > > in that it does not try to fetch folios into the page cache that are about
> > > to be accessed, but rather pre-maps a range of folios around the faulting
> > > address.
> > >
> > > Guard regions making use of PTE markers makes this relatively trivial, as
> > > this case is already handled - see filemap_map_folio_range() and
> > > filemap_map_order0_folio() - in both instances, the solution is to simply
> > > keep the established page table mappings and let the fault handler take
> > > care of PTE markers, as per the comment:
> > >
> > > /*
> > > * NOTE: If there're PTE markers, we'll leave them to be
> > > * handled in the specific fault path, and it'll prohibit
> > > * the fault-around logic.
> > > */
> > >
> > > This works, as establishing guard regions results in page table mappings
> > > with PTE markers, and clearing them removes them.
> > >
> > > Truncation
> > > ~~~~~~~~~~
> > >
> > > File truncation will not eliminate existing guard regions, as the
> > > truncation operation will ultimately zap the range via
> > > unmap_mapping_range(), which specifically excludes PTE markers.
> > >
> > > Zapping
> > > ~~~~~~~
> > >
> > > Zapping is, as with anonymous mappings, handled by zap_nonpresent_ptes(),
> > > which specifically deals with guard entries, leaving them intact except in
> > > instances such as process teardown or munmap() where they need to be
> > > removed.
> > >
> > > Reclaim
> > > ~~~~~~~
> > >
> > > When reclaim is performed on file-backed folios, it ultimately invokes
> > > try_to_unmap_one() via the rmap. If the folio is non-large, then map_pte()
> > > will ultimately abort the operation for the guard region mapping. If large,
> > > then check_pte() will determine that this is a non-device private
> > > entry/device-exclusive entry 'swap' PTE and thus abort the operation in
> > > that instance.
> > >
> > > Therefore, no odd things happen in the instance of reclaim being attempted
> > > upon a file-backed guard region.
> > >
> > > Hole Punching
> > > ~~~~~~~~~~~~~
> > >
> > > This updates the page cache and ultimately invokes unmap_mapping_range(),
> > > which explicitly leaves PTE markers in place.
> > >
> > > Because the establishment of guard regions zapped any existing mappings to
> > > file-backed folios, once the guard regions are removed then the
> > > hole-punched region will be faulted in as usual and everything will behave
> > > as expected.
> > >
> > > Lorenzo Stoakes (4):
> > > mm: allow guard regions in file-backed and read-only mappings
> > > selftests/mm: rename guard-pages to guard-regions
> > > tools/selftests: expand all guard region tests to file-backed
> > > tools/selftests: add file/shmem-backed mapping guard region tests
> > >
> > > mm/madvise.c | 8 +-
> > > tools/testing/selftests/mm/.gitignore | 2 +-
> > > tools/testing/selftests/mm/Makefile | 2 +-
> > > .../mm/{guard-pages.c => guard-regions.c} | 921 ++++++++++++++++--
> > > 4 files changed, 821 insertions(+), 112 deletions(-)
> > > rename tools/testing/selftests/mm/{guard-pages.c => guard-regions.c} (58%)
> > >
> > > --
> > > 2.48.1
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-19 9:03 ` Lorenzo Stoakes
@ 2025-02-19 9:15 ` David Hildenbrand
2025-02-19 9:17 ` Lorenzo Stoakes
0 siblings, 1 reply; 64+ messages in thread
From: David Hildenbrand @ 2025-02-19 9:15 UTC (permalink / raw)
To: Lorenzo Stoakes, Kalesh Singh
Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
Matthew Wilcox, Vlastimil Babka, Paul E . McKenney, Jann Horn,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas
On 19.02.25 10:03, Lorenzo Stoakes wrote:
> On Wed, Feb 19, 2025 at 12:25:51AM -0800, Kalesh Singh wrote:
>> On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes
>> <lorenzo.stoakes@oracle.com> wrote:
>>>
>>> The guard regions feature was initially implemented to support anonymous
>>> mappings only, excluding shmem.
>>>
>>> This was done such as to introduce the feature carefully and incrementally
>>> and to be conservative when considering the various caveats and corner
>>> cases that are applicable to file-backed mappings but not to anonymous
>>> ones.
>>>
>>> Now this feature has landed in 6.13, it is time to revisit this and to
>>> extend this functionality to file-backed and shmem mappings.
>>>
>>> In order to make this maximally useful, and since one may map file-backed
>>> mappings read-only (for instance ELF images), we also remove the
>>> restriction on read-only mappings and permit the establishment of guard
>>> regions in any non-hugetlb, non-mlock()'d mapping.
>>
>> Hi Lorenzo,
>>
>> Thank you for your work on this.
>
> You're welcome.
>
>>
>> Have we thought about how guard regions are represented in /proc/*/[s]maps?
>
> This is off-topic here but... Yes, extensively. No they do not appear
> there.
>
> I thought you had attended LPC and my talk where I mentioned this
> purposefully as a drawback?
>
> I went out of my way to advertise this limitation at the LPC talk, in the
> original series, etc. so it's a little disappointing that this is being
> brought up so late, but nobody else has raised objections to this issue so
> I think in general it's not a limitation that matters in practice.
>
>>
>> In the field, I've found that many applications read the ranges from
>> /proc/self/[s]maps to determine what they can access (usually related
>> to obfuscation techniques). If they don't know of the guard regions it
>> would cause them to crash; I think that we'll need similar entries to
>> PROT_NONE (---p) for these, and generally to maintain consistency
>> between the behavior and what is being said from /proc/*/[s]maps.
>
> No, we cannot have these, sorry.
>
> Firstly /proc/$pid/[s]maps describes VMAs. The entire purpose of this
> feature is to avoid having to accumulate VMAs for regions which are not
> intended to be accessible.
>
> Secondly, there is no practical means for this to be accomplished in
> /proc/$pid/maps in _any_ way - as no metadata relating to a VMA indicates
> they have guard regions.
>
> This is intentional, because setting such metadata is simply not practical
> - why? Because when you try to split the VMA, how do you know which bit
> gets the metadata and which doesn't? You can't without _reading page
> tables_.
>
> /proc/$pid/smaps _does_ read page tables, but we can't start pretending
> VMAs exist when they don't, this would be completely inaccurate, would
> break assumptions for things like mremap (which require a single VMA) and
> would be unworkable.
>
> The best that _could_ be achieved is to have a marker in /proc/$pid/smaps
> saying 'hey this region has guard regions somewhere'.
And then simply expose it in /proc/$pid/pagemap, which is a better
interface for this pte-level information inside of VMAs. We should still
have a spare bit for that purpose in the pagemap entries.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-19 9:15 ` David Hildenbrand
@ 2025-02-19 9:17 ` Lorenzo Stoakes
2025-02-19 18:52 ` Kalesh Singh
0 siblings, 1 reply; 64+ messages in thread
From: Lorenzo Stoakes @ 2025-02-19 9:17 UTC (permalink / raw)
To: David Hildenbrand
Cc: Kalesh Singh, Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
Matthew Wilcox, Vlastimil Babka, Paul E . McKenney, Jann Horn,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas
On Wed, Feb 19, 2025 at 10:15:47AM +0100, David Hildenbrand wrote:
> On 19.02.25 10:03, Lorenzo Stoakes wrote:
> > On Wed, Feb 19, 2025 at 12:25:51AM -0800, Kalesh Singh wrote:
> > > On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > >
> > > > The guard regions feature was initially implemented to support anonymous
> > > > mappings only, excluding shmem.
> > > >
> > > > This was done such as to introduce the feature carefully and incrementally
> > > > and to be conservative when considering the various caveats and corner
> > > > cases that are applicable to file-backed mappings but not to anonymous
> > > > ones.
> > > >
> > > > Now this feature has landed in 6.13, it is time to revisit this and to
> > > > extend this functionality to file-backed and shmem mappings.
> > > >
> > > > In order to make this maximally useful, and since one may map file-backed
> > > > mappings read-only (for instance ELF images), we also remove the
> > > > restriction on read-only mappings and permit the establishment of guard
> > > > regions in any non-hugetlb, non-mlock()'d mapping.
> > >
> > > Hi Lorenzo,
> > >
> > > Thank you for your work on this.
> >
> > You're welcome.
> >
> > >
> > > Have we thought about how guard regions are represented in /proc/*/[s]maps?
> >
> > This is off-topic here but... Yes, extensively. No they do not appear
> > there.
> >
> > I thought you had attended LPC and my talk where I mentioned this
> > purposefully as a drawback?
> >
> > I went out of my way to advertise this limitation at the LPC talk, in the
> > original series, etc. so it's a little disappointing that this is being
> > brought up so late, but nobody else has raised objections to this issue so
> > I think in general it's not a limitation that matters in practice.
> >
> > >
> > > In the field, I've found that many applications read the ranges from
> > > /proc/self/[s]maps to determine what they can access (usually related
> > > to obfuscation techniques). If they don't know of the guard regions it
> > > would cause them to crash; I think that we'll need similar entries to
> > > PROT_NONE (---p) for these, and generally to maintain consistency
> > > between the behavior and what is being said from /proc/*/[s]maps.
> >
> > No, we cannot have these, sorry.
> >
> > Firstly /proc/$pid/[s]maps describes VMAs. The entire purpose of this
> > feature is to avoid having to accumulate VMAs for regions which are not
> > intended to be accessible.
> >
> > Secondly, there is no practical means for this to be accomplished in
> > /proc/$pid/maps in _any_ way - as no metadata relating to a VMA indicates
> > they have guard regions.
> >
> > This is intentional, because setting such metadata is simply not practical
> > - why? Because when you try to split the VMA, how do you know which bit
> > gets the metadata and which doesn't? You can't without _reading page
> > tables_.
> >
> > /proc/$pid/smaps _does_ read page tables, but we can't start pretending
> > VMAs exist when they don't, this would be completely inaccurate, would
> > break assumptions for things like mremap (which require a single VMA) and
> > would be unworkable.
> >
> > The best that _could_ be achieved is to have a marker in /proc/$pid/smaps
> > saying 'hey this region has guard regions somewhere'.
>
> And then simply expose it in /proc/$pid/pagemap, which is a better interface
> for this pte-level information inside of VMAs. We should still have a spare
> bit for that purpose in the pagemap entries.
Ah yeah thanks David forgot about that!
This is also a possibility if that'd solve your problems Kalesh?
This bit will be fought over haha
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-19 8:35 ` Kalesh Singh
2025-02-19 9:15 ` Lorenzo Stoakes
@ 2025-02-19 17:32 ` Liam R. Howlett
1 sibling, 0 replies; 64+ messages in thread
From: Liam R. Howlett @ 2025-02-19 17:32 UTC (permalink / raw)
To: Kalesh Singh
Cc: Lorenzo Stoakes, Andrew Morton, Suren Baghdasaryan,
Matthew Wilcox, Vlastimil Babka, Paul E . McKenney, Jann Horn,
David Hildenbrand, linux-mm, linux-kernel, Shuah Khan,
linux-kselftest, linux-api, John Hubbard, Juan Yescas
* Kalesh Singh <kaleshsingh@google.com> [250219 03:35]:
> On Wed, Feb 19, 2025 at 12:25 AM Kalesh Singh <kaleshsingh@google.com> wrote:
> >
> > On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > The guard regions feature was initially implemented to support anonymous
> > > mappings only, excluding shmem.
> > >
> > > This was done such as to introduce the feature carefully and incrementally
> > > and to be conservative when considering the various caveats and corner
> > > cases that are applicable to file-backed mappings but not to anonymous
> > > ones.
> > >
> > > Now this feature has landed in 6.13, it is time to revisit this and to
> > > extend this functionality to file-backed and shmem mappings.
> > >
> > > In order to make this maximally useful, and since one may map file-backed
> > > mappings read-only (for instance ELF images), we also remove the
> > > restriction on read-only mappings and permit the establishment of guard
> > > regions in any non-hugetlb, non-mlock()'d mapping.
> >
> > Hi Lorenzo,
> >
> > Thank you for your work on this.
> >
> > Have we thought about how guard regions are represented in /proc/*/[s]maps?
> >
> > In the field, I've found that many applications read the ranges from
> > /proc/self/[s]maps to determine what they can access (usually related
> > to obfuscation techniques). If they don't know of the guard regions it
> > would cause them to crash; I think that we'll need similar entries to
> > PROT_NONE (---p) for these, and generally to maintain consistency
> > between the behavior and what is being said from /proc/*/[s]maps.
>
> To clarify why the applications may not be aware of their guard
> regions -- in the case of the ELF mappings these PROT_NONE (guard
> regions) would be installed by the dynamic loader; or may be inherited
> from the parent (zygote in Android's case).
The guard regions are a method to reduce vma count and speed up guard
installation and removal. This very much will change the proc interface
by its design, since it's removing information from the maps for the
advantage of speed and less resources. I thought that was pretty clear
from the start.
The proc interface is also not a very good way to programmatically get
this information, as I'm sure you are aware. I'm also sure you know
that reading that file takes the mmap read lock, and tearing can occur.
I think this implies you are taking a lot of time to get the information
you want in the way you are getting it (parsing a text file, and not
doing any mmap write work)?
If this is a common pattern, I think we need a better interface to query
this type of information. We have an ioctl going in for getting vma
information, but that will lack these new guard regions as well.
David mentioned /proc/self/pagemap - that's certainly worth exploring,
and doesn't involve text parsing.
If it's not that common, then maybe your zygote/loader can communicate
the information in a way that does not involve read locking the programs
vm area?
Are you really parsing the same library information for each application
launched to find the guards instead of asking or being told what they
are? I must be missing something..
Thanks,
Liam
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-19 9:17 ` Lorenzo Stoakes
@ 2025-02-19 18:52 ` Kalesh Singh
2025-02-19 19:20 ` Lorenzo Stoakes
0 siblings, 1 reply; 64+ messages in thread
From: Kalesh Singh @ 2025-02-19 18:52 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: David Hildenbrand, Andrew Morton, Suren Baghdasaryan,
Liam R . Howlett, Matthew Wilcox, Vlastimil Babka,
Paul E . McKenney, Jann Horn, linux-mm, linux-kernel, Shuah Khan,
linux-kselftest, linux-api, John Hubbard, Juan Yescas
On Wed, Feb 19, 2025 at 1:17 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Wed, Feb 19, 2025 at 10:15:47AM +0100, David Hildenbrand wrote:
> > On 19.02.25 10:03, Lorenzo Stoakes wrote:
> > > On Wed, Feb 19, 2025 at 12:25:51AM -0800, Kalesh Singh wrote:
> > > > On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes
> > > > <lorenzo.stoakes@oracle.com> wrote:
> > > > >
> > > > > The guard regions feature was initially implemented to support anonymous
> > > > > mappings only, excluding shmem.
> > > > >
> > > > > This was done such as to introduce the feature carefully and incrementally
> > > > > and to be conservative when considering the various caveats and corner
> > > > > cases that are applicable to file-backed mappings but not to anonymous
> > > > > ones.
> > > > >
> > > > > Now this feature has landed in 6.13, it is time to revisit this and to
> > > > > extend this functionality to file-backed and shmem mappings.
> > > > >
> > > > > In order to make this maximally useful, and since one may map file-backed
> > > > > mappings read-only (for instance ELF images), we also remove the
> > > > > restriction on read-only mappings and permit the establishment of guard
> > > > > regions in any non-hugetlb, non-mlock()'d mapping.
> > > >
> > > > Hi Lorenzo,
> > > >
> > > > Thank you for your work on this.
> > >
> > > You're welcome.
> > >
> > > >
> > > > Have we thought about how guard regions are represented in /proc/*/[s]maps?
> > >
> > > This is off-topic here but... Yes, extensively. No they do not appear
> > > there.
> > >
> > > I thought you had attended LPC and my talk where I mentioned this
> > > purposefully as a drawback?
> > >
> > > I went out of my way to advertise this limitation at the LPC talk, in the
> > > original series, etc. so it's a little disappointing that this is being
> > > brought up so late, but nobody else has raised objections to this issue so
> > > I think in general it's not a limitation that matters in practice.
> > >
Sorry for raising this now, yes at the time I believe we discussed
reducing the vma slab memory usage for the PROT_NONE mappings. I
didn't imagine that apps could have dependencies on the mapped ELF
ranges in /proc/self/[s]maps until recent breakages from a similar
feature. Android itself doesn't depend on this but what I've seen is
banking apps and apps that have obfuscation to prevent reverse
engineering (the particulars of such obfuscation are a black box).
> > > >
> > > > In the field, I've found that many applications read the ranges from
> > > > /proc/self/[s]maps to determine what they can access (usually related
> > > > to obfuscation techniques). If they don't know of the guard regions it
> > > > would cause them to crash; I think that we'll need similar entries to
> > > > PROT_NONE (---p) for these, and generally to maintain consistency
> > > > between the behavior and what is being said from /proc/*/[s]maps.
> > >
> > > No, we cannot have these, sorry.
> > >
> > > Firstly /proc/$pid/[s]maps describes VMAs. The entire purpose of this
> > > feature is to avoid having to accumulate VMAs for regions which are not
> > > intended to be accessible.
> > >
> > > Secondly, there is no practical means for this to be accomplished in
> > > /proc/$pid/maps in _any_ way - as no metadata relating to a VMA indicates
> > > they have guard regions.
> > >
> > > This is intentional, because setting such metadata is simply not practical
> > > - why? Because when you try to split the VMA, how do you know which bit
> > > gets the metadata and which doesn't? You can't without _reading page
> > > tables_.
Yeah the splitting becomes complicated with any vm flags for this...
meaning any attempt to expose this in /proc/*/maps have to
unconditionally walk the page tables :(
> > >
> > > /proc/$pid/smaps _does_ read page tables, but we can't start pretending
> > > VMAs exist when they don't, this would be completely inaccurate, would
> > > break assumptions for things like mremap (which require a single VMA) and
> > > would be unworkable.
> > >
> > > The best that _could_ be achieved is to have a marker in /proc/$pid/smaps
> > > saying 'hey this region has guard regions somewhere'.
> >
> > And then simply expose it in /proc/$pid/pagemap, which is a better interface
> > for this pte-level information inside of VMAs. We should still have a spare
> > bit for that purpose in the pagemap entries.
>
> Ah yeah thanks David forgot about that!
>
> This is also a possibility if that'd solve your problems Kalesh?
I'm not sure what is the correct interface to advertise these. Maybe
smaps as you suggested since we already walk the page tables there?
and pagemap bit for the exact pages as well? It won't solve this
particular issue, as 1000s of in field apps do look at this through
/proc/*/maps. But maybe we have to live with that...
We can argue that such apps are broken since they may trip on the
SIGBUS off the end of the file -- usually this isn't the case for the
ELF segment mappings.
This is still useful for other cases, I just wanted to get some ideas
if this can be extended to further use cases.
Thanks,
Kalesh
>
> This bit will be fought over haha
>
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-19 18:52 ` Kalesh Singh
@ 2025-02-19 19:20 ` Lorenzo Stoakes
2025-02-19 20:56 ` Kalesh Singh
0 siblings, 1 reply; 64+ messages in thread
From: Lorenzo Stoakes @ 2025-02-19 19:20 UTC (permalink / raw)
To: Kalesh Singh
Cc: David Hildenbrand, Andrew Morton, Suren Baghdasaryan,
Liam R . Howlett, Matthew Wilcox, Vlastimil Babka,
Paul E . McKenney, Jann Horn, linux-mm, linux-kernel, Shuah Khan,
linux-kselftest, linux-api, John Hubbard, Juan Yescas
On Wed, Feb 19, 2025 at 10:52:04AM -0800, Kalesh Singh wrote:
> On Wed, Feb 19, 2025 at 1:17 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Wed, Feb 19, 2025 at 10:15:47AM +0100, David Hildenbrand wrote:
> > > On 19.02.25 10:03, Lorenzo Stoakes wrote:
> > > > On Wed, Feb 19, 2025 at 12:25:51AM -0800, Kalesh Singh wrote:
> > > > > On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes
> > > > > <lorenzo.stoakes@oracle.com> wrote:
> > > > > >
> > > > > > The guard regions feature was initially implemented to support anonymous
> > > > > > mappings only, excluding shmem.
> > > > > >
> > > > > > This was done such as to introduce the feature carefully and incrementally
> > > > > > and to be conservative when considering the various caveats and corner
> > > > > > cases that are applicable to file-backed mappings but not to anonymous
> > > > > > ones.
> > > > > >
> > > > > > Now this feature has landed in 6.13, it is time to revisit this and to
> > > > > > extend this functionality to file-backed and shmem mappings.
> > > > > >
> > > > > > In order to make this maximally useful, and since one may map file-backed
> > > > > > mappings read-only (for instance ELF images), we also remove the
> > > > > > restriction on read-only mappings and permit the establishment of guard
> > > > > > regions in any non-hugetlb, non-mlock()'d mapping.
> > > > >
> > > > > Hi Lorenzo,
> > > > >
> > > > > Thank you for your work on this.
> > > >
> > > > You're welcome.
> > > >
> > > > >
> > > > > Have we thought about how guard regions are represented in /proc/*/[s]maps?
> > > >
> > > > This is off-topic here but... Yes, extensively. No they do not appear
> > > > there.
> > > >
> > > > I thought you had attended LPC and my talk where I mentioned this
> > > > purposefully as a drawback?
> > > >
> > > > I went out of my way to advertise this limitation at the LPC talk, in the
> > > > original series, etc. so it's a little disappointing that this is being
> > > > brought up so late, but nobody else has raised objections to this issue so
> > > > I think in general it's not a limitation that matters in practice.
> > > >
>
> Sorry for raising this now, yes at the time I believe we discussed
> reducing the vma slab memory usage for the PROT_NONE mappings. I
> didn't imagine that apps could have dependencies on the mapped ELF
> ranges in /proc/self/[s]maps until recent breakages from a similar
> feature. Android itself doesn't depend on this but what I've seen is
> banking apps and apps that have obfuscation to prevent reverse
> engineering (the particulars of such obfuscation are a black box).
Ack ok fair enough, sorry, but obviously you can understand it's
frustrating when I went to great lengths to advertise this not only at the
talk but in the original series.
Really important to have these discussions early. Not that really we can do
much about this, as inherently this feature cannot give you what you need.
Is it _only_ banking apps that do this? And do they exclusively read
/proc/$pid/maps? I mean there's nothing we can do about that, sorry. If
that's immutable, then unless you do your own very, very, very slow custom
android maps implementation (that will absolutely break the /proc/$pid/maps
scalability efforts atm) this is just a no-go.
>
> > > > >
> > > > > In the field, I've found that many applications read the ranges from
> > > > > /proc/self/[s]maps to determine what they can access (usually related
> > > > > to obfuscation techniques). If they don't know of the guard regions it
> > > > > would cause them to crash; I think that we'll need similar entries to
> > > > > PROT_NONE (---p) for these, and generally to maintain consistency
> > > > > between the behavior and what is being said from /proc/*/[s]maps.
> > > >
> > > > No, we cannot have these, sorry.
> > > >
> > > > Firstly /proc/$pid/[s]maps describes VMAs. The entire purpose of this
> > > > feature is to avoid having to accumulate VMAs for regions which are not
> > > > intended to be accessible.
> > > >
> > > > Secondly, there is no practical means for this to be accomplished in
> > > > /proc/$pid/maps in _any_ way - as no metadata relating to a VMA indicates
> > > > they have guard regions.
> > > >
> > > > This is intentional, because setting such metadata is simply not practical
> > > > - why? Because when you try to split the VMA, how do you know which bit
> > > > gets the metadata and which doesn't? You can't without _reading page
> > > > tables_.
>
> Yeah the splitting becomes complicated with any vm flags for this...
> meaning any attempt to expose this in /proc/*/maps have to
> unconditionally walk the page tables :(
It's not really complicated, it's _impossible_ unless you made literally
all VMA code walk page tables for every single operation. Which we are
emphatically not going to do :)
And no, /proc/$pid/maps is _never_ going to walk page tables. For obvious
performance reasons.
>
> > > >
> > > > /proc/$pid/smaps _does_ read page tables, but we can't start pretending
> > > > VMAs exist when they don't, this would be completely inaccurate, would
> > > > break assumptions for things like mremap (which require a single VMA) and
> > > > would be unworkable.
> > > >
> > > > The best that _could_ be achieved is to have a marker in /proc/$pid/smaps
> > > > saying 'hey this region has guard regions somewhere'.
> > >
> > > And then simply expose it in /proc/$pid/pagemap, which is a better interface
> > > for this pte-level information inside of VMAs. We should still have a spare
> > > bit for that purpose in the pagemap entries.
> >
> > Ah yeah thanks David forgot about that!
> >
> > This is also a possibility if that'd solve your problems Kalesh?
>
> I'm not sure what is the correct interface to advertise these. Maybe
> smaps as you suggested since we already walk the page tables there?
> and pagemap bit for the exact pages as well? It won't solve this
> particular issue, as 1000s of in field apps do look at this through
> /proc/*/maps. But maybe we have to live with that...
I mean why are we even considering this if you can't change this anywhere?
Confused by that.
I'm afraid upstream can't radically change interfaces to suit this
scenario.
We also can't change smaps in the way you want, it _has_ to still give
output per VMA information.
The proposed change that would be there would be a flag or something
indicating that the VMA has guard regions _SOMEWHERE_ in it.
Since this doesn't solve your problem, adds complexity, and nobody else
seems to need it, I would suggest this is not worthwhile and I'd rather not
do this.
Therefore for your needs there are literally only two choices here:
1. Add a bit to /proc/$pid/pagemap OR
2. a new interface.
I am not in favour of a new interface here, if we can just extend pagemap.
What you'd have to do is:
1. Find virtual ranges via /proc/$pid/maps
2. iterate through /proc/$pid/pagemaps to retrieve state for all ranges.
Since anything that would retrieve guard region state would need to walk
page tables, any approach would be slow and I don't think this would be any
less slow than any other interface.
This way you'd be able to find all guard regions all the time.
This is just the trade-off for this feature unfortunately - its whole
design ethos is to allow modification of -faulting- behaviour without
having to modify -VMA- behaviour.
But if it's banking apps whose code you can't control (surprised you don't
lock down these interfaces), I mean is this even useful to you?
If your requirement is 'you have to change /proc/$pid/maps to show guard
regions' I mean the answer is that we can't.
>
> We can argue that such apps are broken since they may trip on the
> SIGBUS off the end of the file -- usually this isn't the case for the
> ELF segment mappings.
Or tearing of the maps interface, or things getting unmapped or or
or... It's really not a sane thing to do.
>
> This is still useful for other cases, I just wanted to get some ideas
> if this can be extended to further use cases.
Well I'm glad that you guys find it useful for _something_ ;)
Again this wasn't written only for you (it is broadly a good feature for
upstream), but I did have your use case in mind, so I'm a little
disappointed that it doesn't help, as I like to solve problems.
But I'm glad it solves at least some for you...
>
> Thanks,
> Kalesh
>
>
> >
> > This bit will be fought over haha
> >
> > >
> > > --
> > > Cheers,
> > >
> > > David / dhildenb
> > >
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-19 19:20 ` Lorenzo Stoakes
@ 2025-02-19 20:56 ` Kalesh Singh
2025-02-20 8:51 ` Lorenzo Stoakes
0 siblings, 1 reply; 64+ messages in thread
From: Kalesh Singh @ 2025-02-19 20:56 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: David Hildenbrand, Andrew Morton, Suren Baghdasaryan,
Liam R . Howlett, Matthew Wilcox, Vlastimil Babka,
Paul E . McKenney, Jann Horn, linux-mm, linux-kernel, Shuah Khan,
linux-kselftest, linux-api, John Hubbard, Juan Yescas
On Wed, Feb 19, 2025 at 11:20 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Wed, Feb 19, 2025 at 10:52:04AM -0800, Kalesh Singh wrote:
> > On Wed, Feb 19, 2025 at 1:17 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Wed, Feb 19, 2025 at 10:15:47AM +0100, David Hildenbrand wrote:
> > > > On 19.02.25 10:03, Lorenzo Stoakes wrote:
> > > > > On Wed, Feb 19, 2025 at 12:25:51AM -0800, Kalesh Singh wrote:
> > > > > > On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes
> > > > > > <lorenzo.stoakes@oracle.com> wrote:
> > > > > > >
> > > > > > > The guard regions feature was initially implemented to support anonymous
> > > > > > > mappings only, excluding shmem.
> > > > > > >
> > > > > > > This was done such as to introduce the feature carefully and incrementally
> > > > > > > and to be conservative when considering the various caveats and corner
> > > > > > > cases that are applicable to file-backed mappings but not to anonymous
> > > > > > > ones.
> > > > > > >
> > > > > > > Now this feature has landed in 6.13, it is time to revisit this and to
> > > > > > > extend this functionality to file-backed and shmem mappings.
> > > > > > >
> > > > > > > In order to make this maximally useful, and since one may map file-backed
> > > > > > > mappings read-only (for instance ELF images), we also remove the
> > > > > > > restriction on read-only mappings and permit the establishment of guard
> > > > > > > regions in any non-hugetlb, non-mlock()'d mapping.
> > > > > >
> > > > > > Hi Lorenzo,
> > > > > >
> > > > > > Thank you for your work on this.
> > > > >
> > > > > You're welcome.
> > > > >
> > > > > >
> > > > > > Have we thought about how guard regions are represented in /proc/*/[s]maps?
> > > > >
> > > > > This is off-topic here but... Yes, extensively. No they do not appear
> > > > > there.
> > > > >
> > > > > I thought you had attended LPC and my talk where I mentioned this
> > > > > purposefully as a drawback?
> > > > >
> > > > > I went out of my way to advertise this limitation at the LPC talk, in the
> > > > > original series, etc. so it's a little disappointing that this is being
> > > > > brought up so late, but nobody else has raised objections to this issue so
> > > > > I think in general it's not a limitation that matters in practice.
> > > > >
> >
> > Sorry for raising this now, yes at the time I believe we discussed
> > reducing the vma slab memory usage for the PROT_NONE mappings. I
> > didn't imagine that apps could have dependencies on the mapped ELF
> > ranges in /proc/self/[s]maps until recent breakages from a similar
> > feature. Android itself doesn't depend on this but what I've seen is
> > banking apps and apps that have obfuscation to prevent reverse
> > engineering (the particulars of such obfuscation are a black box).
>
> Ack ok fair enough, sorry, but obviously you can understand it's
> frustrating when I went to great lengths to advertise this not only at the
> talk but in the original series.
>
> Really important to have these discussions early. Not that really we can do
> much about this, as inherently this feature cannot give you what you need.
>
> Is it _only_ banking apps that do this? And do they exclusively read
> /proc/$pid/maps? I mean there's nothing we can do about that, sorry.
Not only banking apps but that's a common category.
> If that's immutable, then unless you do your own very, very, very slow custom
> android maps implementation (that will absolutely break the /proc/$pid/maps
> scalability efforts atm) this is just a no-go.
>
Yeah unfortunately that's immutable as app versions are mostly
independent from the OS version.
We do have something that handles this by encoding the guard regions
in the vm_flags, but as you can imagine it's not generic enough for
upstream.
> >
> > > > > >
> > > > > > In the field, I've found that many applications read the ranges from
> > > > > > /proc/self/[s]maps to determine what they can access (usually related
> > > > > > to obfuscation techniques). If they don't know of the guard regions it
> > > > > > would cause them to crash; I think that we'll need similar entries to
> > > > > > PROT_NONE (---p) for these, and generally to maintain consistency
> > > > > > between the behavior and what is being said from /proc/*/[s]maps.
> > > > >
> > > > > No, we cannot have these, sorry.
> > > > >
> > > > > Firstly /proc/$pid/[s]maps describes VMAs. The entire purpose of this
> > > > > feature is to avoid having to accumulate VMAs for regions which are not
> > > > > intended to be accessible.
> > > > >
> > > > > Secondly, there is no practical means for this to be accomplished in
> > > > > /proc/$pid/maps in _any_ way - as no metadata relating to a VMA indicates
> > > > > they have guard regions.
> > > > >
> > > > > This is intentional, because setting such metadata is simply not practical
> > > > > - why? Because when you try to split the VMA, how do you know which bit
> > > > > gets the metadata and which doesn't? You can't without _reading page
> > > > > tables_.
> >
> > Yeah the splitting becomes complicated with any vm flags for this...
> > meaning any attempt to expose this in /proc/*/maps have to
> > unconditionally walk the page tables :(
>
> It's not really complicated, it's _impossible_ unless you made literally
> all VMA code walk page tables for every single operation. Which we are
> emphatically not going to do :)
>
> And no, /proc/$pid/maps is _never_ going to walk page tables. For obvious
> performance reasons.
>
> >
> > > > >
> > > > > /proc/$pid/smaps _does_ read page tables, but we can't start pretending
> > > > > VMAs exist when they don't, this would be completely inaccurate, would
> > > > > break assumptions for things like mremap (which require a single VMA) and
> > > > > would be unworkable.
> > > > >
> > > > > The best that _could_ be achieved is to have a marker in /proc/$pid/smaps
> > > > > saying 'hey this region has guard regions somewhere'.
> > > >
> > > > And then simply expose it in /proc/$pid/pagemap, which is a better interface
> > > > for this pte-level information inside of VMAs. We should still have a spare
> > > > bit for that purpose in the pagemap entries.
> > >
> > > Ah yeah thanks David forgot about that!
> > >
> > > This is also a possibility if that'd solve your problems Kalesh?
> >
> > I'm not sure what is the correct interface to advertise these. Maybe
> > smaps as you suggested since we already walk the page tables there?
> > and pagemap bit for the exact pages as well? It won't solve this
> > particular issue, as 1000s of in field apps do look at this through
> > /proc/*/maps. But maybe we have to live with that...
>
> I mean why are we even considering this if you can't change this anywhere?
> Confused by that.
>
> I'm afraid upstream can't radically change interfaces to suit this
> scenario.
>
> We also can't change smaps in the way you want, it _has_ to still give
> output per VMA information.
Sorry I wasn't suggesting to change the entries in smaps, rather
agreeing to your marker suggestion. Maybe a set of ranges for each
smaps entry that has guards? It doesn't solve the use case, but does
make these regions visible to userspace.
>
> The proposed change that would be there would be a flag or something
> indicating that the VMA has guard regions _SOMEWHERE_ in it.
>
> Since this doesn't solve your problem, adds complexity, and nobody else
> seems to need it, I would suggest this is not worthwhile and I'd rather not
> do this.
>
> Therefore for your needs there are literally only two choices here:
>
> 1. Add a bit to /proc/$pid/pagemap OR
> 2. a new interface.
>
> I am not in favour of a new interface here, if we can just extend pagemap.
>
> What you'd have to do is:
>
> 1. Find virtual ranges via /proc/$pid/maps
> 2. iterate through /proc/$pid/pagemaps to retrieve state for all ranges.
>
Could we also consider an smaps field like:
VmGuards: [AAA, BBB), [CCC, DDD), ...
or something of that sort?
> Since anything that would retrieve guard region state would need to walk
> page tables, any approach would be slow and I don't think this would be any
> less slow than any other interface.
>
> This way you'd be able to find all guard regions all the time.
>
> This is just the trade-off for this feature unfortunately - its whole
> design ethos is to allow modification of -faulting- behaviour without
> having to modify -VMA- behaviour.
>
> But if it's banking apps whose code you can't control (surprised you don't
> lock down these interfaces), I mean is this even useful to you?
>
> If your requirement is 'you have to change /proc/$pid/maps to show guard
> regions' I mean the answer is that we can't.
>
> >
> > We can argue that such apps are broken since they may trip on the
> > SIGBUS off the end of the file -- usually this isn't the case for the
> > ELF segment mappings.
>
> Or tearing of the maps interface, or things getting unmapped or or
> or... It's really not a sane thing to do.
>
> >
> > This is still useful for other cases, I just wanted to get some ideas
> > if this can be extended to further use cases.
>
> Well I'm glad that you guys find it useful for _something_ ;)
>
> Again this wasn't written only for you (it is broadly a good feature for
> upstream), but I did have your use case in mind, so I'm a little
> disappointed that it doesn't help, as I like to solve problems.
>
> But I'm glad it solves at least some for you...
I recall Liam had a proposal to store the guard ranges in the maple tree?
I wonder if that can be used in combination with this approach to have
a better representation of this?
>
> >
> > Thanks,
> > Kalesh
> >
> >
> > >
> > > This bit will be fought over haha
> > >
> > > >
> > > > --
> > > > Cheers,
> > > >
> > > > David / dhildenb
> > > >
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-19 20:56 ` Kalesh Singh
@ 2025-02-20 8:51 ` Lorenzo Stoakes
2025-02-20 8:57 ` David Hildenbrand
2025-02-20 9:22 ` Vlastimil Babka
0 siblings, 2 replies; 64+ messages in thread
From: Lorenzo Stoakes @ 2025-02-20 8:51 UTC (permalink / raw)
To: Kalesh Singh
Cc: David Hildenbrand, Andrew Morton, Suren Baghdasaryan,
Liam R . Howlett, Matthew Wilcox, Vlastimil Babka,
Paul E . McKenney, Jann Horn, linux-mm, linux-kernel, Shuah Khan,
linux-kselftest, linux-api, John Hubbard, Juan Yescas
On Wed, Feb 19, 2025 at 12:56:31PM -0800, Kalesh Singh wrote:
> > We also can't change smaps in the way you want, it _has_ to still give
> > output per VMA information.
>
> Sorry I wasn't suggesting to change the entries in smaps, rather
> agreeing to your marker suggestion. Maybe a set of ranges for each
> smaps entry that has guards? It doesn't solve the use case, but does
> make these regions visible to userspace.
No, you are not providing a usecase for this. /proc/$pid/pagemaps does not
contaminate the smaps output, mess with efforts to make it RCU readable,
require updating the ioctl interface, etc. so it is clearly the better
choice.
>
> >
> > The proposed change that would be there would be a flag or something
> > indicating that the VMA has guard regions _SOMEWHERE_ in it.
> >
> > Since this doesn't solve your problem, adds complexity, and nobody else
> > seems to need it, I would suggest this is not worthwhile and I'd rather not
> > do this.
> >
> > Therefore for your needs there are literally only two choices here:
> >
> > 1. Add a bit to /proc/$pid/pagemap OR
> > 2. a new interface.
> >
> > I am not in favour of a new interface here, if we can just extend pagemap.
> >
> > What you'd have to do is:
> >
> > 1. Find virtual ranges via /proc/$pid/maps
> > 2. iterate through /proc/$pid/pagemaps to retrieve state for all ranges.
> >
>
> Could we also consider an smaps field like:
>
> VmGuards: [AAA, BBB), [CCC, DDD), ...
>
> or something of that sort?
No, absolutely, categorically not. You realise these could be thousands of
characters long right?
/proc/$pid/pagemaps resolves this without contaminating this output.
> > Well I'm glad that you guys find it useful for _something_ ;)
> >
> > Again this wasn't written only for you (it is broadly a good feature for
> > upstream), but I did have your use case in mind, so I'm a little
> > disappointed that it doesn't help, as I like to solve problems.
> >
> > But I'm glad it solves at least some for you...
>
> I recall Liam had a proposal to store the guard ranges in the maple tree?
>
> I wonder if that can be used in combination with this approach to have
> a better representation of this?
This was an alternative proposal made prior to the feature being
implemented (and you and others at Google were welcome to comment and many
were cc'd, etc.).
There is no 'in combination with'. This feature would take weeks/months to
implement, fundamentally impact the maple tree VMA implementation
and... not actually achieve anything + immediately be redundant.
Plus it'd likely be slower, have locking implications, would have kernel
memory allocation implications, a lot more complexity and probably other
problems besides (we discussed this at length at the time and a number of
issues came up, I can't recall all of them).
To be crystal clear - we are empathically NOT changing /proc/$pid/maps to
lie about VMAs regardless of underlying implementation, nor adding
thousands of characters to /proc/$pid/smaps entries.
This is independent of implementation and would have been the case had we
gone with a maple node version.
So in no world is your problem solved here, unfortunately you have
inextricably tied yourself to a VMA representation here.
I still wonder if you could find some means of abstracting this, but
/proc/$pid/pagemaps is where I am likely to expose this information for
anybody who needs it, and will likely send a series for this relatively
soon.
If you _can_ abstract this in some way, then if we provide this information
_anywhere_ you can get it.
As I said to you earlier, the _best_ we could do in smaps would be to add a
flag like 'Grd' or something to indicate some part of the VMA is
guarded. But I won't do that unless somebody has an -actual use case- for
it.
David's /proc/$pid/pagemaps suggestion is excellent, avoids all the
pitfalls, exposes guard regions to anybody who really really wants to know
and doesn't interfere with anything else, so this is what we'll go with.
Regards, Lorenzo
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-20 8:51 ` Lorenzo Stoakes
@ 2025-02-20 8:57 ` David Hildenbrand
2025-02-20 9:04 ` Lorenzo Stoakes
2025-02-20 9:22 ` Vlastimil Babka
1 sibling, 1 reply; 64+ messages in thread
From: David Hildenbrand @ 2025-02-20 8:57 UTC (permalink / raw)
To: Lorenzo Stoakes, Kalesh Singh
Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
Matthew Wilcox, Vlastimil Babka, Paul E . McKenney, Jann Horn,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas
On 20.02.25 09:51, Lorenzo Stoakes wrote:
> On Wed, Feb 19, 2025 at 12:56:31PM -0800, Kalesh Singh wrote:
>>> We also can't change smaps in the way you want, it _has_ to still give
>>> output per VMA information.
>>
>> Sorry I wasn't suggesting to change the entries in smaps, rather
>> agreeing to your marker suggestion. Maybe a set of ranges for each
>> smaps entry that has guards? It doesn't solve the use case, but does
>> make these regions visible to userspace.
>
> No, you are not providing a usecase for this. /proc/$pid/pagemaps does not
> contaminate the smaps output, mess with efforts to make it RCU readable,
> require updating the ioctl interface, etc. so it is clearly the better
> choice.
>
>>
>>>
>>> The proposed change that would be there would be a flag or something
>>> indicating that the VMA has guard regions _SOMEWHERE_ in it.
>>>
>>> Since this doesn't solve your problem, adds complexity, and nobody else
>>> seems to need it, I would suggest this is not worthwhile and I'd rather not
>>> do this.
>>>
>>> Therefore for your needs there are literally only two choices here:
>>>
>>> 1. Add a bit to /proc/$pid/pagemap OR
>>> 2. a new interface.
>>>
>>> I am not in favour of a new interface here, if we can just extend pagemap.
>>>
>>> What you'd have to do is:
>>>
>>> 1. Find virtual ranges via /proc/$pid/maps
>>> 2. iterate through /proc/$pid/pagemaps to retrieve state for all ranges.
>>>
>>
>> Could we also consider an smaps field like:
>>
>> VmGuards: [AAA, BBB), [CCC, DDD), ...
>>
>> or something of that sort?
>
> No, absolutely, categorically not. You realise these could be thousands of
> characters long right?
>
> /proc/$pid/pagemaps resolves this without contaminating this output.
>
>>> Well I'm glad that you guys find it useful for _something_ ;)
>>>
>>> Again this wasn't written only for you (it is broadly a good feature for
>>> upstream), but I did have your use case in mind, so I'm a little
>>> disappointed that it doesn't help, as I like to solve problems.
>>>
>>> But I'm glad it solves at least some for you...
>>
>> I recall Liam had a proposal to store the guard ranges in the maple tree?
>>
>> I wonder if that can be used in combination with this approach to have
>> a better representation of this?
>
> This was an alternative proposal made prior to the feature being
> implemented (and you and others at Google were welcome to comment and many
> were cc'd, etc.).
>
> There is no 'in combination with'. This feature would take weeks/months to
> implement, fundamentally impact the maple tree VMA implementation
> and... not actually achieve anything + immediately be redundant.
>
> Plus it'd likely be slower, have locking implications, would have kernel
> memory allocation implications, a lot more complexity and probably other
> problems besides (we discussed this at length at the time and a number of
> issues came up, I can't recall all of them).
>
> To be crystal clear - we are empathically NOT changing /proc/$pid/maps to
> lie about VMAs regardless of underlying implementation, nor adding
> thousands of characters to /proc/$pid/smaps entries.
Yes. Calling it a "guard region" might be part of the problem
(/"misunderstanding"), because it reminds people of "virtual memory
regions".
"Guard markers" or similar might have been clearer that these operate on
individual PTEs, require page table scanning etc ... which makes them a
lot more scalable and fine-grained and provides all these benfits, with
the downside being that we don't end up with that many "virtual memory
regions" that maps/smaps operate on.
[...]
>
> As I said to you earlier, the _best_ we could do in smaps would be to add a
> flag like 'Grd' or something to indicate some part of the VMA is
> guarded. But I won't do that unless somebody has an -actual use case- for
> it.
Right, and that would limit where you have to manually scan. Something
similar is being done with uffd-wp markers IIRC.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-20 8:57 ` David Hildenbrand
@ 2025-02-20 9:04 ` Lorenzo Stoakes
2025-02-20 9:23 ` David Hildenbrand
0 siblings, 1 reply; 64+ messages in thread
From: Lorenzo Stoakes @ 2025-02-20 9:04 UTC (permalink / raw)
To: David Hildenbrand
Cc: Kalesh Singh, Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
Matthew Wilcox, Vlastimil Babka, Paul E . McKenney, Jann Horn,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas
On Thu, Feb 20, 2025 at 09:57:37AM +0100, David Hildenbrand wrote:
> On 20.02.25 09:51, Lorenzo Stoakes wrote:
> > On Wed, Feb 19, 2025 at 12:56:31PM -0800, Kalesh Singh wrote:
> > > > We also can't change smaps in the way you want, it _has_ to still give
> > > > output per VMA information.
> > >
> > > Sorry I wasn't suggesting to change the entries in smaps, rather
> > > agreeing to your marker suggestion. Maybe a set of ranges for each
> > > smaps entry that has guards? It doesn't solve the use case, but does
> > > make these regions visible to userspace.
> >
> > No, you are not providing a usecase for this. /proc/$pid/pagemaps does not
> > contaminate the smaps output, mess with efforts to make it RCU readable,
> > require updating the ioctl interface, etc. so it is clearly the better
> > choice.
> >
> > >
> > > >
> > > > The proposed change that would be there would be a flag or something
> > > > indicating that the VMA has guard regions _SOMEWHERE_ in it.
> > > >
> > > > Since this doesn't solve your problem, adds complexity, and nobody else
> > > > seems to need it, I would suggest this is not worthwhile and I'd rather not
> > > > do this.
> > > >
> > > > Therefore for your needs there are literally only two choices here:
> > > >
> > > > 1. Add a bit to /proc/$pid/pagemap OR
> > > > 2. a new interface.
> > > >
> > > > I am not in favour of a new interface here, if we can just extend pagemap.
> > > >
> > > > What you'd have to do is:
> > > >
> > > > 1. Find virtual ranges via /proc/$pid/maps
> > > > 2. iterate through /proc/$pid/pagemaps to retrieve state for all ranges.
> > > >
> > >
> > > Could we also consider an smaps field like:
> > >
> > > VmGuards: [AAA, BBB), [CCC, DDD), ...
> > >
> > > or something of that sort?
> >
> > No, absolutely, categorically not. You realise these could be thousands of
> > characters long right?
> >
> > /proc/$pid/pagemaps resolves this without contaminating this output.
> >
> > > > Well I'm glad that you guys find it useful for _something_ ;)
> > > >
> > > > Again this wasn't written only for you (it is broadly a good feature for
> > > > upstream), but I did have your use case in mind, so I'm a little
> > > > disappointed that it doesn't help, as I like to solve problems.
> > > >
> > > > But I'm glad it solves at least some for you...
> > >
> > > I recall Liam had a proposal to store the guard ranges in the maple tree?
> > >
> > > I wonder if that can be used in combination with this approach to have
> > > a better representation of this?
> >
> > This was an alternative proposal made prior to the feature being
> > implemented (and you and others at Google were welcome to comment and many
> > were cc'd, etc.).
> >
> > There is no 'in combination with'. This feature would take weeks/months to
> > implement, fundamentally impact the maple tree VMA implementation
> > and... not actually achieve anything + immediately be redundant.
> >
> > Plus it'd likely be slower, have locking implications, would have kernel
> > memory allocation implications, a lot more complexity and probably other
> > problems besides (we discussed this at length at the time and a number of
> > issues came up, I can't recall all of them).
> >
> > To be crystal clear - we are empathically NOT changing /proc/$pid/maps to
> > lie about VMAs regardless of underlying implementation, nor adding
> > thousands of characters to /proc/$pid/smaps entries.
>
> Yes. Calling it a "guard region" might be part of the problem
> (/"misunderstanding"), because it reminds people of "virtual memory
> regions".
>
> "Guard markers" or similar might have been clearer that these operate on
> individual PTEs, require page table scanning etc ... which makes them a lot
> more scalable and fine-grained and provides all these benfits, with the
> downside being that we don't end up with that many "virtual memory regions"
> that maps/smaps operate on.
Honestly David you and the naming... :P
I disagree, sorry. Saying 'guard' anything might make people think one
thing or another. We can't account for that. I mean don't get me started on
'pinning' or any of the million other overloaded terms we use...
I _hugely_ publicly went out of my way to express the limitations, I gave a
talk, we had meetings, I mentioned it in the series.
Honestly if at that point you still don't realise, that's not a naming
problem. It's a 'did not participate with upstream' problem.
I like guard regions, as they're not pages as we previously referred to
them. People have no idea what a marker is, it doesn't sound like it spans
ranges, no don't like it sorry.
And sorry but this naming topic is closed :) I already let you change the
naming of the MADV_'s, which broke my heart, there will not be a second
heart breaking...
>
> [...]
>
> >
> > As I said to you earlier, the _best_ we could do in smaps would be to add a
> > flag like 'Grd' or something to indicate some part of the VMA is
> > guarded. But I won't do that unless somebody has an -actual use case- for
> > it.
>
> Right, and that would limit where you have to manually scan. Something
> similar is being done with uffd-wp markers IIRC.
Yeah that's a good point, but honestly if you're reading smaps that reads
the page tables, then reading /proc/$pid/pagemaps and reading page tables
TWICE that seems inefficient vs. just reading /proc/$pid/maps, then reading
/proc/$pid/pagemaps and reading page tables once.
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-20 8:51 ` Lorenzo Stoakes
2025-02-20 8:57 ` David Hildenbrand
@ 2025-02-20 9:22 ` Vlastimil Babka
2025-02-20 9:53 ` Lorenzo Stoakes
1 sibling, 1 reply; 64+ messages in thread
From: Vlastimil Babka @ 2025-02-20 9:22 UTC (permalink / raw)
To: Lorenzo Stoakes, Kalesh Singh
Cc: David Hildenbrand, Andrew Morton, Suren Baghdasaryan,
Liam R . Howlett, Matthew Wilcox, Paul E . McKenney, Jann Horn,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas
On 2/20/25 09:51, Lorenzo Stoakes wrote:
> On Wed, Feb 19, 2025 at 12:56:31PM -0800, Kalesh Singh wrote:
>
> As I said to you earlier, the _best_ we could do in smaps would be to add a
> flag like 'Grd' or something to indicate some part of the VMA is
In smaps we could say how many kB is covered by guard ptes and it would be
in line with the current output, and the fact it's already scanning page
tables, so virtually no new overhead. But it wouldn't tell you the address
ranges, of course.
> guarded. But I won't do that unless somebody has an -actual use case- for
> it.
Right, we need to hear that first. Also I'm a bit confused about what's the
issue with the existing Android apps and the proposals. Do the existing apps
expect to see particular PROT_NONE regions, which would be gone (become part
of the surrounding vma's) when using guards instead? Then clearly there's no
way to use guards for them without breaking their assumptions.
But assuming future versions of these apps for future android versions would
have to adapt to guards instead of PROT_NONE, why do we have to promise them
to represent the guards, if the point is just so they adapt to the new state
of smaps in their dubious security checking code, and not break anymore?
(but geez, if android apps are to use the android apis, which is java based
(IIRC?) why were there ever allowed to read /proc in the first place? such a
mistake, sigh)
> David's /proc/$pid/pagemaps suggestion is excellent, avoids all the
> pitfalls, exposes guard regions to anybody who really really wants to know
> and doesn't interfere with anything else, so this is what we'll go with.
>
> Regards, Lorenzo
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-20 9:04 ` Lorenzo Stoakes
@ 2025-02-20 9:23 ` David Hildenbrand
2025-02-20 9:47 ` Lorenzo Stoakes
0 siblings, 1 reply; 64+ messages in thread
From: David Hildenbrand @ 2025-02-20 9:23 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Kalesh Singh, Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
Matthew Wilcox, Vlastimil Babka, Paul E . McKenney, Jann Horn,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas
On 20.02.25 10:04, Lorenzo Stoakes wrote:
> On Thu, Feb 20, 2025 at 09:57:37AM +0100, David Hildenbrand wrote:
>> On 20.02.25 09:51, Lorenzo Stoakes wrote:
>>> On Wed, Feb 19, 2025 at 12:56:31PM -0800, Kalesh Singh wrote:
>>>>> We also can't change smaps in the way you want, it _has_ to still give
>>>>> output per VMA information.
>>>>
>>>> Sorry I wasn't suggesting to change the entries in smaps, rather
>>>> agreeing to your marker suggestion. Maybe a set of ranges for each
>>>> smaps entry that has guards? It doesn't solve the use case, but does
>>>> make these regions visible to userspace.
>>>
>>> No, you are not providing a usecase for this. /proc/$pid/pagemaps does not
>>> contaminate the smaps output, mess with efforts to make it RCU readable,
>>> require updating the ioctl interface, etc. so it is clearly the better
>>> choice.
>>>
>>>>
>>>>>
>>>>> The proposed change that would be there would be a flag or something
>>>>> indicating that the VMA has guard regions _SOMEWHERE_ in it.
>>>>>
>>>>> Since this doesn't solve your problem, adds complexity, and nobody else
>>>>> seems to need it, I would suggest this is not worthwhile and I'd rather not
>>>>> do this.
>>>>>
>>>>> Therefore for your needs there are literally only two choices here:
>>>>>
>>>>> 1. Add a bit to /proc/$pid/pagemap OR
>>>>> 2. a new interface.
>>>>>
>>>>> I am not in favour of a new interface here, if we can just extend pagemap.
>>>>>
>>>>> What you'd have to do is:
>>>>>
>>>>> 1. Find virtual ranges via /proc/$pid/maps
>>>>> 2. iterate through /proc/$pid/pagemaps to retrieve state for all ranges.
>>>>>
>>>>
>>>> Could we also consider an smaps field like:
>>>>
>>>> VmGuards: [AAA, BBB), [CCC, DDD), ...
>>>>
>>>> or something of that sort?
>>>
>>> No, absolutely, categorically not. You realise these could be thousands of
>>> characters long right?
>>>
>>> /proc/$pid/pagemaps resolves this without contaminating this output.
>>>
>>>>> Well I'm glad that you guys find it useful for _something_ ;)
>>>>>
>>>>> Again this wasn't written only for you (it is broadly a good feature for
>>>>> upstream), but I did have your use case in mind, so I'm a little
>>>>> disappointed that it doesn't help, as I like to solve problems.
>>>>>
>>>>> But I'm glad it solves at least some for you...
>>>>
>>>> I recall Liam had a proposal to store the guard ranges in the maple tree?
>>>>
>>>> I wonder if that can be used in combination with this approach to have
>>>> a better representation of this?
>>>
>>> This was an alternative proposal made prior to the feature being
>>> implemented (and you and others at Google were welcome to comment and many
>>> were cc'd, etc.).
>>>
>>> There is no 'in combination with'. This feature would take weeks/months to
>>> implement, fundamentally impact the maple tree VMA implementation
>>> and... not actually achieve anything + immediately be redundant.
>>>
>>> Plus it'd likely be slower, have locking implications, would have kernel
>>> memory allocation implications, a lot more complexity and probably other
>>> problems besides (we discussed this at length at the time and a number of
>>> issues came up, I can't recall all of them).
>>>
>>> To be crystal clear - we are empathically NOT changing /proc/$pid/maps to
>>> lie about VMAs regardless of underlying implementation, nor adding
>>> thousands of characters to /proc/$pid/smaps entries.
>>
>> Yes. Calling it a "guard region" might be part of the problem
>> (/"misunderstanding"), because it reminds people of "virtual memory
>> regions".
>>
>> "Guard markers" or similar might have been clearer that these operate on
>> individual PTEs, require page table scanning etc ... which makes them a lot
>> more scalable and fine-grained and provides all these benfits, with the
>> downside being that we don't end up with that many "virtual memory regions"
>> that maps/smaps operate on.
>
> Honestly David you and the naming... :P
>
> I disagree, sorry. Saying 'guard' anything might make people think
one> thing or another. We can't account for that. I mean don't get me
started on
> 'pinning' or any of the million other overloaded terms we use...
>
> I _hugely_ publicly went out of my way to express the limitations, I
gave a> talk, we had meetings, I mentioned it in the series.
>
> Honestly if at that point you still don't realise, that's not a naming
> problem. It's a 'did not participate with upstream' problem.
>
> I like guard regions, as they're not pages as we previously referred to
> them. People have no idea what a marker is, it doesn't sound like it spans
> ranges, no don't like it sorry.
>
> And sorry but this naming topic is closed :) I already let you change the
> naming of the MADV_'s, which broke my heart, there will not be a second
> heart breaking...
Lorenzo, I was not pushing for it to be changed *now*, that ship has
sailed, and personally, I *don't* find it confusing because I know how
it works under the hood.
I was trying to find a reason *why* people would thing that it would
show up in smaps in the first place. For example, just when reading the
MAN page *today*.
Doesn't really matter now, it is named the way it is, and all we can do
is try making documentation clearer if it keeps confusing people.
Your conclusion is 'did not participate with upstream'; I don't agree
with that. But maybe you and Kalesh have a history on that that let's
you react on his questions IMHO more emotionally than it should have been.
>
>>
>> [...]
>>
>>>
>>> As I said to you earlier, the _best_ we could do in smaps would be to add a
>>> flag like 'Grd' or something to indicate some part of the VMA is
>>> guarded. But I won't do that unless somebody has an -actual use case- for
>>> it.
>>
>> Right, and that would limit where you have to manually scan. Something
>> similar is being done with uffd-wp markers IIRC.
>
> Yeah that's a good point, but honestly if you're reading smaps that reads
> the page tables, then reading /proc/$pid/pagemaps and reading page tables
> TWICE that seems inefficient vs. just reading /proc/$pid/maps, then reading
> /proc/$pid/pagemaps and reading page tables once.
Right; I recently wished that we would have an interface to obtain more
VMA flags without having to go through smaps
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-20 9:23 ` David Hildenbrand
@ 2025-02-20 9:47 ` Lorenzo Stoakes
2025-02-20 10:03 ` David Hildenbrand
0 siblings, 1 reply; 64+ messages in thread
From: Lorenzo Stoakes @ 2025-02-20 9:47 UTC (permalink / raw)
To: David Hildenbrand
Cc: Kalesh Singh, Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
Matthew Wilcox, Vlastimil Babka, Paul E . McKenney, Jann Horn,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas
On Thu, Feb 20, 2025 at 10:23:57AM +0100, David Hildenbrand wrote:
> On 20.02.25 10:04, Lorenzo Stoakes wrote:
> > On Thu, Feb 20, 2025 at 09:57:37AM +0100, David Hildenbrand wrote:
> > > On 20.02.25 09:51, Lorenzo Stoakes wrote:
> > > > On Wed, Feb 19, 2025 at 12:56:31PM -0800, Kalesh Singh wrote:
> > > > > > We also can't change smaps in the way you want, it _has_ to still give
> > > > > > output per VMA information.
> > > > >
> > > > > Sorry I wasn't suggesting to change the entries in smaps, rather
> > > > > agreeing to your marker suggestion. Maybe a set of ranges for each
> > > > > smaps entry that has guards? It doesn't solve the use case, but does
> > > > > make these regions visible to userspace.
> > > >
> > > > No, you are not providing a usecase for this. /proc/$pid/pagemaps does not
> > > > contaminate the smaps output, mess with efforts to make it RCU readable,
> > > > require updating the ioctl interface, etc. so it is clearly the better
> > > > choice.
> > > >
> > > > >
> > > > > >
> > > > > > The proposed change that would be there would be a flag or something
> > > > > > indicating that the VMA has guard regions _SOMEWHERE_ in it.
> > > > > >
> > > > > > Since this doesn't solve your problem, adds complexity, and nobody else
> > > > > > seems to need it, I would suggest this is not worthwhile and I'd rather not
> > > > > > do this.
> > > > > >
> > > > > > Therefore for your needs there are literally only two choices here:
> > > > > >
> > > > > > 1. Add a bit to /proc/$pid/pagemap OR
> > > > > > 2. a new interface.
> > > > > >
> > > > > > I am not in favour of a new interface here, if we can just extend pagemap.
> > > > > >
> > > > > > What you'd have to do is:
> > > > > >
> > > > > > 1. Find virtual ranges via /proc/$pid/maps
> > > > > > 2. iterate through /proc/$pid/pagemaps to retrieve state for all ranges.
> > > > > >
> > > > >
> > > > > Could we also consider an smaps field like:
> > > > >
> > > > > VmGuards: [AAA, BBB), [CCC, DDD), ...
> > > > >
> > > > > or something of that sort?
> > > >
> > > > No, absolutely, categorically not. You realise these could be thousands of
> > > > characters long right?
> > > >
> > > > /proc/$pid/pagemaps resolves this without contaminating this output.
> > > >
> > > > > > Well I'm glad that you guys find it useful for _something_ ;)
> > > > > >
> > > > > > Again this wasn't written only for you (it is broadly a good feature for
> > > > > > upstream), but I did have your use case in mind, so I'm a little
> > > > > > disappointed that it doesn't help, as I like to solve problems.
> > > > > >
> > > > > > But I'm glad it solves at least some for you...
> > > > >
> > > > > I recall Liam had a proposal to store the guard ranges in the maple tree?
> > > > >
> > > > > I wonder if that can be used in combination with this approach to have
> > > > > a better representation of this?
> > > >
> > > > This was an alternative proposal made prior to the feature being
> > > > implemented (and you and others at Google were welcome to comment and many
> > > > were cc'd, etc.).
> > > >
> > > > There is no 'in combination with'. This feature would take weeks/months to
> > > > implement, fundamentally impact the maple tree VMA implementation
> > > > and... not actually achieve anything + immediately be redundant.
> > > >
> > > > Plus it'd likely be slower, have locking implications, would have kernel
> > > > memory allocation implications, a lot more complexity and probably other
> > > > problems besides (we discussed this at length at the time and a number of
> > > > issues came up, I can't recall all of them).
> > > >
> > > > To be crystal clear - we are empathically NOT changing /proc/$pid/maps to
> > > > lie about VMAs regardless of underlying implementation, nor adding
> > > > thousands of characters to /proc/$pid/smaps entries.
> > >
> > > Yes. Calling it a "guard region" might be part of the problem
> > > (/"misunderstanding"), because it reminds people of "virtual memory
> > > regions".
> > >
> > > "Guard markers" or similar might have been clearer that these operate on
> > > individual PTEs, require page table scanning etc ... which makes them a lot
> > > more scalable and fine-grained and provides all these benfits, with the
> > > downside being that we don't end up with that many "virtual memory regions"
> > > that maps/smaps operate on.
> >
> > Honestly David you and the naming... :P
> >
> > I disagree, sorry. Saying 'guard' anything might make people think one>
> thing or another. We can't account for that. I mean don't get me started on
> > 'pinning' or any of the million other overloaded terms we use...
> >
> > I _hugely_ publicly went out of my way to express the limitations, I gave
> a> talk, we had meetings, I mentioned it in the series.
> >
> > Honestly if at that point you still don't realise, that's not a naming
> > problem. It's a 'did not participate with upstream' problem.
> >
> > I like guard regions, as they're not pages as we previously referred to
> > them. People have no idea what a marker is, it doesn't sound like it spans
> > ranges, no don't like it sorry.
> >
> > And sorry but this naming topic is closed :) I already let you change the
> > naming of the MADV_'s, which broke my heart, there will not be a second
> > heart breaking...
>
> Lorenzo, I was not pushing for it to be changed *now*, that ship has sailed,
> and personally, I *don't* find it confusing because I know how it works
> under the hood.
>
> I was trying to find a reason *why* people would thing that it would show up
> in smaps in the first place. For example, just when reading the MAN page
> *today*.
>
> Doesn't really matter now, it is named the way it is, and all we can do is
> try making documentation clearer if it keeps confusing people.
Right, but I disagree with your analysis. In case as you say, it doesn't
really matter at this stage.
>
> Your conclusion is 'did not participate with upstream'; I don't agree with
> that. But maybe you and Kalesh have a history on that that let's you react
> on his questions IMHO more emotionally than it should have been.
This is wholly unfair, I have been very reasonable in response to this
thread. I have offered to find solutions, I have tried to understand the
problem in spite of having gone to great lengths to try to discuss the
limitations of the proposed approach in every venue I possibly could.
I go out of my way to deal professionally and objectively with what is
presented. Nothing here is emotional. So I'd ask that you please abstain
from making commentary like this which has no basis.
My point about not participating with upstream is that Kalesh is now asking
for an -entirely different- approach to guard regions than was implemented,
months after it was merged.
This is after many discussion were had including with people at Google
among other organisations, an RFC which clearly delineated the limitations,
a talk at LPC also.
I feel I communite these things better than many people actually, I go out
of my way to document, add extensive self-documenting tests, etc. etc. and
to participate, engage and take onboard feedback from others.
So I'm suggesting that clearly - something broke down here. There was a
miscommunication, or there was a lack of awareness of a key requirement.
I mean a large motivator for file-backed support here came from the LPC
talk give by Kalesh and Juan re: ELF guard regions. So obviously that this
breakdown in communication with upstream occurred is very unfortunate.
I am not blaming anybody or being 'emotional'. I am simply stating what
seems to me to be a clear fact.
I genuinely don't understand how it could be seen any other way? How can
requesting that an entirely different alternative approach months after the
fact be anything other than some failure to engage with upstream?
I am emphatically _not_ blaming Kalesh by the way, whom I respect and with
whom I have no problem, in any way whatsoever. I apologised when I realised
that he simply was not aware of this limitation at LPC if you look through
the thread, having politely suggested disappointment at this not having
been brought up then.
I am suggesting that within Google clearly there has been some form of
miscommunication or failure for one aspect of things (this limitation) to
be expressed at the right time to engage with upstream.
Sorry if you misinterpreted that as something else, this is the _only_
point I am making here.
And again, I am going out of my way to find practical and helpful ways
forward - though I cannot see how we can possibly fulfil Kalesh's needs
here.
>
>
> >
> > >
> > > [...]
> > >
> > > >
> > > > As I said to you earlier, the _best_ we could do in smaps would be to add a
> > > > flag like 'Grd' or something to indicate some part of the VMA is
> > > > guarded. But I won't do that unless somebody has an -actual use case- for
> > > > it.
> > >
> > > Right, and that would limit where you have to manually scan. Something
> > > similar is being done with uffd-wp markers IIRC.
> >
> > Yeah that's a good point, but honestly if you're reading smaps that reads
> > the page tables, then reading /proc/$pid/pagemaps and reading page tables
> > TWICE that seems inefficient vs. just reading /proc/$pid/maps, then reading
> > /proc/$pid/pagemaps and reading page tables once.
>
> Right; I recently wished that we would have an interface to obtain more VMA
> flags without having to go through smaps
Well maybe that lends itself to the idea of adding a whole new interface in
general...
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-20 9:22 ` Vlastimil Babka
@ 2025-02-20 9:53 ` Lorenzo Stoakes
0 siblings, 0 replies; 64+ messages in thread
From: Lorenzo Stoakes @ 2025-02-20 9:53 UTC (permalink / raw)
To: Vlastimil Babka
Cc: Kalesh Singh, David Hildenbrand, Andrew Morton,
Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
Paul E . McKenney, Jann Horn, linux-mm, linux-kernel, Shuah Khan,
linux-kselftest, linux-api, John Hubbard, Juan Yescas
On Thu, Feb 20, 2025 at 10:22:29AM +0100, Vlastimil Babka wrote:
> On 2/20/25 09:51, Lorenzo Stoakes wrote:
> > On Wed, Feb 19, 2025 at 12:56:31PM -0800, Kalesh Singh wrote:
> >
> > As I said to you earlier, the _best_ we could do in smaps would be to add a
> > flag like 'Grd' or something to indicate some part of the VMA is
>
> In smaps we could say how many kB is covered by guard ptes and it would be
> in line with the current output, and the fact it's already scanning page
> tables, so virtually no new overhead. But it wouldn't tell you the address
> ranges, of course.
Yes right. Sorry, I didn't think to suggest this, the fundamental point
being that there is essentially a 'flag' or 'entry' saying 'there are
guard regions here'. And indeed that could include ranges.
But it doesn't tell you the actual address ranges which is what Kalesh
appears to require.
You could equally well find these, and the sizes (albeit less conveniently)
with a combination of /proc/$pid/maps and adding a guard bit to
/proc/$pid/pagemap.
>
> > guarded. But I won't do that unless somebody has an -actual use case- for
> > it.
>
> Right, we need to hear that first. Also I'm a bit confused about what's the
> issue with the existing Android apps and the proposals. Do the existing apps
> expect to see particular PROT_NONE regions, which would be gone (become part
> of the surrounding vma's) when using guards instead? Then clearly there's no
> way to use guards for them without breaking their assumptions.
>
> But assuming future versions of these apps for future android versions would
> have to adapt to guards instead of PROT_NONE, why do we have to promise them
> to represent the guards, if the point is just so they adapt to the new state
> of smaps in their dubious security checking code, and not break anymore?
>
> (but geez, if android apps are to use the android apis, which is java based
> (IIRC?) why were there ever allowed to read /proc in the first place? such a
> mistake, sigh)
>
> > David's /proc/$pid/pagemaps suggestion is excellent, avoids all the
> > pitfalls, exposes guard regions to anybody who really really wants to know
> > and doesn't interfere with anything else, so this is what we'll go with.
> >
> > Regards, Lorenzo
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-20 9:47 ` Lorenzo Stoakes
@ 2025-02-20 10:03 ` David Hildenbrand
2025-02-20 10:15 ` Lorenzo Stoakes
0 siblings, 1 reply; 64+ messages in thread
From: David Hildenbrand @ 2025-02-20 10:03 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Kalesh Singh, Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
Matthew Wilcox, Vlastimil Babka, Paul E . McKenney, Jann Horn,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas
>> Your conclusion is 'did not participate with upstream'; I don't agree with
>> that. But maybe you and Kalesh have a history on that that let's you react
>> on his questions IMHO more emotionally than it should have been.
>
> This is wholly unfair, I have been very reasonable in response to this
> thread. I have offered to find solutions, I have tried to understand the
> problem in spite of having gone to great lengths to try to discuss the
> limitations of the proposed approach in every venue I possibly could.
>
> I go out of my way to deal professionally and objectively with what is
> presented. Nothing here is emotional. So I'd ask that you please abstain
> from making commentary like this which has no basis.
I appreciate everything you write below. But this request is just
impossible. I will keep raising my opinion and misunderstandings will
happen.
Note that the whole "Honestly David you and the naming. .." thing could
have been written as "I don't think it's a naming problem."
>>
>>
>>>
>>>>
>>>> [...]
>>>>
>>>>>
>>>>> As I said to you earlier, the _best_ we could do in smaps would be to add a
>>>>> flag like 'Grd' or something to indicate some part of the VMA is
>>>>> guarded. But I won't do that unless somebody has an -actual use case- for
>>>>> it.
>>>>
>>>> Right, and that would limit where you have to manually scan. Something
>>>> similar is being done with uffd-wp markers IIRC.
>>>
>>> Yeah that's a good point, but honestly if you're reading smaps that reads
>>> the page tables, then reading /proc/$pid/pagemaps and reading page tables
>>> TWICE that seems inefficient vs. just reading /proc/$pid/maps, then reading
>>> /proc/$pid/pagemaps and reading page tables once.
>>
>> Right; I recently wished that we would have an interface to obtain more VMA
>> flags without having to go through smaps
>
> Well maybe that lends itself to the idea of adding a whole new interface in
> general...
An extended "maps" interface might be reasonable, that allows for
exposing more things without walking the page tables. (e.g., flags)
Maybe one could have an indicator that says "ever had guard regions in
this mapping" without actually walking the page tables.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-20 10:03 ` David Hildenbrand
@ 2025-02-20 10:15 ` Lorenzo Stoakes
2025-02-20 12:44 ` David Hildenbrand
0 siblings, 1 reply; 64+ messages in thread
From: Lorenzo Stoakes @ 2025-02-20 10:15 UTC (permalink / raw)
To: David Hildenbrand
Cc: Kalesh Singh, Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
Matthew Wilcox, Vlastimil Babka, Paul E . McKenney, Jann Horn,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas
On Thu, Feb 20, 2025 at 11:03:02AM +0100, David Hildenbrand wrote:
> > > Your conclusion is 'did not participate with upstream'; I don't agree with
> > > that. But maybe you and Kalesh have a history on that that let's you react
> > > on his questions IMHO more emotionally than it should have been.
> >
> > This is wholly unfair, I have been very reasonable in response to this
> > thread. I have offered to find solutions, I have tried to understand the
> > problem in spite of having gone to great lengths to try to discuss the
> > limitations of the proposed approach in every venue I possibly could.
> >
> > I go out of my way to deal professionally and objectively with what is
> > presented. Nothing here is emotional. So I'd ask that you please abstain
> > from making commentary like this which has no basis.
>
> I appreciate everything you write below. But this request is just
> impossible. I will keep raising my opinion and misunderstandings will
> happen.
Well I wouldn't ask you not to express your opinion David, you know I respect
and like you, and by all means push back hard or call out what you think is bad
behaviour :)
I just meant to say, in my view, that there was no basis, but I appreciate
miscommunications happen.
So apologies if I came off as being difficult or rude, it actually wasn't
intended. And to re-emphasise - I have zero personal issue with anybody in this
thread whatsoever!
I just want to find the best way forward, technically and am willing to do
whatever work is required to make the guard region implementation as good as it
possibly can be.
>
> Note that the whole "Honestly David you and the naming. .." thing could have
> been written as "I don't think it's a naming problem."
I feel like I _always_ get in trouble when I try to write in a 'tongue-in-cheek'
style, which is what this was meant to be... so I think herein lies the basis of
the miscommunication :)
I apologise, the household is ill, which maybe affects my judgment in how I
write these, but in general text is a very poor medium. It was meant to be said
in a jolly tone with a wink...
I think maybe I should learn my lesson with these things, I thought the ':p'
would make this clear but yeah, text, poor medium.
Anyway apologies if this seemed disrespectful.
>
> > >
> > >
> > > >
> > > > >
> > > > > [...]
> > > > >
> > > > > >
> > > > > > As I said to you earlier, the _best_ we could do in smaps would be to add a
> > > > > > flag like 'Grd' or something to indicate some part of the VMA is
> > > > > > guarded. But I won't do that unless somebody has an -actual use case- for
> > > > > > it.
> > > > >
> > > > > Right, and that would limit where you have to manually scan. Something
> > > > > similar is being done with uffd-wp markers IIRC.
> > > >
> > > > Yeah that's a good point, but honestly if you're reading smaps that reads
> > > > the page tables, then reading /proc/$pid/pagemaps and reading page tables
> > > > TWICE that seems inefficient vs. just reading /proc/$pid/maps, then reading
> > > > /proc/$pid/pagemaps and reading page tables once.
> > >
> > > Right; I recently wished that we would have an interface to obtain more VMA
> > > flags without having to go through smaps
> >
> > Well maybe that lends itself to the idea of adding a whole new interface in
> > general...
>
> An extended "maps" interface might be reasonable, that allows for exposing
> more things without walking the page tables. (e.g., flags)
>
> Maybe one could have an indicator that says "ever had guard regions in this
> mapping" without actually walking the page tables.
Yeah this is something we've discussed before, but it's a little fraught. Let's
say it was a VMA flag, in this case we'd have to make this flag 'sticky' and not
impact merging (easy enough) to account for splits/merges.
The problem comes in that we would then need to acquire the VMA write lock to do
it, something we don't currently require on application of guard regions.
We'd also have to make sure nothing else makes any assumptions about VMA flags
implying differences in VMAs in this one instance (though we do already do this
for VM_SOFTDIRTY).
I saw this as possibly something like VM_MAYBE_GUARD_REGIONS or something.
>
> --
> Cheers,
>
> David / dhildenb
>
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-20 10:15 ` Lorenzo Stoakes
@ 2025-02-20 12:44 ` David Hildenbrand
2025-02-20 13:18 ` Lorenzo Stoakes
0 siblings, 1 reply; 64+ messages in thread
From: David Hildenbrand @ 2025-02-20 12:44 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Kalesh Singh, Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
Matthew Wilcox, Vlastimil Babka, Paul E . McKenney, Jann Horn,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas
On 20.02.25 11:15, Lorenzo Stoakes wrote:
> On Thu, Feb 20, 2025 at 11:03:02AM +0100, David Hildenbrand wrote:
>>>> Your conclusion is 'did not participate with upstream'; I don't agree with
>>>> that. But maybe you and Kalesh have a history on that that let's you react
>>>> on his questions IMHO more emotionally than it should have been.
>>>
>>> This is wholly unfair, I have been very reasonable in response to this
>>> thread. I have offered to find solutions, I have tried to understand the
>>> problem in spite of having gone to great lengths to try to discuss the
>>> limitations of the proposed approach in every venue I possibly could.
>>>
>>> I go out of my way to deal professionally and objectively with what is
>>> presented. Nothing here is emotional. So I'd ask that you please abstain
>>> from making commentary like this which has no basis.
>>
>> I appreciate everything you write below. But this request is just
>> impossible. I will keep raising my opinion and misunderstandings will
>> happen.
>
> Well I wouldn't ask you not to express your opinion David, you know I respect
> and like you, and by all means push back hard or call out what you think is bad
> behaviour :)
>
> I just meant to say, in my view, that there was no basis, but I appreciate
> miscommunications happen.
> > So apologies if I came off as being difficult or rude, it actually
wasn't
> intended. And to re-emphasise - I have zero personal issue with anybody in this
> thread whatsoever!
It sounded to me like you were trying to defend your work (again, IMHO
too emotionally, and I might have completely misinterpreted that) and
slowly switching to "friendly fire" (towards me). Apologies from my side
if I completely misunderstood/misinterpreted that.
To recap: what we have upstream is great; you did a great job. Yes, the
mechanism has its drawbacks, but that's just part of the design.
Some people maybe have wrong expectations, maybe there were
misunderstandings, or maybe there are requirements that only now pop up;
it's sometimes unavoidable, and that's ok.
We can try to document it better (and I was trying to find clues why
people might be mislead), and see if/how we could sort out these
requirements. But we can likely not make it perfect in any possible way
(I'm sure there are plenty of use cases where what we currently have is
more than sufficient).
> > I just want to find the best way forward, technically and am
willing to do
> whatever work is required to make the guard region implementation as good as it
> possibly can be.
>
>>
>> Note that the whole "Honestly David you and the naming. .." thing could have
>> been written as "I don't think it's a naming problem."
>
> I feel like I _always_ get in trouble when I try to write in a 'tongue-in-cheek'
> style, which is what this was meant to be... so I think herein lies the basis of
> the miscommunication :)
>
> I apologise, the household is ill, which maybe affects my judgment in how I
> write these, but in general text is a very poor medium. It was meant to be said
> in a jolly tone with a wink...
>
> I think maybe I should learn my lesson with these things, I thought the ':p'
> would make this clear but yeah, text, poor medium.
>
> Anyway apologies if this seemed disrespectful.
No worries, it's hard to really make me angry, and I appreciate your
openness and your apology (well, and you and your work, obviously).
I'll note, though, if my memory serves me right, that nobody so far ever
criticized the way I communicate upstream, or even told me to abstain
from certain communication.
That probably hurt most, now that a couple of hours passed. Nothing that
a couple of beers and a bit of self-reflection on my communication style
can't fix ;)
[...]
>>>>> Yeah that's a good point, but honestly if you're reading smaps that reads
>>>>> the page tables, then reading /proc/$pid/pagemaps and reading page tables
>>>>> TWICE that seems inefficient vs. just reading /proc/$pid/maps, then reading
>>>>> /proc/$pid/pagemaps and reading page tables once.
>>>>
>>>> Right; I recently wished that we would have an interface to obtain more VMA
>>>> flags without having to go through smaps
>>>
>>> Well maybe that lends itself to the idea of adding a whole new interface in
>>> general...
>>
>> An extended "maps" interface might be reasonable, that allows for exposing
>> more things without walking the page tables. (e.g., flags)
>>
>> Maybe one could have an indicator that says "ever had guard regions in this
>> mapping" without actually walking the page tables.
>
> Yeah this is something we've discussed before, but it's a little fraught. Let's
> say it was a VMA flag, in this case we'd have to make this flag 'sticky' and not
> impact merging (easy enough) to account for splits/merges.
> > The problem comes in that we would then need to acquire the VMA
write lock to do
> it, something we don't currently require on application of guard regions.
Right, and we shouldn't write-lock the mmap. We'd need some way to just
atomically set such an indicator on a VMA.
I'll also note that it might be helpful for smallish region, but
especially for large ones (including when they are split and the
indicator is wrong), it's less helpful. I don't have to tell you about
the VMA merging implications, probably it would be like VM_SOFTDIRTY
handling :)
>
> We'd also have to make sure nothing else makes any assumptions about VMA flags
> implying differences in VMAs in this one instance (though we do already do this
> for VM_SOFTDIRTY).
>
> I saw this as possibly something like VM_MAYBE_GUARD_REGIONS or something.
Yes.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-20 12:44 ` David Hildenbrand
@ 2025-02-20 13:18 ` Lorenzo Stoakes
2025-02-20 16:21 ` Suren Baghdasaryan
0 siblings, 1 reply; 64+ messages in thread
From: Lorenzo Stoakes @ 2025-02-20 13:18 UTC (permalink / raw)
To: David Hildenbrand
Cc: Kalesh Singh, Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
Matthew Wilcox, Vlastimil Babka, Paul E . McKenney, Jann Horn,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas
On Thu, Feb 20, 2025 at 01:44:20PM +0100, David Hildenbrand wrote:
> On 20.02.25 11:15, Lorenzo Stoakes wrote:
> > On Thu, Feb 20, 2025 at 11:03:02AM +0100, David Hildenbrand wrote:
> > > > > Your conclusion is 'did not participate with upstream'; I don't agree with
> > > > > that. But maybe you and Kalesh have a history on that that let's you react
> > > > > on his questions IMHO more emotionally than it should have been.
> > > >
> > > > This is wholly unfair, I have been very reasonable in response to this
> > > > thread. I have offered to find solutions, I have tried to understand the
> > > > problem in spite of having gone to great lengths to try to discuss the
> > > > limitations of the proposed approach in every venue I possibly could.
> > > >
> > > > I go out of my way to deal professionally and objectively with what is
> > > > presented. Nothing here is emotional. So I'd ask that you please abstain
> > > > from making commentary like this which has no basis.
> > >
> > > I appreciate everything you write below. But this request is just
> > > impossible. I will keep raising my opinion and misunderstandings will
> > > happen.
> >
> > Well I wouldn't ask you not to express your opinion David, you know I respect
> > and like you, and by all means push back hard or call out what you think is bad
> > behaviour :)
> >
> > I just meant to say, in my view, that there was no basis, but I appreciate
> > miscommunications happen.
> > > So apologies if I came off as being difficult or rude, it actually
> wasn't
> > intended. And to re-emphasise - I have zero personal issue with anybody in this
> > thread whatsoever!
>
> It sounded to me like you were trying to defend your work (again, IMHO too
> emotionally, and I might have completely misinterpreted that) and slowly
> switching to "friendly fire" (towards me). Apologies from my side if I
> completely misunderstood/misinterpreted that.
Right this was not at all my intent, sorry if it seemed that way. I may well
have communicated terribly, so apologies on my side too.
>
> To recap: what we have upstream is great; you did a great job. Yes, the
> mechanism has its drawbacks, but that's just part of the design.
Thanks :)
>
> Some people maybe have wrong expectations, maybe there were
> misunderstandings, or maybe there are requirements that only now pop up;
> it's sometimes unavoidable, and that's ok.
>
> We can try to document it better (and I was trying to find clues why people
> might be mislead), and see if/how we could sort out these requirements. But
> we can likely not make it perfect in any possible way (I'm sure there are
> plenty of use cases where what we currently have is more than sufficient).
Sure and I"m very open to adding a documentation page for guard regions, in
fact was considering this very thing recently. I already added man pages
but be good to be able to go into more depth.
>
> > > I just want to find the best way forward, technically and am willing to
> do
> > whatever work is required to make the guard region implementation as good as it
> > possibly can be.
> >
> > >
> > > Note that the whole "Honestly David you and the naming. .." thing could have
> > > been written as "I don't think it's a naming problem."
> >
> > I feel like I _always_ get in trouble when I try to write in a 'tongue-in-cheek'
> > style, which is what this was meant to be... so I think herein lies the basis of
> > the miscommunication :)
> >
> > I apologise, the household is ill, which maybe affects my judgment in how I
> > write these, but in general text is a very poor medium. It was meant to be said
> > in a jolly tone with a wink...
> >
> > I think maybe I should learn my lesson with these things, I thought the ':p'
> > would make this clear but yeah, text, poor medium.
> >
> > Anyway apologies if this seemed disrespectful.
>
> No worries, it's hard to really make me angry, and I appreciate your
> openness and your apology (well, and you and your work, obviously).
>
> I'll note, though, if my memory serves me right, that nobody so far ever
> criticized the way I communicate upstream, or even told me to abstain from
> certain communication.
I wish I could say the same haha, so perhaps this was a problem on my side
honestly. I do have a habit of being 'tongue in cheek' and failing to
communicate that which I did say the last time I wouldn't repeat. It is not
intended, I promise.
As the abstain, was more a British turn of phrase, meaning to say - I
dispute the claim that this is an emotional thing and please don't say this
if it isn't so.
But I understand that of course, you may have interpreted it as so, due to
my having failed to communicate it well.
Again, I must say, text remains replete with possibilities for
miscommunication, misunderstanding and it can so often be difficult to
communicate one's intent.
But again of course, I apologise if I overstepped the line in any way!
>
> That probably hurt most, now that a couple of hours passed. Nothing that a
> couple of beers and a bit of self-reflection on my communication style can't
> fix ;)
Ugh sorry, man. Not my intent. And it seems - I literally OWE YOU pints
now. :) we will fix this at lsf...
Perhaps owe Kalesh some too should he be there... will budget
accordingly... :P
>
> [...]
>
> > > > > > Yeah that's a good point, but honestly if you're reading smaps that reads
> > > > > > the page tables, then reading /proc/$pid/pagemaps and reading page tables
> > > > > > TWICE that seems inefficient vs. just reading /proc/$pid/maps, then reading
> > > > > > /proc/$pid/pagemaps and reading page tables once.
> > > > >
> > > > > Right; I recently wished that we would have an interface to obtain more VMA
> > > > > flags without having to go through smaps
> > > >
> > > > Well maybe that lends itself to the idea of adding a whole new interface in
> > > > general...
> > >
> > > An extended "maps" interface might be reasonable, that allows for exposing
> > > more things without walking the page tables. (e.g., flags)
> > >
> > > Maybe one could have an indicator that says "ever had guard regions in this
> > > mapping" without actually walking the page tables.
> >
> > Yeah this is something we've discussed before, but it's a little fraught. Let's
> > say it was a VMA flag, in this case we'd have to make this flag 'sticky' and not
> > impact merging (easy enough) to account for splits/merges.
> > > The problem comes in that we would then need to acquire the VMA write
> lock to do
> > it, something we don't currently require on application of guard regions.
>
> Right, and we shouldn't write-lock the mmap. We'd need some way to just
> atomically set such an indicator on a VMA.
Hm yeah, could be tricky, we definitely can't manage a new field in
vm_area_struct, this is a very sensitive subject at the moment really with
Suren's work with VMAs allocated via SLAB_TYPESAFE_BY_RCU, putting the lock
into the VMA and the alignment requirements.
Not sure what precedent we'd have with atomic setting of a VMA flag for
this... could be tricky.
>
> I'll also note that it might be helpful for smallish region, but especially
> for large ones (including when they are split and the indicator is wrong),
> it's less helpful. I don't have to tell you about the VMA merging
> implications, probably it would be like VM_SOFTDIRTY handling :)
Yeah indeed now we've simplified merging a lot of possibilities emerge,
this is one!
>
> >
> > We'd also have to make sure nothing else makes any assumptions about VMA flags
> > implying differences in VMAs in this one instance (though we do already do this
> > for VM_SOFTDIRTY).
> >
> > I saw this as possibly something like VM_MAYBE_GUARD_REGIONS or something.
>
> Yes.
>
> --
> Cheers,
>
> David / dhildenb
>
Best, Lorenzo
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-20 13:18 ` Lorenzo Stoakes
@ 2025-02-20 16:21 ` Suren Baghdasaryan
2025-02-20 18:08 ` Kalesh Singh
0 siblings, 1 reply; 64+ messages in thread
From: Suren Baghdasaryan @ 2025-02-20 16:21 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: David Hildenbrand, Kalesh Singh, Andrew Morton, Liam R . Howlett,
Matthew Wilcox, Vlastimil Babka, Paul E . McKenney, Jann Horn,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas
On Thu, Feb 20, 2025 at 5:18 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Feb 20, 2025 at 01:44:20PM +0100, David Hildenbrand wrote:
> > On 20.02.25 11:15, Lorenzo Stoakes wrote:
> > > On Thu, Feb 20, 2025 at 11:03:02AM +0100, David Hildenbrand wrote:
> > > > > > Your conclusion is 'did not participate with upstream'; I don't agree with
> > > > > > that. But maybe you and Kalesh have a history on that that let's you react
> > > > > > on his questions IMHO more emotionally than it should have been.
> > > > >
> > > > > This is wholly unfair, I have been very reasonable in response to this
> > > > > thread. I have offered to find solutions, I have tried to understand the
> > > > > problem in spite of having gone to great lengths to try to discuss the
> > > > > limitations of the proposed approach in every venue I possibly could.
> > > > >
> > > > > I go out of my way to deal professionally and objectively with what is
> > > > > presented. Nothing here is emotional. So I'd ask that you please abstain
> > > > > from making commentary like this which has no basis.
> > > >
> > > > I appreciate everything you write below. But this request is just
> > > > impossible. I will keep raising my opinion and misunderstandings will
> > > > happen.
> > >
> > > Well I wouldn't ask you not to express your opinion David, you know I respect
> > > and like you, and by all means push back hard or call out what you think is bad
> > > behaviour :)
> > >
> > > I just meant to say, in my view, that there was no basis, but I appreciate
> > > miscommunications happen.
> > > > So apologies if I came off as being difficult or rude, it actually
> > wasn't
> > > intended. And to re-emphasise - I have zero personal issue with anybody in this
> > > thread whatsoever!
> >
> > It sounded to me like you were trying to defend your work (again, IMHO too
> > emotionally, and I might have completely misinterpreted that) and slowly
> > switching to "friendly fire" (towards me). Apologies from my side if I
> > completely misunderstood/misinterpreted that.
>
> Right this was not at all my intent, sorry if it seemed that way. I may well
> have communicated terribly, so apologies on my side too.
Sorry for being late to the party. Was sick for a couple of days.
Lorenzo is right, there was a breakdown in communication at Google and
he has all the rights to be upset. The issue with obfuscators should
have been communicated once it was discovered. I was in regular
discussions with Lorenzo but wasn't directly involved with this
particular project and wasn't aware or did not realize that the
obfuscator issue renders guards unusable for this usecase. My
apologies, I should have asked more questions about it. I suspect
Lorenzo would have implemented this anyway...
To make guard regions work for this usecase, first we (Android) need
to abstract /proc/pid/maps accesses. Only then we can use additional
interfaces like /proc/pid/pagemaps to obtain guard region information.
I'll start figuring out what it takes to insert such an abstraction.
Thanks,
Suren.
>
> >
> > To recap: what we have upstream is great; you did a great job. Yes, the
> > mechanism has its drawbacks, but that's just part of the design.
>
> Thanks :)
>
> >
> > Some people maybe have wrong expectations, maybe there were
> > misunderstandings, or maybe there are requirements that only now pop up;
> > it's sometimes unavoidable, and that's ok.
> >
> > We can try to document it better (and I was trying to find clues why people
> > might be mislead), and see if/how we could sort out these requirements. But
> > we can likely not make it perfect in any possible way (I'm sure there are
> > plenty of use cases where what we currently have is more than sufficient).
>
> Sure and I"m very open to adding a documentation page for guard regions, in
> fact was considering this very thing recently. I already added man pages
> but be good to be able to go into more depth.
>
> >
> > > > I just want to find the best way forward, technically and am willing to
> > do
> > > whatever work is required to make the guard region implementation as good as it
> > > possibly can be.
> > >
> > > >
> > > > Note that the whole "Honestly David you and the naming. .." thing could have
> > > > been written as "I don't think it's a naming problem."
> > >
> > > I feel like I _always_ get in trouble when I try to write in a 'tongue-in-cheek'
> > > style, which is what this was meant to be... so I think herein lies the basis of
> > > the miscommunication :)
> > >
> > > I apologise, the household is ill, which maybe affects my judgment in how I
> > > write these, but in general text is a very poor medium. It was meant to be said
> > > in a jolly tone with a wink...
> > >
> > > I think maybe I should learn my lesson with these things, I thought the ':p'
> > > would make this clear but yeah, text, poor medium.
> > >
> > > Anyway apologies if this seemed disrespectful.
> >
> > No worries, it's hard to really make me angry, and I appreciate your
> > openness and your apology (well, and you and your work, obviously).
> >
> > I'll note, though, if my memory serves me right, that nobody so far ever
> > criticized the way I communicate upstream, or even told me to abstain from
> > certain communication.
>
> I wish I could say the same haha, so perhaps this was a problem on my side
> honestly. I do have a habit of being 'tongue in cheek' and failing to
> communicate that which I did say the last time I wouldn't repeat. It is not
> intended, I promise.
>
> As the abstain, was more a British turn of phrase, meaning to say - I
> dispute the claim that this is an emotional thing and please don't say this
> if it isn't so.
>
> But I understand that of course, you may have interpreted it as so, due to
> my having failed to communicate it well.
>
> Again, I must say, text remains replete with possibilities for
> miscommunication, misunderstanding and it can so often be difficult to
> communicate one's intent.
>
> But again of course, I apologise if I overstepped the line in any way!
>
> >
> > That probably hurt most, now that a couple of hours passed. Nothing that a
> > couple of beers and a bit of self-reflection on my communication style can't
> > fix ;)
>
> Ugh sorry, man. Not my intent. And it seems - I literally OWE YOU pints
> now. :) we will fix this at lsf...
>
> Perhaps owe Kalesh some too should he be there... will budget
> accordingly... :P
>
> >
> > [...]
> >
> > > > > > > Yeah that's a good point, but honestly if you're reading smaps that reads
> > > > > > > the page tables, then reading /proc/$pid/pagemaps and reading page tables
> > > > > > > TWICE that seems inefficient vs. just reading /proc/$pid/maps, then reading
> > > > > > > /proc/$pid/pagemaps and reading page tables once.
> > > > > >
> > > > > > Right; I recently wished that we would have an interface to obtain more VMA
> > > > > > flags without having to go through smaps
> > > > >
> > > > > Well maybe that lends itself to the idea of adding a whole new interface in
> > > > > general...
> > > >
> > > > An extended "maps" interface might be reasonable, that allows for exposing
> > > > more things without walking the page tables. (e.g., flags)
> > > >
> > > > Maybe one could have an indicator that says "ever had guard regions in this
> > > > mapping" without actually walking the page tables.
> > >
> > > Yeah this is something we've discussed before, but it's a little fraught. Let's
> > > say it was a VMA flag, in this case we'd have to make this flag 'sticky' and not
> > > impact merging (easy enough) to account for splits/merges.
> > > > The problem comes in that we would then need to acquire the VMA write
> > lock to do
> > > it, something we don't currently require on application of guard regions.
> >
> > Right, and we shouldn't write-lock the mmap. We'd need some way to just
> > atomically set such an indicator on a VMA.
>
> Hm yeah, could be tricky, we definitely can't manage a new field in
> vm_area_struct, this is a very sensitive subject at the moment really with
> Suren's work with VMAs allocated via SLAB_TYPESAFE_BY_RCU, putting the lock
> into the VMA and the alignment requirements.
>
> Not sure what precedent we'd have with atomic setting of a VMA flag for
> this... could be tricky.
>
> >
> > I'll also note that it might be helpful for smallish region, but especially
> > for large ones (including when they are split and the indicator is wrong),
> > it's less helpful. I don't have to tell you about the VMA merging
> > implications, probably it would be like VM_SOFTDIRTY handling :)
>
> Yeah indeed now we've simplified merging a lot of possibilities emerge,
> this is one!
>
> >
> > >
> > > We'd also have to make sure nothing else makes any assumptions about VMA flags
> > > implying differences in VMAs in this one instance (though we do already do this
> > > for VM_SOFTDIRTY).
> > >
> > > I saw this as possibly something like VM_MAYBE_GUARD_REGIONS or something.
> >
> > Yes.
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >
>
> Best, Lorenzo
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-20 16:21 ` Suren Baghdasaryan
@ 2025-02-20 18:08 ` Kalesh Singh
2025-02-21 11:04 ` Lorenzo Stoakes
0 siblings, 1 reply; 64+ messages in thread
From: Kalesh Singh @ 2025-02-20 18:08 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: Lorenzo Stoakes, David Hildenbrand, Andrew Morton,
Liam R . Howlett, Matthew Wilcox, Vlastimil Babka,
Paul E . McKenney, Jann Horn, linux-mm, linux-kernel, Shuah Khan,
linux-kselftest, linux-api, John Hubbard, Juan Yescas
On Thu, Feb 20, 2025 at 8:22 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Feb 20, 2025 at 5:18 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Thu, Feb 20, 2025 at 01:44:20PM +0100, David Hildenbrand wrote:
> > > On 20.02.25 11:15, Lorenzo Stoakes wrote:
> > > > On Thu, Feb 20, 2025 at 11:03:02AM +0100, David Hildenbrand wrote:
> > > > > > > Your conclusion is 'did not participate with upstream'; I don't agree with
> > > > > > > that. But maybe you and Kalesh have a history on that that let's you react
> > > > > > > on his questions IMHO more emotionally than it should have been.
> > > > > >
> > > > > > This is wholly unfair, I have been very reasonable in response to this
> > > > > > thread. I have offered to find solutions, I have tried to understand the
> > > > > > problem in spite of having gone to great lengths to try to discuss the
> > > > > > limitations of the proposed approach in every venue I possibly could.
> > > > > >
> > > > > > I go out of my way to deal professionally and objectively with what is
> > > > > > presented. Nothing here is emotional. So I'd ask that you please abstain
> > > > > > from making commentary like this which has no basis.
> > > > >
> > > > > I appreciate everything you write below. But this request is just
> > > > > impossible. I will keep raising my opinion and misunderstandings will
> > > > > happen.
> > > >
> > > > Well I wouldn't ask you not to express your opinion David, you know I respect
> > > > and like you, and by all means push back hard or call out what you think is bad
> > > > behaviour :)
> > > >
> > > > I just meant to say, in my view, that there was no basis, but I appreciate
> > > > miscommunications happen.
> > > > > So apologies if I came off as being difficult or rude, it actually
> > > wasn't
> > > > intended. And to re-emphasise - I have zero personal issue with anybody in this
> > > > thread whatsoever!
> > >
> > > It sounded to me like you were trying to defend your work (again, IMHO too
> > > emotionally, and I might have completely misinterpreted that) and slowly
> > > switching to "friendly fire" (towards me). Apologies from my side if I
> > > completely misunderstood/misinterpreted that.
> >
> > Right this was not at all my intent, sorry if it seemed that way. I may well
> > have communicated terribly, so apologies on my side too.
Hi everyone,
Thank you for all the discussion.
I don't find any personal issues with the communication in this
thread, but I appreciate David being the object voice of reason.
I understand it can be frustrating since you have made many efforts to
communicate these tradeoffs. Unfortunately these issues were not known
for the file-backed ELF guard regions for my particular use case.
>
> Sorry for being late to the party. Was sick for a couple of days.
> Lorenzo is right, there was a breakdown in communication at Google and
> he has all the rights to be upset. The issue with obfuscators should
> have been communicated once it was discovered. I was in regular
> discussions with Lorenzo but wasn't directly involved with this
> particular project and wasn't aware or did not realize that the
> obfuscator issue renders guards unusable for this usecase. My
> apologies, I should have asked more questions about it. I suspect
> Lorenzo would have implemented this anyway...
>
Suren's use case is different from mine and this design fits perfectly
for anon guard regions from the allocator. :)
So I think in conclusion, these aren't VMAs and shouldn't be treated
as such; we will advertise them from pagemap for those who need to
know.
-- Kalesh
> To make guard regions work for this usecase, first we (Android) need
> to abstract /proc/pid/maps accesses. Only then we can use additional
> interfaces like /proc/pid/pagemaps to obtain guard region information.
> I'll start figuring out what it takes to insert such an abstraction.
> Thanks,
> Suren.
>
>
> >
> > >
> > > To recap: what we have upstream is great; you did a great job. Yes, the
> > > mechanism has its drawbacks, but that's just part of the design.
> >
> > Thanks :)
> >
> > >
> > > Some people maybe have wrong expectations, maybe there were
> > > misunderstandings, or maybe there are requirements that only now pop up;
> > > it's sometimes unavoidable, and that's ok.
> > >
> > > We can try to document it better (and I was trying to find clues why people
> > > might be mislead), and see if/how we could sort out these requirements. But
> > > we can likely not make it perfect in any possible way (I'm sure there are
> > > plenty of use cases where what we currently have is more than sufficient).
> >
> > Sure and I"m very open to adding a documentation page for guard regions, in
> > fact was considering this very thing recently. I already added man pages
> > but be good to be able to go into more depth.
> >
> > >
> > > > > I just want to find the best way forward, technically and am willing to
> > > do
> > > > whatever work is required to make the guard region implementation as good as it
> > > > possibly can be.
> > > >
> > > > >
> > > > > Note that the whole "Honestly David you and the naming. .." thing could have
> > > > > been written as "I don't think it's a naming problem."
> > > >
> > > > I feel like I _always_ get in trouble when I try to write in a 'tongue-in-cheek'
> > > > style, which is what this was meant to be... so I think herein lies the basis of
> > > > the miscommunication :)
> > > >
> > > > I apologise, the household is ill, which maybe affects my judgment in how I
> > > > write these, but in general text is a very poor medium. It was meant to be said
> > > > in a jolly tone with a wink...
> > > >
> > > > I think maybe I should learn my lesson with these things, I thought the ':p'
> > > > would make this clear but yeah, text, poor medium.
> > > >
> > > > Anyway apologies if this seemed disrespectful.
> > >
> > > No worries, it's hard to really make me angry, and I appreciate your
> > > openness and your apology (well, and you and your work, obviously).
> > >
> > > I'll note, though, if my memory serves me right, that nobody so far ever
> > > criticized the way I communicate upstream, or even told me to abstain from
> > > certain communication.
> >
> > I wish I could say the same haha, so perhaps this was a problem on my side
> > honestly. I do have a habit of being 'tongue in cheek' and failing to
> > communicate that which I did say the last time I wouldn't repeat. It is not
> > intended, I promise.
> >
> > As the abstain, was more a British turn of phrase, meaning to say - I
> > dispute the claim that this is an emotional thing and please don't say this
> > if it isn't so.
> >
> > But I understand that of course, you may have interpreted it as so, due to
> > my having failed to communicate it well.
> >
> > Again, I must say, text remains replete with possibilities for
> > miscommunication, misunderstanding and it can so often be difficult to
> > communicate one's intent.
> >
> > But again of course, I apologise if I overstepped the line in any way!
> >
> > >
> > > That probably hurt most, now that a couple of hours passed. Nothing that a
> > > couple of beers and a bit of self-reflection on my communication style can't
> > > fix ;)
> >
> > Ugh sorry, man. Not my intent. And it seems - I literally OWE YOU pints
> > now. :) we will fix this at lsf...
> >
> > Perhaps owe Kalesh some too should he be there... will budget
> > accordingly... :P
> >
> > >
> > > [...]
> > >
> > > > > > > > Yeah that's a good point, but honestly if you're reading smaps that reads
> > > > > > > > the page tables, then reading /proc/$pid/pagemaps and reading page tables
> > > > > > > > TWICE that seems inefficient vs. just reading /proc/$pid/maps, then reading
> > > > > > > > /proc/$pid/pagemaps and reading page tables once.
> > > > > > >
> > > > > > > Right; I recently wished that we would have an interface to obtain more VMA
> > > > > > > flags without having to go through smaps
> > > > > >
> > > > > > Well maybe that lends itself to the idea of adding a whole new interface in
> > > > > > general...
> > > > >
> > > > > An extended "maps" interface might be reasonable, that allows for exposing
> > > > > more things without walking the page tables. (e.g., flags)
> > > > >
> > > > > Maybe one could have an indicator that says "ever had guard regions in this
> > > > > mapping" without actually walking the page tables.
> > > >
> > > > Yeah this is something we've discussed before, but it's a little fraught. Let's
> > > > say it was a VMA flag, in this case we'd have to make this flag 'sticky' and not
> > > > impact merging (easy enough) to account for splits/merges.
> > > > > The problem comes in that we would then need to acquire the VMA write
> > > lock to do
> > > > it, something we don't currently require on application of guard regions.
> > >
> > > Right, and we shouldn't write-lock the mmap. We'd need some way to just
> > > atomically set such an indicator on a VMA.
> >
> > Hm yeah, could be tricky, we definitely can't manage a new field in
> > vm_area_struct, this is a very sensitive subject at the moment really with
> > Suren's work with VMAs allocated via SLAB_TYPESAFE_BY_RCU, putting the lock
> > into the VMA and the alignment requirements.
> >
> > Not sure what precedent we'd have with atomic setting of a VMA flag for
> > this... could be tricky.
> >
> > >
> > > I'll also note that it might be helpful for smallish region, but especially
> > > for large ones (including when they are split and the indicator is wrong),
> > > it's less helpful. I don't have to tell you about the VMA merging
> > > implications, probably it would be like VM_SOFTDIRTY handling :)
> >
> > Yeah indeed now we've simplified merging a lot of possibilities emerge,
> > this is one!
> >
> > >
> > > >
> > > > We'd also have to make sure nothing else makes any assumptions about VMA flags
> > > > implying differences in VMAs in this one instance (though we do already do this
> > > > for VM_SOFTDIRTY).
> > > >
> > > > I saw this as possibly something like VM_MAYBE_GUARD_REGIONS or something.
> > >
> > > Yes.
> > >
> > > --
> > > Cheers,
> > >
> > > David / dhildenb
> > >
> >
> > Best, Lorenzo
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-20 18:08 ` Kalesh Singh
@ 2025-02-21 11:04 ` Lorenzo Stoakes
2025-02-21 17:24 ` Kalesh Singh
0 siblings, 1 reply; 64+ messages in thread
From: Lorenzo Stoakes @ 2025-02-21 11:04 UTC (permalink / raw)
To: Kalesh Singh
Cc: Suren Baghdasaryan, David Hildenbrand, Andrew Morton,
Liam R . Howlett, Matthew Wilcox, Vlastimil Babka,
Paul E . McKenney, Jann Horn, linux-mm, linux-kernel, Shuah Khan,
linux-kselftest, linux-api, John Hubbard, Juan Yescas
On Thu, Feb 20, 2025 at 10:08:36AM -0800, Kalesh Singh wrote:
> On Thu, Feb 20, 2025 at 8:22 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Thu, Feb 20, 2025 at 5:18 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Thu, Feb 20, 2025 at 01:44:20PM +0100, David Hildenbrand wrote:
> > > > On 20.02.25 11:15, Lorenzo Stoakes wrote:
> > > > > On Thu, Feb 20, 2025 at 11:03:02AM +0100, David Hildenbrand wrote:
> > > > > > > > Your conclusion is 'did not participate with upstream'; I don't agree with
> > > > > > > > that. But maybe you and Kalesh have a history on that that let's you react
> > > > > > > > on his questions IMHO more emotionally than it should have been.
> > > > > > >
> > > > > > > This is wholly unfair, I have been very reasonable in response to this
> > > > > > > thread. I have offered to find solutions, I have tried to understand the
> > > > > > > problem in spite of having gone to great lengths to try to discuss the
> > > > > > > limitations of the proposed approach in every venue I possibly could.
> > > > > > >
> > > > > > > I go out of my way to deal professionally and objectively with what is
> > > > > > > presented. Nothing here is emotional. So I'd ask that you please abstain
> > > > > > > from making commentary like this which has no basis.
> > > > > >
> > > > > > I appreciate everything you write below. But this request is just
> > > > > > impossible. I will keep raising my opinion and misunderstandings will
> > > > > > happen.
> > > > >
> > > > > Well I wouldn't ask you not to express your opinion David, you know I respect
> > > > > and like you, and by all means push back hard or call out what you think is bad
> > > > > behaviour :)
> > > > >
> > > > > I just meant to say, in my view, that there was no basis, but I appreciate
> > > > > miscommunications happen.
> > > > > > So apologies if I came off as being difficult or rude, it actually
> > > > wasn't
> > > > > intended. And to re-emphasise - I have zero personal issue with anybody in this
> > > > > thread whatsoever!
> > > >
> > > > It sounded to me like you were trying to defend your work (again, IMHO too
> > > > emotionally, and I might have completely misinterpreted that) and slowly
> > > > switching to "friendly fire" (towards me). Apologies from my side if I
> > > > completely misunderstood/misinterpreted that.
> > >
> > > Right this was not at all my intent, sorry if it seemed that way. I may well
> > > have communicated terribly, so apologies on my side too.
>
> Hi everyone,
>
> Thank you for all the discussion.
>
> I don't find any personal issues with the communication in this
> thread, but I appreciate David being the object voice of reason.
>
> I understand it can be frustrating since you have made many efforts to
> communicate these tradeoffs. Unfortunately these issues were not known
> for the file-backed ELF guard regions for my particular use case.
>
> >
> > Sorry for being late to the party. Was sick for a couple of days.
> > Lorenzo is right, there was a breakdown in communication at Google and
> > he has all the rights to be upset. The issue with obfuscators should
> > have been communicated once it was discovered. I was in regular
> > discussions with Lorenzo but wasn't directly involved with this
> > particular project and wasn't aware or did not realize that the
> > obfuscator issue renders guards unusable for this usecase. My
> > apologies, I should have asked more questions about it. I suspect
> > Lorenzo would have implemented this anyway...
> >
>
> Suren's use case is different from mine and this design fits perfectly
> for anon guard regions from the allocator. :)
>
> So I think in conclusion, these aren't VMAs and shouldn't be treated
> as such; we will advertise them from pagemap for those who need to
> know.
>
Thanks Kalesh, glad there were no issues here and we have found
constructive common ground! :)
It turns out implementing the pagemap side of things is _really_
straightforward, so I'll be sending a series for that shortly. Hopefully
this provides some basis for whichever use cases need this information, as
it is the best and least invasive place for this information at this stage.
Cheers, Lorenzo
> -- Kalesh
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-21 11:04 ` Lorenzo Stoakes
@ 2025-02-21 17:24 ` Kalesh Singh
0 siblings, 0 replies; 64+ messages in thread
From: Kalesh Singh @ 2025-02-21 17:24 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Suren Baghdasaryan, David Hildenbrand, Andrew Morton,
Liam R . Howlett, Matthew Wilcox, Vlastimil Babka,
Paul E . McKenney, Jann Horn, linux-mm, linux-kernel, Shuah Khan,
linux-kselftest, linux-api, John Hubbard, Juan Yescas
On Fri, Feb 21, 2025 at 3:04 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Feb 20, 2025 at 10:08:36AM -0800, Kalesh Singh wrote:
> > On Thu, Feb 20, 2025 at 8:22 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Thu, Feb 20, 2025 at 5:18 AM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > >
> > > > On Thu, Feb 20, 2025 at 01:44:20PM +0100, David Hildenbrand wrote:
> > > > > On 20.02.25 11:15, Lorenzo Stoakes wrote:
> > > > > > On Thu, Feb 20, 2025 at 11:03:02AM +0100, David Hildenbrand wrote:
> > > > > > > > > Your conclusion is 'did not participate with upstream'; I don't agree with
> > > > > > > > > that. But maybe you and Kalesh have a history on that that let's you react
> > > > > > > > > on his questions IMHO more emotionally than it should have been.
> > > > > > > >
> > > > > > > > This is wholly unfair, I have been very reasonable in response to this
> > > > > > > > thread. I have offered to find solutions, I have tried to understand the
> > > > > > > > problem in spite of having gone to great lengths to try to discuss the
> > > > > > > > limitations of the proposed approach in every venue I possibly could.
> > > > > > > >
> > > > > > > > I go out of my way to deal professionally and objectively with what is
> > > > > > > > presented. Nothing here is emotional. So I'd ask that you please abstain
> > > > > > > > from making commentary like this which has no basis.
> > > > > > >
> > > > > > > I appreciate everything you write below. But this request is just
> > > > > > > impossible. I will keep raising my opinion and misunderstandings will
> > > > > > > happen.
> > > > > >
> > > > > > Well I wouldn't ask you not to express your opinion David, you know I respect
> > > > > > and like you, and by all means push back hard or call out what you think is bad
> > > > > > behaviour :)
> > > > > >
> > > > > > I just meant to say, in my view, that there was no basis, but I appreciate
> > > > > > miscommunications happen.
> > > > > > > So apologies if I came off as being difficult or rude, it actually
> > > > > wasn't
> > > > > > intended. And to re-emphasise - I have zero personal issue with anybody in this
> > > > > > thread whatsoever!
> > > > >
> > > > > It sounded to me like you were trying to defend your work (again, IMHO too
> > > > > emotionally, and I might have completely misinterpreted that) and slowly
> > > > > switching to "friendly fire" (towards me). Apologies from my side if I
> > > > > completely misunderstood/misinterpreted that.
> > > >
> > > > Right this was not at all my intent, sorry if it seemed that way. I may well
> > > > have communicated terribly, so apologies on my side too.
> >
> > Hi everyone,
> >
> > Thank you for all the discussion.
> >
> > I don't find any personal issues with the communication in this
> > thread, but I appreciate David being the object voice of reason.
> >
> > I understand it can be frustrating since you have made many efforts to
> > communicate these tradeoffs. Unfortunately these issues were not known
> > for the file-backed ELF guard regions for my particular use case.
> >
> > >
> > > Sorry for being late to the party. Was sick for a couple of days.
> > > Lorenzo is right, there was a breakdown in communication at Google and
> > > he has all the rights to be upset. The issue with obfuscators should
> > > have been communicated once it was discovered. I was in regular
> > > discussions with Lorenzo but wasn't directly involved with this
> > > particular project and wasn't aware or did not realize that the
> > > obfuscator issue renders guards unusable for this usecase. My
> > > apologies, I should have asked more questions about it. I suspect
> > > Lorenzo would have implemented this anyway...
> > >
> >
> > Suren's use case is different from mine and this design fits perfectly
> > for anon guard regions from the allocator. :)
> >
> > So I think in conclusion, these aren't VMAs and shouldn't be treated
> > as such; we will advertise them from pagemap for those who need to
> > know.
> >
>
> Thanks Kalesh, glad there were no issues here and we have found
> constructive common ground! :)
>
> It turns out implementing the pagemap side of things is _really_
> straightforward, so I'll be sending a series for that shortly. Hopefully
> this provides some basis for whichever use cases need this information, as
> it is the best and least invasive place for this information at this stage.
Hi Lorenzo,
Reviewed your patches, agreed that is the cleanest way to advertise
this information.
Thanks,
Kalesh
>
> Cheers, Lorenzo
>
> > -- Kalesh
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/4] mm: allow guard regions in file-backed and read-only mappings
2025-02-13 18:17 ` [PATCH 1/4] mm: allow guard regions in file-backed and read-only mappings Lorenzo Stoakes
2025-02-18 14:15 ` Vlastimil Babka
2025-02-18 16:01 ` David Hildenbrand
@ 2025-02-24 14:02 ` Lorenzo Stoakes
2 siblings, 0 replies; 64+ messages in thread
From: Lorenzo Stoakes @ 2025-02-24 14:02 UTC (permalink / raw)
To: Andrew Morton
Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
Vlastimil Babka, Paul E . McKenney, Jann Horn, David Hildenbrand,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
One thing to note with this series is that it now implies file-backed VMAs
which install guard regions will now have an anon_vma installed if not
already present (i.e. if not post-CoW MAP_PRIVATE).
I have audited kernel source for instances of vma->anon_vma checks and
found nowhere where this would be problematic for pure file-backed mappings.
I also discussed (off-list) with Matthew who confirmed he can't see any
issue with this.
In effect, we treat these VMAs as if they are MAP_PRIVATE, only with 0
CoW'd pages. As a result, the rmap never has a reason to reference the
anon_vma from folios at any point and thus no unexpected or weird behaviour
results.
The anon_vma logic tries to avoid unnecessary anon_vma propagation on fork
so we ought to at least minimise overhead.
However, this is still overhead, and unwelcome overhead.
The whole reason we do this (in madvise_guard_install()) is to ensure that
fork _copies page tables_. Otherwise, in vma_needs_copy(), nothing will
indicate that we should do so.
This was already an unpleasant thing to have to do, but without a new VMA
flag we really have no reasonable means of ensuring this happens.
Going forward, I intend to add a new VMA flag, VM_MAYBE_GUARDED or
something like this.
This would have specific behaviour - for the purposes of merging, it would
be ignored. However on both split and merge, it will be propagated. It is
therefore 'sticky'.
This is to avoid having to traverse page tables to determine which parts of
a VMA contain guard regions and of course to maintain the desirable
qualities of guard regions - the lack of VMA propagation (+ thus slab
allocations of VMAs).
Adding this flag and adjusting vma_needs_copy() to reference it would
resolve the issue.
However :) we have a VMA flag space issue, so it'd render this a 64-bit
feature only.
Having discussed with Matthew a plan by which to perhaps extend available
flags for 32-bit we may going forward be able to avoid this. But this may
be a longer term project.
In the meantime, we'd have to resort to the anon_vma hack for 32-bit, using
the flag for 64-bit. The issue with this however is if we do then intend to
allow the flag to enable /proc/$pid/maps visibility (something this could
allow), it would also end up being 64-bit only which would be a pity.
Regardless - I wanted to highlight this behaviour as it is perhaps somewhat
surprising.
On Thu, Feb 13, 2025 at 06:17:00PM +0000, Lorenzo Stoakes wrote:
> There is no reason to disallow guard regions in file-backed mappings -
> readahead and fault-around both function correctly in the presence of PTE
> markers, equally other operations relating to memory-mapped files function
> correctly.
>
> Additionally, read-only mappings if introducing guard-regions, only
> restrict the mapping further, which means there is no violation of any
> access rights by permitting this to be so.
>
> Removing this restriction allows for read-only mapped files (such as
> executable files) correctly which would otherwise not be permitted.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/madvise.c | 8 +-------
> 1 file changed, 1 insertion(+), 7 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 6ecead476a80..e01e93e179a8 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1051,13 +1051,7 @@ static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked)
> if (!allow_locked)
> disallowed |= VM_LOCKED;
>
> - if (!vma_is_anonymous(vma))
> - return false;
> -
> - if ((vma->vm_flags & (VM_MAYWRITE | disallowed)) != VM_MAYWRITE)
> - return false;
> -
> - return true;
> + return !(vma->vm_flags & disallowed);
> }
>
> static bool is_guard_pte_marker(pte_t ptent)
> --
> 2.48.1
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-18 17:28 ` Lorenzo Stoakes
2025-02-18 17:31 ` David Hildenbrand
@ 2025-02-25 15:54 ` Vlastimil Babka
2025-02-25 16:31 ` David Hildenbrand
2025-02-25 16:37 ` Lorenzo Stoakes
1 sibling, 2 replies; 64+ messages in thread
From: Vlastimil Babka @ 2025-02-25 15:54 UTC (permalink / raw)
To: Lorenzo Stoakes, David Hildenbrand
Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
Matthew Wilcox, Paul E . McKenney, Jann Horn, linux-mm,
linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
On 2/18/25 18:28, Lorenzo Stoakes wrote:
> On Tue, Feb 18, 2025 at 06:25:35PM +0100, David Hildenbrand wrote:
>>
>> > > >
>> > > > It fails because it tries to 'touch' the memory, but 'touching' guard
>> > > > region memory causes a segfault. This kind of breaks the idea of
>> > > > mlock()'ing guard regions.
>> > > >
>> > > > I think adding workarounds to make this possible in any way is not really
>> > > > worth it (and would probably be pretty gross).
>> > > >
>> > > > We already document that 'mlock()ing lightweight guard regions will fail'
>> > > > as per man page so this is all in line with that.
>> > >
>> > > Right, and I claim that supporting VM_LOCKONFAULT might likely be as easy as
>> > > allowing install/remove of guard regions when that flag is set.
>> >
>> > We already allow this flag! VM_LOCKED and VM_HUGETLB are the only flags we
>> > disallow.
>>
>>
>> See mlock2();
>>
>> SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags)
>> {
>> vm_flags_t vm_flags = VM_LOCKED;
>>
>> if (flags & ~MLOCK_ONFAULT)
>> return -EINVAL;
>>
>> if (flags & MLOCK_ONFAULT)
>> vm_flags |= VM_LOCKONFAULT;
>>
>> return do_mlock(start, len, vm_flags);
>> }
>>
>>
>> VM_LOCKONFAULT always as VM_LOCKED set as well.
>
> OK cool, that makes sense.
>
> As with much kernel stuff, I knew this in the past. Then I forgot. Then I knew
> again, then... :P if only somebody would write it down in a book...
>
> Yeah then that makes sense to check explicitly for (VM_LOCKED | VM_LOCKONFAULT)
> in any MADV_GUARD_INSTALL_LOCKED variant as obviously this would be passively
> excluded right now.
Sorry for the late reply. So AFAIU from your conversations, guards can't be
compatible with VM_LOCKED, which means e.g. any attempts of glibc to use
guards for stacks will soon discover that mlockall() users exist and are
broken by this, and the attempts will fail? That's a bummer.
As for compatibility with VM_LOCKONFAULT, do we need a new
MADV_GUARD_INSTALL_LOCKED or can we say MADV_GUARD_INSTALL is new enough
that it can be just retrofitted (like you retrofit file backed mappings)?
AFAIU the only risk would be breaking somebody that already relies on a
failure for VM_LOCKONFAULT, and it's unlikely there's such a somebody now.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-25 15:54 ` Vlastimil Babka
@ 2025-02-25 16:31 ` David Hildenbrand
2025-02-25 16:37 ` Lorenzo Stoakes
1 sibling, 0 replies; 64+ messages in thread
From: David Hildenbrand @ 2025-02-25 16:31 UTC (permalink / raw)
To: Vlastimil Babka, Lorenzo Stoakes
Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
Matthew Wilcox, Paul E . McKenney, Jann Horn, linux-mm,
linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
On 25.02.25 16:54, Vlastimil Babka wrote:
> On 2/18/25 18:28, Lorenzo Stoakes wrote:
>> On Tue, Feb 18, 2025 at 06:25:35PM +0100, David Hildenbrand wrote:
>>>
>>>>>>
>>>>>> It fails because it tries to 'touch' the memory, but 'touching' guard
>>>>>> region memory causes a segfault. This kind of breaks the idea of
>>>>>> mlock()'ing guard regions.
>>>>>>
>>>>>> I think adding workarounds to make this possible in any way is not really
>>>>>> worth it (and would probably be pretty gross).
>>>>>>
>>>>>> We already document that 'mlock()ing lightweight guard regions will fail'
>>>>>> as per man page so this is all in line with that.
>>>>>
>>>>> Right, and I claim that supporting VM_LOCKONFAULT might likely be as easy as
>>>>> allowing install/remove of guard regions when that flag is set.
>>>>
>>>> We already allow this flag! VM_LOCKED and VM_HUGETLB are the only flags we
>>>> disallow.
>>>
>>>
>>> See mlock2();
>>>
>>> SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags)
>>> {
>>> vm_flags_t vm_flags = VM_LOCKED;
>>>
>>> if (flags & ~MLOCK_ONFAULT)
>>> return -EINVAL;
>>>
>>> if (flags & MLOCK_ONFAULT)
>>> vm_flags |= VM_LOCKONFAULT;
>>>
>>> return do_mlock(start, len, vm_flags);
>>> }
>>>
>>>
>>> VM_LOCKONFAULT always as VM_LOCKED set as well.
>>
>> OK cool, that makes sense.
>>
>> As with much kernel stuff, I knew this in the past. Then I forgot. Then I knew
>> again, then... :P if only somebody would write it down in a book...
>>
>> Yeah then that makes sense to check explicitly for (VM_LOCKED | VM_LOCKONFAULT)
>> in any MADV_GUARD_INSTALL_LOCKED variant as obviously this would be passively
>> excluded right now.
>
> Sorry for the late reply. So AFAIU from your conversations, guards can't be
> compatible with VM_LOCKED, which means e.g. any attempts of glibc to use
> guards for stacks will soon discover that mlockall() users exist and are
> broken by this, and the attempts will fail? That's a bummer.
>
> As for compatibility with VM_LOCKONFAULT, do we need a new
> MADV_GUARD_INSTALL_LOCKED or can we say MADV_GUARD_INSTALL is new enough
> that it can be just retrofitted (like you retrofit file backed mappings)?
> AFAIU the only risk would be breaking somebody that already relies on a
> failure for VM_LOCKONFAULT, and it's unlikely there's such a somebody now.
Exactly my thinking I didn't have time to phrase during that
conversation. IMHO, we were careful with MADV_DONTNEED because it was
... around for a little bit longer :)
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-25 15:54 ` Vlastimil Babka
2025-02-25 16:31 ` David Hildenbrand
@ 2025-02-25 16:37 ` Lorenzo Stoakes
2025-02-25 16:48 ` David Hildenbrand
1 sibling, 1 reply; 64+ messages in thread
From: Lorenzo Stoakes @ 2025-02-25 16:37 UTC (permalink / raw)
To: Vlastimil Babka
Cc: David Hildenbrand, Andrew Morton, Suren Baghdasaryan,
Liam R . Howlett, Matthew Wilcox, Paul E . McKenney, Jann Horn,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
On Tue, Feb 25, 2025 at 04:54:22PM +0100, Vlastimil Babka wrote:
> On 2/18/25 18:28, Lorenzo Stoakes wrote:
> > On Tue, Feb 18, 2025 at 06:25:35PM +0100, David Hildenbrand wrote:
> >>
> >> > > >
> >> > > > It fails because it tries to 'touch' the memory, but 'touching' guard
> >> > > > region memory causes a segfault. This kind of breaks the idea of
> >> > > > mlock()'ing guard regions.
> >> > > >
> >> > > > I think adding workarounds to make this possible in any way is not really
> >> > > > worth it (and would probably be pretty gross).
> >> > > >
> >> > > > We already document that 'mlock()ing lightweight guard regions will fail'
> >> > > > as per man page so this is all in line with that.
> >> > >
> >> > > Right, and I claim that supporting VM_LOCKONFAULT might likely be as easy as
> >> > > allowing install/remove of guard regions when that flag is set.
> >> >
> >> > We already allow this flag! VM_LOCKED and VM_HUGETLB are the only flags we
> >> > disallow.
> >>
> >>
> >> See mlock2();
> >>
> >> SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags)
> >> {
> >> vm_flags_t vm_flags = VM_LOCKED;
> >>
> >> if (flags & ~MLOCK_ONFAULT)
> >> return -EINVAL;
> >>
> >> if (flags & MLOCK_ONFAULT)
> >> vm_flags |= VM_LOCKONFAULT;
> >>
> >> return do_mlock(start, len, vm_flags);
> >> }
> >>
> >>
> >> VM_LOCKONFAULT always as VM_LOCKED set as well.
> >
> > OK cool, that makes sense.
> >
> > As with much kernel stuff, I knew this in the past. Then I forgot. Then I knew
> > again, then... :P if only somebody would write it down in a book...
> >
> > Yeah then that makes sense to check explicitly for (VM_LOCKED | VM_LOCKONFAULT)
> > in any MADV_GUARD_INSTALL_LOCKED variant as obviously this would be passively
> > excluded right now.
>
> Sorry for the late reply. So AFAIU from your conversations, guards can't be
> compatible with VM_LOCKED, which means e.g. any attempts of glibc to use
> guards for stacks will soon discover that mlockall() users exist and are
> broken by this, and the attempts will fail? That's a bummer.
>
Yeah damn, this pushes up the priority on this.
Yeah unfortunately we cannot support this with guard regions being installed
_after_ the mlockall() but can for before.
Let me write this on my eternal whiteboard of doom (TM), because that ups the
priority on this. I want to have a good think and see if it might after all be
possible to find a way to make things work here for sake of this case.
This thing is already shipped now, so it is inevitably going to be an add-on.
I will try some experiments when I get a sec.
Thanks very much for bringing this point up! This is pretty key.
> As for compatibility with VM_LOCKONFAULT, do we need a new
> MADV_GUARD_INSTALL_LOCKED or can we say MADV_GUARD_INSTALL is new enough
> that it can be just retrofitted (like you retrofit file backed mappings)?
> AFAIU the only risk would be breaking somebody that already relies on a
> failure for VM_LOCKONFAULT, and it's unlikely there's such a somebody now.
>
>
Hmm yeah I suppose. I guess just to be consistent with the other _LOCKED
variants? (which seem to be... undocumented at least in man pages :P, and yes I
realise this is me semi-volunteering to do that obviously...).
But on the other hand, we could also expand this if you and I see also Dave feel
this makes sense and wouldn't be confusing.
Agreed entirely that it'd be very very odd for a user to rely on that so I think
we'll be fine.
I shall return to this topic later, in the form of a series, probably!
Cheers!
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings
2025-02-25 16:37 ` Lorenzo Stoakes
@ 2025-02-25 16:48 ` David Hildenbrand
0 siblings, 0 replies; 64+ messages in thread
From: David Hildenbrand @ 2025-02-25 16:48 UTC (permalink / raw)
To: Lorenzo Stoakes, Vlastimil Babka
Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
Matthew Wilcox, Paul E . McKenney, Jann Horn, linux-mm,
linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
>> As for compatibility with VM_LOCKONFAULT, do we need a new
>> MADV_GUARD_INSTALL_LOCKED or can we say MADV_GUARD_INSTALL is new enough
>> that it can be just retrofitted (like you retrofit file backed mappings)?
>> AFAIU the only risk would be breaking somebody that already relies on a
>> failure for VM_LOCKONFAULT, and it's unlikely there's such a somebody now.
>>
>>
>
> Hmm yeah I suppose. I guess just to be consistent with the other _LOCKED
> variants? (which seem to be... undocumented at least in man pages :P, and yes I
> realise this is me semi-volunteering to do that obviously...).
>
> But on the other hand, we could also expand this if you and I see also Dave feel
> this makes sense and wouldn't be confusing.
Just my 2 cents: one thing that came to mind: an existing library would
have to be updated to use the _LOCKED variant if the app would be using
mlockall(future), which is a bit unfortunate -- and if it could be
avoided, it would be great.
But yeah, devil is in the detail ...
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 2/4] selftests/mm: rename guard-pages to guard-regions
2025-02-13 18:17 ` [PATCH 2/4] selftests/mm: rename guard-pages to guard-regions Lorenzo Stoakes
2025-02-18 14:15 ` Vlastimil Babka
@ 2025-03-02 8:35 ` Lorenzo Stoakes
1 sibling, 0 replies; 64+ messages in thread
From: Lorenzo Stoakes @ 2025-03-02 8:35 UTC (permalink / raw)
To: Andrew Morton
Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
Vlastimil Babka, Paul E . McKenney, Jann Horn, David Hildenbrand,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
On Thu, Feb 13, 2025 at 06:17:01PM +0000, Lorenzo Stoakes wrote:
> The feature formerly referred to as guard pages is more correctly referred
> to as 'guard regions', as in fact no pages are ever allocated in the
> process of installing the regions.
>
> To avoid confusion, rename the tests accordingly.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Hi Andrew, apologies I realise I missed updating run_vmtests.sh here, mea culpa.
Quick fix-patch enclosed, thanks!
----8<----
From 1b9a2d1fad165f9b7dd327a9e369e16c87661e9b Mon Sep 17 00:00:00 2001
From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Date: Sun, 2 Mar 2025 08:31:57 +0000
Subject: [PATCH] tools/selftests: fix guard regions invocation
---
tools/testing/selftests/mm/run_vmtests.sh | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index fe3606de37f9..55d4453793eb 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -404,7 +404,7 @@ CATEGORY="mremap" run_test ./mremap_dontunmap
CATEGORY="hmm" run_test bash ./test_hmm.sh smoke
# MADV_GUARD_INSTALL and MADV_GUARD_REMOVE tests
-CATEGORY="madv_guard" run_test ./guard-pages
+CATEGORY="madv_guard" run_test ./guard-regions
# MADV_POPULATE_READ and MADV_POPULATE_WRITE tests
CATEGORY="madv_populate" run_test ./madv_populate
--
2.48.1
^ permalink raw reply related [flat|nested] 64+ messages in thread
* Re: [PATCH 3/4] tools/selftests: expand all guard region tests to file-backed
2025-02-13 18:17 ` [PATCH 3/4] tools/selftests: expand all guard region tests to file-backed Lorenzo Stoakes
2025-02-18 14:17 ` Vlastimil Babka
@ 2025-04-22 10:37 ` Ryan Roberts
2025-04-22 10:47 ` Lorenzo Stoakes
1 sibling, 1 reply; 64+ messages in thread
From: Ryan Roberts @ 2025-04-22 10:37 UTC (permalink / raw)
To: Lorenzo Stoakes, Andrew Morton
Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
Vlastimil Babka, Paul E . McKenney, Jann Horn, David Hildenbrand,
linux-mm, linux-kernel, Shuah Khan, linux-kselftest, linux-api,
John Hubbard, Juan Yescas, Kalesh Singh
On 13/02/2025 18:17, Lorenzo Stoakes wrote:
> Extend the guard region tests to allow for test fixture variants for anon,
> shmem, and local file files.
>
> This allows us to assert that each of the expected behaviours of anonymous
> memory also applies correctly to file-backed (both shmem and an a file
> created locally in the current working directory) and thus asserts the same
> correctness guarantees as all the remaining tests do.
>
> The fixture teardown is now performed in the parent process rather than
> child forked ones, meaning cleanup is always performed, including unlinking
> any generated temporary files.
>
> Additionally the variant fixture data type now contains an enum value
> indicating the type of backing store and the mmap() invocation is
> abstracted to allow for the mapping of whichever backing store the variant
> is testing.
>
> We adjust tests as necessary to account for the fact they may now reference
> files rather than anonymous memory.
Hi Lorenzo,
I'm getting a test failure in v6.15-rc3 on arm64:
----8<----
# RUN guard_regions.shmem.uffd ...
# guard-regions.c:1467:uffd:Expected ioctl(uffd, UFFDIO_REGISTER, ®) (-1) ==
0 (0)
# uffd: Test terminated by assertion
# FAIL guard_regions.shmem.uffd
not ok 45 guard_regions.shmem.uffd
----8<----
The ioctl is returning EINVAL.
[...]
> @@ -1281,6 +1398,9 @@ TEST_F(guard_regions, uffd)
> struct uffdio_register reg;
> struct uffdio_range range;
>
> + if (!is_anon_backed(variant))
Perhaps this should be filtering out shmem too? Although the manual for
userfaultfd implies that shmem is supported:
"""
Up to Linux 4.11, userfaultfd can be used only with anonymous private memory
mappings. Since Linux 4.11, userfaultfd can be also used with hugetlbfs and
shared memory mappings.
"""
But I'm not sure if that's referring specifically to UFFDIO_REGISTER_MODE_MISSING?
Any ideas before I start debugging further?
Thanks,
Ryan
> + SKIP(return, "uffd only works on anon backing");
> +
> /* Set up uffd. */
> uffd = userfaultfd(0);
> if (uffd == -1 && errno == EPERM)
> @@ -1290,8 +1410,8 @@ TEST_F(guard_regions, uffd)
> ASSERT_EQ(ioctl(uffd, UFFDIO_API, &api), 0);
>
> /* Map 10 pages. */
> - ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> - MAP_ANON | MAP_PRIVATE, -1, 0);
> + ptr = mmap_(self, variant, NULL, 10 * page_size,
> + PROT_READ | PROT_WRITE, 0, 0);
> ASSERT_NE(ptr, MAP_FAILED);
>
> /* Register the range with uffd. */
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 3/4] tools/selftests: expand all guard region tests to file-backed
2025-04-22 10:37 ` Ryan Roberts
@ 2025-04-22 10:47 ` Lorenzo Stoakes
2025-04-22 11:03 ` Ryan Roberts
0 siblings, 1 reply; 64+ messages in thread
From: Lorenzo Stoakes @ 2025-04-22 10:47 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
Matthew Wilcox, Vlastimil Babka, Paul E . McKenney, Jann Horn,
David Hildenbrand, linux-mm, linux-kernel, Shuah Khan,
linux-kselftest, linux-api, John Hubbard, Juan Yescas,
Kalesh Singh
On Tue, Apr 22, 2025 at 11:37:57AM +0100, Ryan Roberts wrote:
> On 13/02/2025 18:17, Lorenzo Stoakes wrote:
> > Extend the guard region tests to allow for test fixture variants for anon,
> > shmem, and local file files.
> >
> > This allows us to assert that each of the expected behaviours of anonymous
> > memory also applies correctly to file-backed (both shmem and an a file
> > created locally in the current working directory) and thus asserts the same
> > correctness guarantees as all the remaining tests do.
> >
> > The fixture teardown is now performed in the parent process rather than
> > child forked ones, meaning cleanup is always performed, including unlinking
> > any generated temporary files.
> >
> > Additionally the variant fixture data type now contains an enum value
> > indicating the type of backing store and the mmap() invocation is
> > abstracted to allow for the mapping of whichever backing store the variant
> > is testing.
> >
> > We adjust tests as necessary to account for the fact they may now reference
> > files rather than anonymous memory.
>
> Hi Lorenzo,
>
> I'm getting a test failure in v6.15-rc3 on arm64:
>
> ----8<----
> # RUN guard_regions.shmem.uffd ...
> # guard-regions.c:1467:uffd:Expected ioctl(uffd, UFFDIO_REGISTER, ®) (-1) ==
> 0 (0)
> # uffd: Test terminated by assertion
> # FAIL guard_regions.shmem.uffd
> not ok 45 guard_regions.shmem.uffd
> ----8<----
>
> The ioctl is returning EINVAL.
Hm strange, that works fine <resists urge to say 'on my machine'> on x86-64. Is
userfaultfd enabled in your config, to ask a silly question?
It'd be odd for this to vary depending upon arch.
So a factor here is a _stupidity_ in the testing - does your system mount /tmp
as tmpfs or an actual file system? As the test code unconditionally assumes /tmp
is indeed going to get you a shmem file.
It's shameful to be honest. But actually I suspect this more than anything
else...
>
> [...]
>
> > @@ -1281,6 +1398,9 @@ TEST_F(guard_regions, uffd)
> > struct uffdio_register reg;
> > struct uffdio_range range;
> >
> > + if (!is_anon_backed(variant))
>
> Perhaps this should be filtering out shmem too? Although the manual for
> userfaultfd implies that shmem is supported:
Yeah it should work with it fine.
>
> """
> Up to Linux 4.11, userfaultfd can be used only with anonymous private memory
> mappings. Since Linux 4.11, userfaultfd can be also used with hugetlbfs and
> shared memory mappings.
> """
>
> But I'm not sure if that's referring specifically to UFFDIO_REGISTER_MODE_MISSING?
>
> Any ideas before I start debugging further?
>
> Thanks,
> Ryan
>
> > + SKIP(return, "uffd only works on anon backing");
> > +
> > /* Set up uffd. */
> > uffd = userfaultfd(0);
> > if (uffd == -1 && errno == EPERM)
> > @@ -1290,8 +1410,8 @@ TEST_F(guard_regions, uffd)
> > ASSERT_EQ(ioctl(uffd, UFFDIO_API, &api), 0);
> >
> > /* Map 10 pages. */
> > - ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> > - MAP_ANON | MAP_PRIVATE, -1, 0);
> > + ptr = mmap_(self, variant, NULL, 10 * page_size,
> > + PROT_READ | PROT_WRITE, 0, 0);
> > ASSERT_NE(ptr, MAP_FAILED);
> >
> > /* Register the range with uffd. */
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 3/4] tools/selftests: expand all guard region tests to file-backed
2025-04-22 10:47 ` Lorenzo Stoakes
@ 2025-04-22 11:03 ` Ryan Roberts
2025-04-22 11:07 ` Lorenzo Stoakes
0 siblings, 1 reply; 64+ messages in thread
From: Ryan Roberts @ 2025-04-22 11:03 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
Matthew Wilcox, Vlastimil Babka, Paul E . McKenney, Jann Horn,
David Hildenbrand, linux-mm, linux-kernel, Shuah Khan,
linux-kselftest, linux-api, John Hubbard, Juan Yescas,
Kalesh Singh
On 22/04/2025 11:47, Lorenzo Stoakes wrote:
> On Tue, Apr 22, 2025 at 11:37:57AM +0100, Ryan Roberts wrote:
>> On 13/02/2025 18:17, Lorenzo Stoakes wrote:
>>> Extend the guard region tests to allow for test fixture variants for anon,
>>> shmem, and local file files.
>>>
>>> This allows us to assert that each of the expected behaviours of anonymous
>>> memory also applies correctly to file-backed (both shmem and an a file
>>> created locally in the current working directory) and thus asserts the same
>>> correctness guarantees as all the remaining tests do.
>>>
>>> The fixture teardown is now performed in the parent process rather than
>>> child forked ones, meaning cleanup is always performed, including unlinking
>>> any generated temporary files.
>>>
>>> Additionally the variant fixture data type now contains an enum value
>>> indicating the type of backing store and the mmap() invocation is
>>> abstracted to allow for the mapping of whichever backing store the variant
>>> is testing.
>>>
>>> We adjust tests as necessary to account for the fact they may now reference
>>> files rather than anonymous memory.
>>
>> Hi Lorenzo,
>>
>> I'm getting a test failure in v6.15-rc3 on arm64:
>>
>> ----8<----
>> # RUN guard_regions.shmem.uffd ...
>> # guard-regions.c:1467:uffd:Expected ioctl(uffd, UFFDIO_REGISTER, ®) (-1) ==
>> 0 (0)
>> # uffd: Test terminated by assertion
>> # FAIL guard_regions.shmem.uffd
>> not ok 45 guard_regions.shmem.uffd
>> ----8<----
>>
>> The ioctl is returning EINVAL.
>
> Hm strange, that works fine <resists urge to say 'on my machine'> on x86-64. Is
> userfaultfd enabled in your config, to ask a silly question?
Yep, and the anon version of the test is passing, as are all the uffd tests.
>
> It'd be odd for this to vary depending upon arch.
>
> So a factor here is a _stupidity_ in the testing - does your system mount /tmp
> as tmpfs or an actual file system? As the test code unconditionally assumes /tmp
> is indeed going to get you a shmem file.
Ahh that's probably it. I'm on Ubuntu and it looks like /tmp is just a dir on
the rootfs (XFS in my case).
Forcing a tmpfs to /tmp solved it.
Looks like uffd-unit-tests (see shmem_allocate_area()) is just using memfd.
Would it be reasonable to take that approach? Or just use anon+shared via mmap?
Thanks,
Ryan
>
> It's shameful to be honest. But actually I suspect this more than anything
> else...
>
>>
>> [...]
>>
>>> @@ -1281,6 +1398,9 @@ TEST_F(guard_regions, uffd)
>>> struct uffdio_register reg;
>>> struct uffdio_range range;
>>>
>>> + if (!is_anon_backed(variant))
>>
>> Perhaps this should be filtering out shmem too? Although the manual for
>> userfaultfd implies that shmem is supported:
>
> Yeah it should work with it fine.
>
>>
>> """
>> Up to Linux 4.11, userfaultfd can be used only with anonymous private memory
>> mappings. Since Linux 4.11, userfaultfd can be also used with hugetlbfs and
>> shared memory mappings.
>> """
>>
>> But I'm not sure if that's referring specifically to UFFDIO_REGISTER_MODE_MISSING?
>>
>> Any ideas before I start debugging further?
>>
>> Thanks,
>> Ryan
>>
>>> + SKIP(return, "uffd only works on anon backing");
>>> +
>>> /* Set up uffd. */
>>> uffd = userfaultfd(0);
>>> if (uffd == -1 && errno == EPERM)
>>> @@ -1290,8 +1410,8 @@ TEST_F(guard_regions, uffd)
>>> ASSERT_EQ(ioctl(uffd, UFFDIO_API, &api), 0);
>>>
>>> /* Map 10 pages. */
>>> - ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
>>> - MAP_ANON | MAP_PRIVATE, -1, 0);
>>> + ptr = mmap_(self, variant, NULL, 10 * page_size,
>>> + PROT_READ | PROT_WRITE, 0, 0);
>>> ASSERT_NE(ptr, MAP_FAILED);
>>>
>>> /* Register the range with uffd. */
>>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 3/4] tools/selftests: expand all guard region tests to file-backed
2025-04-22 11:03 ` Ryan Roberts
@ 2025-04-22 11:07 ` Lorenzo Stoakes
2025-04-22 11:11 ` Ryan Roberts
0 siblings, 1 reply; 64+ messages in thread
From: Lorenzo Stoakes @ 2025-04-22 11:07 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
Matthew Wilcox, Vlastimil Babka, Paul E . McKenney, Jann Horn,
David Hildenbrand, linux-mm, linux-kernel, Shuah Khan,
linux-kselftest, linux-api, John Hubbard, Juan Yescas,
Kalesh Singh
On Tue, Apr 22, 2025 at 12:03:08PM +0100, Ryan Roberts wrote:
> On 22/04/2025 11:47, Lorenzo Stoakes wrote:
> > On Tue, Apr 22, 2025 at 11:37:57AM +0100, Ryan Roberts wrote:
[snip]
> >>
> >> Hi Lorenzo,
> >>
> >> I'm getting a test failure in v6.15-rc3 on arm64:
> >>
> >> ----8<----
> >> # RUN guard_regions.shmem.uffd ...
> >> # guard-regions.c:1467:uffd:Expected ioctl(uffd, UFFDIO_REGISTER, ®) (-1) ==
> >> 0 (0)
> >> # uffd: Test terminated by assertion
> >> # FAIL guard_regions.shmem.uffd
> >> not ok 45 guard_regions.shmem.uffd
> >> ----8<----
> >>
> >> The ioctl is returning EINVAL.
> >
> > Hm strange, that works fine <resists urge to say 'on my machine'> on x86-64. Is
> > userfaultfd enabled in your config, to ask a silly question?
>
> Yep, and the anon version of the test is passing, as are all the uffd tests.
>
> >
> > It'd be odd for this to vary depending upon arch.
> >
> > So a factor here is a _stupidity_ in the testing - does your system mount /tmp
> > as tmpfs or an actual file system? As the test code unconditionally assumes /tmp
> > is indeed going to get you a shmem file.
>
> Ahh that's probably it. I'm on Ubuntu and it looks like /tmp is just a dir on
> the rootfs (XFS in my case).
>
> Forcing a tmpfs to /tmp solved it.
>
> Looks like uffd-unit-tests (see shmem_allocate_area()) is just using memfd.
> Would it be reasonable to take that approach? Or just use anon+shared via mmap?
Yeah could be either memfd or MAP_ANON | MAP_SHARED. I think I did it this
way to make the code a little easier given all the variant stuff.
Unless you felt especially keen, I can add fixing this to my todos :)
Sorry about this! Entirely my fault, knew it wasn't ideal when I wrote it,
but with test code sometimes you (perhaps incorrectly) tolerate things you
wouldn't elsewhere...
Cheers, Lorenzo
>
> Thanks,
> Ryan
>
[snip]
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 3/4] tools/selftests: expand all guard region tests to file-backed
2025-04-22 11:07 ` Lorenzo Stoakes
@ 2025-04-22 11:11 ` Ryan Roberts
0 siblings, 0 replies; 64+ messages in thread
From: Ryan Roberts @ 2025-04-22 11:11 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
Matthew Wilcox, Vlastimil Babka, Paul E . McKenney, Jann Horn,
David Hildenbrand, linux-mm, linux-kernel, Shuah Khan,
linux-kselftest, linux-api, John Hubbard, Juan Yescas,
Kalesh Singh
On 22/04/2025 12:07, Lorenzo Stoakes wrote:
> On Tue, Apr 22, 2025 at 12:03:08PM +0100, Ryan Roberts wrote:
>> On 22/04/2025 11:47, Lorenzo Stoakes wrote:
>>> On Tue, Apr 22, 2025 at 11:37:57AM +0100, Ryan Roberts wrote:
>
> [snip]
>
>>>>
>>>> Hi Lorenzo,
>>>>
>>>> I'm getting a test failure in v6.15-rc3 on arm64:
>>>>
>>>> ----8<----
>>>> # RUN guard_regions.shmem.uffd ...
>>>> # guard-regions.c:1467:uffd:Expected ioctl(uffd, UFFDIO_REGISTER, ®) (-1) ==
>>>> 0 (0)
>>>> # uffd: Test terminated by assertion
>>>> # FAIL guard_regions.shmem.uffd
>>>> not ok 45 guard_regions.shmem.uffd
>>>> ----8<----
>>>>
>>>> The ioctl is returning EINVAL.
>>>
>>> Hm strange, that works fine <resists urge to say 'on my machine'> on x86-64. Is
>>> userfaultfd enabled in your config, to ask a silly question?
>>
>> Yep, and the anon version of the test is passing, as are all the uffd tests.
>>
>>>
>>> It'd be odd for this to vary depending upon arch.
>>>
>>> So a factor here is a _stupidity_ in the testing - does your system mount /tmp
>>> as tmpfs or an actual file system? As the test code unconditionally assumes /tmp
>>> is indeed going to get you a shmem file.
>>
>> Ahh that's probably it. I'm on Ubuntu and it looks like /tmp is just a dir on
>> the rootfs (XFS in my case).
>>
>> Forcing a tmpfs to /tmp solved it.
>>
>> Looks like uffd-unit-tests (see shmem_allocate_area()) is just using memfd.
>> Would it be reasonable to take that approach? Or just use anon+shared via mmap?
>
> Yeah could be either memfd or MAP_ANON | MAP_SHARED. I think I did it this
> way to make the code a little easier given all the variant stuff.
>
> Unless you felt especially keen, I can add fixing this to my todos :)
I'm not keen, but if it lingers I might have a crack. Otherwise it's just more
noise to ignore when I'm running tests...
>
> Sorry about this! Entirely my fault, knew it wasn't ideal when I wrote it,
> but with test code sometimes you (perhaps incorrectly) tolerate things you
> wouldn't elsewhere...
No problem; I would have done the same thing. Given we know the root cause is a
test issue, this is not a high priority issue from my perspective.
>
> Cheers, Lorenzo
>
>>
>> Thanks,
>> Ryan
>>
>
> [snip]
^ permalink raw reply [flat|nested] 64+ messages in thread
end of thread, other threads:[~2025-04-22 11:11 UTC | newest]
Thread overview: 64+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-13 18:16 [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings Lorenzo Stoakes
2025-02-13 18:17 ` [PATCH 1/4] mm: allow guard regions in file-backed and read-only mappings Lorenzo Stoakes
2025-02-18 14:15 ` Vlastimil Babka
2025-02-18 16:01 ` David Hildenbrand
2025-02-18 16:12 ` Lorenzo Stoakes
2025-02-18 16:17 ` David Hildenbrand
2025-02-18 16:21 ` Lorenzo Stoakes
2025-02-18 16:27 ` David Hildenbrand
2025-02-18 16:49 ` Lorenzo Stoakes
2025-02-18 17:00 ` David Hildenbrand
2025-02-18 17:04 ` Lorenzo Stoakes
2025-02-24 14:02 ` Lorenzo Stoakes
2025-02-13 18:17 ` [PATCH 2/4] selftests/mm: rename guard-pages to guard-regions Lorenzo Stoakes
2025-02-18 14:15 ` Vlastimil Babka
2025-03-02 8:35 ` Lorenzo Stoakes
2025-02-13 18:17 ` [PATCH 3/4] tools/selftests: expand all guard region tests to file-backed Lorenzo Stoakes
2025-02-18 14:17 ` Vlastimil Babka
2025-04-22 10:37 ` Ryan Roberts
2025-04-22 10:47 ` Lorenzo Stoakes
2025-04-22 11:03 ` Ryan Roberts
2025-04-22 11:07 ` Lorenzo Stoakes
2025-04-22 11:11 ` Ryan Roberts
2025-02-13 18:17 ` [PATCH 4/4] tools/selftests: add file/shmem-backed mapping guard region tests Lorenzo Stoakes
2025-02-18 14:18 ` Vlastimil Babka
2025-02-18 12:12 ` [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings Vlastimil Babka
2025-02-18 13:05 ` Lorenzo Stoakes
2025-02-18 14:35 ` David Hildenbrand
2025-02-18 14:53 ` Lorenzo Stoakes
2025-02-18 15:20 ` David Hildenbrand
2025-02-18 16:43 ` Lorenzo Stoakes
2025-02-18 17:14 ` David Hildenbrand
2025-02-18 17:20 ` Lorenzo Stoakes
2025-02-18 17:25 ` David Hildenbrand
2025-02-18 17:28 ` Lorenzo Stoakes
2025-02-18 17:31 ` David Hildenbrand
2025-02-25 15:54 ` Vlastimil Babka
2025-02-25 16:31 ` David Hildenbrand
2025-02-25 16:37 ` Lorenzo Stoakes
2025-02-25 16:48 ` David Hildenbrand
2025-02-19 8:25 ` Kalesh Singh
2025-02-19 8:35 ` Kalesh Singh
2025-02-19 9:15 ` Lorenzo Stoakes
2025-02-19 17:32 ` Liam R. Howlett
2025-02-19 9:03 ` Lorenzo Stoakes
2025-02-19 9:15 ` David Hildenbrand
2025-02-19 9:17 ` Lorenzo Stoakes
2025-02-19 18:52 ` Kalesh Singh
2025-02-19 19:20 ` Lorenzo Stoakes
2025-02-19 20:56 ` Kalesh Singh
2025-02-20 8:51 ` Lorenzo Stoakes
2025-02-20 8:57 ` David Hildenbrand
2025-02-20 9:04 ` Lorenzo Stoakes
2025-02-20 9:23 ` David Hildenbrand
2025-02-20 9:47 ` Lorenzo Stoakes
2025-02-20 10:03 ` David Hildenbrand
2025-02-20 10:15 ` Lorenzo Stoakes
2025-02-20 12:44 ` David Hildenbrand
2025-02-20 13:18 ` Lorenzo Stoakes
2025-02-20 16:21 ` Suren Baghdasaryan
2025-02-20 18:08 ` Kalesh Singh
2025-02-21 11:04 ` Lorenzo Stoakes
2025-02-21 17:24 ` Kalesh Singh
2025-02-20 9:22 ` Vlastimil Babka
2025-02-20 9:53 ` Lorenzo Stoakes
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).