[PATCH v3 0/9] ioctl()-based API to query VMAs from /proc/<pid>/maps

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/9] ioctl()-based API to query VMAs from /proc/<pid>/maps
@ 2024-06-05  0:24 Andrii Nakryiko
  2024-06-05  0:24 ` [PATCH v3 1/9] mm: add find_vma()-like API but RCU protected and taking VMA lock Andrii Nakryiko
                   ` (8 more replies)
  0 siblings, 9 replies; 31+ messages in thread
From: Andrii Nakryiko @ 2024-06-05  0:24 UTC (permalink / raw)
  To: linux-fsdevel, brauner, viro, akpm
  Cc: linux-kernel, bpf, gregkh, linux-mm, liam.howlett, surenb, rppt,
	Andrii Nakryiko

Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
applications to query VMA information more efficiently than reading *all* VMAs
nonselectively through text-based interface of /proc/<pid>/maps file.

Patch #3 goes into a lot of details and background on some common patterns of
using /proc/<pid>/maps in the area of performance profiling and subsequent
symbolization of captured stack traces. As mentioned in that patch, patterns
of VMA querying can differ depending on specific use case, but can generally
be grouped into two main categories: the need to query a small subset of VMAs
covering a given batch of addresses, or reading/storing/caching all
(typically, executable) VMAs upfront for later processing.

The new PROCMAP_QUERY ioctl() API added in this patch set was motivated by the
former pattern of usage. Patch #9 adds a tool that faithfully reproduces an
efficient VMA matching pass of a symbolizer, collecting a subset of covering
VMAs for a given set of addresses as efficiently as possible. This tool is
serving both as a testing ground, as well as a benchmarking tool.
It implements everything both for currently existing text-based
/proc/<pid>/maps interface, as well as for newly-added PROCMAP_QUERY ioctl().

But based on discussion on previous revision of this patch set, it turned out
that this ioctl() API is competitive with highly-optimized text-based
pre-processing pattern that perf tool is using. Based on perf discussion, this
revision adds more flexibility in specifying a subset of VMAs that are of
interest. Now it's possible to specify desired permissions of VMAs (e.g.,
request only executable ones) and/or restrict to only a subset of VMAs that
have file backing. This further improves the efficiency when using this new
API thanks to more selective (executable VMAs only) querying.

In addition to a custom benchmarking tool from patch #9, and experimental perf
integration (available at [0]), Daniel Mueller has since also implemented an
experimental integration into blazesym (see [1]), a library used for stack
trace symbolization by our server fleet-wide profiler and another on-device
profiler agent that runs on weaker ARM devices. The latter ARM-based device
profiler is especially sensitive to performance, and so we benchmarked and
compared text-based /proc/<pid>/maps solution to the equivalent one using
PROCMAP_QUERY ioctl().

Results are very encouraging, giving us 5x improvement for end-to-end
so-called "address normalization" pass, which is the part of the symbolization
process that happens locally on ARM device, before being sent out for further
heavier-weight processing on more powerful remote server. Note that this is
not an artificial microbenchmark. It's a full end-to-end API call being
measured with real-world data on real-world device.

  TEXT-BASED
  ==========
  Benchmarking main/normalize_process_no_build_ids_uncached_maps
  main/normalize_process_no_build_ids_uncached_maps
	  time:   [49.777 µs 49.982 µs 50.250 µs]

  IOCTL-BASED
  ===========
  Benchmarking main/normalize_process_no_build_ids_uncached_maps
  main/normalize_process_no_build_ids_uncached_maps
	  time:   [10.328 µs 10.391 µs 10.457 µs]
	  change: [−79.453% −79.304% −79.166%] (p = 0.00 < 0.02)
	  Performance has improved.

You can see above that we see the drop from 50µs down to 10µs for exactly
the same amount of work, with the same data and target process.

Results for more synthentic benchmarks that hammer /proc/<pid>/maps processing
specifically can be found in patch #9. In short, we see about ~40x improvement
with our custom benchmark tool (it varies depending on captured set of
addresses, previous revision used a different set of captured addresses,
giving about ~35x improvement). And even for perf-based benchmark it's on par
or slightly ahead when using permission-based filtering (fetching only
executable VMAs).

Another big change since v1 is the use of RCU-protected per-VMA lock during
querying, which is what has been requested by mm folks in favor of current
mmap_lock-based protection used by /proc/<pid>/maps text-based implementation.
For that, we added a new internal API that is equivalent to find_vma(), see
patch #1.

One thing that did not change was basing this new API as an ioctl() command
on /proc/<pid>/maps file. An ioctl-based API on top of pidfd was considered,
but has its own downsides. Implementing ioctl() directly on pidfd will cause
access permission checks on every single ioctl(), which leads to performance
concerns and potential spam of capable() audit messages. It also prevents
a nice pattern, possible with /proc/<pid>/maps, in which application opens
/proc/self/maps FD (requiring no additional capabilities) and passed this FD
to profiling agent for querying. To achieve similar pattern, a new file would
have to be created from pidf just for VMA querying, which is considered to be
inferior to just querying /proc/<pid>/maps FD as proposed in current approach.
These aspects were discussed in the hallway track at recent LSF/MM/BPF 2024
and sticking to procfs ioctl() was the final agreement we arrived at.

This patch set is based on top of next-20240604 tag in linux-next tree.

Note: currently there is a race when using RCU-protected per-VMA locking,
which Liam is fixing. RFC patches can be found at [2]. This patch set can
proceed in parallel with that solution.

  [0] https://github.com/anakryiko/linux/commits/procfs-proc-maps-ioctl-v2/
  [1] https://github.com/libbpf/blazesym/pull/675
  [2] https://lore.kernel.org/linux-mm/20240531163217.1584450-1-Liam.Howlett@oracle.com/

v2->v3:
  - drop mmap_lock aggressively under CONFIG_PER_VMA_LOCK (Liam);
  - code massaging to abstract per-VMA vs mmap_lock differences (Liam);
v1->v2:
  - per-VMA lock is used, if possible (Liam, Suren);
  - added file-backed VMA querying (perf folks);
  - added permission-based VMA querying (perf folks);
  - split out build ID into separate patch (Suren);
  - better documented API, added mention of ioctl() into procfs docs (Greg).

Andrii Nakryiko (9):
  mm: add find_vma()-like API but RCU protected and taking VMA lock
  fs/procfs: extract logic for getting VMA name constituents
  fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
  fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API
  fs/procfs: add build ID fetching to PROCMAP_QUERY API
  docs/procfs: call out ioctl()-based PROCMAP_QUERY command existence
  tools: sync uapi/linux/fs.h header into tools subdir
  selftests/bpf: make use of PROCMAP_QUERY ioctl if available
  selftests/bpf: add simple benchmark tool for /proc/<pid>/maps APIs

 Documentation/filesystems/proc.rst          |   8 +
 fs/proc/task_mmu.c                          | 412 +++++++++++++--
 include/linux/mm.h                          |   8 +
 include/uapi/linux/fs.h                     | 156 +++++-
 mm/memory.c                                 |  62 +++
 tools/include/uapi/linux/fs.h               | 550 ++++++++++++++++++++
 tools/testing/selftests/bpf/.gitignore      |   1 +
 tools/testing/selftests/bpf/Makefile        |   2 +-
 tools/testing/selftests/bpf/procfs_query.c  | 386 ++++++++++++++
 tools/testing/selftests/bpf/test_progs.c    |   3 +
 tools/testing/selftests/bpf/test_progs.h    |   2 +
 tools/testing/selftests/bpf/trace_helpers.c | 104 +++-
 12 files changed, 1623 insertions(+), 71 deletions(-)
 create mode 100644 tools/include/uapi/linux/fs.h
 create mode 100644 tools/testing/selftests/bpf/procfs_query.c

-- 
2.43.0

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 1/9] mm: add find_vma()-like API but RCU protected and taking VMA lock
  2024-06-05  0:24 [PATCH v3 0/9] ioctl()-based API to query VMAs from /proc/<pid>/maps Andrii Nakryiko
@ 2024-06-05  0:24 ` Andrii Nakryiko
  2024-06-05  0:57   ` Matthew Wilcox
  2024-06-05  0:24 ` [PATCH v3 2/9] fs/procfs: extract logic for getting VMA name constituents Andrii Nakryiko
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Andrii Nakryiko @ 2024-06-05  0:24 UTC (permalink / raw)
  To: linux-fsdevel, brauner, viro, akpm
  Cc: linux-kernel, bpf, gregkh, linux-mm, liam.howlett, surenb, rppt,
	Andrii Nakryiko

Existing lock_vma_under_rcu() API assumes exact VMA match, so it's not
a 100% equivalent of find_vma(). There are use cases that do want
find_vma() semantics of finding an exact VMA or the next one.

Also, it's important for such an API to let user distinguish between not
being able to get per-VMA lock and not having any VMAs at or after
provided address.

As such, this patch adds a new find_vma()-like API,
find_and_lock_vma_rcu(), which finds exact or next VMA, attempts to take
per-VMA lock, and if that fails, returns ERR_PTR(-EBUSY). It still
returns NULL if there is no VMA at or after address. In successfuly case
it will return valid and non-isolated VMA with VMA lock taken.

This API will be used in subsequent patch in this patch set to implement
a new user-facing API for querying process VMAs.

Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/mm.h |  8 ++++++
 mm/memory.c        | 62 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 70 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c41c82bcbec2..3ab52b7e124c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -776,6 +776,8 @@ static inline void assert_fault_locked(struct vm_fault *vmf)
 		mmap_assert_locked(vmf->vma->vm_mm);
 }
 
+struct vm_area_struct *find_and_lock_vma_rcu(struct mm_struct *mm,
+					  unsigned long address);
 struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 					  unsigned long address);
 
@@ -790,6 +792,12 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 static inline void vma_mark_detached(struct vm_area_struct *vma,
 				     bool detached) {}
 
+struct vm_area_struct *find_and_lock_vma_rcu(struct mm_struct *mm,
+					     unsigned long address)
+{
+	return -EOPNOTSUPP;
+}
+
 static inline struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 		unsigned long address)
 {
diff --git a/mm/memory.c b/mm/memory.c
index eef4e482c0c2..c9517742bd6d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5913,6 +5913,68 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
 #endif
 
 #ifdef CONFIG_PER_VMA_LOCK
+/*
+ * find_and_lock_vma_rcu() - Find and lock the VMA for a given address, or the
+ * next VMA. Search is done under RCU protection, without taking or assuming
+ * mmap_lock. Returned VMA is guaranteed to be stable and not isolated.
+
+ * @mm: The mm_struct to check
+ * @addr: The address
+ *
+ * Returns: The VMA associated with addr, or the next VMA.
+ * May return %NULL in the case of no VMA at addr or above.
+ * If the VMA is being modified and can't be locked, -EBUSY is returned.
+ */
+struct vm_area_struct *find_and_lock_vma_rcu(struct mm_struct *mm,
+					     unsigned long address)
+{
+	MA_STATE(mas, &mm->mm_mt, address, address);
+	struct vm_area_struct *vma;
+	int err;
+
+	rcu_read_lock();
+retry:
+	vma = mas_find(&mas, ULONG_MAX);
+	if (!vma) {
+		err = 0; /* no VMA, return NULL */
+		goto inval;
+	}
+
+	if (!vma_start_read(vma)) {
+		err = -EBUSY;
+		goto inval;
+	}
+
+	/*
+	 * Check since vm_start/vm_end might change before we lock the VMA.
+	 * Note, unlike lock_vma_under_rcu() we are searching for VMA covering
+	 * address or the next one, so we only make sure VMA wasn't updated to
+	 * end before the address.
+	 */
+	if (unlikely(vma->vm_end <= address)) {
+		err = -EBUSY;
+		goto inval_end_read;
+	}
+
+	/* Check if the VMA got isolated after we found it */
+	if (vma->detached) {
+		vma_end_read(vma);
+		count_vm_vma_lock_event(VMA_LOCK_MISS);
+		/* The area was replaced with another one */
+		goto retry;
+	}
+
+	rcu_read_unlock();
+	return vma;
+
+inval_end_read:
+	vma_end_read(vma);
+inval:
+	rcu_read_unlock();
+	count_vm_vma_lock_event(VMA_LOCK_ABORT);
+	return ERR_PTR(err);
+}
+
 /*
  * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
  * stable and not isolated. If the VMA is not found or is being modified the
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 2/9] fs/procfs: extract logic for getting VMA name constituents
  2024-06-05  0:24 [PATCH v3 0/9] ioctl()-based API to query VMAs from /proc/<pid>/maps Andrii Nakryiko
  2024-06-05  0:24 ` [PATCH v3 1/9] mm: add find_vma()-like API but RCU protected and taking VMA lock Andrii Nakryiko
@ 2024-06-05  0:24 ` Andrii Nakryiko
  2024-06-05  0:24 ` [PATCH v3 3/9] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps Andrii Nakryiko
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 31+ messages in thread
From: Andrii Nakryiko @ 2024-06-05  0:24 UTC (permalink / raw)
  To: linux-fsdevel, brauner, viro, akpm
  Cc: linux-kernel, bpf, gregkh, linux-mm, liam.howlett, surenb, rppt,
	Andrii Nakryiko

Extract generic logic to fetch relevant pieces of data to describe VMA
name. This could be just some string (either special constant or
user-provided), or a string with some formatted wrapping text (e.g.,
"[anon_shmem:<something>]"), or, commonly, file path. seq_file-based
logic has different methods to handle all three cases, but they are
currently mixed in with extracting underlying sources of data.

This patch splits this into data fetching and data formatting, so that
data fetching can be reused later on.

There should be no functional changes.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 fs/proc/task_mmu.c | 125 +++++++++++++++++++++++++--------------------
 1 file changed, 71 insertions(+), 54 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f8d35f993fe5..334ae210a95a 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -239,6 +239,67 @@ static int do_maps_open(struct inode *inode, struct file *file,
 				sizeof(struct proc_maps_private));
 }
 
+static void get_vma_name(struct vm_area_struct *vma,
+			 const struct path **path,
+			 const char **name,
+			 const char **name_fmt)
+{
+	struct anon_vma_name *anon_name = vma->vm_mm ? anon_vma_name(vma) : NULL;
+
+	*name = NULL;
+	*path = NULL;
+	*name_fmt = NULL;
+
+	/*
+	 * Print the dentry name for named mappings, and a
+	 * special [heap] marker for the heap:
+	 */
+	if (vma->vm_file) {
+		/*
+		 * If user named this anon shared memory via
+		 * prctl(PR_SET_VMA ..., use the provided name.
+		 */
+		if (anon_name) {
+			*name_fmt = "[anon_shmem:%s]";
+			*name = anon_name->name;
+		} else {
+			*path = file_user_path(vma->vm_file);
+		}
+		return;
+	}
+
+	if (vma->vm_ops && vma->vm_ops->name) {
+		*name = vma->vm_ops->name(vma);
+		if (*name)
+			return;
+	}
+
+	*name = arch_vma_name(vma);
+	if (*name)
+		return;
+
+	if (!vma->vm_mm) {
+		*name = "[vdso]";
+		return;
+	}
+
+	if (vma_is_initial_heap(vma)) {
+		*name = "[heap]";
+		return;
+	}
+
+	if (vma_is_initial_stack(vma)) {
+		*name = "[stack]";
+		return;
+	}
+
+	if (anon_name) {
+		*name_fmt = "[anon:%s]";
+		*name = anon_name->name;
+		return;
+	}
+}
+
 static void show_vma_header_prefix(struct seq_file *m,
 				   unsigned long start, unsigned long end,
 				   vm_flags_t flags, unsigned long long pgoff,
@@ -262,17 +323,15 @@ static void show_vma_header_prefix(struct seq_file *m,
 static void
 show_map_vma(struct seq_file *m, struct vm_area_struct *vma)
 {
-	struct anon_vma_name *anon_name = NULL;
-	struct mm_struct *mm = vma->vm_mm;
-	struct file *file = vma->vm_file;
+	const struct path *path;
+	const char *name_fmt, *name;
 	vm_flags_t flags = vma->vm_flags;
 	unsigned long ino = 0;
 	unsigned long long pgoff = 0;
 	unsigned long start, end;
 	dev_t dev = 0;
-	const char *name = NULL;
 
-	if (file) {
+	if (vma->vm_file) {
 		const struct inode *inode = file_user_inode(vma->vm_file);
 
 		dev = inode->i_sb->s_dev;
@@ -283,57 +342,15 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma)
 	start = vma->vm_start;
 	end = vma->vm_end;
 	show_vma_header_prefix(m, start, end, flags, pgoff, dev, ino);
-	if (mm)
-		anon_name = anon_vma_name(vma);
 
-	/*
-	 * Print the dentry name for named mappings, and a
-	 * special [heap] marker for the heap:
-	 */
-	if (file) {
+	get_vma_name(vma, &path, &name, &name_fmt);
+	if (path) {
 		seq_pad(m, ' ');
-		/*
-		 * If user named this anon shared memory via
-		 * prctl(PR_SET_VMA ..., use the provided name.
-		 */
-		if (anon_name)
-			seq_printf(m, "[anon_shmem:%s]", anon_name->name);
-		else
-			seq_path(m, file_user_path(file), "\n");
-		goto done;
-	}
-
-	if (vma->vm_ops && vma->vm_ops->name) {
-		name = vma->vm_ops->name(vma);
-		if (name)
-			goto done;
-	}
-
-	name = arch_vma_name(vma);
-	if (!name) {
-		if (!mm) {
-			name = "[vdso]";
-			goto done;
-		}
-
-		if (vma_is_initial_heap(vma)) {
-			name = "[heap]";
-			goto done;
-		}
-
-		if (vma_is_initial_stack(vma)) {
-			name = "[stack]";
-			goto done;
-		}
-
-		if (anon_name) {
-			seq_pad(m, ' ');
-			seq_printf(m, "[anon:%s]", anon_name->name);
-		}
-	}
-
-done:
-	if (name) {
+		seq_path(m, path, "\n");
+	} else if (name_fmt) {
+		seq_pad(m, ' ');
+		seq_printf(m, name_fmt, name);
+	} else if (name) {
 		seq_pad(m, ' ');
 		seq_puts(m, name);
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 3/9] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
  2024-06-05  0:24 [PATCH v3 0/9] ioctl()-based API to query VMAs from /proc/<pid>/maps Andrii Nakryiko
  2024-06-05  0:24 ` [PATCH v3 1/9] mm: add find_vma()-like API but RCU protected and taking VMA lock Andrii Nakryiko
  2024-06-05  0:24 ` [PATCH v3 2/9] fs/procfs: extract logic for getting VMA name constituents Andrii Nakryiko
@ 2024-06-05  0:24 ` Andrii Nakryiko
  2024-06-07 22:31   ` Andrei Vagin
  2024-06-05  0:24 ` [PATCH v3 4/9] fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API Andrii Nakryiko
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Andrii Nakryiko @ 2024-06-05  0:24 UTC (permalink / raw)
  To: linux-fsdevel, brauner, viro, akpm
  Cc: linux-kernel, bpf, gregkh, linux-mm, liam.howlett, surenb, rppt,
	Andrii Nakryiko

/proc/<pid>/maps file is extremely useful in practice for various tasks
involving figuring out process memory layout, what files are backing any
given memory range, etc. One important class of applications that
absolutely rely on this are profilers/stack symbolizers (perf tool being one
of them). Patterns of use differ, but they generally would fall into two
categories.

In on-demand pattern, a profiler/symbolizer would normally capture stack
trace containing absolute memory addresses of some functions, and would
then use /proc/<pid>/maps file to find corresponding backing ELF files
(normally, only executable VMAs are of interest), file offsets within
them, and then continue from there to get yet more information (ELF
symbols, DWARF information) to get human-readable symbolic information.
This pattern is used by Meta's fleet-wide profiler, as one example.

In preprocessing pattern, application doesn't know the set of addresses
of interest, so it has to fetch all relevant VMAs (again, probably only
executable ones), store or cache them, then proceed with profiling and
stack trace capture. Once done, it would do symbolization based on
stored VMA information. This can happen at much later point in time.
This patterns is used by perf tool, as an example.

In either case, there are both performance and correctness requirement
involved. This address to VMA information translation has to be done as
efficiently as possible, but also not miss any VMA (especially in the
case of loading/unloading shared libraries). In practice, correctness
can't be guaranteed (due to process dying before VMA data can be
captured, or shared library being unloaded, etc), but any effort to
maximize the chance of finding the VMA is appreciated.

Unfortunately, for all the /proc/<pid>/maps file universality and
usefulness, it doesn't fit the above use cases 100%.

First, it's main purpose is to emit all VMAs sequentially, but in
practice captured addresses would fall only into a smaller subset of all
process' VMAs, mainly containing executable text. Yet, library would
need to parse most or all of the contents to find needed VMAs, as there
is no way to skip VMAs that are of no use. Efficient library can do the
linear pass and it is still relatively efficient, but it's definitely an
overhead that can be avoided, if there was a way to do more targeted
querying of the relevant VMA information.

Second, it's a text based interface, which makes its programmatic use from
applications and libraries more cumbersome and inefficient due to the
need to handle text parsing to get necessary pieces of information. The
overhead is actually payed both by kernel, formatting originally binary
VMA data into text, and then by user space application, parsing it back
into binary data for further use.

For the on-demand pattern of usage, described above, another problem
when writing generic stack trace symbolization library is an unfortunate
performance-vs-correctness tradeoff that needs to be made. Library has
to make a decision to either cache parsed contents of /proc/<pid>/maps
(after initial processing) to service future requests (if application
requests to symbolize another set of addresses (for the same process),
captured at some later time, which is typical for periodic/continuous
profiling cases) to avoid higher costs of re-parsing this file. Or it
has to choose to cache the contents in memory to speed up future
requests. In the former case, more memory is used for the cache and
there is a risk of getting stale data if application loads or unloads
shared libraries, or otherwise changed its set of VMAs somehow, e.g.,
through additional mmap() calls.  In the latter case, it's the
performance hit that comes from re-opening the file and re-parsing its
contents all over again.

This patch aims to solve this problem by providing a new API built on
top of /proc/<pid>/maps. It's meant to address both non-selectiveness
and text nature of /proc/<pid>/maps, by giving user more control of what
sort of VMA(s) needs to be queried, and being binary-based interface
eliminates the overhead of text formatting (on kernel side) and parsing
(on user space side).

It's also designed to be extensible and forward/backward compatible by
including required struct size field, which user has to provide. We use
established copy_struct_from_user() approach to handle extensibility.

User has a choice to pick either getting VMA that covers provided
address or -ENOENT if none is found (exact, least surprising, case). Or,
with an extra query flag (PROCMAP_QUERY_COVERING_OR_NEXT_VMA), they can
get either VMA that covers the address (if there is one), or the closest
next VMA (i.e., VMA with the smallest vm_start > addr). The latter allows
more efficient use, but, given it could be a surprising behavior,
requires an explicit opt-in.

There is another query flag that is useful for some use cases.
PROCMAP_QUERY_FILE_BACKED_VMA instructs this API to only return
file-backed VMAs. Combining this with PROCMAP_QUERY_COVERING_OR_NEXT_VMA
makes it possible to efficiently iterate only file-backed VMAs of the
process, which is what profilers/symbolizers are normally interested in.

All the above querying flags can be combined with (also optional) set of
desired VMA permissions flags. This allows to, for example, iterate only
an executable subset of VMAs, which is what preprocessing pattern, used
by perf tool, would benefit from, as the assumption is that captured
stack traces would have addresses of executable code. This saves time by
skipping non-executable VMAs altogether efficienty.

All these querying flags (modifiers) are orthogonal and can be combined
in a semantically meaningful and natural way.

Basing this ioctl()-based API on top of /proc/<pid>/maps's FD makes
sense given it's querying the same set of VMA data. It's also benefitial
because permission checks for /proc/<pid>/maps is performed at open time
once, and the actual data read of text contents of /proc/<pid>/maps is
done without further permission checks. We piggyback on this pattern
with ioctl()-based API as well, as that's a desired property. Both for
performance reasons, but also for security and flexibility reasons.

Allowing application to open an FD for /proc/self/maps without any extra
capabilities, and then passing it to some sort of profiling agent
through Unix-domain socket, would allow such profiling agent to not
require some of the capabilities that are otherwise expected when
opening /proc/<pid>/maps file for *another* process. This is a desirable
property for some more restricted setups.

This new ioctl-based implementation doesn't interfere with
seq_file-based implementation of /proc/<pid>/maps textual interface, and
so could be used together or independently without paying any price for
that.

Note also, that fetching VMA name (e.g., backing file path, or special
hard-coded or user-provided names) is optional just like build ID. If
user sets vma_name_size to zero, kernel code won't attempt to retrieve
it, saving resources.

To simplify reviewing, per-VMA locking is not yet added in this patch,
but the overall code structure is ready for it and will be adjusted in
the next patch to take per-VMA locking into account.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 fs/proc/task_mmu.c      | 218 ++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/fs.h | 128 ++++++++++++++++++++++-
 2 files changed, 345 insertions(+), 1 deletion(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 334ae210a95a..614fbe5d0667 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -375,11 +375,229 @@ static int pid_maps_open(struct inode *inode, struct file *file)
 	return do_maps_open(inode, file, &proc_pid_maps_op);
 }
 
+#define PROCMAP_QUERY_VMA_FLAGS (				\
+		PROCMAP_QUERY_VMA_READABLE |			\
+		PROCMAP_QUERY_VMA_WRITABLE |			\
+		PROCMAP_QUERY_VMA_EXECUTABLE |			\
+		PROCMAP_QUERY_VMA_SHARED			\
+)
+
+#define PROCMAP_QUERY_VALID_FLAGS_MASK (			\
+		PROCMAP_QUERY_COVERING_OR_NEXT_VMA |		\
+		PROCMAP_QUERY_FILE_BACKED_VMA |			\
+		PROCMAP_QUERY_VMA_FLAGS				\
+)
+
+static int query_vma_setup(struct mm_struct *mm)
+{
+	return mmap_read_lock_killable(mm);
+}
+
+static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma)
+{
+	mmap_read_unlock(mm);
+}
+
+static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr)
+{
+	return find_vma(mm, addr);
+}
+
+static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
+						 unsigned long addr, u32 flags)
+{
+	struct vm_area_struct *vma;
+
+next_vma:
+	vma = query_vma_find_by_addr(mm, addr);
+	if (!vma)
+		goto no_vma;
+
+	/* user requested only file-backed VMA, keep iterating */
+	if ((flags & PROCMAP_QUERY_FILE_BACKED_VMA) && !vma->vm_file)
+		goto skip_vma;
+
+	/* VMA permissions should satisfy query flags */
+	if (flags & PROCMAP_QUERY_VMA_FLAGS) {
+		u32 perm = 0;
+
+		if (flags & PROCMAP_QUERY_VMA_READABLE)
+			perm |= VM_READ;
+		if (flags & PROCMAP_QUERY_VMA_WRITABLE)
+			perm |= VM_WRITE;
+		if (flags & PROCMAP_QUERY_VMA_EXECUTABLE)
+			perm |= VM_EXEC;
+		if (flags & PROCMAP_QUERY_VMA_SHARED)
+			perm |= VM_MAYSHARE;
+
+		if ((vma->vm_flags & perm) != perm)
+			goto skip_vma;
+	}
+
+	/* found covering VMA or user is OK with the matching next VMA */
+	if ((flags & PROCMAP_QUERY_COVERING_OR_NEXT_VMA) || vma->vm_start <= addr)
+		return vma;
+
+skip_vma:
+	/*
+	 * If the user needs closest matching VMA, keep iterating.
+	 */
+	addr = vma->vm_end;
+	if (flags & PROCMAP_QUERY_COVERING_OR_NEXT_VMA)
+		goto next_vma;
+no_vma:
+	return ERR_PTR(-ENOENT);
+}
+
+static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
+{
+	struct procmap_query karg;
+	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+	const char *name = NULL;
+	char *name_buf = NULL;
+	__u64 usize;
+	int err;
+
+	if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize)))
+		return -EFAULT;
+	/* argument struct can never be that large, reject abuse */
+	if (usize > PAGE_SIZE)
+		return -E2BIG;
+	/* argument struct should have at least query_flags and query_addr fields */
+	if (usize < offsetofend(struct procmap_query, query_addr))
+		return -EINVAL;
+	err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize);
+	if (err)
+		return err;
+
+	/* reject unknown flags */
+	if (karg.query_flags & ~PROCMAP_QUERY_VALID_FLAGS_MASK)
+		return -EINVAL;
+	/* either both buffer address and size are set, or both should be zero */
+	if (!!karg.vma_name_size != !!karg.vma_name_addr)
+		return -EINVAL;
+
+	mm = priv->mm;
+	if (!mm || !mmget_not_zero(mm))
+		return -ESRCH;
+
+	err = query_vma_setup(mm);
+	if (err) {
+		mmput(mm);
+		return err;
+	}
+
+	vma = query_matching_vma(mm, karg.query_addr, karg.query_flags);
+	if (IS_ERR(vma)) {
+		err = PTR_ERR(vma);
+		vma = NULL;
+		goto out;
+	}
+
+	karg.vma_start = vma->vm_start;
+	karg.vma_end = vma->vm_end;
+
+	if (vma->vm_file) {
+		const struct inode *inode = file_user_inode(vma->vm_file);
+
+		karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT;
+		karg.dev_major = MAJOR(inode->i_sb->s_dev);
+		karg.dev_minor = MINOR(inode->i_sb->s_dev);
+		karg.inode = inode->i_ino;
+	} else {
+		karg.vma_offset = 0;
+		karg.dev_major = 0;
+		karg.dev_minor = 0;
+		karg.inode = 0;
+	}
+
+	karg.vma_flags = 0;
+	if (vma->vm_flags & VM_READ)
+		karg.vma_flags |= PROCMAP_QUERY_VMA_READABLE;
+	if (vma->vm_flags & VM_WRITE)
+		karg.vma_flags |= PROCMAP_QUERY_VMA_WRITABLE;
+	if (vma->vm_flags & VM_EXEC)
+		karg.vma_flags |= PROCMAP_QUERY_VMA_EXECUTABLE;
+	if (vma->vm_flags & VM_MAYSHARE)
+		karg.vma_flags |= PROCMAP_QUERY_VMA_SHARED;
+
+	if (karg.vma_name_size) {
+		size_t name_buf_sz = min_t(size_t, PATH_MAX, karg.vma_name_size);
+		const struct path *path;
+		const char *name_fmt;
+		size_t name_sz = 0;
+
+		get_vma_name(vma, &path, &name, &name_fmt);
+
+		if (path || name_fmt || name) {
+			name_buf = kmalloc(name_buf_sz, GFP_KERNEL);
+			if (!name_buf) {
+				err = -ENOMEM;
+				goto out;
+			}
+		}
+		if (path) {
+			name = d_path(path, name_buf, name_buf_sz);
+			if (IS_ERR(name)) {
+				err = PTR_ERR(name);
+				goto out;
+			}
+			name_sz = name_buf + name_buf_sz - name;
+		} else if (name || name_fmt) {
+			name_sz = 1 + snprintf(name_buf, name_buf_sz, name_fmt ?: "%s", name);
+			name = name_buf;
+		}
+		if (name_sz > name_buf_sz) {
+			err = -ENAMETOOLONG;
+			goto out;
+		}
+		karg.vma_name_size = name_sz;
+	}
+
+	/* unlock vma or mmap_lock, and put mm_struct before copying data to user */
+	query_vma_teardown(mm, vma);
+	mmput(mm);
+
+	if (karg.vma_name_size && copy_to_user((void __user *)karg.vma_name_addr,
+					       name, karg.vma_name_size)) {
+		kfree(name_buf);
+		return -EFAULT;
+	}
+	kfree(name_buf);
+
+	if (copy_to_user(uarg, &karg, min_t(size_t, sizeof(karg), usize)))
+		return -EFAULT;
+
+	return 0;
+
+out:
+	query_vma_teardown(mm, vma);
+	mmput(mm);
+	kfree(name_buf);
+	return err;
+}
+
+static long procfs_procmap_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	struct seq_file *seq = file->private_data;
+	struct proc_maps_private *priv = seq->private;
+
+	switch (cmd) {
+	case PROCMAP_QUERY:
+		return do_procmap_query(priv, (void __user *)arg);
+	default:
+		return -ENOIOCTLCMD;
+	}
+}
+
 const struct file_operations proc_pid_maps_operations = {
 	.open		= pid_maps_open,
 	.read		= seq_read,
 	.llseek		= seq_lseek,
 	.release	= proc_map_release,
+	.unlocked_ioctl = procfs_procmap_ioctl,
+	.compat_ioctl	= procfs_procmap_ioctl,
 };
 
 /*
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 45e4e64fd664..f25e7004972d 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -333,8 +333,10 @@ typedef int __bitwise __kernel_rwf_t;
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
 			 RWF_APPEND | RWF_NOAPPEND)
 
+#define PROCFS_IOCTL_MAGIC 'f'
+
 /* Pagemap ioctl */
-#define PAGEMAP_SCAN	_IOWR('f', 16, struct pm_scan_arg)
+#define PAGEMAP_SCAN	_IOWR(PROCFS_IOCTL_MAGIC, 16, struct pm_scan_arg)
 
 /* Bitmasks provided in pm_scan_args masks and reported in page_region.categories. */
 #define PAGE_IS_WPALLOWED	(1 << 0)
@@ -393,4 +395,128 @@ struct pm_scan_arg {
 	__u64 return_mask;
 };
 
+/* /proc/<pid>/maps ioctl */
+#define PROCMAP_QUERY	_IOWR(PROCFS_IOCTL_MAGIC, 17, struct procmap_query)
+
+enum procmap_query_flags {
+	/*
+	 * VMA permission flags.
+	 *
+	 * Can be used as part of procmap_query.query_flags field to look up
+	 * only VMAs satisfying specified subset of permissions. E.g., specifying
+	 * PROCMAP_QUERY_VMA_READABLE only will return both readable and read/write VMAs,
+	 * while having PROCMAP_QUERY_VMA_READABLE | PROCMAP_QUERY_VMA_WRITABLE will only
+	 * return read/write VMAs, though both executable/non-executable and
+	 * private/shared will be ignored.
+	 *
+	 * PROCMAP_QUERY_VMA_* flags are also returned in procmap_query.vma_flags
+	 * field to specify actual VMA permissions.
+	 */
+	PROCMAP_QUERY_VMA_READABLE		= 0x01,
+	PROCMAP_QUERY_VMA_WRITABLE		= 0x02,
+	PROCMAP_QUERY_VMA_EXECUTABLE		= 0x04,
+	PROCMAP_QUERY_VMA_SHARED		= 0x08,
+	/*
+	 * Query modifier flags.
+	 *
+	 * By default VMA that covers provided address is returned, or -ENOENT
+	 * is returned. With PROCMAP_QUERY_COVERING_OR_NEXT_VMA flag set, closest
+	 * VMA with vma_start > addr will be returned if no covering VMA is
+	 * found.
+	 *
+	 * PROCMAP_QUERY_FILE_BACKED_VMA instructs query to consider only VMAs that
+	 * have file backing. Can be combined with PROCMAP_QUERY_COVERING_OR_NEXT_VMA
+	 * to iterate all VMAs with file backing.
+	 */
+	PROCMAP_QUERY_COVERING_OR_NEXT_VMA	= 0x10,
+	PROCMAP_QUERY_FILE_BACKED_VMA		= 0x20,
+};
+
+/*
+ * Input/output argument structured passed into ioctl() call. It can be used
+ * to query a set of VMAs (Virtual Memory Areas) of a process.
+ *
+ * Each field can be one of three kinds, marked in a short comment to the
+ * right of the field:
+ *   - "in", input argument, user has to provide this value, kernel doesn't modify it;
+ *   - "out", output argument, kernel sets this field with VMA data;
+ *   - "in/out", input and output argument; user provides initial value (used
+ *     to specify maximum allowable buffer size), and kernel sets it to actual
+ *     amount of data written (or zero, if there is no data).
+ *
+ * If matching VMA is found (according to criterias specified by
+ * query_addr/query_flags, all the out fields are filled out, and ioctl()
+ * returns 0. If there is no matching VMA, -ENOENT will be returned.
+ * In case of any other error, negative error code other than -ENOENT is
+ * returned.
+ *
+ * Most of the data is similar to the one returned as text in /proc/<pid>/maps
+ * file, but procmap_query provides more querying flexibility. There are no
+ * consistency guarantees between subsequent ioctl() calls, but data returned
+ * for matched VMA is self-consistent.
+ */
+struct procmap_query {
+	/* Query struct size, for backwards/forward compatibility */
+	__u64 size;
+	/*
+	 * Query flags, a combination of enum procmap_query_flags values.
+	 * Defines query filtering and behavior, see enum procmap_query_flags.
+	 *
+	 * Input argument, provided by user. Kernel doesn't modify it.
+	 */
+	__u64 query_flags;		/* in */
+	/*
+	 * Query address. By default, VMA that covers this address will
+	 * be looked up. PROCMAP_QUERY_* flags above modify this default
+	 * behavior further.
+	 *
+	 * Input argument, provided by user. Kernel doesn't modify it.
+	 */
+	__u64 query_addr;		/* in */
+	/* VMA starting (inclusive) and ending (exclusive) address, if VMA is found. */
+	__u64 vma_start;		/* out */
+	__u64 vma_end;			/* out */
+	/* VMA permissions flags. A combination of PROCMAP_QUERY_VMA_* flags. */
+	__u64 vma_flags;		/* out */
+	/*
+	 * VMA file offset. If VMA has file backing, this specifies offset
+	 * within the file that VMA's start address corresponds to.
+	 * Is set to zero if VMA has no backing file.
+	 */
+	__u64 vma_offset;		/* out */
+	/* Backing file's inode number, or zero, if VMA has no backing file. */
+	__u64 inode;			/* out */
+	/* Backing file's device major/minor number, or zero, if VMA has no backing file. */
+	__u32 dev_major;		/* out */
+	__u32 dev_minor;		/* out */
+	/*
+	 * If set to non-zero value, signals the request to return VMA name
+	 * (i.e., VMA's backing file's absolute path, with " (deleted)" suffix
+	 * appended, if file was unlinked from FS) for matched VMA. VMA name
+	 * can also be some special name (e.g., "[heap]", "[stack]") or could
+	 * be even user-supplied with prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME).
+	 *
+	 * Kernel will set this field to zero, if VMA has no associated name.
+	 * Otherwise kernel will return actual amount of bytes filled in
+	 * user-supplied buffer (see vma_name_addr field below), including the
+	 * terminating zero.
+	 *
+	 * If VMA name is longer that user-supplied maximum buffer size,
+	 * -E2BIG error is returned.
+	 *
+	 * If this field is set to non-zero value, vma_name_addr should point
+	 * to valid user space memory buffer of at least vma_name_size bytes.
+	 * If set to zero, vma_name_addr should be set to zero as well
+	 */
+	__u32 vma_name_size;		/* in/out */
+	/*
+	 * User-supplied address of a buffer of at least vma_name_size bytes
+	 * for kernel to fill with matched VMA's name (see vma_name_size field
+	 * description above for details).
+	 *
+	 * Should be set to zero if VMA name should not be returned.
+	 */
+	__u64 vma_name_addr;		/* in */
+};
+
 #endif /* _UAPI_LINUX_FS_H */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 4/9] fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API
  2024-06-05  0:24 [PATCH v3 0/9] ioctl()-based API to query VMAs from /proc/<pid>/maps Andrii Nakryiko
                   ` (2 preceding siblings ...)
  2024-06-05  0:24 ` [PATCH v3 3/9] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps Andrii Nakryiko
@ 2024-06-05  0:24 ` Andrii Nakryiko
  2024-06-05 23:15   ` Suren Baghdasaryan
  2024-06-05  0:24 ` [PATCH v3 5/9] fs/procfs: add build ID fetching to " Andrii Nakryiko
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Andrii Nakryiko @ 2024-06-05  0:24 UTC (permalink / raw)
  To: linux-fsdevel, brauner, viro, akpm
  Cc: linux-kernel, bpf, gregkh, linux-mm, liam.howlett, surenb, rppt,
	Andrii Nakryiko

Attempt to use RCU-protected per-VMA lock when looking up requested VMA
as much as possible, only falling back to mmap_lock if per-VMA lock
failed. This is done so that querying of VMAs doesn't interfere with
other critical tasks, like page fault handling.

This has been suggested by mm folks, and we make use of a newly added
internal API that works like find_vma(), but tries to use per-VMA lock.

We have two sets of setup/query/teardown helper functions with different
implementations depending on availability of per-VMA lock (conditioned
on CONFIG_PER_VMA_LOCK) to abstract per-VMA lock subtleties.

When per-VMA lock is available, lookup is done under RCU, attempting to
take a per-VMA lock. If that fails, we fallback to mmap_lock, but then
proceed to unconditionally grab per-VMA lock again, dropping mmap_lock
immediately. In this configuration mmap_lock is never helf for long,
minimizing disruptions while querying.

When per-VMA lock is compiled out, we take mmap_lock once, query VMAs
using find_vma() API, and then unlock mmap_lock at the very end once as
well. In this setup we avoid locking/unlocking mmap_lock on every looked
up VMA (depending on query parameters we might need to iterate a few of
them).

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 fs/proc/task_mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 614fbe5d0667..140032ffc551 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -388,6 +388,49 @@ static int pid_maps_open(struct inode *inode, struct file *file)
 		PROCMAP_QUERY_VMA_FLAGS				\
 )
 
+#ifdef CONFIG_PER_VMA_LOCK
+static int query_vma_setup(struct mm_struct *mm)
+{
+	/* in the presence of per-VMA lock we don't need any setup/teardown */
+	return 0;
+}
+
+static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma)
+{
+	/* in the presence of per-VMA lock we need to unlock vma, if present */
+	if (vma)
+		vma_end_read(vma);
+}
+
+static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr)
+{
+	struct vm_area_struct *vma;
+
+	/* try to use less disruptive per-VMA lock */
+	vma = find_and_lock_vma_rcu(mm, addr);
+	if (IS_ERR(vma)) {
+		/* failed to take per-VMA lock, fallback to mmap_lock */
+		if (mmap_read_lock_killable(mm))
+			return ERR_PTR(-EINTR);
+
+		vma = find_vma(mm, addr);
+		if (vma) {
+			/*
+			 * We cannot use vma_start_read() as it may fail due to
+			 * false locked (see comment in vma_start_read()). We
+			 * can avoid that by directly locking vm_lock under
+			 * mmap_lock, which guarantees that nobody can lock the
+			 * vma for write (vma_start_write()) under us.
+			 */
+			down_read(&vma->vm_lock->lock);
+		}
+
+		mmap_read_unlock(mm);
+	}
+
+	return vma;
+}
+#else
 static int query_vma_setup(struct mm_struct *mm)
 {
 	return mmap_read_lock_killable(mm);
@@ -402,6 +445,7 @@ static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsig
 {
 	return find_vma(mm, addr);
 }
+#endif
 
 static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
 						 unsigned long addr, u32 flags)
@@ -441,8 +485,10 @@ static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
 skip_vma:
 	/*
 	 * If the user needs closest matching VMA, keep iterating.
+	 * But before we proceed we might need to unlock current VMA.
 	 */
 	addr = vma->vm_end;
+	vma_end_read(vma); /* no-op under !CONFIG_PER_VMA_LOCK */
 	if (flags & PROCMAP_QUERY_COVERING_OR_NEXT_VMA)
 		goto next_vma;
 no_vma:
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 5/9] fs/procfs: add build ID fetching to PROCMAP_QUERY API
  2024-06-05  0:24 [PATCH v3 0/9] ioctl()-based API to query VMAs from /proc/<pid>/maps Andrii Nakryiko
                   ` (3 preceding siblings ...)
  2024-06-05  0:24 ` [PATCH v3 4/9] fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API Andrii Nakryiko
@ 2024-06-05  0:24 ` Andrii Nakryiko
  2024-06-05  0:24 ` [PATCH v3 6/9] docs/procfs: call out ioctl()-based PROCMAP_QUERY command existence Andrii Nakryiko
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 31+ messages in thread
From: Andrii Nakryiko @ 2024-06-05  0:24 UTC (permalink / raw)
  To: linux-fsdevel, brauner, viro, akpm
  Cc: linux-kernel, bpf, gregkh, linux-mm, liam.howlett, surenb, rppt,
	Andrii Nakryiko

The need to get ELF build ID reliably is an important aspect when
dealing with profiling and stack trace symbolization, and
/proc/<pid>/maps textual representation doesn't help with this.

To get backing file's ELF build ID, application has to first resolve
VMA, then use it's start/end address range to follow a special
/proc/<pid>/map_files/<start>-<end> symlink to open the ELF file (this
is necessary because backing file might have been removed from the disk
or was already replaced with another binary in the same file path.

Such approach, beyond just adding complexity of having to do a bunch of
extra work, has extra security implications. Because application opens
underlying ELF file and needs read access to its entire contents (as far
as kernel is concerned), kernel puts additional capable() checks on
following /proc/<pid>/map_files/<start>-<end> symlink. And that makes
sense in general.

But in the case of build ID, profiler/symbolizer doesn't need the
contents of ELF file, per se. It's only build ID that is of interest,
and ELF build ID itself doesn't provide any sensitive information.

So this patch adds a way to request backing file's ELF build ID along
the rest of VMA information in the same API. User has control over
whether this piece of information is requested or not by either setting
build_id_size field to zero or non-zero maximum buffer size they
provided through build_id_addr field (which encodes user pointer as
__u64 field). This is a completely optional piece of information, and so
has no performance implications for user cases that don't care about
build ID, while improving performance and simplifying the setup for
those application that do need it.

Kernel already implements build ID fetching, which is used from BPF
subsystem. We are reusing this code here, but plan a follow up changes
to make it work better under more relaxed assumption (compared to what
existing code assumes) of being called from user process context, in
which page faults are allowed. BPF-specific implementation currently
bails out if necessary part of ELF file is not paged in, all due to
extra BPF-specific restrictions (like the need to fetch build ID in
restrictive contexts such as NMI handler).

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 fs/proc/task_mmu.c      | 25 ++++++++++++++++++++++++-
 include/uapi/linux/fs.h | 28 ++++++++++++++++++++++++++++
 2 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 140032ffc551..4b7251fb1a4b 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -22,6 +22,7 @@
 #include <linux/pkeys.h>
 #include <linux/minmax.h>
 #include <linux/overflow.h>
+#include <linux/buildid.h>
 
 #include <asm/elf.h>
 #include <asm/tlb.h>
@@ -491,6 +492,7 @@ static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
 	vma_end_read(vma); /* no-op under !CONFIG_PER_VMA_LOCK */
 	if (flags & PROCMAP_QUERY_COVERING_OR_NEXT_VMA)
 		goto next_vma;
+
 no_vma:
 	return ERR_PTR(-ENOENT);
 }
@@ -501,7 +503,7 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
 	struct vm_area_struct *vma;
 	struct mm_struct *mm;
 	const char *name = NULL;
-	char *name_buf = NULL;
+	char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL;
 	__u64 usize;
 	int err;
 
@@ -523,6 +525,8 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
 	/* either both buffer address and size are set, or both should be zero */
 	if (!!karg.vma_name_size != !!karg.vma_name_addr)
 		return -EINVAL;
+	if (!!karg.build_id_size != !!karg.build_id_addr)
+		return -EINVAL;
 
 	mm = priv->mm;
 	if (!mm || !mmget_not_zero(mm))
@@ -568,6 +572,21 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
 	if (vma->vm_flags & VM_MAYSHARE)
 		karg.vma_flags |= PROCMAP_QUERY_VMA_SHARED;
 
+	if (karg.build_id_size) {
+		__u32 build_id_sz;
+
+		err = build_id_parse(vma, build_id_buf, &build_id_sz);
+		if (err) {
+			karg.build_id_size = 0;
+		} else {
+			if (karg.build_id_size < build_id_sz) {
+				err = -ENAMETOOLONG;
+				goto out;
+			}
+			karg.build_id_size = build_id_sz;
+		}
+	}
+
 	if (karg.vma_name_size) {
 		size_t name_buf_sz = min_t(size_t, PATH_MAX, karg.vma_name_size);
 		const struct path *path;
@@ -612,6 +631,10 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
 	}
 	kfree(name_buf);
 
+	if (karg.build_id_size && copy_to_user((void __user *)karg.build_id_addr,
+					       build_id_buf, karg.build_id_size))
+		return -EFAULT;
+
 	if (copy_to_user(uarg, &karg, min_t(size_t, sizeof(karg), usize)))
 		return -EFAULT;
 
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index f25e7004972d..7306022780d3 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -509,6 +509,26 @@ struct procmap_query {
 	 * If set to zero, vma_name_addr should be set to zero as well
 	 */
 	__u32 vma_name_size;		/* in/out */
+	/*
+	 * If set to non-zero value, signals the request to extract and return
+	 * VMA's backing file's build ID, if the backing file is an ELF file
+	 * and it contains embedded build ID.
+	 *
+	 * Kernel will set this field to zero, if VMA has no backing file,
+	 * backing file is not an ELF file, or ELF file has no build ID
+	 * embedded.
+	 *
+	 * Build ID is a binary value (not a string). Kernel will set
+	 * build_id_size field to exact number of bytes used for build ID.
+	 * If build ID is requested and present, but needs more bytes than
+	 * user-supplied maximum buffer size (see build_id_addr field below),
+	 * -E2BIG error will be returned.
+	 *
+	 * If this field is set to non-zero value, build_id_addr should point
+	 * to valid user space memory buffer of at least build_id_size bytes.
+	 * If set to zero, build_id_addr should be set to zero as well
+	 */
+	__u32 build_id_size;		/* in/out */
 	/*
 	 * User-supplied address of a buffer of at least vma_name_size bytes
 	 * for kernel to fill with matched VMA's name (see vma_name_size field
@@ -517,6 +537,14 @@ struct procmap_query {
 	 * Should be set to zero if VMA name should not be returned.
 	 */
 	__u64 vma_name_addr;		/* in */
+	/*
+	 * User-supplied address of a buffer of at least build_id_size bytes
+	 * for kernel to fill with matched VMA's ELF build ID, if available
+	 * (see build_id_size field description above for details).
+	 *
+	 * Should be set to zero if build ID should not be returned.
+	 */
+	__u64 build_id_addr;		/* in */
 };
 
 #endif /* _UAPI_LINUX_FS_H */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 6/9] docs/procfs: call out ioctl()-based PROCMAP_QUERY command existence
  2024-06-05  0:24 [PATCH v3 0/9] ioctl()-based API to query VMAs from /proc/<pid>/maps Andrii Nakryiko
                   ` (4 preceding siblings ...)
  2024-06-05  0:24 ` [PATCH v3 5/9] fs/procfs: add build ID fetching to " Andrii Nakryiko
@ 2024-06-05  0:24 ` Andrii Nakryiko
  2024-06-05  0:24 ` [PATCH v3 7/9] tools: sync uapi/linux/fs.h header into tools subdir Andrii Nakryiko
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 31+ messages in thread
From: Andrii Nakryiko @ 2024-06-05  0:24 UTC (permalink / raw)
  To: linux-fsdevel, brauner, viro, akpm
  Cc: linux-kernel, bpf, gregkh, linux-mm, liam.howlett, surenb, rppt,
	Andrii Nakryiko

Call out PROCMAP_QUERY ioctl() existence in the section describing
/proc/PID/maps file in documentation. We refer user to UAPI header for
low-level details of this programmatic interface.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 Documentation/filesystems/proc.rst | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 7c3a565ffbef..f2bbd1e86204 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -443,6 +443,14 @@ is not associated with a file:
 
  or if empty, the mapping is anonymous.
 
+Starting with 6.11 kernel, /proc/PID/maps provides an alternative
+ioctl()-based API that gives ability to flexibly and efficiently query and
+filter individual VMAs.  This interface is binary and is meant for more
+efficient programmatic use. `struct procmap_query`, defined in linux/fs.h UAPI
+header, serves as an input/output argument to the `PROCMAP_QUERY` ioctl()
+command. See comments in linus/fs.h UAPI header for details on query
+semantics, supported flags, data returned, and general API usage information.
+
 The /proc/PID/smaps is an extension based on maps, showing the memory
 consumption for each of the process's mappings. For each mapping (aka Virtual
 Memory Area, or VMA) there is a series of lines such as the following::
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 7/9] tools: sync uapi/linux/fs.h header into tools subdir
  2024-06-05  0:24 [PATCH v3 0/9] ioctl()-based API to query VMAs from /proc/<pid>/maps Andrii Nakryiko
                   ` (5 preceding siblings ...)
  2024-06-05  0:24 ` [PATCH v3 6/9] docs/procfs: call out ioctl()-based PROCMAP_QUERY command existence Andrii Nakryiko
@ 2024-06-05  0:24 ` Andrii Nakryiko
  2024-06-05  0:24 ` [PATCH v3 8/9] selftests/bpf: make use of PROCMAP_QUERY ioctl if available Andrii Nakryiko
  2024-06-05  0:24 ` [PATCH v3 9/9] selftests/bpf: add simple benchmark tool for /proc/<pid>/maps APIs Andrii Nakryiko
  8 siblings, 0 replies; 31+ messages in thread
From: Andrii Nakryiko @ 2024-06-05  0:24 UTC (permalink / raw)
  To: linux-fsdevel, brauner, viro, akpm
  Cc: linux-kernel, bpf, gregkh, linux-mm, liam.howlett, surenb, rppt,
	Andrii Nakryiko

We need this UAPI header in tools/include subdirectory for using it from
BPF selftests.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 tools/include/uapi/linux/fs.h | 550 ++++++++++++++++++++++++++++++++++
 1 file changed, 550 insertions(+)
 create mode 100644 tools/include/uapi/linux/fs.h

diff --git a/tools/include/uapi/linux/fs.h b/tools/include/uapi/linux/fs.h
new file mode 100644
index 000000000000..7306022780d3
--- /dev/null
+++ b/tools/include/uapi/linux/fs.h
@@ -0,0 +1,550 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_FS_H
+#define _UAPI_LINUX_FS_H
+
+/*
+ * This file has definitions for some important file table structures
+ * and constants and structures used by various generic file system
+ * ioctl's.  Please do not make any changes in this file before
+ * sending patches for review to linux-fsdevel@vger.kernel.org and
+ * linux-api@vger.kernel.org.
+ */
+
+#include <linux/limits.h>
+#include <linux/ioctl.h>
+#include <linux/types.h>
+#ifndef __KERNEL__
+#include <linux/fscrypt.h>
+#endif
+
+/* Use of MS_* flags within the kernel is restricted to core mount(2) code. */
+#if !defined(__KERNEL__)
+#include <linux/mount.h>
+#endif
+
+/*
+ * It's silly to have NR_OPEN bigger than NR_FILE, but you can change
+ * the file limit at runtime and only root can increase the per-process
+ * nr_file rlimit, so it's safe to set up a ridiculously high absolute
+ * upper limit on files-per-process.
+ *
+ * Some programs (notably those using select()) may have to be 
+ * recompiled to take full advantage of the new limits..  
+ */
+
+/* Fixed constants first: */
+#undef NR_OPEN
+#define INR_OPEN_CUR 1024	/* Initial setting for nfile rlimits */
+#define INR_OPEN_MAX 4096	/* Hard limit for nfile rlimits */
+
+#define BLOCK_SIZE_BITS 10
+#define BLOCK_SIZE (1<<BLOCK_SIZE_BITS)
+
+#define SEEK_SET	0	/* seek relative to beginning of file */
+#define SEEK_CUR	1	/* seek relative to current file position */
+#define SEEK_END	2	/* seek relative to end of file */
+#define SEEK_DATA	3	/* seek to the next data */
+#define SEEK_HOLE	4	/* seek to the next hole */
+#define SEEK_MAX	SEEK_HOLE
+
+#define RENAME_NOREPLACE	(1 << 0)	/* Don't overwrite target */
+#define RENAME_EXCHANGE		(1 << 1)	/* Exchange source and dest */
+#define RENAME_WHITEOUT		(1 << 2)	/* Whiteout source */
+
+struct file_clone_range {
+	__s64 src_fd;
+	__u64 src_offset;
+	__u64 src_length;
+	__u64 dest_offset;
+};
+
+struct fstrim_range {
+	__u64 start;
+	__u64 len;
+	__u64 minlen;
+};
+
+/*
+ * We include a length field because some filesystems (vfat) have an identifier
+ * that we do want to expose as a UUID, but doesn't have the standard length.
+ *
+ * We use a fixed size buffer beacuse this interface will, by fiat, never
+ * support "UUIDs" longer than 16 bytes; we don't want to force all downstream
+ * users to have to deal with that.
+ */
+struct fsuuid2 {
+	__u8	len;
+	__u8	uuid[16];
+};
+
+struct fs_sysfs_path {
+	__u8			len;
+	__u8			name[128];
+};
+
+/* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
+#define FILE_DEDUPE_RANGE_SAME		0
+#define FILE_DEDUPE_RANGE_DIFFERS	1
+
+/* from struct btrfs_ioctl_file_extent_same_info */
+struct file_dedupe_range_info {
+	__s64 dest_fd;		/* in - destination file */
+	__u64 dest_offset;	/* in - start of extent in destination */
+	__u64 bytes_deduped;	/* out - total # of bytes we were able
+				 * to dedupe from this file. */
+	/* status of this dedupe operation:
+	 * < 0 for error
+	 * == FILE_DEDUPE_RANGE_SAME if dedupe succeeds
+	 * == FILE_DEDUPE_RANGE_DIFFERS if data differs
+	 */
+	__s32 status;		/* out - see above description */
+	__u32 reserved;		/* must be zero */
+};
+
+/* from struct btrfs_ioctl_file_extent_same_args */
+struct file_dedupe_range {
+	__u64 src_offset;	/* in - start of extent in source */
+	__u64 src_length;	/* in - length of extent */
+	__u16 dest_count;	/* in - total elements in info array */
+	__u16 reserved1;	/* must be zero */
+	__u32 reserved2;	/* must be zero */
+	struct file_dedupe_range_info info[];
+};
+
+/* And dynamically-tunable limits and defaults: */
+struct files_stat_struct {
+	unsigned long nr_files;		/* read only */
+	unsigned long nr_free_files;	/* read only */
+	unsigned long max_files;		/* tunable */
+};
+
+struct inodes_stat_t {
+	long nr_inodes;
+	long nr_unused;
+	long dummy[5];		/* padding for sysctl ABI compatibility */
+};
+
+
+#define NR_FILE  8192	/* this can well be larger on a larger system */
+
+/*
+ * Structure for FS_IOC_FSGETXATTR[A] and FS_IOC_FSSETXATTR.
+ */
+struct fsxattr {
+	__u32		fsx_xflags;	/* xflags field value (get/set) */
+	__u32		fsx_extsize;	/* extsize field value (get/set)*/
+	__u32		fsx_nextents;	/* nextents field value (get)	*/
+	__u32		fsx_projid;	/* project identifier (get/set) */
+	__u32		fsx_cowextsize;	/* CoW extsize field value (get/set)*/
+	unsigned char	fsx_pad[8];
+};
+
+/*
+ * Flags for the fsx_xflags field
+ */
+#define FS_XFLAG_REALTIME	0x00000001	/* data in realtime volume */
+#define FS_XFLAG_PREALLOC	0x00000002	/* preallocated file extents */
+#define FS_XFLAG_IMMUTABLE	0x00000008	/* file cannot be modified */
+#define FS_XFLAG_APPEND		0x00000010	/* all writes append */
+#define FS_XFLAG_SYNC		0x00000020	/* all writes synchronous */
+#define FS_XFLAG_NOATIME	0x00000040	/* do not update access time */
+#define FS_XFLAG_NODUMP		0x00000080	/* do not include in backups */
+#define FS_XFLAG_RTINHERIT	0x00000100	/* create with rt bit set */
+#define FS_XFLAG_PROJINHERIT	0x00000200	/* create with parents projid */
+#define FS_XFLAG_NOSYMLINKS	0x00000400	/* disallow symlink creation */
+#define FS_XFLAG_EXTSIZE	0x00000800	/* extent size allocator hint */
+#define FS_XFLAG_EXTSZINHERIT	0x00001000	/* inherit inode extent size */
+#define FS_XFLAG_NODEFRAG	0x00002000	/* do not defragment */
+#define FS_XFLAG_FILESTREAM	0x00004000	/* use filestream allocator */
+#define FS_XFLAG_DAX		0x00008000	/* use DAX for IO */
+#define FS_XFLAG_COWEXTSIZE	0x00010000	/* CoW extent size allocator hint */
+#define FS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this	*/
+
+/* the read-only stuff doesn't really belong here, but any other place is
+   probably as bad and I don't want to create yet another include file. */
+
+#define BLKROSET   _IO(0x12,93)	/* set device read-only (0 = read-write) */
+#define BLKROGET   _IO(0x12,94)	/* get read-only status (0 = read_write) */
+#define BLKRRPART  _IO(0x12,95)	/* re-read partition table */
+#define BLKGETSIZE _IO(0x12,96)	/* return device size /512 (long *arg) */
+#define BLKFLSBUF  _IO(0x12,97)	/* flush buffer cache */
+#define BLKRASET   _IO(0x12,98)	/* set read ahead for block device */
+#define BLKRAGET   _IO(0x12,99)	/* get current read ahead setting */
+#define BLKFRASET  _IO(0x12,100)/* set filesystem (mm/filemap.c) read-ahead */
+#define BLKFRAGET  _IO(0x12,101)/* get filesystem (mm/filemap.c) read-ahead */
+#define BLKSECTSET _IO(0x12,102)/* set max sectors per request (ll_rw_blk.c) */
+#define BLKSECTGET _IO(0x12,103)/* get max sectors per request (ll_rw_blk.c) */
+#define BLKSSZGET  _IO(0x12,104)/* get block device sector size */
+#if 0
+#define BLKPG      _IO(0x12,105)/* See blkpg.h */
+
+/* Some people are morons.  Do not use sizeof! */
+
+#define BLKELVGET  _IOR(0x12,106,size_t)/* elevator get */
+#define BLKELVSET  _IOW(0x12,107,size_t)/* elevator set */
+/* This was here just to show that the number is taken -
+   probably all these _IO(0x12,*) ioctls should be moved to blkpg.h. */
+#endif
+/* A jump here: 108-111 have been used for various private purposes. */
+#define BLKBSZGET  _IOR(0x12,112,size_t)
+#define BLKBSZSET  _IOW(0x12,113,size_t)
+#define BLKGETSIZE64 _IOR(0x12,114,size_t)	/* return device size in bytes (u64 *arg) */
+#define BLKTRACESETUP _IOWR(0x12,115,struct blk_user_trace_setup)
+#define BLKTRACESTART _IO(0x12,116)
+#define BLKTRACESTOP _IO(0x12,117)
+#define BLKTRACETEARDOWN _IO(0x12,118)
+#define BLKDISCARD _IO(0x12,119)
+#define BLKIOMIN _IO(0x12,120)
+#define BLKIOOPT _IO(0x12,121)
+#define BLKALIGNOFF _IO(0x12,122)
+#define BLKPBSZGET _IO(0x12,123)
+#define BLKDISCARDZEROES _IO(0x12,124)
+#define BLKSECDISCARD _IO(0x12,125)
+#define BLKROTATIONAL _IO(0x12,126)
+#define BLKZEROOUT _IO(0x12,127)
+#define BLKGETDISKSEQ _IOR(0x12,128,__u64)
+/*
+ * A jump here: 130-136 are reserved for zoned block devices
+ * (see uapi/linux/blkzoned.h)
+ */
+
+#define BMAP_IOCTL 1		/* obsolete - kept for compatibility */
+#define FIBMAP	   _IO(0x00,1)	/* bmap access */
+#define FIGETBSZ   _IO(0x00,2)	/* get the block size used for bmap */
+#define FIFREEZE	_IOWR('X', 119, int)	/* Freeze */
+#define FITHAW		_IOWR('X', 120, int)	/* Thaw */
+#define FITRIM		_IOWR('X', 121, struct fstrim_range)	/* Trim */
+#define FICLONE		_IOW(0x94, 9, int)
+#define FICLONERANGE	_IOW(0x94, 13, struct file_clone_range)
+#define FIDEDUPERANGE	_IOWR(0x94, 54, struct file_dedupe_range)
+
+#define FSLABEL_MAX 256	/* Max chars for the interface; each fs may differ */
+
+#define	FS_IOC_GETFLAGS			_IOR('f', 1, long)
+#define	FS_IOC_SETFLAGS			_IOW('f', 2, long)
+#define	FS_IOC_GETVERSION		_IOR('v', 1, long)
+#define	FS_IOC_SETVERSION		_IOW('v', 2, long)
+#define FS_IOC_FIEMAP			_IOWR('f', 11, struct fiemap)
+#define FS_IOC32_GETFLAGS		_IOR('f', 1, int)
+#define FS_IOC32_SETFLAGS		_IOW('f', 2, int)
+#define FS_IOC32_GETVERSION		_IOR('v', 1, int)
+#define FS_IOC32_SETVERSION		_IOW('v', 2, int)
+#define FS_IOC_FSGETXATTR		_IOR('X', 31, struct fsxattr)
+#define FS_IOC_FSSETXATTR		_IOW('X', 32, struct fsxattr)
+#define FS_IOC_GETFSLABEL		_IOR(0x94, 49, char[FSLABEL_MAX])
+#define FS_IOC_SETFSLABEL		_IOW(0x94, 50, char[FSLABEL_MAX])
+/* Returns the external filesystem UUID, the same one blkid returns */
+#define FS_IOC_GETFSUUID		_IOR(0x15, 0, struct fsuuid2)
+/*
+ * Returns the path component under /sys/fs/ that refers to this filesystem;
+ * also /sys/kernel/debug/ for filesystems with debugfs exports
+ */
+#define FS_IOC_GETFSSYSFSPATH		_IOR(0x15, 1, struct fs_sysfs_path)
+
+/*
+ * Inode flags (FS_IOC_GETFLAGS / FS_IOC_SETFLAGS)
+ *
+ * Note: for historical reasons, these flags were originally used and
+ * defined for use by ext2/ext3, and then other file systems started
+ * using these flags so they wouldn't need to write their own version
+ * of chattr/lsattr (which was shipped as part of e2fsprogs).  You
+ * should think twice before trying to use these flags in new
+ * contexts, or trying to assign these flags, since they are used both
+ * as the UAPI and the on-disk encoding for ext2/3/4.  Also, we are
+ * almost out of 32-bit flags.  :-)
+ *
+ * We have recently hoisted FS_IOC_FSGETXATTR / FS_IOC_FSSETXATTR from
+ * XFS to the generic FS level interface.  This uses a structure that
+ * has padding and hence has more room to grow, so it may be more
+ * appropriate for many new use cases.
+ *
+ * Please do not change these flags or interfaces before checking with
+ * linux-fsdevel@vger.kernel.org and linux-api@vger.kernel.org.
+ */
+#define	FS_SECRM_FL			0x00000001 /* Secure deletion */
+#define	FS_UNRM_FL			0x00000002 /* Undelete */
+#define	FS_COMPR_FL			0x00000004 /* Compress file */
+#define FS_SYNC_FL			0x00000008 /* Synchronous updates */
+#define FS_IMMUTABLE_FL			0x00000010 /* Immutable file */
+#define FS_APPEND_FL			0x00000020 /* writes to file may only append */
+#define FS_NODUMP_FL			0x00000040 /* do not dump file */
+#define FS_NOATIME_FL			0x00000080 /* do not update atime */
+/* Reserved for compression usage... */
+#define FS_DIRTY_FL			0x00000100
+#define FS_COMPRBLK_FL			0x00000200 /* One or more compressed clusters */
+#define FS_NOCOMP_FL			0x00000400 /* Don't compress */
+/* End compression flags --- maybe not all used */
+#define FS_ENCRYPT_FL			0x00000800 /* Encrypted file */
+#define FS_BTREE_FL			0x00001000 /* btree format dir */
+#define FS_INDEX_FL			0x00001000 /* hash-indexed directory */
+#define FS_IMAGIC_FL			0x00002000 /* AFS directory */
+#define FS_JOURNAL_DATA_FL		0x00004000 /* Reserved for ext3 */
+#define FS_NOTAIL_FL			0x00008000 /* file tail should not be merged */
+#define FS_DIRSYNC_FL			0x00010000 /* dirsync behaviour (directories only) */
+#define FS_TOPDIR_FL			0x00020000 /* Top of directory hierarchies*/
+#define FS_HUGE_FILE_FL			0x00040000 /* Reserved for ext4 */
+#define FS_EXTENT_FL			0x00080000 /* Extents */
+#define FS_VERITY_FL			0x00100000 /* Verity protected inode */
+#define FS_EA_INODE_FL			0x00200000 /* Inode used for large EA */
+#define FS_EOFBLOCKS_FL			0x00400000 /* Reserved for ext4 */
+#define FS_NOCOW_FL			0x00800000 /* Do not cow file */
+#define FS_DAX_FL			0x02000000 /* Inode is DAX */
+#define FS_INLINE_DATA_FL		0x10000000 /* Reserved for ext4 */
+#define FS_PROJINHERIT_FL		0x20000000 /* Create with parents projid */
+#define FS_CASEFOLD_FL			0x40000000 /* Folder is case insensitive */
+#define FS_RESERVED_FL			0x80000000 /* reserved for ext2 lib */
+
+#define FS_FL_USER_VISIBLE		0x0003DFFF /* User visible flags */
+#define FS_FL_USER_MODIFIABLE		0x000380FF /* User modifiable flags */
+
+
+#define SYNC_FILE_RANGE_WAIT_BEFORE	1
+#define SYNC_FILE_RANGE_WRITE		2
+#define SYNC_FILE_RANGE_WAIT_AFTER	4
+#define SYNC_FILE_RANGE_WRITE_AND_WAIT	(SYNC_FILE_RANGE_WRITE | \
+					 SYNC_FILE_RANGE_WAIT_BEFORE | \
+					 SYNC_FILE_RANGE_WAIT_AFTER)
+
+/*
+ * Flags for preadv2/pwritev2:
+ */
+
+typedef int __bitwise __kernel_rwf_t;
+
+/* high priority request, poll if possible */
+#define RWF_HIPRI	((__force __kernel_rwf_t)0x00000001)
+
+/* per-IO O_DSYNC */
+#define RWF_DSYNC	((__force __kernel_rwf_t)0x00000002)
+
+/* per-IO O_SYNC */
+#define RWF_SYNC	((__force __kernel_rwf_t)0x00000004)
+
+/* per-IO, return -EAGAIN if operation would block */
+#define RWF_NOWAIT	((__force __kernel_rwf_t)0x00000008)
+
+/* per-IO O_APPEND */
+#define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
+
+/* per-IO negation of O_APPEND */
+#define RWF_NOAPPEND	((__force __kernel_rwf_t)0x00000020)
+
+/* mask of flags supported by the kernel */
+#define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
+			 RWF_APPEND | RWF_NOAPPEND)
+
+#define PROCFS_IOCTL_MAGIC 'f'
+
+/* Pagemap ioctl */
+#define PAGEMAP_SCAN	_IOWR(PROCFS_IOCTL_MAGIC, 16, struct pm_scan_arg)
+
+/* Bitmasks provided in pm_scan_args masks and reported in page_region.categories. */
+#define PAGE_IS_WPALLOWED	(1 << 0)
+#define PAGE_IS_WRITTEN		(1 << 1)
+#define PAGE_IS_FILE		(1 << 2)
+#define PAGE_IS_PRESENT		(1 << 3)
+#define PAGE_IS_SWAPPED		(1 << 4)
+#define PAGE_IS_PFNZERO		(1 << 5)
+#define PAGE_IS_HUGE		(1 << 6)
+#define PAGE_IS_SOFT_DIRTY	(1 << 7)
+
+/*
+ * struct page_region - Page region with flags
+ * @start:	Start of the region
+ * @end:	End of the region (exclusive)
+ * @categories:	PAGE_IS_* category bitmask for the region
+ */
+struct page_region {
+	__u64 start;
+	__u64 end;
+	__u64 categories;
+};
+
+/* Flags for PAGEMAP_SCAN ioctl */
+#define PM_SCAN_WP_MATCHING	(1 << 0)	/* Write protect the pages matched. */
+#define PM_SCAN_CHECK_WPASYNC	(1 << 1)	/* Abort the scan when a non-WP-enabled page is found. */
+
+/*
+ * struct pm_scan_arg - Pagemap ioctl argument
+ * @size:		Size of the structure
+ * @flags:		Flags for the IOCTL
+ * @start:		Starting address of the region
+ * @end:		Ending address of the region
+ * @walk_end		Address where the scan stopped (written by kernel).
+ *			walk_end == end (address tags cleared) informs that the scan completed on entire range.
+ * @vec:		Address of page_region struct array for output
+ * @vec_len:		Length of the page_region struct array
+ * @max_pages:		Optional limit for number of returned pages (0 = disabled)
+ * @category_inverted:	PAGE_IS_* categories which values match if 0 instead of 1
+ * @category_mask:	Skip pages for which any category doesn't match
+ * @category_anyof_mask: Skip pages for which no category matches
+ * @return_mask:	PAGE_IS_* categories that are to be reported in `page_region`s returned
+ */
+struct pm_scan_arg {
+	__u64 size;
+	__u64 flags;
+	__u64 start;
+	__u64 end;
+	__u64 walk_end;
+	__u64 vec;
+	__u64 vec_len;
+	__u64 max_pages;
+	__u64 category_inverted;
+	__u64 category_mask;
+	__u64 category_anyof_mask;
+	__u64 return_mask;
+};
+
+/* /proc/<pid>/maps ioctl */
+#define PROCMAP_QUERY	_IOWR(PROCFS_IOCTL_MAGIC, 17, struct procmap_query)
+
+enum procmap_query_flags {
+	/*
+	 * VMA permission flags.
+	 *
+	 * Can be used as part of procmap_query.query_flags field to look up
+	 * only VMAs satisfying specified subset of permissions. E.g., specifying
+	 * PROCMAP_QUERY_VMA_READABLE only will return both readable and read/write VMAs,
+	 * while having PROCMAP_QUERY_VMA_READABLE | PROCMAP_QUERY_VMA_WRITABLE will only
+	 * return read/write VMAs, though both executable/non-executable and
+	 * private/shared will be ignored.
+	 *
+	 * PROCMAP_QUERY_VMA_* flags are also returned in procmap_query.vma_flags
+	 * field to specify actual VMA permissions.
+	 */
+	PROCMAP_QUERY_VMA_READABLE		= 0x01,
+	PROCMAP_QUERY_VMA_WRITABLE		= 0x02,
+	PROCMAP_QUERY_VMA_EXECUTABLE		= 0x04,
+	PROCMAP_QUERY_VMA_SHARED		= 0x08,
+	/*
+	 * Query modifier flags.
+	 *
+	 * By default VMA that covers provided address is returned, or -ENOENT
+	 * is returned. With PROCMAP_QUERY_COVERING_OR_NEXT_VMA flag set, closest
+	 * VMA with vma_start > addr will be returned if no covering VMA is
+	 * found.
+	 *
+	 * PROCMAP_QUERY_FILE_BACKED_VMA instructs query to consider only VMAs that
+	 * have file backing. Can be combined with PROCMAP_QUERY_COVERING_OR_NEXT_VMA
+	 * to iterate all VMAs with file backing.
+	 */
+	PROCMAP_QUERY_COVERING_OR_NEXT_VMA	= 0x10,
+	PROCMAP_QUERY_FILE_BACKED_VMA		= 0x20,
+};
+
+/*
+ * Input/output argument structured passed into ioctl() call. It can be used
+ * to query a set of VMAs (Virtual Memory Areas) of a process.
+ *
+ * Each field can be one of three kinds, marked in a short comment to the
+ * right of the field:
+ *   - "in", input argument, user has to provide this value, kernel doesn't modify it;
+ *   - "out", output argument, kernel sets this field with VMA data;
+ *   - "in/out", input and output argument; user provides initial value (used
+ *     to specify maximum allowable buffer size), and kernel sets it to actual
+ *     amount of data written (or zero, if there is no data).
+ *
+ * If matching VMA is found (according to criterias specified by
+ * query_addr/query_flags, all the out fields are filled out, and ioctl()
+ * returns 0. If there is no matching VMA, -ENOENT will be returned.
+ * In case of any other error, negative error code other than -ENOENT is
+ * returned.
+ *
+ * Most of the data is similar to the one returned as text in /proc/<pid>/maps
+ * file, but procmap_query provides more querying flexibility. There are no
+ * consistency guarantees between subsequent ioctl() calls, but data returned
+ * for matched VMA is self-consistent.
+ */
+struct procmap_query {
+	/* Query struct size, for backwards/forward compatibility */
+	__u64 size;
+	/*
+	 * Query flags, a combination of enum procmap_query_flags values.
+	 * Defines query filtering and behavior, see enum procmap_query_flags.
+	 *
+	 * Input argument, provided by user. Kernel doesn't modify it.
+	 */
+	__u64 query_flags;		/* in */
+	/*
+	 * Query address. By default, VMA that covers this address will
+	 * be looked up. PROCMAP_QUERY_* flags above modify this default
+	 * behavior further.
+	 *
+	 * Input argument, provided by user. Kernel doesn't modify it.
+	 */
+	__u64 query_addr;		/* in */
+	/* VMA starting (inclusive) and ending (exclusive) address, if VMA is found. */
+	__u64 vma_start;		/* out */
+	__u64 vma_end;			/* out */
+	/* VMA permissions flags. A combination of PROCMAP_QUERY_VMA_* flags. */
+	__u64 vma_flags;		/* out */
+	/*
+	 * VMA file offset. If VMA has file backing, this specifies offset
+	 * within the file that VMA's start address corresponds to.
+	 * Is set to zero if VMA has no backing file.
+	 */
+	__u64 vma_offset;		/* out */
+	/* Backing file's inode number, or zero, if VMA has no backing file. */
+	__u64 inode;			/* out */
+	/* Backing file's device major/minor number, or zero, if VMA has no backing file. */
+	__u32 dev_major;		/* out */
+	__u32 dev_minor;		/* out */
+	/*
+	 * If set to non-zero value, signals the request to return VMA name
+	 * (i.e., VMA's backing file's absolute path, with " (deleted)" suffix
+	 * appended, if file was unlinked from FS) for matched VMA. VMA name
+	 * can also be some special name (e.g., "[heap]", "[stack]") or could
+	 * be even user-supplied with prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME).
+	 *
+	 * Kernel will set this field to zero, if VMA has no associated name.
+	 * Otherwise kernel will return actual amount of bytes filled in
+	 * user-supplied buffer (see vma_name_addr field below), including the
+	 * terminating zero.
+	 *
+	 * If VMA name is longer that user-supplied maximum buffer size,
+	 * -E2BIG error is returned.
+	 *
+	 * If this field is set to non-zero value, vma_name_addr should point
+	 * to valid user space memory buffer of at least vma_name_size bytes.
+	 * If set to zero, vma_name_addr should be set to zero as well
+	 */
+	__u32 vma_name_size;		/* in/out */
+	/*
+	 * If set to non-zero value, signals the request to extract and return
+	 * VMA's backing file's build ID, if the backing file is an ELF file
+	 * and it contains embedded build ID.
+	 *
+	 * Kernel will set this field to zero, if VMA has no backing file,
+	 * backing file is not an ELF file, or ELF file has no build ID
+	 * embedded.
+	 *
+	 * Build ID is a binary value (not a string). Kernel will set
+	 * build_id_size field to exact number of bytes used for build ID.
+	 * If build ID is requested and present, but needs more bytes than
+	 * user-supplied maximum buffer size (see build_id_addr field below),
+	 * -E2BIG error will be returned.
+	 *
+	 * If this field is set to non-zero value, build_id_addr should point
+	 * to valid user space memory buffer of at least build_id_size bytes.
+	 * If set to zero, build_id_addr should be set to zero as well
+	 */
+	__u32 build_id_size;		/* in/out */
+	/*
+	 * User-supplied address of a buffer of at least vma_name_size bytes
+	 * for kernel to fill with matched VMA's name (see vma_name_size field
+	 * description above for details).
+	 *
+	 * Should be set to zero if VMA name should not be returned.
+	 */
+	__u64 vma_name_addr;		/* in */
+	/*
+	 * User-supplied address of a buffer of at least build_id_size bytes
+	 * for kernel to fill with matched VMA's ELF build ID, if available
+	 * (see build_id_size field description above for details).
+	 *
+	 * Should be set to zero if build ID should not be returned.
+	 */
+	__u64 build_id_addr;		/* in */
+};
+
+#endif /* _UAPI_LINUX_FS_H */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 8/9] selftests/bpf: make use of PROCMAP_QUERY ioctl if available
  2024-06-05  0:24 [PATCH v3 0/9] ioctl()-based API to query VMAs from /proc/<pid>/maps Andrii Nakryiko
                   ` (6 preceding siblings ...)
  2024-06-05  0:24 ` [PATCH v3 7/9] tools: sync uapi/linux/fs.h header into tools subdir Andrii Nakryiko
@ 2024-06-05  0:24 ` Andrii Nakryiko
  2024-06-05  0:24 ` [PATCH v3 9/9] selftests/bpf: add simple benchmark tool for /proc/<pid>/maps APIs Andrii Nakryiko
  8 siblings, 0 replies; 31+ messages in thread
From: Andrii Nakryiko @ 2024-06-05  0:24 UTC (permalink / raw)
  To: linux-fsdevel, brauner, viro, akpm
  Cc: linux-kernel, bpf, gregkh, linux-mm, liam.howlett, surenb, rppt,
	Andrii Nakryiko

Instead of parsing text-based /proc/<pid>/maps file, try to use
PROCMAP_QUERY ioctl() to simplify and speed up data fetching.
This logic is used to do uprobe file offset calculation, so any bugs in
this logic would manifest as failing uprobe BPF selftests.

This also serves as a simple demonstration of one of the intended uses.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 tools/testing/selftests/bpf/test_progs.c    |   3 +
 tools/testing/selftests/bpf/test_progs.h    |   2 +
 tools/testing/selftests/bpf/trace_helpers.c | 104 +++++++++++++++++---
 3 files changed, 94 insertions(+), 15 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
index 89ff704e9dad..6a19970f2531 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -19,6 +19,8 @@
 #include <bpf/btf.h>
 #include "json_writer.h"
 
+int env_verbosity = 0;
+
 static bool verbose(void)
 {
 	return env.verbosity > VERBOSE_NONE;
@@ -848,6 +850,7 @@ static error_t parse_arg(int key, char *arg, struct argp_state *state)
 				return -EINVAL;
 			}
 		}
+		env_verbosity = env->verbosity;
 
 		if (verbose()) {
 			if (setenv("SELFTESTS_VERBOSE", "1", 1) == -1) {
diff --git a/tools/testing/selftests/bpf/test_progs.h b/tools/testing/selftests/bpf/test_progs.h
index 0ba5a20b19ba..6eae7fdab0d7 100644
--- a/tools/testing/selftests/bpf/test_progs.h
+++ b/tools/testing/selftests/bpf/test_progs.h
@@ -95,6 +95,8 @@ struct test_state {
 	FILE *stdout;
 };
 
+extern int env_verbosity;
+
 struct test_env {
 	struct test_selector test_selector;
 	struct test_selector subtest_selector;
diff --git a/tools/testing/selftests/bpf/trace_helpers.c b/tools/testing/selftests/bpf/trace_helpers.c
index 70e29f316fe7..6723186c46bb 100644
--- a/tools/testing/selftests/bpf/trace_helpers.c
+++ b/tools/testing/selftests/bpf/trace_helpers.c
@@ -10,6 +10,8 @@
 #include <pthread.h>
 #include <unistd.h>
 #include <linux/perf_event.h>
+#include <linux/fs.h>
+#include <sys/ioctl.h>
 #include <sys/mman.h>
 #include "trace_helpers.h"
 #include <linux/limits.h>
@@ -233,29 +235,91 @@ int kallsyms_find(const char *sym, unsigned long long *addr)
 	return err;
 }
 
+#ifdef PROCMAP_QUERY
+int env_verbosity __weak = 0;
+
+int procmap_query(int fd, const void *addr, __u32 query_flags, size_t *start, size_t *offset, int *flags)
+{
+	char path_buf[PATH_MAX], build_id_buf[20];
+	struct procmap_query q;
+	int err;
+
+	memset(&q, 0, sizeof(q));
+	q.size = sizeof(q);
+	q.query_flags = query_flags;
+	q.query_addr = (__u64)addr;
+	q.vma_name_addr = (__u64)path_buf;
+	q.vma_name_size = sizeof(path_buf);
+	q.build_id_addr = (__u64)build_id_buf;
+	q.build_id_size = sizeof(build_id_buf);
+
+	err = ioctl(fd, PROCMAP_QUERY, &q);
+	if (err < 0) {
+		err = -errno;
+		if (err == -ENOTTY)
+			return -EOPNOTSUPP; /* ioctl() not implemented yet */
+		if (err == -ENOENT)
+			return -ESRCH; /* vma not found */
+		return err;
+	}
+
+	if (env_verbosity >= 1) {
+		printf("VMA FOUND (addr %08lx): %08lx-%08lx %c%c%c%c %08lx %02x:%02x %ld %s (build ID: %s, %d bytes)\n",
+		       (long)addr, (long)q.vma_start, (long)q.vma_end,
+		       (q.vma_flags & PROCMAP_QUERY_VMA_READABLE) ? 'r' : '-',
+		       (q.vma_flags & PROCMAP_QUERY_VMA_WRITABLE) ? 'w' : '-',
+		       (q.vma_flags & PROCMAP_QUERY_VMA_EXECUTABLE) ? 'x' : '-',
+		       (q.vma_flags & PROCMAP_QUERY_VMA_SHARED) ? 's' : 'p',
+		       (long)q.vma_offset, q.dev_major, q.dev_minor, (long)q.inode,
+		       q.vma_name_size ? path_buf : "",
+		       q.build_id_size ? "YES" : "NO",
+		       q.build_id_size);
+	}
+
+	*start = q.vma_start;
+	*offset = q.vma_offset;
+	*flags = q.vma_flags;
+	return 0;
+}
+#else
+int procmap_query(int fd, const void *addr, size_t *start, size_t *offset, int *flags)
+{
+	return -EOPNOTSUPP;
+}
+#endif
+
 ssize_t get_uprobe_offset(const void *addr)
 {
-	size_t start, end, base;
-	char buf[256];
-	bool found = false;
+	size_t start, base, end;
 	FILE *f;
+	char buf[256];
+	int err, flags;
 
 	f = fopen("/proc/self/maps", "r");
 	if (!f)
 		return -errno;
 
-	while (fscanf(f, "%zx-%zx %s %zx %*[^\n]\n", &start, &end, buf, &base) == 4) {
-		if (buf[2] == 'x' && (uintptr_t)addr >= start && (uintptr_t)addr < end) {
-			found = true;
-			break;
+	/* requested executable VMA only */
+	err = procmap_query(fileno(f), addr, PROCMAP_QUERY_VMA_EXECUTABLE, &start, &base, &flags);
+	if (err == -EOPNOTSUPP) {
+		bool found = false;
+
+		while (fscanf(f, "%zx-%zx %s %zx %*[^\n]\n", &start, &end, buf, &base) == 4) {
+			if (buf[2] == 'x' && (uintptr_t)addr >= start && (uintptr_t)addr < end) {
+				found = true;
+				break;
+			}
+		}
+		if (!found) {
+			fclose(f);
+			return -ESRCH;
 		}
+	} else if (err) {
+		fclose(f);
+		return err;
 	}
-
 	fclose(f);
 
-	if (!found)
-		return -ESRCH;
-
 #if defined(__powerpc64__) && defined(_CALL_ELF) && _CALL_ELF == 2
 
 #define OP_RT_RA_MASK   0xffff0000UL
@@ -296,15 +360,25 @@ ssize_t get_rel_offset(uintptr_t addr)
 	size_t start, end, offset;
 	char buf[256];
 	FILE *f;
+	int err, flags;
 
 	f = fopen("/proc/self/maps", "r");
 	if (!f)
 		return -errno;
 
-	while (fscanf(f, "%zx-%zx %s %zx %*[^\n]\n", &start, &end, buf, &offset) == 4) {
-		if (addr >= start && addr < end) {
-			fclose(f);
-			return (size_t)addr - start + offset;
+	err = procmap_query(fileno(f), (const void *)addr, 0, &start, &offset, &flags);
+	if (err == 0) {
+		fclose(f);
+		return (size_t)addr - start + offset;
+	} else if (err != -EOPNOTSUPP) {
+		fclose(f);
+		return err;
+	} else if (err) {
+		while (fscanf(f, "%zx-%zx %s %zx %*[^\n]\n", &start, &end, buf, &offset) == 4) {
+			if (addr >= start && addr < end) {
+				fclose(f);
+				return (size_t)addr - start + offset;
+			}
 		}
 	}
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 9/9] selftests/bpf: add simple benchmark tool for /proc/<pid>/maps APIs
  2024-06-05  0:24 [PATCH v3 0/9] ioctl()-based API to query VMAs from /proc/<pid>/maps Andrii Nakryiko
                   ` (7 preceding siblings ...)
  2024-06-05  0:24 ` [PATCH v3 8/9] selftests/bpf: make use of PROCMAP_QUERY ioctl if available Andrii Nakryiko
@ 2024-06-05  0:24 ` Andrii Nakryiko
  8 siblings, 0 replies; 31+ messages in thread
From: Andrii Nakryiko @ 2024-06-05  0:24 UTC (permalink / raw)
  To: linux-fsdevel, brauner, viro, akpm
  Cc: linux-kernel, bpf, gregkh, linux-mm, liam.howlett, surenb, rppt,
	Andrii Nakryiko

Implement a simple tool/benchmark for comparing address "resolution"
logic based on textual /proc/<pid>/maps interface and new binary
ioctl-based PROCMAP_QUERY command.

The tool expects a file with a list of hex addresses, relevant PID, and
then provides control over whether textual or binary ioctl-based ways to
process VMAs should be used.

The overall logic implements as efficient way to do batched processing
of a given set of (unsorted) addresses. We first sort them in increasing
order (remembering their original position to restore original order, if
necessary), and then process all VMAs from /proc/<pid>/maps, matching
addresses to VMAs and calculating file offsets, if matched. For
ioctl-based approach the idea is similar, but is implemented even more
efficiently, requesting only VMAs that cover all given addresses,
skipping all the irrelevant VMAs altogether.

To be able to compare efficiency of both APIs the tool has "benchark" mode.
User provides a number of processing runs to run in a tight loop, only timing
specifically /proc/<pid>/maps parsing and processing parts of the logic.
Address sorting and re-sorting is excluded. This gives a more direct way
to compare ioctl- vs text-based APIs.

We used a medium-sized production application to do representative
benchmark. A bunch of stack traces were captured, resulting in 3244
user space addresses (464 unique ones, but we didn't deduplicate them).
Application itself had 655 VMAs reported in /proc/<pid>/maps.

Averaging time taken to process all addresses 10000 times, showed that:
  - text-based approach took 333 microseconds *per one batch run*;
  - ioctl-based approach took 8 microseconds *per (identical) batch run*.

This gives about ~40x speed up to do exactly the same amount of work
(build IDs were not fetched for ioctl-based benchmark; fetching build
IDs resulted in 2x slowdown compared to no-build-ID case). The ratio
will vary depending on exact set of addresses and how many VMAs they are
mapped to. So 40x isn't something to take for granted, but it does show
possible improvements that are achievable.

I also did an strace run of both cases. In text-based one the tool did
27 read() syscalls, fetching up to 4KB of data in one go (which is
seq_file limitations, bumping the buffer size has no effect, as data is
always capped at 4KB). In comparison, ioctl-based implementation had to
do only 5 ioctl() calls to fetch all relevant VMAs.

It is projected that savings from processing big production applications
would only widen the gap in favor of binary-based querying ioctl API, as
bigger applications will tend to have even more non-executable VMA
mappings relative to executable ones. E.g., one of the larger production
applications in the server fleet has upwards of 20000 VMAs, which would
make benchmark even more unfair to processing /proc/<pid>/maps file.

This tool is implementing one of the patterns of usage, referred to as
"on-demand profiling" use case in the main patch implementing ioctl()
API. perf is an example of the pre-processing pattern in which all (or
all executable) VMAs are loaded and stored for further querying. We
implemented an experimental change to perf to benchmark text-based and
ioctl-based APIs, and in perf benchmarks ioctl-based interface was no
worse than optimized text-based parsing benchmark. Filtering to only
executable VMAs further made ioctl-based benchmarks faster, as perf
would be querying about 1/3 of all VMAs only, compared to the need to
read and parse all of VMAs.

E.g., running `perf bench internals synthesize --mt -M 8`, we are getting.

TEXT-BASED
==========
  # ./perf-parse bench internals synthesize --mt -M 8
  # Running 'internals/synthesize' benchmark:
  Computing performance of multi threaded perf event synthesis by
  synthesizing events on CPU 0:
    Number of synthesis threads: 1
      Average synthesis took: 10238.600 usec (+- 309.656 usec)
      Average num. events: 3744.000 (+- 0.000)
      Average time per event 2.735 usec
    ...
    Number of synthesis threads: 8
      Average synthesis took: 6814.600 usec (+- 149.418 usec)
      Average num. events: 3744.000 (+- 0.000)
      Average time per event 1.820 usec

IOCTL-BASED, FETCHING ALL VMAS
==============================
  # ./perf-ioctl-all bench internals synthesize --mt -M 8
  # Running 'internals/synthesize' benchmark:
  Computing performance of multi threaded perf event synthesis by
  synthesizing events on CPU 0:
    Number of synthesis threads: 1
      Average synthesis took: 9944.800 usec (+- 381.794 usec)
      Average num. events: 3593.000 (+- 0.000)
      Average time per event 2.768 usec
    ...
    Number of synthesis threads: 8
      Average synthesis took: 6598.600 usec (+- 137.503 usec)
      Average num. events: 3595.000 (+- 0.000)
      Average time per event 1.835 usec

IOCTL-BASED, FETCHING EXECUTABLE VMAS
=====================================
  # ./perf-ioctl-exec bench internals synthesize --mt -M 8
  # Running 'internals/synthesize' benchmark:
  Computing performance of multi threaded perf event synthesis by
  synthesizing events on CPU 0:
    Number of synthesis threads: 1
      Average synthesis took: 8539.600 usec (+- 364.875 usec)
      Average num. events: 3569.000 (+- 0.000)
      Average time per event 2.393 usec
    ...
    Number of synthesis threads: 8
      Average synthesis took: 5657.600 usec (+- 107.219 usec)
      Average num. events: 3571.000 (+- 0.000)
      Average time per event 1.584 usec

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 tools/testing/selftests/bpf/.gitignore     |   1 +
 tools/testing/selftests/bpf/Makefile       |   2 +-
 tools/testing/selftests/bpf/procfs_query.c | 386 +++++++++++++++++++++
 3 files changed, 388 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/procfs_query.c

diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
index 5025401323af..903b14931bfe 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -44,6 +44,7 @@ test_cpp
 /veristat
 /sign-file
 /uprobe_multi
+/procfs_query
 *.ko
 *.tmp
 xskxceiver
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index e0b3887b3d2d..0afa667a54e5 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -144,7 +144,7 @@ TEST_GEN_PROGS_EXTENDED = test_skb_cgroup_id_user \
 	flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \
 	test_lirc_mode2_user xdping test_cpp runqslower bench bpf_testmod.ko \
 	xskxceiver xdp_redirect_multi xdp_synproxy veristat xdp_hw_metadata \
-	xdp_features bpf_test_no_cfi.ko
+	xdp_features bpf_test_no_cfi.ko procfs_query
 
 TEST_GEN_FILES += liburandom_read.so urandom_read sign-file uprobe_multi
 
diff --git a/tools/testing/selftests/bpf/procfs_query.c b/tools/testing/selftests/bpf/procfs_query.c
new file mode 100644
index 000000000000..63e06568f1ff
--- /dev/null
+++ b/tools/testing/selftests/bpf/procfs_query.c
@@ -0,0 +1,386 @@
+// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
+#include <argp.h>
+#include <stdio.h>
+#include <string.h>
+#include <stdlib.h>
+#include <time.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <sys/ioctl.h>
+#include <linux/fs.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <time.h>
+
+static bool verbose;
+static bool quiet;
+static bool use_ioctl;
+static bool request_build_id;
+static char *addrs_path;
+static int pid;
+static int bench_runs;
+
+const char *argp_program_version = "procfs_query 0.0";
+const char *argp_program_bug_address = "<bpf@vger.kernel.org>";
+
+static inline uint64_t get_time_ns(void)
+{
+	struct timespec t;
+
+	clock_gettime(CLOCK_MONOTONIC, &t);
+
+	return (uint64_t)t.tv_sec * 1000000000 + t.tv_nsec;
+}
+
+static const struct argp_option opts[] = {
+	{ "verbose", 'v', NULL, 0, "Verbose mode" },
+	{ "quiet", 'q', NULL, 0, "Quiet mode (no output)" },
+	{ "pid", 'p', "PID", 0, "PID of the process" },
+	{ "addrs-path", 'f', "PATH", 0, "File with addresses to resolve" },
+	{ "benchmark", 'B', "RUNS", 0, "Benchmark mode" },
+	{ "query", 'Q', NULL, 0, "Use ioctl()-based point query API (by default text parsing is done)" },
+	{ "build-id", 'b', NULL, 0, "Fetch build ID, if available (only for ioctl mode)" },
+	{},
+};
+
+static error_t parse_arg(int key, char *arg, struct argp_state *state)
+{
+	switch (key) {
+	case 'v':
+		verbose = true;
+		break;
+	case 'q':
+		quiet = true;
+		break;
+	case 'Q':
+		use_ioctl = true;
+		break;
+	case 'b':
+		request_build_id = true;
+		break;
+	case 'p':
+		pid = strtol(arg, NULL, 10);
+		break;
+	case 'f':
+		addrs_path = strdup(arg);
+		break;
+	case 'B':
+		bench_runs = strtol(arg, NULL, 10);
+		if (bench_runs <= 0) {
+			fprintf(stderr, "Invalid benchmark run count: %s\n", arg);
+			return -EINVAL;
+		}
+		break;
+	case ARGP_KEY_ARG:
+		argp_usage(state);
+		break;
+	default:
+		return ARGP_ERR_UNKNOWN;
+	}
+	return 0;
+}
+
+static const struct argp argp = {
+	.options = opts,
+	.parser = parse_arg,
+};
+
+struct addr {
+	unsigned long long addr;
+	int idx;
+};
+
+static struct addr *addrs;
+static size_t addr_cnt, addr_cap;
+
+struct resolved_addr {
+	unsigned long long file_off;
+	const char *vma_name;
+	int build_id_sz;
+	char build_id[20];
+};
+
+static struct resolved_addr *resolved;
+
+static int resolve_addrs_ioctl(void)
+{
+	char buf[32], build_id_buf[20], vma_name[PATH_MAX];
+	struct procmap_query q;
+	int fd, err, i;
+	struct addr *a = &addrs[0];
+	struct resolved_addr *r;
+
+	snprintf(buf, sizeof(buf), "/proc/%d/maps", pid);
+	fd = open(buf, O_RDONLY);
+	if (fd < 0) {
+		err = -errno;
+		fprintf(stderr, "Failed to open process map file (%s): %d\n", buf, err);
+		return err;
+	}
+
+	memset(&q, 0, sizeof(q));
+	q.size = sizeof(q);
+	q.query_flags = PROCMAP_QUERY_COVERING_OR_NEXT_VMA;
+	q.vma_name_addr = (__u64)vma_name;
+	if (request_build_id)
+		q.build_id_addr = (__u64)build_id_buf;
+
+	for (i = 0; i < addr_cnt; ) {
+		char *name = NULL;
+
+		q.query_addr = (__u64)a->addr;
+		q.vma_name_size = sizeof(vma_name);
+		if (request_build_id)
+			q.build_id_size = sizeof(build_id_buf);
+
+		err = ioctl(fd, PROCMAP_QUERY, &q);
+		if (err < 0 && errno == ENOTTY) {
+			close(fd);
+			fprintf(stderr, "PROCMAP_QUERY ioctl() command is not supported on this kernel!\n");
+			return -EOPNOTSUPP; /* ioctl() not implemented yet */
+		}
+		if (err < 0 && errno == ENOENT) {
+			fprintf(stderr, "ENOENT addr %lx\n", (long)q.query_addr);
+			i++;
+			a++;
+			continue; /* unresolved address */
+		}
+		if (err < 0) {
+			err = -errno;
+			close(fd);
+			fprintf(stderr, "PROCMAP_QUERY ioctl() returned error: %d\n", err);
+			return err;
+		}
+
+		if (verbose) {
+			printf("VMA FOUND (addr %08lx): %08lx-%08lx %c%c%c%c %08lx %02x:%02x %ld %s (build ID: %s, %d bytes)\n",
+			       (long)q.query_addr, (long)q.vma_start, (long)q.vma_end,
+			       (q.vma_flags & PROCMAP_QUERY_VMA_READABLE) ? 'r' : '-',
+			       (q.vma_flags & PROCMAP_QUERY_VMA_WRITABLE) ? 'w' : '-',
+			       (q.vma_flags & PROCMAP_QUERY_VMA_EXECUTABLE) ? 'x' : '-',
+			       (q.vma_flags & PROCMAP_QUERY_VMA_SHARED) ? 's' : 'p',
+			       (long)q.vma_offset, q.dev_major, q.dev_minor, (long)q.inode,
+			       q.vma_name_size ? vma_name : "",
+			       q.build_id_size ? "YES" : "NO",
+			       q.build_id_size);
+		}
+
+		/* skip addrs falling before current VMA */
+		for (; i < addr_cnt && a->addr < q.vma_start; i++, a++) {
+		}
+		/* process addrs covered by current VMA */
+		for (; i < addr_cnt && a->addr < q.vma_end; i++, a++) {
+			r = &resolved[a->idx];
+			r->file_off = a->addr - q.vma_start + q.vma_offset;
+
+			/* reuse name, if it was already strdup()'ed */
+			if (q.vma_name_size)
+				name = name ?: strdup(vma_name);
+			r->vma_name = name;
+
+			if (q.build_id_size) {
+				r->build_id_sz = q.build_id_size;
+				memcpy(r->build_id, build_id_buf, q.build_id_size);
+			}
+		}
+	}
+
+	close(fd);
+	return 0;
+}
+
+static int resolve_addrs_parse(void)
+{
+	size_t vma_start, vma_end, vma_offset, ino;
+	uint32_t dev_major, dev_minor;
+	char perms[4], buf[32], vma_name[PATH_MAX], fbuf[4096];
+	FILE *f;
+	int err, idx = 0;
+	struct addr *a = &addrs[idx];
+	struct resolved_addr *r;
+
+	snprintf(buf, sizeof(buf), "/proc/%d/maps", pid);
+	f = fopen(buf, "r");
+	if (!f) {
+		err = -errno;
+		fprintf(stderr, "Failed to open process map file (%s): %d\n", buf, err);
+		return err;
+	}
+
+	err = setvbuf(f, fbuf, _IOFBF, sizeof(fbuf));
+	if (err) {
+		err = -errno;
+		fprintf(stderr, "Failed to set custom file buffer size: %d\n", err);
+		return err;
+	}
+
+	while ((err = fscanf(f, "%zx-%zx %c%c%c%c %zx %x:%x %zu %[^\n]\n",
+			     &vma_start, &vma_end,
+			     &perms[0], &perms[1], &perms[2], &perms[3],
+			     &vma_offset, &dev_major, &dev_minor, &ino, vma_name)) >= 10) {
+		const char *name = NULL;
+
+		/* skip addrs before current vma, they stay unresolved */
+		for (; idx < addr_cnt && a->addr < vma_start; idx++, a++) {
+		}
+
+		/* resolve all addrs within current vma now */
+		for (; idx < addr_cnt && a->addr < vma_end; idx++, a++) {
+			r = &resolved[a->idx];
+			r->file_off = a->addr - vma_start + vma_offset;
+
+			/* reuse name, if it was already strdup()'ed */
+			if (err > 10)
+				name = name ?: strdup(vma_name);
+			else
+				name = NULL;
+			r->vma_name = name;
+		}
+
+		/* ran out of addrs to resolve, stop early */
+		if (idx >= addr_cnt)
+			break;
+	}
+
+	fclose(f);
+	return 0;
+}
+
+static int cmp_by_addr(const void *a, const void *b)
+{
+	const struct addr *x = a, *y = b;
+
+	if (x->addr != y->addr)
+		return x->addr < y->addr ? -1 : 1;
+	return x->idx < y->idx ? -1 : 1;
+}
+
+static int cmp_by_idx(const void *a, const void *b)
+{
+	const struct addr *x = a, *y = b;
+
+	return x->idx < y->idx ? -1 : 1;
+}
+
+int main(int argc, char **argv)
+{
+	FILE* f;
+	int err, i;
+	unsigned long long addr;
+	uint64_t start_ns;
+	double total_ns;
+
+	/* Parse command line arguments */
+	err = argp_parse(&argp, argc, argv, 0, NULL, NULL);
+	if (err)
+		return err;
+
+	if (pid <= 0 || !addrs_path) {
+		fprintf(stderr, "Please provide PID and file with addresses to process!\n");
+		exit(1);
+	}
+
+	if (verbose) {
+		fprintf(stderr, "PID: %d\n", pid);
+		fprintf(stderr, "PATH: %s\n", addrs_path);
+	}
+
+	f = fopen(addrs_path, "r");
+	if (!f) {
+		err = -errno;
+		fprintf(stderr, "Failed to open '%s': %d\n", addrs_path, err);
+		goto out;
+	}
+
+	while ((err = fscanf(f, "%llx\n", &addr)) == 1) {
+		if (addr_cnt == addr_cap) {
+			addr_cap = addr_cap == 0 ? 16 : (addr_cap * 3 / 2);
+			addrs = realloc(addrs, sizeof(*addrs) * addr_cap);
+			memset(addrs + addr_cnt, 0, (addr_cap - addr_cnt) * sizeof(*addrs));
+		}
+
+		addrs[addr_cnt].addr = addr;
+		addrs[addr_cnt].idx = addr_cnt;
+
+		addr_cnt++;
+	}
+	if (verbose)
+		fprintf(stderr, "READ %zu addrs!\n", addr_cnt);
+	if (!feof(f)) {
+		fprintf(stderr, "Failure parsing full list of addresses at '%s'!\n", addrs_path);
+		err = -EINVAL;
+		fclose(f);
+		goto out;
+	}
+	fclose(f);
+	if (addr_cnt == 0) {
+		fprintf(stderr, "No addresses provided, bailing out!\n");
+		err = -ENOENT;
+		goto out;
+	}
+
+	resolved = calloc(addr_cnt, sizeof(*resolved));
+
+	qsort(addrs, addr_cnt, sizeof(*addrs), cmp_by_addr);
+	if (verbose) {
+		fprintf(stderr, "SORTED ADDRS (%zu):\n", addr_cnt);
+		for (i = 0; i < addr_cnt; i++) {
+			fprintf(stderr, "ADDR #%d: %#llx\n", addrs[i].idx, addrs[i].addr);
+		}
+	}
+
+	start_ns = get_time_ns();
+	for (i = bench_runs ?: 1; i > 0; i--) {
+		if (use_ioctl) {
+			err = resolve_addrs_ioctl();
+		} else {
+			err = resolve_addrs_parse();
+		}
+		if (err) {
+			fprintf(stderr, "Failed to resolve addrs: %d!\n", err);
+			goto out;
+		}
+	}
+	total_ns = get_time_ns() - start_ns;
+
+	if (bench_runs) {
+		fprintf(stderr, "BENCHMARK MODE. RUNS: %d TOTAL TIME (ms): %.3lf TIME/RUN (ms): %.3lf TIME/ADDR (us): %.3lf\n",
+			bench_runs, total_ns / 1000000.0, total_ns / bench_runs / 1000000.0,
+			total_ns / bench_runs / addr_cnt / 1000.0);
+	}
+
+	/* sort them back into the original order */
+	qsort(addrs, addr_cnt, sizeof(*addrs), cmp_by_idx);
+
+	if (!quiet) {
+		printf("RESOLVED ADDRS (%zu):\n", addr_cnt);
+		for (i = 0; i < addr_cnt; i++) {
+			const struct addr *a = &addrs[i];
+			const struct resolved_addr *r = &resolved[a->idx];
+
+			if (r->file_off) {
+				printf("RESOLVED   #%d: %#llx -> OFF %#llx",
+					a->idx, a->addr, r->file_off);
+				if (r->vma_name)
+					printf(" NAME %s", r->vma_name);
+				if (r->build_id_sz) {
+					char build_id_str[41];
+					int j;
+
+					for (j = 0; j < r->build_id_sz; j++)
+						sprintf(&build_id_str[j * 2], "%02hhx", r->build_id[j]);
+					printf(" BUILDID %s", build_id_str);
+				}
+				printf("\n");
+			} else {
+				printf("UNRESOLVED #%d: %#llx\n", a->idx, a->addr);
+			}
+		}
+	}
+out:
+	free(addrs);
+	free(addrs_path);
+	free(resolved);
+
+	return err < 0 ? -err : 0;
+}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 1/9] mm: add find_vma()-like API but RCU protected and taking VMA lock
  2024-06-05  0:24 ` [PATCH v3 1/9] mm: add find_vma()-like API but RCU protected and taking VMA lock Andrii Nakryiko
@ 2024-06-05  0:57   ` Matthew Wilcox
  2024-06-05 13:33     ` Liam R. Howlett
  0 siblings, 1 reply; 31+ messages in thread
From: Matthew Wilcox @ 2024-06-05  0:57 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-fsdevel, brauner, viro, akpm, linux-kernel, bpf, gregkh,
	linux-mm, liam.howlett, surenb, rppt

On Tue, Jun 04, 2024 at 05:24:46PM -0700, Andrii Nakryiko wrote:
> +/*
> + * find_and_lock_vma_rcu() - Find and lock the VMA for a given address, or the
> + * next VMA. Search is done under RCU protection, without taking or assuming
> + * mmap_lock. Returned VMA is guaranteed to be stable and not isolated.

You know this is supposed to be the _short_ description, right?
Three lines is way too long.  The full description goes between the
arguments and the Return: line.

> + * @mm: The mm_struct to check
> + * @addr: The address
> + *
> + * Returns: The VMA associated with addr, or the next VMA.
> + * May return %NULL in the case of no VMA at addr or above.
> + * If the VMA is being modified and can't be locked, -EBUSY is returned.
> + */
> +struct vm_area_struct *find_and_lock_vma_rcu(struct mm_struct *mm,
> +					     unsigned long address)
> +{
> +	MA_STATE(mas, &mm->mm_mt, address, address);
> +	struct vm_area_struct *vma;
> +	int err;
> +
> +	rcu_read_lock();
> +retry:
> +	vma = mas_find(&mas, ULONG_MAX);
> +	if (!vma) {
> +		err = 0; /* no VMA, return NULL */
> +		goto inval;
> +	}
> +
> +	if (!vma_start_read(vma)) {
> +		err = -EBUSY;
> +		goto inval;
> +	}
> +
> +	/*
> +	 * Check since vm_start/vm_end might change before we lock the VMA.
> +	 * Note, unlike lock_vma_under_rcu() we are searching for VMA covering
> +	 * address or the next one, so we only make sure VMA wasn't updated to
> +	 * end before the address.
> +	 */
> +	if (unlikely(vma->vm_end <= address)) {
> +		err = -EBUSY;
> +		goto inval_end_read;
> +	}
> +
> +	/* Check if the VMA got isolated after we found it */
> +	if (vma->detached) {
> +		vma_end_read(vma);
> +		count_vm_vma_lock_event(VMA_LOCK_MISS);
> +		/* The area was replaced with another one */

Surely you need to mas_reset() before you goto retry?

> +		goto retry;
> +	}

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 1/9] mm: add find_vma()-like API but RCU protected and taking VMA lock
  2024-06-05  0:57   ` Matthew Wilcox
@ 2024-06-05 13:33     ` Liam R. Howlett
  2024-06-05 16:13       ` Andrii Nakryiko
  0 siblings, 1 reply; 31+ messages in thread
From: Liam R. Howlett @ 2024-06-05 13:33 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrii Nakryiko, linux-fsdevel, brauner, viro, akpm, linux-kernel,
	bpf, gregkh, linux-mm, surenb, rppt

* Matthew Wilcox <willy@infradead.org> [240604 20:57]:
> On Tue, Jun 04, 2024 at 05:24:46PM -0700, Andrii Nakryiko wrote:
> > +/*
> > + * find_and_lock_vma_rcu() - Find and lock the VMA for a given address, or the
> > + * next VMA. Search is done under RCU protection, without taking or assuming
> > + * mmap_lock. Returned VMA is guaranteed to be stable and not isolated.
> 
> You know this is supposed to be the _short_ description, right?
> Three lines is way too long.  The full description goes between the
> arguments and the Return: line.
> 
> > + * @mm: The mm_struct to check
> > + * @addr: The address
> > + *
> > + * Returns: The VMA associated with addr, or the next VMA.
> > + * May return %NULL in the case of no VMA at addr or above.
> > + * If the VMA is being modified and can't be locked, -EBUSY is returned.
> > + */
> > +struct vm_area_struct *find_and_lock_vma_rcu(struct mm_struct *mm,
> > +					     unsigned long address)
> > +{
> > +	MA_STATE(mas, &mm->mm_mt, address, address);
> > +	struct vm_area_struct *vma;
> > +	int err;
> > +
> > +	rcu_read_lock();
> > +retry:
> > +	vma = mas_find(&mas, ULONG_MAX);
> > +	if (!vma) {
> > +		err = 0; /* no VMA, return NULL */
> > +		goto inval;
> > +	}
> > +
> > +	if (!vma_start_read(vma)) {
> > +		err = -EBUSY;
> > +		goto inval;
> > +	}
> > +
> > +	/*
> > +	 * Check since vm_start/vm_end might change before we lock the VMA.
> > +	 * Note, unlike lock_vma_under_rcu() we are searching for VMA covering
> > +	 * address or the next one, so we only make sure VMA wasn't updated to
> > +	 * end before the address.
> > +	 */
> > +	if (unlikely(vma->vm_end <= address)) {
> > +		err = -EBUSY;
> > +		goto inval_end_read;
> > +	}
> > +
> > +	/* Check if the VMA got isolated after we found it */
> > +	if (vma->detached) {
> > +		vma_end_read(vma);
> > +		count_vm_vma_lock_event(VMA_LOCK_MISS);
> > +		/* The area was replaced with another one */
> 
> Surely you need to mas_reset() before you goto retry?

Probably more than that.  We've found and may have adjusted the
index/last; we should reconfigure the maple state.  You should probably
use mas_set(), which will reset the maple state and set the index and
long to address.


> 
> > +		goto retry;
> > +	}
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 1/9] mm: add find_vma()-like API but RCU protected and taking VMA lock
  2024-06-05 13:33     ` Liam R. Howlett
@ 2024-06-05 16:13       ` Andrii Nakryiko
  2024-06-05 16:24         ` Andrii Nakryiko
  0 siblings, 1 reply; 31+ messages in thread
From: Andrii Nakryiko @ 2024-06-05 16:13 UTC (permalink / raw)
  To: Liam R. Howlett, Matthew Wilcox, Andrii Nakryiko, linux-fsdevel,
	brauner, viro, akpm, linux-kernel, bpf, gregkh, linux-mm, surenb,
	rppt

On Wed, Jun 5, 2024 at 6:33 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Matthew Wilcox <willy@infradead.org> [240604 20:57]:
> > On Tue, Jun 04, 2024 at 05:24:46PM -0700, Andrii Nakryiko wrote:
> > > +/*
> > > + * find_and_lock_vma_rcu() - Find and lock the VMA for a given address, or the
> > > + * next VMA. Search is done under RCU protection, without taking or assuming
> > > + * mmap_lock. Returned VMA is guaranteed to be stable and not isolated.
> >
> > You know this is supposed to be the _short_ description, right?
> > Three lines is way too long.  The full description goes between the
> > arguments and the Return: line.

Sure, I'll adjust.

> >
> > > + * @mm: The mm_struct to check
> > > + * @addr: The address
> > > + *
> > > + * Returns: The VMA associated with addr, or the next VMA.
> > > + * May return %NULL in the case of no VMA at addr or above.
> > > + * If the VMA is being modified and can't be locked, -EBUSY is returned.
> > > + */
> > > +struct vm_area_struct *find_and_lock_vma_rcu(struct mm_struct *mm,
> > > +                                        unsigned long address)
> > > +{
> > > +   MA_STATE(mas, &mm->mm_mt, address, address);
> > > +   struct vm_area_struct *vma;
> > > +   int err;
> > > +
> > > +   rcu_read_lock();
> > > +retry:
> > > +   vma = mas_find(&mas, ULONG_MAX);
> > > +   if (!vma) {
> > > +           err = 0; /* no VMA, return NULL */
> > > +           goto inval;
> > > +   }
> > > +
> > > +   if (!vma_start_read(vma)) {
> > > +           err = -EBUSY;
> > > +           goto inval;
> > > +   }
> > > +
> > > +   /*
> > > +    * Check since vm_start/vm_end might change before we lock the VMA.
> > > +    * Note, unlike lock_vma_under_rcu() we are searching for VMA covering
> > > +    * address or the next one, so we only make sure VMA wasn't updated to
> > > +    * end before the address.
> > > +    */
> > > +   if (unlikely(vma->vm_end <= address)) {
> > > +           err = -EBUSY;
> > > +           goto inval_end_read;
> > > +   }
> > > +
> > > +   /* Check if the VMA got isolated after we found it */
> > > +   if (vma->detached) {
> > > +           vma_end_read(vma);
> > > +           count_vm_vma_lock_event(VMA_LOCK_MISS);
> > > +           /* The area was replaced with another one */
> >
> > Surely you need to mas_reset() before you goto retry?
>
> Probably more than that.  We've found and may have adjusted the
> index/last; we should reconfigure the maple state.  You should probably
> use mas_set(), which will reset the maple state and set the index and
> long to address.

Yep, makes sense, thanks. As for the `unlikely(vma->vm_end <=
address)` case, I presume we want to do the same, right? Basically, on
each retry start from the `address` unconditionally, no matter what's
the reason for retry.

>
>
> >
> > > +           goto retry;
> > > +   }
> >

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 1/9] mm: add find_vma()-like API but RCU protected and taking VMA lock
  2024-06-05 16:13       ` Andrii Nakryiko
@ 2024-06-05 16:24         ` Andrii Nakryiko
  2024-06-05 16:27           ` Andrii Nakryiko
  0 siblings, 1 reply; 31+ messages in thread
From: Andrii Nakryiko @ 2024-06-05 16:24 UTC (permalink / raw)
  To: Liam R. Howlett, Matthew Wilcox, Andrii Nakryiko, linux-fsdevel,
	brauner, viro, akpm, linux-kernel, bpf, gregkh, linux-mm, surenb,
	rppt

On Wed, Jun 5, 2024 at 9:13 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Wed, Jun 5, 2024 at 6:33 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > * Matthew Wilcox <willy@infradead.org> [240604 20:57]:
> > > On Tue, Jun 04, 2024 at 05:24:46PM -0700, Andrii Nakryiko wrote:
> > > > +/*
> > > > + * find_and_lock_vma_rcu() - Find and lock the VMA for a given address, or the
> > > > + * next VMA. Search is done under RCU protection, without taking or assuming
> > > > + * mmap_lock. Returned VMA is guaranteed to be stable and not isolated.
> > >
> > > You know this is supposed to be the _short_ description, right?
> > > Three lines is way too long.  The full description goes between the
> > > arguments and the Return: line.
>
> Sure, I'll adjust.
>
> > >
> > > > + * @mm: The mm_struct to check
> > > > + * @addr: The address
> > > > + *
> > > > + * Returns: The VMA associated with addr, or the next VMA.
> > > > + * May return %NULL in the case of no VMA at addr or above.
> > > > + * If the VMA is being modified and can't be locked, -EBUSY is returned.
> > > > + */
> > > > +struct vm_area_struct *find_and_lock_vma_rcu(struct mm_struct *mm,
> > > > +                                        unsigned long address)
> > > > +{
> > > > +   MA_STATE(mas, &mm->mm_mt, address, address);
> > > > +   struct vm_area_struct *vma;
> > > > +   int err;
> > > > +
> > > > +   rcu_read_lock();
> > > > +retry:
> > > > +   vma = mas_find(&mas, ULONG_MAX);
> > > > +   if (!vma) {
> > > > +           err = 0; /* no VMA, return NULL */
> > > > +           goto inval;
> > > > +   }
> > > > +
> > > > +   if (!vma_start_read(vma)) {
> > > > +           err = -EBUSY;
> > > > +           goto inval;
> > > > +   }
> > > > +
> > > > +   /*
> > > > +    * Check since vm_start/vm_end might change before we lock the VMA.
> > > > +    * Note, unlike lock_vma_under_rcu() we are searching for VMA covering
> > > > +    * address or the next one, so we only make sure VMA wasn't updated to
> > > > +    * end before the address.
> > > > +    */
> > > > +   if (unlikely(vma->vm_end <= address)) {
> > > > +           err = -EBUSY;
> > > > +           goto inval_end_read;
> > > > +   }
> > > > +
> > > > +   /* Check if the VMA got isolated after we found it */
> > > > +   if (vma->detached) {
> > > > +           vma_end_read(vma);
> > > > +           count_vm_vma_lock_event(VMA_LOCK_MISS);
> > > > +           /* The area was replaced with another one */
> > >
> > > Surely you need to mas_reset() before you goto retry?
> >
> > Probably more than that.  We've found and may have adjusted the
> > index/last; we should reconfigure the maple state.  You should probably
> > use mas_set(), which will reset the maple state and set the index and
> > long to address.
>
> Yep, makes sense, thanks. As for the `unlikely(vma->vm_end <=
> address)` case, I presume we want to do the same, right? Basically, on
> each retry start from the `address` unconditionally, no matter what's
> the reason for retry.

ah, never mind, we don't retry in that situation, I'll just put
`mas_set(&mas, address);` right before `goto retry;`. Unless we should
actually retry in the case when VMA got moved before the requested
address, not sure, let me know what you think. Presumably retrying
will allow us to get the correct VMA without the need to fall back to
mmap_lock?

>
> >
> >
> > >
> > > > +           goto retry;
> > > > +   }
> > >

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 1/9] mm: add find_vma()-like API but RCU protected and taking VMA lock
  2024-06-05 16:24         ` Andrii Nakryiko
@ 2024-06-05 16:27           ` Andrii Nakryiko
  2024-06-05 17:03             ` Liam R. Howlett
  0 siblings, 1 reply; 31+ messages in thread
From: Andrii Nakryiko @ 2024-06-05 16:27 UTC (permalink / raw)
  To: Liam R. Howlett, Matthew Wilcox, Andrii Nakryiko, linux-fsdevel,
	brauner, viro, akpm, linux-kernel, bpf, gregkh, linux-mm, surenb,
	rppt

On Wed, Jun 5, 2024 at 9:24 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Wed, Jun 5, 2024 at 9:13 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Wed, Jun 5, 2024 at 6:33 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > >
> > > * Matthew Wilcox <willy@infradead.org> [240604 20:57]:
> > > > On Tue, Jun 04, 2024 at 05:24:46PM -0700, Andrii Nakryiko wrote:
> > > > > +/*
> > > > > + * find_and_lock_vma_rcu() - Find and lock the VMA for a given address, or the
> > > > > + * next VMA. Search is done under RCU protection, without taking or assuming
> > > > > + * mmap_lock. Returned VMA is guaranteed to be stable and not isolated.
> > > >
> > > > You know this is supposed to be the _short_ description, right?
> > > > Three lines is way too long.  The full description goes between the
> > > > arguments and the Return: line.
> >
> > Sure, I'll adjust.
> >
> > > >
> > > > > + * @mm: The mm_struct to check
> > > > > + * @addr: The address
> > > > > + *
> > > > > + * Returns: The VMA associated with addr, or the next VMA.
> > > > > + * May return %NULL in the case of no VMA at addr or above.
> > > > > + * If the VMA is being modified and can't be locked, -EBUSY is returned.
> > > > > + */
> > > > > +struct vm_area_struct *find_and_lock_vma_rcu(struct mm_struct *mm,
> > > > > +                                        unsigned long address)
> > > > > +{
> > > > > +   MA_STATE(mas, &mm->mm_mt, address, address);
> > > > > +   struct vm_area_struct *vma;
> > > > > +   int err;
> > > > > +
> > > > > +   rcu_read_lock();
> > > > > +retry:
> > > > > +   vma = mas_find(&mas, ULONG_MAX);
> > > > > +   if (!vma) {
> > > > > +           err = 0; /* no VMA, return NULL */
> > > > > +           goto inval;
> > > > > +   }
> > > > > +
> > > > > +   if (!vma_start_read(vma)) {
> > > > > +           err = -EBUSY;
> > > > > +           goto inval;
> > > > > +   }
> > > > > +
> > > > > +   /*
> > > > > +    * Check since vm_start/vm_end might change before we lock the VMA.
> > > > > +    * Note, unlike lock_vma_under_rcu() we are searching for VMA covering
> > > > > +    * address or the next one, so we only make sure VMA wasn't updated to
> > > > > +    * end before the address.
> > > > > +    */
> > > > > +   if (unlikely(vma->vm_end <= address)) {
> > > > > +           err = -EBUSY;
> > > > > +           goto inval_end_read;
> > > > > +   }
> > > > > +
> > > > > +   /* Check if the VMA got isolated after we found it */
> > > > > +   if (vma->detached) {
> > > > > +           vma_end_read(vma);
> > > > > +           count_vm_vma_lock_event(VMA_LOCK_MISS);
> > > > > +           /* The area was replaced with another one */
> > > >
> > > > Surely you need to mas_reset() before you goto retry?
> > >
> > > Probably more than that.  We've found and may have adjusted the
> > > index/last; we should reconfigure the maple state.  You should probably
> > > use mas_set(), which will reset the maple state and set the index and
> > > long to address.
> >
> > Yep, makes sense, thanks. As for the `unlikely(vma->vm_end <=
> > address)` case, I presume we want to do the same, right? Basically, on
> > each retry start from the `address` unconditionally, no matter what's
> > the reason for retry.
>
> ah, never mind, we don't retry in that situation, I'll just put
> `mas_set(&mas, address);` right before `goto retry;`. Unless we should
> actually retry in the case when VMA got moved before the requested
> address, not sure, let me know what you think. Presumably retrying
> will allow us to get the correct VMA without the need to fall back to
> mmap_lock?

sorry, one more question as I look some more around this (unfamiliar
to me) piece of code. I see that lock_vma_under_rcu counts
VMA_LOCK_MISS on retry, but I see that there is actually a
VMA_LOCK_RETRY stat as well. Any reason it's a MISS instead of RETRY?
Should I use MISS as well, or actually count a RETRY?

>
> >
> > >
> > >
> > > >
> > > > > +           goto retry;
> > > > > +   }
> > > >

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 1/9] mm: add find_vma()-like API but RCU protected and taking VMA lock
  2024-06-05 16:27           ` Andrii Nakryiko
@ 2024-06-05 17:03             ` Liam R. Howlett
  2024-06-05 23:22               ` Suren Baghdasaryan
  0 siblings, 1 reply; 31+ messages in thread
From: Liam R. Howlett @ 2024-06-05 17:03 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Matthew Wilcox, Andrii Nakryiko, linux-fsdevel, brauner, viro,
	akpm, linux-kernel, bpf, gregkh, linux-mm, surenb, rppt

* Andrii Nakryiko <andrii.nakryiko@gmail.com> [240605 12:27]:
> On Wed, Jun 5, 2024 at 9:24 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Wed, Jun 5, 2024 at 9:13 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Wed, Jun 5, 2024 at 6:33 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > >
> > > > * Matthew Wilcox <willy@infradead.org> [240604 20:57]:
> > > > > On Tue, Jun 04, 2024 at 05:24:46PM -0700, Andrii Nakryiko wrote:
> > > > > > +/*
> > > > > > + * find_and_lock_vma_rcu() - Find and lock the VMA for a given address, or the
> > > > > > + * next VMA. Search is done under RCU protection, without taking or assuming
> > > > > > + * mmap_lock. Returned VMA is guaranteed to be stable and not isolated.
> > > > >
> > > > > You know this is supposed to be the _short_ description, right?
> > > > > Three lines is way too long.  The full description goes between the
> > > > > arguments and the Return: line.
> > >
> > > Sure, I'll adjust.
> > >
> > > > >
> > > > > > + * @mm: The mm_struct to check
> > > > > > + * @addr: The address
> > > > > > + *
> > > > > > + * Returns: The VMA associated with addr, or the next VMA.
> > > > > > + * May return %NULL in the case of no VMA at addr or above.
> > > > > > + * If the VMA is being modified and can't be locked, -EBUSY is returned.
> > > > > > + */
> > > > > > +struct vm_area_struct *find_and_lock_vma_rcu(struct mm_struct *mm,
> > > > > > +                                        unsigned long address)
> > > > > > +{
> > > > > > +   MA_STATE(mas, &mm->mm_mt, address, address);
> > > > > > +   struct vm_area_struct *vma;
> > > > > > +   int err;
> > > > > > +
> > > > > > +   rcu_read_lock();
> > > > > > +retry:
> > > > > > +   vma = mas_find(&mas, ULONG_MAX);
> > > > > > +   if (!vma) {
> > > > > > +           err = 0; /* no VMA, return NULL */
> > > > > > +           goto inval;
> > > > > > +   }
> > > > > > +
> > > > > > +   if (!vma_start_read(vma)) {
> > > > > > +           err = -EBUSY;
> > > > > > +           goto inval;
> > > > > > +   }
> > > > > > +
> > > > > > +   /*
> > > > > > +    * Check since vm_start/vm_end might change before we lock the VMA.
> > > > > > +    * Note, unlike lock_vma_under_rcu() we are searching for VMA covering
> > > > > > +    * address or the next one, so we only make sure VMA wasn't updated to
> > > > > > +    * end before the address.
> > > > > > +    */
> > > > > > +   if (unlikely(vma->vm_end <= address)) {
> > > > > > +           err = -EBUSY;
> > > > > > +           goto inval_end_read;
> > > > > > +   }
> > > > > > +
> > > > > > +   /* Check if the VMA got isolated after we found it */
> > > > > > +   if (vma->detached) {
> > > > > > +           vma_end_read(vma);
> > > > > > +           count_vm_vma_lock_event(VMA_LOCK_MISS);
> > > > > > +           /* The area was replaced with another one */
> > > > >
> > > > > Surely you need to mas_reset() before you goto retry?
> > > >
> > > > Probably more than that.  We've found and may have adjusted the
> > > > index/last; we should reconfigure the maple state.  You should probably
> > > > use mas_set(), which will reset the maple state and set the index and
> > > > long to address.
> > >
> > > Yep, makes sense, thanks. As for the `unlikely(vma->vm_end <=
> > > address)` case, I presume we want to do the same, right? Basically, on
> > > each retry start from the `address` unconditionally, no matter what's
> > > the reason for retry.
> >
> > ah, never mind, we don't retry in that situation, I'll just put
> > `mas_set(&mas, address);` right before `goto retry;`. Unless we should
> > actually retry in the case when VMA got moved before the requested
> > address, not sure, let me know what you think. Presumably retrying
> > will allow us to get the correct VMA without the need to fall back to
> > mmap_lock?
> 
> sorry, one more question as I look some more around this (unfamiliar
> to me) piece of code. I see that lock_vma_under_rcu counts
> VMA_LOCK_MISS on retry, but I see that there is actually a
> VMA_LOCK_RETRY stat as well. Any reason it's a MISS instead of RETRY?
> Should I use MISS as well, or actually count a RETRY?
> 

VMA_LOCK_MISS is used here because we missed the VMA due to a write
happening to move the vma (rather rare).  The VMA_LOCK missed the vma.

VMA_LOCK_RETRY is used to indicate we need to retry under the mmap lock.
A retry is needed after the VMA_LOCK did not work under rcu locking.

Thanks,
Liam

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/9] fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API
  2024-06-05  0:24 ` [PATCH v3 4/9] fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API Andrii Nakryiko
@ 2024-06-05 23:15   ` Suren Baghdasaryan
  2024-06-06 16:51     ` Andrii Nakryiko
  0 siblings, 1 reply; 31+ messages in thread
From: Suren Baghdasaryan @ 2024-06-05 23:15 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-fsdevel, brauner, viro, akpm, linux-kernel, bpf, gregkh,
	linux-mm, liam.howlett, rppt

On Tue, Jun 4, 2024 at 5:25 PM Andrii Nakryiko <andrii@kernel.org> wrote:
>
> Attempt to use RCU-protected per-VMA lock when looking up requested VMA
> as much as possible, only falling back to mmap_lock if per-VMA lock
> failed. This is done so that querying of VMAs doesn't interfere with
> other critical tasks, like page fault handling.
>
> This has been suggested by mm folks, and we make use of a newly added
> internal API that works like find_vma(), but tries to use per-VMA lock.
>
> We have two sets of setup/query/teardown helper functions with different
> implementations depending on availability of per-VMA lock (conditioned
> on CONFIG_PER_VMA_LOCK) to abstract per-VMA lock subtleties.
>
> When per-VMA lock is available, lookup is done under RCU, attempting to
> take a per-VMA lock. If that fails, we fallback to mmap_lock, but then
> proceed to unconditionally grab per-VMA lock again, dropping mmap_lock
> immediately. In this configuration mmap_lock is never helf for long,
> minimizing disruptions while querying.
>
> When per-VMA lock is compiled out, we take mmap_lock once, query VMAs
> using find_vma() API, and then unlock mmap_lock at the very end once as
> well. In this setup we avoid locking/unlocking mmap_lock on every looked
> up VMA (depending on query parameters we might need to iterate a few of
> them).
>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  fs/proc/task_mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 46 insertions(+)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 614fbe5d0667..140032ffc551 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -388,6 +388,49 @@ static int pid_maps_open(struct inode *inode, struct file *file)
>                 PROCMAP_QUERY_VMA_FLAGS                         \
>  )
>
> +#ifdef CONFIG_PER_VMA_LOCK
> +static int query_vma_setup(struct mm_struct *mm)
> +{
> +       /* in the presence of per-VMA lock we don't need any setup/teardown */
> +       return 0;
> +}
> +
> +static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma)
> +{
> +       /* in the presence of per-VMA lock we need to unlock vma, if present */
> +       if (vma)
> +               vma_end_read(vma);
> +}
> +
> +static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr)
> +{
> +       struct vm_area_struct *vma;
> +
> +       /* try to use less disruptive per-VMA lock */
> +       vma = find_and_lock_vma_rcu(mm, addr);
> +       if (IS_ERR(vma)) {
> +               /* failed to take per-VMA lock, fallback to mmap_lock */
> +               if (mmap_read_lock_killable(mm))
> +                       return ERR_PTR(-EINTR);
> +
> +               vma = find_vma(mm, addr);
> +               if (vma) {
> +                       /*
> +                        * We cannot use vma_start_read() as it may fail due to
> +                        * false locked (see comment in vma_start_read()). We
> +                        * can avoid that by directly locking vm_lock under
> +                        * mmap_lock, which guarantees that nobody can lock the
> +                        * vma for write (vma_start_write()) under us.
> +                        */
> +                       down_read(&vma->vm_lock->lock);

Hi Andrii,
The above pattern of locking VMA under mmap_lock and then dropping
mmap_lock is becoming more common. Matthew had an RFC proposal for an
API to do this here:
https://lore.kernel.org/all/ZivhG0yrbpFqORDw@casper.infradead.org/. It
might be worth reviving that discussion.

> +               }
> +
> +               mmap_read_unlock(mm);

Later on in your code you are calling get_vma_name() which might call
anon_vma_name() to retrieve user-defined VMA name. After this patch
this operation will be done without holding mmap_lock, however per
https://elixir.bootlin.com/linux/latest/source/include/linux/mm_types.h#L582
this function has to be called with mmap_lock held for read. Indeed
with debug flags enabled you should hit this assertion:
https://elixir.bootlin.com/linux/latest/source/mm/madvise.c#L96.

> +       }
> +
> +       return vma;
> +}
> +#else
>  static int query_vma_setup(struct mm_struct *mm)
>  {
>         return mmap_read_lock_killable(mm);
> @@ -402,6 +445,7 @@ static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsig
>  {
>         return find_vma(mm, addr);
>  }
> +#endif
>
>  static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
>                                                  unsigned long addr, u32 flags)
> @@ -441,8 +485,10 @@ static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
>  skip_vma:
>         /*
>          * If the user needs closest matching VMA, keep iterating.
> +        * But before we proceed we might need to unlock current VMA.
>          */
>         addr = vma->vm_end;
> +       vma_end_read(vma); /* no-op under !CONFIG_PER_VMA_LOCK */
>         if (flags & PROCMAP_QUERY_COVERING_OR_NEXT_VMA)
>                 goto next_vma;
>  no_vma:
> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 1/9] mm: add find_vma()-like API but RCU protected and taking VMA lock
  2024-06-05 17:03             ` Liam R. Howlett
@ 2024-06-05 23:22               ` Suren Baghdasaryan
  2024-06-06 16:51                 ` Andrii Nakryiko
  0 siblings, 1 reply; 31+ messages in thread
From: Suren Baghdasaryan @ 2024-06-05 23:22 UTC (permalink / raw)
  To: Liam R. Howlett, Andrii Nakryiko, Matthew Wilcox, Andrii Nakryiko,
	linux-fsdevel, brauner, viro, akpm, linux-kernel, bpf, gregkh,
	linux-mm, surenb, rppt

On Wed, Jun 5, 2024 at 10:03 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Andrii Nakryiko <andrii.nakryiko@gmail.com> [240605 12:27]:
> > On Wed, Jun 5, 2024 at 9:24 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Wed, Jun 5, 2024 at 9:13 AM Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> wrote:
> > > >
> > > > On Wed, Jun 5, 2024 at 6:33 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > > >
> > > > > * Matthew Wilcox <willy@infradead.org> [240604 20:57]:
> > > > > > On Tue, Jun 04, 2024 at 05:24:46PM -0700, Andrii Nakryiko wrote:
> > > > > > > +/*
> > > > > > > + * find_and_lock_vma_rcu() - Find and lock the VMA for a given address, or the
> > > > > > > + * next VMA. Search is done under RCU protection, without taking or assuming
> > > > > > > + * mmap_lock. Returned VMA is guaranteed to be stable and not isolated.
> > > > > >
> > > > > > You know this is supposed to be the _short_ description, right?
> > > > > > Three lines is way too long.  The full description goes between the
> > > > > > arguments and the Return: line.
> > > >
> > > > Sure, I'll adjust.
> > > >
> > > > > >
> > > > > > > + * @mm: The mm_struct to check
> > > > > > > + * @addr: The address
> > > > > > > + *
> > > > > > > + * Returns: The VMA associated with addr, or the next VMA.
> > > > > > > + * May return %NULL in the case of no VMA at addr or above.
> > > > > > > + * If the VMA is being modified and can't be locked, -EBUSY is returned.
> > > > > > > + */
> > > > > > > +struct vm_area_struct *find_and_lock_vma_rcu(struct mm_struct *mm,
> > > > > > > +                                        unsigned long address)
> > > > > > > +{
> > > > > > > +   MA_STATE(mas, &mm->mm_mt, address, address);
> > > > > > > +   struct vm_area_struct *vma;
> > > > > > > +   int err;
> > > > > > > +
> > > > > > > +   rcu_read_lock();
> > > > > > > +retry:
> > > > > > > +   vma = mas_find(&mas, ULONG_MAX);
> > > > > > > +   if (!vma) {
> > > > > > > +           err = 0; /* no VMA, return NULL */
> > > > > > > +           goto inval;
> > > > > > > +   }
> > > > > > > +
> > > > > > > +   if (!vma_start_read(vma)) {
> > > > > > > +           err = -EBUSY;
> > > > > > > +           goto inval;
> > > > > > > +   }
> > > > > > > +
> > > > > > > +   /*
> > > > > > > +    * Check since vm_start/vm_end might change before we lock the VMA.
> > > > > > > +    * Note, unlike lock_vma_under_rcu() we are searching for VMA covering
> > > > > > > +    * address or the next one, so we only make sure VMA wasn't updated to
> > > > > > > +    * end before the address.
> > > > > > > +    */
> > > > > > > +   if (unlikely(vma->vm_end <= address)) {
> > > > > > > +           err = -EBUSY;
> > > > > > > +           goto inval_end_read;
> > > > > > > +   }
> > > > > > > +
> > > > > > > +   /* Check if the VMA got isolated after we found it */
> > > > > > > +   if (vma->detached) {
> > > > > > > +           vma_end_read(vma);
> > > > > > > +           count_vm_vma_lock_event(VMA_LOCK_MISS);
> > > > > > > +           /* The area was replaced with another one */
> > > > > >
> > > > > > Surely you need to mas_reset() before you goto retry?
> > > > >
> > > > > Probably more than that.  We've found and may have adjusted the
> > > > > index/last; we should reconfigure the maple state.  You should probably
> > > > > use mas_set(), which will reset the maple state and set the index and
> > > > > long to address.
> > > >
> > > > Yep, makes sense, thanks. As for the `unlikely(vma->vm_end <=
> > > > address)` case, I presume we want to do the same, right? Basically, on
> > > > each retry start from the `address` unconditionally, no matter what's
> > > > the reason for retry.
> > >
> > > ah, never mind, we don't retry in that situation, I'll just put
> > > `mas_set(&mas, address);` right before `goto retry;`. Unless we should
> > > actually retry in the case when VMA got moved before the requested
> > > address, not sure, let me know what you think. Presumably retrying
> > > will allow us to get the correct VMA without the need to fall back to
> > > mmap_lock?
> >
> > sorry, one more question as I look some more around this (unfamiliar
> > to me) piece of code. I see that lock_vma_under_rcu counts
> > VMA_LOCK_MISS on retry, but I see that there is actually a
> > VMA_LOCK_RETRY stat as well. Any reason it's a MISS instead of RETRY?
> > Should I use MISS as well, or actually count a RETRY?
> >
>
> VMA_LOCK_MISS is used here because we missed the VMA due to a write
> happening to move the vma (rather rare).  The VMA_LOCK missed the vma.
>
> VMA_LOCK_RETRY is used to indicate we need to retry under the mmap lock.
> A retry is needed after the VMA_LOCK did not work under rcu locking.

Originally lock_vma_under_rcu() was used only inside page fault path,
so these counters helped us quantify how effective VMA locking is when
handling page faults. With more users of that function these counters
will be affected by other paths as well. I'm not sure but I think it
makes sense to use them only inside page fault path, IOW we should
probably move count_vm_vma_lock_event() calls outside of
lock_vma_under_rcu() and add them only when handling page faults.

>
> Thanks,
> Liam

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/9] fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API
  2024-06-05 23:15   ` Suren Baghdasaryan
@ 2024-06-06 16:51     ` Andrii Nakryiko
  2024-06-06 17:12       ` Suren Baghdasaryan
  2024-06-06 17:15       ` Liam R. Howlett
  0 siblings, 2 replies; 31+ messages in thread
From: Andrii Nakryiko @ 2024-06-06 16:51 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrii Nakryiko, linux-fsdevel, brauner, viro, akpm, linux-kernel,
	bpf, gregkh, linux-mm, liam.howlett, rppt

On Wed, Jun 5, 2024 at 4:16 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Jun 4, 2024 at 5:25 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > Attempt to use RCU-protected per-VMA lock when looking up requested VMA
> > as much as possible, only falling back to mmap_lock if per-VMA lock
> > failed. This is done so that querying of VMAs doesn't interfere with
> > other critical tasks, like page fault handling.
> >
> > This has been suggested by mm folks, and we make use of a newly added
> > internal API that works like find_vma(), but tries to use per-VMA lock.
> >
> > We have two sets of setup/query/teardown helper functions with different
> > implementations depending on availability of per-VMA lock (conditioned
> > on CONFIG_PER_VMA_LOCK) to abstract per-VMA lock subtleties.
> >
> > When per-VMA lock is available, lookup is done under RCU, attempting to
> > take a per-VMA lock. If that fails, we fallback to mmap_lock, but then
> > proceed to unconditionally grab per-VMA lock again, dropping mmap_lock
> > immediately. In this configuration mmap_lock is never helf for long,
> > minimizing disruptions while querying.
> >
> > When per-VMA lock is compiled out, we take mmap_lock once, query VMAs
> > using find_vma() API, and then unlock mmap_lock at the very end once as
> > well. In this setup we avoid locking/unlocking mmap_lock on every looked
> > up VMA (depending on query parameters we might need to iterate a few of
> > them).
> >
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >  fs/proc/task_mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 46 insertions(+)
> >
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index 614fbe5d0667..140032ffc551 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -388,6 +388,49 @@ static int pid_maps_open(struct inode *inode, struct file *file)
> >                 PROCMAP_QUERY_VMA_FLAGS                         \
> >  )
> >
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +static int query_vma_setup(struct mm_struct *mm)
> > +{
> > +       /* in the presence of per-VMA lock we don't need any setup/teardown */
> > +       return 0;
> > +}
> > +
> > +static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma)
> > +{
> > +       /* in the presence of per-VMA lock we need to unlock vma, if present */
> > +       if (vma)
> > +               vma_end_read(vma);
> > +}
> > +
> > +static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr)
> > +{
> > +       struct vm_area_struct *vma;
> > +
> > +       /* try to use less disruptive per-VMA lock */
> > +       vma = find_and_lock_vma_rcu(mm, addr);
> > +       if (IS_ERR(vma)) {
> > +               /* failed to take per-VMA lock, fallback to mmap_lock */
> > +               if (mmap_read_lock_killable(mm))
> > +                       return ERR_PTR(-EINTR);
> > +
> > +               vma = find_vma(mm, addr);
> > +               if (vma) {
> > +                       /*
> > +                        * We cannot use vma_start_read() as it may fail due to
> > +                        * false locked (see comment in vma_start_read()). We
> > +                        * can avoid that by directly locking vm_lock under
> > +                        * mmap_lock, which guarantees that nobody can lock the
> > +                        * vma for write (vma_start_write()) under us.
> > +                        */
> > +                       down_read(&vma->vm_lock->lock);
>
> Hi Andrii,
> The above pattern of locking VMA under mmap_lock and then dropping
> mmap_lock is becoming more common. Matthew had an RFC proposal for an
> API to do this here:
> https://lore.kernel.org/all/ZivhG0yrbpFqORDw@casper.infradead.org/. It
> might be worth reviving that discussion.

Sure, it would be nice to have generic and blessed primitives to use
here. But the good news is that once this is all figured out by you mm
folks, it should be easy to make use of those primitives here, right?

>
> > +               }
> > +
> > +               mmap_read_unlock(mm);
>
> Later on in your code you are calling get_vma_name() which might call
> anon_vma_name() to retrieve user-defined VMA name. After this patch
> this operation will be done without holding mmap_lock, however per
> https://elixir.bootlin.com/linux/latest/source/include/linux/mm_types.h#L582
> this function has to be called with mmap_lock held for read. Indeed
> with debug flags enabled you should hit this assertion:
> https://elixir.bootlin.com/linux/latest/source/mm/madvise.c#L96.

Sigh... Ok, what's the suggestion then? Should it be some variant of
mmap_assert_locked() || vma_assert_locked() logic, or it's not so
simple?

Maybe I should just drop the CONFIG_PER_VMA_LOCK changes for now until
all these gotchas are figured out for /proc/<pid>/maps anyway, and
then we can adapt both text-based and ioctl-based /proc/<pid>/maps
APIs on top of whatever the final approach will end up being the right
one?

Liam, any objections to this? The whole point of this patch set is to
add a new API, not all the CONFIG_PER_VMA_LOCK gotchas. My
implementation is structured in a way that should be easily amenable
to CONFIG_PER_VMA_LOCK changes, but if there are a few more subtle
things that need to be figured for existing text-based
/proc/<pid>/maps anyways, I think it would be best to use mmap_lock
for now for this new API, and then adopt the same final
CONFIG_PER_VMA_LOCK-aware solution.

>
> > +       }
> > +
> > +       return vma;
> > +}
> > +#else
> >  static int query_vma_setup(struct mm_struct *mm)
> >  {
> >         return mmap_read_lock_killable(mm);
> > @@ -402,6 +445,7 @@ static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsig
> >  {
> >         return find_vma(mm, addr);
> >  }
> > +#endif
> >
> >  static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
> >                                                  unsigned long addr, u32 flags)
> > @@ -441,8 +485,10 @@ static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
> >  skip_vma:
> >         /*
> >          * If the user needs closest matching VMA, keep iterating.
> > +        * But before we proceed we might need to unlock current VMA.
> >          */
> >         addr = vma->vm_end;
> > +       vma_end_read(vma); /* no-op under !CONFIG_PER_VMA_LOCK */
> >         if (flags & PROCMAP_QUERY_COVERING_OR_NEXT_VMA)
> >                 goto next_vma;
> >  no_vma:
> > --
> > 2.43.0
> >

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 1/9] mm: add find_vma()-like API but RCU protected and taking VMA lock
  2024-06-05 23:22               ` Suren Baghdasaryan
@ 2024-06-06 16:51                 ` Andrii Nakryiko
  2024-06-06 17:13                   ` Suren Baghdasaryan
  0 siblings, 1 reply; 31+ messages in thread
From: Andrii Nakryiko @ 2024-06-06 16:51 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Liam R. Howlett, Matthew Wilcox, Andrii Nakryiko, linux-fsdevel,
	brauner, viro, akpm, linux-kernel, bpf, gregkh, linux-mm, rppt

On Wed, Jun 5, 2024 at 4:22 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Jun 5, 2024 at 10:03 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > * Andrii Nakryiko <andrii.nakryiko@gmail.com> [240605 12:27]:
> > > On Wed, Jun 5, 2024 at 9:24 AM Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> wrote:
> > > >
> > > > On Wed, Jun 5, 2024 at 9:13 AM Andrii Nakryiko
> > > > <andrii.nakryiko@gmail.com> wrote:
> > > > >
> > > > > On Wed, Jun 5, 2024 at 6:33 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > > > >
> > > > > > * Matthew Wilcox <willy@infradead.org> [240604 20:57]:
> > > > > > > On Tue, Jun 04, 2024 at 05:24:46PM -0700, Andrii Nakryiko wrote:
> > > > > > > > +/*
> > > > > > > > + * find_and_lock_vma_rcu() - Find and lock the VMA for a given address, or the
> > > > > > > > + * next VMA. Search is done under RCU protection, without taking or assuming
> > > > > > > > + * mmap_lock. Returned VMA is guaranteed to be stable and not isolated.
> > > > > > >
> > > > > > > You know this is supposed to be the _short_ description, right?
> > > > > > > Three lines is way too long.  The full description goes between the
> > > > > > > arguments and the Return: line.
> > > > >
> > > > > Sure, I'll adjust.
> > > > >
> > > > > > >
> > > > > > > > + * @mm: The mm_struct to check
> > > > > > > > + * @addr: The address
> > > > > > > > + *
> > > > > > > > + * Returns: The VMA associated with addr, or the next VMA.
> > > > > > > > + * May return %NULL in the case of no VMA at addr or above.
> > > > > > > > + * If the VMA is being modified and can't be locked, -EBUSY is returned.
> > > > > > > > + */
> > > > > > > > +struct vm_area_struct *find_and_lock_vma_rcu(struct mm_struct *mm,
> > > > > > > > +                                        unsigned long address)
> > > > > > > > +{
> > > > > > > > +   MA_STATE(mas, &mm->mm_mt, address, address);
> > > > > > > > +   struct vm_area_struct *vma;
> > > > > > > > +   int err;
> > > > > > > > +
> > > > > > > > +   rcu_read_lock();
> > > > > > > > +retry:
> > > > > > > > +   vma = mas_find(&mas, ULONG_MAX);
> > > > > > > > +   if (!vma) {
> > > > > > > > +           err = 0; /* no VMA, return NULL */
> > > > > > > > +           goto inval;
> > > > > > > > +   }
> > > > > > > > +
> > > > > > > > +   if (!vma_start_read(vma)) {
> > > > > > > > +           err = -EBUSY;
> > > > > > > > +           goto inval;
> > > > > > > > +   }
> > > > > > > > +
> > > > > > > > +   /*
> > > > > > > > +    * Check since vm_start/vm_end might change before we lock the VMA.
> > > > > > > > +    * Note, unlike lock_vma_under_rcu() we are searching for VMA covering
> > > > > > > > +    * address or the next one, so we only make sure VMA wasn't updated to
> > > > > > > > +    * end before the address.
> > > > > > > > +    */
> > > > > > > > +   if (unlikely(vma->vm_end <= address)) {
> > > > > > > > +           err = -EBUSY;
> > > > > > > > +           goto inval_end_read;
> > > > > > > > +   }
> > > > > > > > +
> > > > > > > > +   /* Check if the VMA got isolated after we found it */
> > > > > > > > +   if (vma->detached) {
> > > > > > > > +           vma_end_read(vma);
> > > > > > > > +           count_vm_vma_lock_event(VMA_LOCK_MISS);
> > > > > > > > +           /* The area was replaced with another one */
> > > > > > >
> > > > > > > Surely you need to mas_reset() before you goto retry?
> > > > > >
> > > > > > Probably more than that.  We've found and may have adjusted the
> > > > > > index/last; we should reconfigure the maple state.  You should probably
> > > > > > use mas_set(), which will reset the maple state and set the index and
> > > > > > long to address.
> > > > >
> > > > > Yep, makes sense, thanks. As for the `unlikely(vma->vm_end <=
> > > > > address)` case, I presume we want to do the same, right? Basically, on
> > > > > each retry start from the `address` unconditionally, no matter what's
> > > > > the reason for retry.
> > > >
> > > > ah, never mind, we don't retry in that situation, I'll just put
> > > > `mas_set(&mas, address);` right before `goto retry;`. Unless we should
> > > > actually retry in the case when VMA got moved before the requested
> > > > address, not sure, let me know what you think. Presumably retrying
> > > > will allow us to get the correct VMA without the need to fall back to
> > > > mmap_lock?
> > >
> > > sorry, one more question as I look some more around this (unfamiliar
> > > to me) piece of code. I see that lock_vma_under_rcu counts
> > > VMA_LOCK_MISS on retry, but I see that there is actually a
> > > VMA_LOCK_RETRY stat as well. Any reason it's a MISS instead of RETRY?
> > > Should I use MISS as well, or actually count a RETRY?
> > >
> >
> > VMA_LOCK_MISS is used here because we missed the VMA due to a write
> > happening to move the vma (rather rare).  The VMA_LOCK missed the vma.
> >
> > VMA_LOCK_RETRY is used to indicate we need to retry under the mmap lock.
> > A retry is needed after the VMA_LOCK did not work under rcu locking.
>
> Originally lock_vma_under_rcu() was used only inside page fault path,
> so these counters helped us quantify how effective VMA locking is when
> handling page faults. With more users of that function these counters
> will be affected by other paths as well. I'm not sure but I think it
> makes sense to use them only inside page fault path, IOW we should
> probably move count_vm_vma_lock_event() calls outside of
> lock_vma_under_rcu() and add them only when handling page faults.

Alright, seems like I should then just drop count_vm_vma_lock_event()
from the API I'm adding.

>
> >
> > Thanks,
> > Liam

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/9] fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API
  2024-06-06 16:51     ` Andrii Nakryiko
@ 2024-06-06 17:12       ` Suren Baghdasaryan
  2024-06-06 18:03         ` Andrii Nakryiko
  2024-06-06 17:15       ` Liam R. Howlett
  1 sibling, 1 reply; 31+ messages in thread
From: Suren Baghdasaryan @ 2024-06-06 17:12 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, linux-fsdevel, brauner, viro, akpm, linux-kernel,
	bpf, gregkh, linux-mm, liam.howlett, rppt

On Thu, Jun 6, 2024 at 9:52 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Wed, Jun 5, 2024 at 4:16 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Tue, Jun 4, 2024 at 5:25 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> > >
> > > Attempt to use RCU-protected per-VMA lock when looking up requested VMA
> > > as much as possible, only falling back to mmap_lock if per-VMA lock
> > > failed. This is done so that querying of VMAs doesn't interfere with
> > > other critical tasks, like page fault handling.
> > >
> > > This has been suggested by mm folks, and we make use of a newly added
> > > internal API that works like find_vma(), but tries to use per-VMA lock.
> > >
> > > We have two sets of setup/query/teardown helper functions with different
> > > implementations depending on availability of per-VMA lock (conditioned
> > > on CONFIG_PER_VMA_LOCK) to abstract per-VMA lock subtleties.
> > >
> > > When per-VMA lock is available, lookup is done under RCU, attempting to
> > > take a per-VMA lock. If that fails, we fallback to mmap_lock, but then
> > > proceed to unconditionally grab per-VMA lock again, dropping mmap_lock
> > > immediately. In this configuration mmap_lock is never helf for long,
> > > minimizing disruptions while querying.
> > >
> > > When per-VMA lock is compiled out, we take mmap_lock once, query VMAs
> > > using find_vma() API, and then unlock mmap_lock at the very end once as
> > > well. In this setup we avoid locking/unlocking mmap_lock on every looked
> > > up VMA (depending on query parameters we might need to iterate a few of
> > > them).
> > >
> > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > ---
> > >  fs/proc/task_mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 46 insertions(+)
> > >
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index 614fbe5d0667..140032ffc551 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -388,6 +388,49 @@ static int pid_maps_open(struct inode *inode, struct file *file)
> > >                 PROCMAP_QUERY_VMA_FLAGS                         \
> > >  )
> > >
> > > +#ifdef CONFIG_PER_VMA_LOCK
> > > +static int query_vma_setup(struct mm_struct *mm)
> > > +{
> > > +       /* in the presence of per-VMA lock we don't need any setup/teardown */
> > > +       return 0;
> > > +}
> > > +
> > > +static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma)
> > > +{
> > > +       /* in the presence of per-VMA lock we need to unlock vma, if present */
> > > +       if (vma)
> > > +               vma_end_read(vma);
> > > +}
> > > +
> > > +static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr)
> > > +{
> > > +       struct vm_area_struct *vma;
> > > +
> > > +       /* try to use less disruptive per-VMA lock */
> > > +       vma = find_and_lock_vma_rcu(mm, addr);
> > > +       if (IS_ERR(vma)) {
> > > +               /* failed to take per-VMA lock, fallback to mmap_lock */
> > > +               if (mmap_read_lock_killable(mm))
> > > +                       return ERR_PTR(-EINTR);
> > > +
> > > +               vma = find_vma(mm, addr);
> > > +               if (vma) {
> > > +                       /*
> > > +                        * We cannot use vma_start_read() as it may fail due to
> > > +                        * false locked (see comment in vma_start_read()). We
> > > +                        * can avoid that by directly locking vm_lock under
> > > +                        * mmap_lock, which guarantees that nobody can lock the
> > > +                        * vma for write (vma_start_write()) under us.
> > > +                        */
> > > +                       down_read(&vma->vm_lock->lock);
> >
> > Hi Andrii,
> > The above pattern of locking VMA under mmap_lock and then dropping
> > mmap_lock is becoming more common. Matthew had an RFC proposal for an
> > API to do this here:
> > https://lore.kernel.org/all/ZivhG0yrbpFqORDw@casper.infradead.org/. It
> > might be worth reviving that discussion.
>
> Sure, it would be nice to have generic and blessed primitives to use
> here. But the good news is that once this is all figured out by you mm
> folks, it should be easy to make use of those primitives here, right?
>
> >
> > > +               }
> > > +
> > > +               mmap_read_unlock(mm);
> >
> > Later on in your code you are calling get_vma_name() which might call
> > anon_vma_name() to retrieve user-defined VMA name. After this patch
> > this operation will be done without holding mmap_lock, however per
> > https://elixir.bootlin.com/linux/latest/source/include/linux/mm_types.h#L582
> > this function has to be called with mmap_lock held for read. Indeed
> > with debug flags enabled you should hit this assertion:
> > https://elixir.bootlin.com/linux/latest/source/mm/madvise.c#L96.
>
> Sigh... Ok, what's the suggestion then? Should it be some variant of
> mmap_assert_locked() || vma_assert_locked() logic, or it's not so
> simple?
>
> Maybe I should just drop the CONFIG_PER_VMA_LOCK changes for now until
> all these gotchas are figured out for /proc/<pid>/maps anyway, and
> then we can adapt both text-based and ioctl-based /proc/<pid>/maps
> APIs on top of whatever the final approach will end up being the right
> one?
>
> Liam, any objections to this? The whole point of this patch set is to
> add a new API, not all the CONFIG_PER_VMA_LOCK gotchas. My
> implementation is structured in a way that should be easily amenable
> to CONFIG_PER_VMA_LOCK changes, but if there are a few more subtle
> things that need to be figured for existing text-based
> /proc/<pid>/maps anyways, I think it would be best to use mmap_lock
> for now for this new API, and then adopt the same final
> CONFIG_PER_VMA_LOCK-aware solution.

I agree that you should start simple, using mmap_lock first and then
work on improvements. Would the proposed solution become useless with
coarse mmap_lock'ing?

>
> >
> > > +       }
> > > +
> > > +       return vma;
> > > +}
> > > +#else
> > >  static int query_vma_setup(struct mm_struct *mm)
> > >  {
> > >         return mmap_read_lock_killable(mm);
> > > @@ -402,6 +445,7 @@ static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsig
> > >  {
> > >         return find_vma(mm, addr);
> > >  }
> > > +#endif
> > >
> > >  static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
> > >                                                  unsigned long addr, u32 flags)
> > > @@ -441,8 +485,10 @@ static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
> > >  skip_vma:
> > >         /*
> > >          * If the user needs closest matching VMA, keep iterating.
> > > +        * But before we proceed we might need to unlock current VMA.
> > >          */
> > >         addr = vma->vm_end;
> > > +       vma_end_read(vma); /* no-op under !CONFIG_PER_VMA_LOCK */
> > >         if (flags & PROCMAP_QUERY_COVERING_OR_NEXT_VMA)
> > >                 goto next_vma;
> > >  no_vma:
> > > --
> > > 2.43.0
> > >

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 1/9] mm: add find_vma()-like API but RCU protected and taking VMA lock
  2024-06-06 16:51                 ` Andrii Nakryiko
@ 2024-06-06 17:13                   ` Suren Baghdasaryan
  0 siblings, 0 replies; 31+ messages in thread
From: Suren Baghdasaryan @ 2024-06-06 17:13 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Liam R. Howlett, Matthew Wilcox, Andrii Nakryiko, linux-fsdevel,
	brauner, viro, akpm, linux-kernel, bpf, gregkh, linux-mm, rppt

On Thu, Jun 6, 2024 at 9:52 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Wed, Jun 5, 2024 at 4:22 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Wed, Jun 5, 2024 at 10:03 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > >
> > > * Andrii Nakryiko <andrii.nakryiko@gmail.com> [240605 12:27]:
> > > > On Wed, Jun 5, 2024 at 9:24 AM Andrii Nakryiko
> > > > <andrii.nakryiko@gmail.com> wrote:
> > > > >
> > > > > On Wed, Jun 5, 2024 at 9:13 AM Andrii Nakryiko
> > > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > >
> > > > > > On Wed, Jun 5, 2024 at 6:33 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > > > > >
> > > > > > > * Matthew Wilcox <willy@infradead.org> [240604 20:57]:
> > > > > > > > On Tue, Jun 04, 2024 at 05:24:46PM -0700, Andrii Nakryiko wrote:
> > > > > > > > > +/*
> > > > > > > > > + * find_and_lock_vma_rcu() - Find and lock the VMA for a given address, or the
> > > > > > > > > + * next VMA. Search is done under RCU protection, without taking or assuming
> > > > > > > > > + * mmap_lock. Returned VMA is guaranteed to be stable and not isolated.
> > > > > > > >
> > > > > > > > You know this is supposed to be the _short_ description, right?
> > > > > > > > Three lines is way too long.  The full description goes between the
> > > > > > > > arguments and the Return: line.
> > > > > >
> > > > > > Sure, I'll adjust.
> > > > > >
> > > > > > > >
> > > > > > > > > + * @mm: The mm_struct to check
> > > > > > > > > + * @addr: The address
> > > > > > > > > + *
> > > > > > > > > + * Returns: The VMA associated with addr, or the next VMA.
> > > > > > > > > + * May return %NULL in the case of no VMA at addr or above.
> > > > > > > > > + * If the VMA is being modified and can't be locked, -EBUSY is returned.
> > > > > > > > > + */
> > > > > > > > > +struct vm_area_struct *find_and_lock_vma_rcu(struct mm_struct *mm,
> > > > > > > > > +                                        unsigned long address)
> > > > > > > > > +{
> > > > > > > > > +   MA_STATE(mas, &mm->mm_mt, address, address);
> > > > > > > > > +   struct vm_area_struct *vma;
> > > > > > > > > +   int err;
> > > > > > > > > +
> > > > > > > > > +   rcu_read_lock();
> > > > > > > > > +retry:
> > > > > > > > > +   vma = mas_find(&mas, ULONG_MAX);
> > > > > > > > > +   if (!vma) {
> > > > > > > > > +           err = 0; /* no VMA, return NULL */
> > > > > > > > > +           goto inval;
> > > > > > > > > +   }
> > > > > > > > > +
> > > > > > > > > +   if (!vma_start_read(vma)) {
> > > > > > > > > +           err = -EBUSY;
> > > > > > > > > +           goto inval;
> > > > > > > > > +   }
> > > > > > > > > +
> > > > > > > > > +   /*
> > > > > > > > > +    * Check since vm_start/vm_end might change before we lock the VMA.
> > > > > > > > > +    * Note, unlike lock_vma_under_rcu() we are searching for VMA covering
> > > > > > > > > +    * address or the next one, so we only make sure VMA wasn't updated to
> > > > > > > > > +    * end before the address.
> > > > > > > > > +    */
> > > > > > > > > +   if (unlikely(vma->vm_end <= address)) {
> > > > > > > > > +           err = -EBUSY;
> > > > > > > > > +           goto inval_end_read;
> > > > > > > > > +   }
> > > > > > > > > +
> > > > > > > > > +   /* Check if the VMA got isolated after we found it */
> > > > > > > > > +   if (vma->detached) {
> > > > > > > > > +           vma_end_read(vma);
> > > > > > > > > +           count_vm_vma_lock_event(VMA_LOCK_MISS);
> > > > > > > > > +           /* The area was replaced with another one */
> > > > > > > >
> > > > > > > > Surely you need to mas_reset() before you goto retry?
> > > > > > >
> > > > > > > Probably more than that.  We've found and may have adjusted the
> > > > > > > index/last; we should reconfigure the maple state.  You should probably
> > > > > > > use mas_set(), which will reset the maple state and set the index and
> > > > > > > long to address.
> > > > > >
> > > > > > Yep, makes sense, thanks. As for the `unlikely(vma->vm_end <=
> > > > > > address)` case, I presume we want to do the same, right? Basically, on
> > > > > > each retry start from the `address` unconditionally, no matter what's
> > > > > > the reason for retry.
> > > > >
> > > > > ah, never mind, we don't retry in that situation, I'll just put
> > > > > `mas_set(&mas, address);` right before `goto retry;`. Unless we should
> > > > > actually retry in the case when VMA got moved before the requested
> > > > > address, not sure, let me know what you think. Presumably retrying
> > > > > will allow us to get the correct VMA without the need to fall back to
> > > > > mmap_lock?
> > > >
> > > > sorry, one more question as I look some more around this (unfamiliar
> > > > to me) piece of code. I see that lock_vma_under_rcu counts
> > > > VMA_LOCK_MISS on retry, but I see that there is actually a
> > > > VMA_LOCK_RETRY stat as well. Any reason it's a MISS instead of RETRY?
> > > > Should I use MISS as well, or actually count a RETRY?
> > > >
> > >
> > > VMA_LOCK_MISS is used here because we missed the VMA due to a write
> > > happening to move the vma (rather rare).  The VMA_LOCK missed the vma.
> > >
> > > VMA_LOCK_RETRY is used to indicate we need to retry under the mmap lock.
> > > A retry is needed after the VMA_LOCK did not work under rcu locking.
> >
> > Originally lock_vma_under_rcu() was used only inside page fault path,
> > so these counters helped us quantify how effective VMA locking is when
> > handling page faults. With more users of that function these counters
> > will be affected by other paths as well. I'm not sure but I think it
> > makes sense to use them only inside page fault path, IOW we should
> > probably move count_vm_vma_lock_event() calls outside of
> > lock_vma_under_rcu() and add them only when handling page faults.
>
> Alright, seems like I should then just drop count_vm_vma_lock_event()
> from the API I'm adding.

That would be my preference but as I said, I'm not 100% sure about
this direction.

>
> >
> > >
> > > Thanks,
> > > Liam

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/9] fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API
  2024-06-06 16:51     ` Andrii Nakryiko
  2024-06-06 17:12       ` Suren Baghdasaryan
@ 2024-06-06 17:15       ` Liam R. Howlett
  2024-06-06 17:33         ` Suren Baghdasaryan
  2024-06-06 18:09         ` Andrii Nakryiko
  1 sibling, 2 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-06-06 17:15 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Suren Baghdasaryan, Andrii Nakryiko, linux-fsdevel, brauner, viro,
	akpm, linux-kernel, bpf, gregkh, linux-mm, rppt

* Andrii Nakryiko <andrii.nakryiko@gmail.com> [240606 12:52]:
> On Wed, Jun 5, 2024 at 4:16 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Tue, Jun 4, 2024 at 5:25 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> > >
> > > Attempt to use RCU-protected per-VMA lock when looking up requested VMA
> > > as much as possible, only falling back to mmap_lock if per-VMA lock
> > > failed. This is done so that querying of VMAs doesn't interfere with
> > > other critical tasks, like page fault handling.
> > >
> > > This has been suggested by mm folks, and we make use of a newly added
> > > internal API that works like find_vma(), but tries to use per-VMA lock.
> > >
> > > We have two sets of setup/query/teardown helper functions with different
> > > implementations depending on availability of per-VMA lock (conditioned
> > > on CONFIG_PER_VMA_LOCK) to abstract per-VMA lock subtleties.
> > >
> > > When per-VMA lock is available, lookup is done under RCU, attempting to
> > > take a per-VMA lock. If that fails, we fallback to mmap_lock, but then
> > > proceed to unconditionally grab per-VMA lock again, dropping mmap_lock
> > > immediately. In this configuration mmap_lock is never helf for long,
> > > minimizing disruptions while querying.
> > >
> > > When per-VMA lock is compiled out, we take mmap_lock once, query VMAs
> > > using find_vma() API, and then unlock mmap_lock at the very end once as
> > > well. In this setup we avoid locking/unlocking mmap_lock on every looked
> > > up VMA (depending on query parameters we might need to iterate a few of
> > > them).
> > >
> > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > ---
> > >  fs/proc/task_mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 46 insertions(+)
> > >
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index 614fbe5d0667..140032ffc551 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -388,6 +388,49 @@ static int pid_maps_open(struct inode *inode, struct file *file)
> > >                 PROCMAP_QUERY_VMA_FLAGS                         \
> > >  )
> > >
> > > +#ifdef CONFIG_PER_VMA_LOCK
> > > +static int query_vma_setup(struct mm_struct *mm)
> > > +{
> > > +       /* in the presence of per-VMA lock we don't need any setup/teardown */
> > > +       return 0;
> > > +}
> > > +
> > > +static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma)
> > > +{
> > > +       /* in the presence of per-VMA lock we need to unlock vma, if present */
> > > +       if (vma)
> > > +               vma_end_read(vma);
> > > +}
> > > +
> > > +static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr)
> > > +{
> > > +       struct vm_area_struct *vma;
> > > +
> > > +       /* try to use less disruptive per-VMA lock */
> > > +       vma = find_and_lock_vma_rcu(mm, addr);
> > > +       if (IS_ERR(vma)) {
> > > +               /* failed to take per-VMA lock, fallback to mmap_lock */
> > > +               if (mmap_read_lock_killable(mm))
> > > +                       return ERR_PTR(-EINTR);
> > > +
> > > +               vma = find_vma(mm, addr);
> > > +               if (vma) {
> > > +                       /*
> > > +                        * We cannot use vma_start_read() as it may fail due to
> > > +                        * false locked (see comment in vma_start_read()). We
> > > +                        * can avoid that by directly locking vm_lock under
> > > +                        * mmap_lock, which guarantees that nobody can lock the
> > > +                        * vma for write (vma_start_write()) under us.
> > > +                        */
> > > +                       down_read(&vma->vm_lock->lock);
> >
> > Hi Andrii,
> > The above pattern of locking VMA under mmap_lock and then dropping
> > mmap_lock is becoming more common. Matthew had an RFC proposal for an
> > API to do this here:
> > https://lore.kernel.org/all/ZivhG0yrbpFqORDw@casper.infradead.org/. It
> > might be worth reviving that discussion.
> 
> Sure, it would be nice to have generic and blessed primitives to use
> here. But the good news is that once this is all figured out by you mm
> folks, it should be easy to make use of those primitives here, right?
> 
> >
> > > +               }
> > > +
> > > +               mmap_read_unlock(mm);
> >
> > Later on in your code you are calling get_vma_name() which might call
> > anon_vma_name() to retrieve user-defined VMA name. After this patch
> > this operation will be done without holding mmap_lock, however per
> > https://elixir.bootlin.com/linux/latest/source/include/linux/mm_types.h#L582
> > this function has to be called with mmap_lock held for read. Indeed
> > with debug flags enabled you should hit this assertion:
> > https://elixir.bootlin.com/linux/latest/source/mm/madvise.c#L96.

The documentation on the first link says to hold the lock or take a
reference, but then we assert the lock.  If you take a reference to the
anon vma name, then we will trigger the assert.  Either the
documentation needs changing or the assert is incorrect - or I'm missing
something?

> 
> Sigh... Ok, what's the suggestion then? Should it be some variant of
> mmap_assert_locked() || vma_assert_locked() logic, or it's not so
> simple?
> 
> Maybe I should just drop the CONFIG_PER_VMA_LOCK changes for now until
> all these gotchas are figured out for /proc/<pid>/maps anyway, and
> then we can adapt both text-based and ioctl-based /proc/<pid>/maps
> APIs on top of whatever the final approach will end up being the right
> one?
> 
> Liam, any objections to this? The whole point of this patch set is to
> add a new API, not all the CONFIG_PER_VMA_LOCK gotchas. My
> implementation is structured in a way that should be easily amenable
> to CONFIG_PER_VMA_LOCK changes, but if there are a few more subtle
> things that need to be figured for existing text-based
> /proc/<pid>/maps anyways, I think it would be best to use mmap_lock
> for now for this new API, and then adopt the same final
> CONFIG_PER_VMA_LOCK-aware solution.

The reason I was hoping to have the new interface use the per-vma
locking from the start is to ensure the guarantees that we provide to
the users would not change.  We'd also avoid shifting to yet another
mmap_lock users.

I also didn't think it would complicate your series too much, so I
understand why you want to revert to the old locking semantics.  I'm
fine with you continuing with the series on the old lock.  Thanks for
trying to make this work.

Regards,
Liam

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/9] fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API
  2024-06-06 17:15       ` Liam R. Howlett
@ 2024-06-06 17:33         ` Suren Baghdasaryan
  2024-06-06 18:07           ` Liam R. Howlett
  2024-06-06 18:09         ` Andrii Nakryiko
  1 sibling, 1 reply; 31+ messages in thread
From: Suren Baghdasaryan @ 2024-06-06 17:33 UTC (permalink / raw)
  To: Liam R. Howlett, Andrii Nakryiko, Suren Baghdasaryan,
	Andrii Nakryiko, linux-fsdevel, brauner, viro, akpm, linux-kernel,
	bpf, gregkh, linux-mm, rppt

On Thu, Jun 6, 2024 at 10:15 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Andrii Nakryiko <andrii.nakryiko@gmail.com> [240606 12:52]:
> > On Wed, Jun 5, 2024 at 4:16 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Tue, Jun 4, 2024 at 5:25 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > >
> > > > Attempt to use RCU-protected per-VMA lock when looking up requested VMA
> > > > as much as possible, only falling back to mmap_lock if per-VMA lock
> > > > failed. This is done so that querying of VMAs doesn't interfere with
> > > > other critical tasks, like page fault handling.
> > > >
> > > > This has been suggested by mm folks, and we make use of a newly added
> > > > internal API that works like find_vma(), but tries to use per-VMA lock.
> > > >
> > > > We have two sets of setup/query/teardown helper functions with different
> > > > implementations depending on availability of per-VMA lock (conditioned
> > > > on CONFIG_PER_VMA_LOCK) to abstract per-VMA lock subtleties.
> > > >
> > > > When per-VMA lock is available, lookup is done under RCU, attempting to
> > > > take a per-VMA lock. If that fails, we fallback to mmap_lock, but then
> > > > proceed to unconditionally grab per-VMA lock again, dropping mmap_lock
> > > > immediately. In this configuration mmap_lock is never helf for long,
> > > > minimizing disruptions while querying.
> > > >
> > > > When per-VMA lock is compiled out, we take mmap_lock once, query VMAs
> > > > using find_vma() API, and then unlock mmap_lock at the very end once as
> > > > well. In this setup we avoid locking/unlocking mmap_lock on every looked
> > > > up VMA (depending on query parameters we might need to iterate a few of
> > > > them).
> > > >
> > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > > ---
> > > >  fs/proc/task_mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 46 insertions(+)
> > > >
> > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > index 614fbe5d0667..140032ffc551 100644
> > > > --- a/fs/proc/task_mmu.c
> > > > +++ b/fs/proc/task_mmu.c
> > > > @@ -388,6 +388,49 @@ static int pid_maps_open(struct inode *inode, struct file *file)
> > > >                 PROCMAP_QUERY_VMA_FLAGS                         \
> > > >  )
> > > >
> > > > +#ifdef CONFIG_PER_VMA_LOCK
> > > > +static int query_vma_setup(struct mm_struct *mm)
> > > > +{
> > > > +       /* in the presence of per-VMA lock we don't need any setup/teardown */
> > > > +       return 0;
> > > > +}
> > > > +
> > > > +static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma)
> > > > +{
> > > > +       /* in the presence of per-VMA lock we need to unlock vma, if present */
> > > > +       if (vma)
> > > > +               vma_end_read(vma);
> > > > +}
> > > > +
> > > > +static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr)
> > > > +{
> > > > +       struct vm_area_struct *vma;
> > > > +
> > > > +       /* try to use less disruptive per-VMA lock */
> > > > +       vma = find_and_lock_vma_rcu(mm, addr);
> > > > +       if (IS_ERR(vma)) {
> > > > +               /* failed to take per-VMA lock, fallback to mmap_lock */
> > > > +               if (mmap_read_lock_killable(mm))
> > > > +                       return ERR_PTR(-EINTR);
> > > > +
> > > > +               vma = find_vma(mm, addr);
> > > > +               if (vma) {
> > > > +                       /*
> > > > +                        * We cannot use vma_start_read() as it may fail due to
> > > > +                        * false locked (see comment in vma_start_read()). We
> > > > +                        * can avoid that by directly locking vm_lock under
> > > > +                        * mmap_lock, which guarantees that nobody can lock the
> > > > +                        * vma for write (vma_start_write()) under us.
> > > > +                        */
> > > > +                       down_read(&vma->vm_lock->lock);
> > >
> > > Hi Andrii,
> > > The above pattern of locking VMA under mmap_lock and then dropping
> > > mmap_lock is becoming more common. Matthew had an RFC proposal for an
> > > API to do this here:
> > > https://lore.kernel.org/all/ZivhG0yrbpFqORDw@casper.infradead.org/. It
> > > might be worth reviving that discussion.
> >
> > Sure, it would be nice to have generic and blessed primitives to use
> > here. But the good news is that once this is all figured out by you mm
> > folks, it should be easy to make use of those primitives here, right?
> >
> > >
> > > > +               }
> > > > +
> > > > +               mmap_read_unlock(mm);
> > >
> > > Later on in your code you are calling get_vma_name() which might call
> > > anon_vma_name() to retrieve user-defined VMA name. After this patch
> > > this operation will be done without holding mmap_lock, however per
> > > https://elixir.bootlin.com/linux/latest/source/include/linux/mm_types.h#L582
> > > this function has to be called with mmap_lock held for read. Indeed
> > > with debug flags enabled you should hit this assertion:
> > > https://elixir.bootlin.com/linux/latest/source/mm/madvise.c#L96.
>
> The documentation on the first link says to hold the lock or take a
> reference, but then we assert the lock.  If you take a reference to the
> anon vma name, then we will trigger the assert.  Either the
> documentation needs changing or the assert is incorrect - or I'm missing
> something?

I think the documentation is correct. It says that at the time of
calling anon_vma_name() the mmap_lock should be locked (hence the
assertion). Then the user can raise anon_vma_name refcount, drop
mmap_lock and safely continue using anon_vma_name object. IOW this is
fine:

mmap_read_lock(vma->mm);
anon_name = anon_vma_name(vma);
anon_vma_name_get(anon_name);
mmap_read_unlock(vma->mm);
// keep using anon_name
anon_vma_name_put(anon_name);


>
> >
> > Sigh... Ok, what's the suggestion then? Should it be some variant of
> > mmap_assert_locked() || vma_assert_locked() logic, or it's not so
> > simple?
> >
> > Maybe I should just drop the CONFIG_PER_VMA_LOCK changes for now until
> > all these gotchas are figured out for /proc/<pid>/maps anyway, and
> > then we can adapt both text-based and ioctl-based /proc/<pid>/maps
> > APIs on top of whatever the final approach will end up being the right
> > one?
> >
> > Liam, any objections to this? The whole point of this patch set is to
> > add a new API, not all the CONFIG_PER_VMA_LOCK gotchas. My
> > implementation is structured in a way that should be easily amenable
> > to CONFIG_PER_VMA_LOCK changes, but if there are a few more subtle
> > things that need to be figured for existing text-based
> > /proc/<pid>/maps anyways, I think it would be best to use mmap_lock
> > for now for this new API, and then adopt the same final
> > CONFIG_PER_VMA_LOCK-aware solution.
>
> The reason I was hoping to have the new interface use the per-vma
> locking from the start is to ensure the guarantees that we provide to
> the users would not change.  We'd also avoid shifting to yet another
> mmap_lock users.
>
> I also didn't think it would complicate your series too much, so I
> understand why you want to revert to the old locking semantics.  I'm
> fine with you continuing with the series on the old lock.  Thanks for
> trying to make this work.
>
> Regards,
> Liam

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/9] fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API
  2024-06-06 17:12       ` Suren Baghdasaryan
@ 2024-06-06 18:03         ` Andrii Nakryiko
  0 siblings, 0 replies; 31+ messages in thread
From: Andrii Nakryiko @ 2024-06-06 18:03 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrii Nakryiko, linux-fsdevel, brauner, viro, akpm, linux-kernel,
	bpf, gregkh, linux-mm, liam.howlett, rppt

On Thu, Jun 6, 2024 at 10:12 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Jun 6, 2024 at 9:52 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Wed, Jun 5, 2024 at 4:16 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Tue, Jun 4, 2024 at 5:25 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > >
> > > > Attempt to use RCU-protected per-VMA lock when looking up requested VMA
> > > > as much as possible, only falling back to mmap_lock if per-VMA lock
> > > > failed. This is done so that querying of VMAs doesn't interfere with
> > > > other critical tasks, like page fault handling.
> > > >
> > > > This has been suggested by mm folks, and we make use of a newly added
> > > > internal API that works like find_vma(), but tries to use per-VMA lock.
> > > >
> > > > We have two sets of setup/query/teardown helper functions with different
> > > > implementations depending on availability of per-VMA lock (conditioned
> > > > on CONFIG_PER_VMA_LOCK) to abstract per-VMA lock subtleties.
> > > >
> > > > When per-VMA lock is available, lookup is done under RCU, attempting to
> > > > take a per-VMA lock. If that fails, we fallback to mmap_lock, but then
> > > > proceed to unconditionally grab per-VMA lock again, dropping mmap_lock
> > > > immediately. In this configuration mmap_lock is never helf for long,
> > > > minimizing disruptions while querying.
> > > >
> > > > When per-VMA lock is compiled out, we take mmap_lock once, query VMAs
> > > > using find_vma() API, and then unlock mmap_lock at the very end once as
> > > > well. In this setup we avoid locking/unlocking mmap_lock on every looked
> > > > up VMA (depending on query parameters we might need to iterate a few of
> > > > them).
> > > >
> > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > > ---
> > > >  fs/proc/task_mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 46 insertions(+)
> > > >
> > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > index 614fbe5d0667..140032ffc551 100644
> > > > --- a/fs/proc/task_mmu.c
> > > > +++ b/fs/proc/task_mmu.c
> > > > @@ -388,6 +388,49 @@ static int pid_maps_open(struct inode *inode, struct file *file)
> > > >                 PROCMAP_QUERY_VMA_FLAGS                         \
> > > >  )
> > > >
> > > > +#ifdef CONFIG_PER_VMA_LOCK
> > > > +static int query_vma_setup(struct mm_struct *mm)
> > > > +{
> > > > +       /* in the presence of per-VMA lock we don't need any setup/teardown */
> > > > +       return 0;
> > > > +}
> > > > +
> > > > +static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma)
> > > > +{
> > > > +       /* in the presence of per-VMA lock we need to unlock vma, if present */
> > > > +       if (vma)
> > > > +               vma_end_read(vma);
> > > > +}
> > > > +
> > > > +static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr)
> > > > +{
> > > > +       struct vm_area_struct *vma;
> > > > +
> > > > +       /* try to use less disruptive per-VMA lock */
> > > > +       vma = find_and_lock_vma_rcu(mm, addr);
> > > > +       if (IS_ERR(vma)) {
> > > > +               /* failed to take per-VMA lock, fallback to mmap_lock */
> > > > +               if (mmap_read_lock_killable(mm))
> > > > +                       return ERR_PTR(-EINTR);
> > > > +
> > > > +               vma = find_vma(mm, addr);
> > > > +               if (vma) {
> > > > +                       /*
> > > > +                        * We cannot use vma_start_read() as it may fail due to
> > > > +                        * false locked (see comment in vma_start_read()). We
> > > > +                        * can avoid that by directly locking vm_lock under
> > > > +                        * mmap_lock, which guarantees that nobody can lock the
> > > > +                        * vma for write (vma_start_write()) under us.
> > > > +                        */
> > > > +                       down_read(&vma->vm_lock->lock);
> > >
> > > Hi Andrii,
> > > The above pattern of locking VMA under mmap_lock and then dropping
> > > mmap_lock is becoming more common. Matthew had an RFC proposal for an
> > > API to do this here:
> > > https://lore.kernel.org/all/ZivhG0yrbpFqORDw@casper.infradead.org/. It
> > > might be worth reviving that discussion.
> >
> > Sure, it would be nice to have generic and blessed primitives to use
> > here. But the good news is that once this is all figured out by you mm
> > folks, it should be easy to make use of those primitives here, right?
> >
> > >
> > > > +               }
> > > > +
> > > > +               mmap_read_unlock(mm);
> > >
> > > Later on in your code you are calling get_vma_name() which might call
> > > anon_vma_name() to retrieve user-defined VMA name. After this patch
> > > this operation will be done without holding mmap_lock, however per
> > > https://elixir.bootlin.com/linux/latest/source/include/linux/mm_types.h#L582
> > > this function has to be called with mmap_lock held for read. Indeed
> > > with debug flags enabled you should hit this assertion:
> > > https://elixir.bootlin.com/linux/latest/source/mm/madvise.c#L96.
> >
> > Sigh... Ok, what's the suggestion then? Should it be some variant of
> > mmap_assert_locked() || vma_assert_locked() logic, or it's not so
> > simple?
> >
> > Maybe I should just drop the CONFIG_PER_VMA_LOCK changes for now until
> > all these gotchas are figured out for /proc/<pid>/maps anyway, and
> > then we can adapt both text-based and ioctl-based /proc/<pid>/maps
> > APIs on top of whatever the final approach will end up being the right
> > one?
> >
> > Liam, any objections to this? The whole point of this patch set is to
> > add a new API, not all the CONFIG_PER_VMA_LOCK gotchas. My
> > implementation is structured in a way that should be easily amenable
> > to CONFIG_PER_VMA_LOCK changes, but if there are a few more subtle
> > things that need to be figured for existing text-based
> > /proc/<pid>/maps anyways, I think it would be best to use mmap_lock
> > for now for this new API, and then adopt the same final
> > CONFIG_PER_VMA_LOCK-aware solution.
>
> I agree that you should start simple, using mmap_lock first and then
> work on improvements. Would the proposed solution become useless with
> coarse mmap_lock'ing?

Sorry, it's not clear what you mean by "proposed solution"? If you
mean this new ioctl-based API, no it's still very useful and fast even
if we take mmap_lock.

But if you mean vm_lock, then I'd say that due to anon_vma_name()
complication it makes vm_lock not attractive anymore, because vma_name
will be requested pretty much always. And if we need to take mmap_lock
anyways, then what's the point of per-VMA lock, right?

I'd like to be a good citizen here and help you guys not add new
mmap_lock users (and adopt per-VMA lock more widely), but I'm not sure
I can solve the anon_vma_name() conundrum, unfortunately.

Ultimately, I do care the most about having this new API available for
my customers to take advantage of, of course.

>
> >
> > >
> > > > +       }
> > > > +
> > > > +       return vma;
> > > > +}
> > > > +#else
> > > >  static int query_vma_setup(struct mm_struct *mm)
> > > >  {
> > > >         return mmap_read_lock_killable(mm);
> > > > @@ -402,6 +445,7 @@ static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsig
> > > >  {
> > > >         return find_vma(mm, addr);
> > > >  }
> > > > +#endif
> > > >
> > > >  static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
> > > >                                                  unsigned long addr, u32 flags)
> > > > @@ -441,8 +485,10 @@ static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
> > > >  skip_vma:
> > > >         /*
> > > >          * If the user needs closest matching VMA, keep iterating.
> > > > +        * But before we proceed we might need to unlock current VMA.
> > > >          */
> > > >         addr = vma->vm_end;
> > > > +       vma_end_read(vma); /* no-op under !CONFIG_PER_VMA_LOCK */
> > > >         if (flags & PROCMAP_QUERY_COVERING_OR_NEXT_VMA)
> > > >                 goto next_vma;
> > > >  no_vma:
> > > > --
> > > > 2.43.0
> > > >

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/9] fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API
  2024-06-06 17:33         ` Suren Baghdasaryan
@ 2024-06-06 18:07           ` Liam R. Howlett
  0 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-06-06 18:07 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrii Nakryiko, Andrii Nakryiko, linux-fsdevel, brauner, viro,
	akpm, linux-kernel, bpf, gregkh, linux-mm, rppt

* Suren Baghdasaryan <surenb@google.com> [240606 13:33]:
> On Thu, Jun 6, 2024 at 10:15 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > * Andrii Nakryiko <andrii.nakryiko@gmail.com> [240606 12:52]:
> > > On Wed, Jun 5, 2024 at 4:16 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > On Tue, Jun 4, 2024 at 5:25 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > > >
> > > > > Attempt to use RCU-protected per-VMA lock when looking up requested VMA
> > > > > as much as possible, only falling back to mmap_lock if per-VMA lock
> > > > > failed. This is done so that querying of VMAs doesn't interfere with
> > > > > other critical tasks, like page fault handling.
> > > > >
> > > > > This has been suggested by mm folks, and we make use of a newly added
> > > > > internal API that works like find_vma(), but tries to use per-VMA lock.
> > > > >
> > > > > We have two sets of setup/query/teardown helper functions with different
> > > > > implementations depending on availability of per-VMA lock (conditioned
> > > > > on CONFIG_PER_VMA_LOCK) to abstract per-VMA lock subtleties.
> > > > >
> > > > > When per-VMA lock is available, lookup is done under RCU, attempting to
> > > > > take a per-VMA lock. If that fails, we fallback to mmap_lock, but then
> > > > > proceed to unconditionally grab per-VMA lock again, dropping mmap_lock
> > > > > immediately. In this configuration mmap_lock is never helf for long,
> > > > > minimizing disruptions while querying.
> > > > >
> > > > > When per-VMA lock is compiled out, we take mmap_lock once, query VMAs
> > > > > using find_vma() API, and then unlock mmap_lock at the very end once as
> > > > > well. In this setup we avoid locking/unlocking mmap_lock on every looked
> > > > > up VMA (depending on query parameters we might need to iterate a few of
> > > > > them).
> > > > >
> > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > > > ---
> > > > >  fs/proc/task_mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> > > > >  1 file changed, 46 insertions(+)
> > > > >
> > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > > index 614fbe5d0667..140032ffc551 100644
> > > > > --- a/fs/proc/task_mmu.c
> > > > > +++ b/fs/proc/task_mmu.c
> > > > > @@ -388,6 +388,49 @@ static int pid_maps_open(struct inode *inode, struct file *file)
> > > > >                 PROCMAP_QUERY_VMA_FLAGS                         \
> > > > >  )
> > > > >
> > > > > +#ifdef CONFIG_PER_VMA_LOCK
> > > > > +static int query_vma_setup(struct mm_struct *mm)
> > > > > +{
> > > > > +       /* in the presence of per-VMA lock we don't need any setup/teardown */
> > > > > +       return 0;
> > > > > +}
> > > > > +
> > > > > +static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma)
> > > > > +{
> > > > > +       /* in the presence of per-VMA lock we need to unlock vma, if present */
> > > > > +       if (vma)
> > > > > +               vma_end_read(vma);
> > > > > +}
> > > > > +
> > > > > +static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr)
> > > > > +{
> > > > > +       struct vm_area_struct *vma;
> > > > > +
> > > > > +       /* try to use less disruptive per-VMA lock */
> > > > > +       vma = find_and_lock_vma_rcu(mm, addr);
> > > > > +       if (IS_ERR(vma)) {
> > > > > +               /* failed to take per-VMA lock, fallback to mmap_lock */
> > > > > +               if (mmap_read_lock_killable(mm))
> > > > > +                       return ERR_PTR(-EINTR);
> > > > > +
> > > > > +               vma = find_vma(mm, addr);
> > > > > +               if (vma) {
> > > > > +                       /*
> > > > > +                        * We cannot use vma_start_read() as it may fail due to
> > > > > +                        * false locked (see comment in vma_start_read()). We
> > > > > +                        * can avoid that by directly locking vm_lock under
> > > > > +                        * mmap_lock, which guarantees that nobody can lock the
> > > > > +                        * vma for write (vma_start_write()) under us.
> > > > > +                        */
> > > > > +                       down_read(&vma->vm_lock->lock);
> > > >
> > > > Hi Andrii,
> > > > The above pattern of locking VMA under mmap_lock and then dropping
> > > > mmap_lock is becoming more common. Matthew had an RFC proposal for an
> > > > API to do this here:
> > > > https://lore.kernel.org/all/ZivhG0yrbpFqORDw@casper.infradead.org/. It
> > > > might be worth reviving that discussion.
> > >
> > > Sure, it would be nice to have generic and blessed primitives to use
> > > here. But the good news is that once this is all figured out by you mm
> > > folks, it should be easy to make use of those primitives here, right?
> > >
> > > >
> > > > > +               }
> > > > > +
> > > > > +               mmap_read_unlock(mm);
> > > >
> > > > Later on in your code you are calling get_vma_name() which might call
> > > > anon_vma_name() to retrieve user-defined VMA name. After this patch
> > > > this operation will be done without holding mmap_lock, however per
> > > > https://elixir.bootlin.com/linux/latest/source/include/linux/mm_types.h#L582
> > > > this function has to be called with mmap_lock held for read. Indeed
> > > > with debug flags enabled you should hit this assertion:
> > > > https://elixir.bootlin.com/linux/latest/source/mm/madvise.c#L96.
> >
> > The documentation on the first link says to hold the lock or take a
> > reference, but then we assert the lock.  If you take a reference to the
> > anon vma name, then we will trigger the assert.  Either the
> > documentation needs changing or the assert is incorrect - or I'm missing
> > something?
> 
> I think the documentation is correct. It says that at the time of
> calling anon_vma_name() the mmap_lock should be locked (hence the
> assertion). Then the user can raise anon_vma_name refcount, drop
> mmap_lock and safely continue using anon_vma_name object. IOW this is
> fine:
> 
> mmap_read_lock(vma->mm);
> anon_name = anon_vma_name(vma);
> anon_vma_name_get(anon_name);
> mmap_read_unlock(vma->mm);
> // keep using anon_name
> anon_vma_name_put(anon_name);
> 

I consider that an optimistic view of what will happen, but okay.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/9] fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API
  2024-06-06 17:15       ` Liam R. Howlett
  2024-06-06 17:33         ` Suren Baghdasaryan
@ 2024-06-06 18:09         ` Andrii Nakryiko
  2024-06-06 18:32           ` Liam R. Howlett
  1 sibling, 1 reply; 31+ messages in thread
From: Andrii Nakryiko @ 2024-06-06 18:09 UTC (permalink / raw)
  To: Liam R. Howlett, Andrii Nakryiko, Suren Baghdasaryan,
	Andrii Nakryiko, linux-fsdevel, brauner, viro, akpm, linux-kernel,
	bpf, gregkh, linux-mm, rppt

On Thu, Jun 6, 2024 at 10:15 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Andrii Nakryiko <andrii.nakryiko@gmail.com> [240606 12:52]:
> > On Wed, Jun 5, 2024 at 4:16 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Tue, Jun 4, 2024 at 5:25 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > >
> > > > Attempt to use RCU-protected per-VMA lock when looking up requested VMA
> > > > as much as possible, only falling back to mmap_lock if per-VMA lock
> > > > failed. This is done so that querying of VMAs doesn't interfere with
> > > > other critical tasks, like page fault handling.
> > > >
> > > > This has been suggested by mm folks, and we make use of a newly added
> > > > internal API that works like find_vma(), but tries to use per-VMA lock.
> > > >
> > > > We have two sets of setup/query/teardown helper functions with different
> > > > implementations depending on availability of per-VMA lock (conditioned
> > > > on CONFIG_PER_VMA_LOCK) to abstract per-VMA lock subtleties.
> > > >
> > > > When per-VMA lock is available, lookup is done under RCU, attempting to
> > > > take a per-VMA lock. If that fails, we fallback to mmap_lock, but then
> > > > proceed to unconditionally grab per-VMA lock again, dropping mmap_lock
> > > > immediately. In this configuration mmap_lock is never helf for long,
> > > > minimizing disruptions while querying.
> > > >
> > > > When per-VMA lock is compiled out, we take mmap_lock once, query VMAs
> > > > using find_vma() API, and then unlock mmap_lock at the very end once as
> > > > well. In this setup we avoid locking/unlocking mmap_lock on every looked
> > > > up VMA (depending on query parameters we might need to iterate a few of
> > > > them).
> > > >
> > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > > ---
> > > >  fs/proc/task_mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 46 insertions(+)
> > > >
> > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > index 614fbe5d0667..140032ffc551 100644
> > > > --- a/fs/proc/task_mmu.c
> > > > +++ b/fs/proc/task_mmu.c
> > > > @@ -388,6 +388,49 @@ static int pid_maps_open(struct inode *inode, struct file *file)
> > > >                 PROCMAP_QUERY_VMA_FLAGS                         \
> > > >  )
> > > >
> > > > +#ifdef CONFIG_PER_VMA_LOCK
> > > > +static int query_vma_setup(struct mm_struct *mm)
> > > > +{
> > > > +       /* in the presence of per-VMA lock we don't need any setup/teardown */
> > > > +       return 0;
> > > > +}
> > > > +
> > > > +static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma)
> > > > +{
> > > > +       /* in the presence of per-VMA lock we need to unlock vma, if present */
> > > > +       if (vma)
> > > > +               vma_end_read(vma);
> > > > +}
> > > > +
> > > > +static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr)
> > > > +{
> > > > +       struct vm_area_struct *vma;
> > > > +
> > > > +       /* try to use less disruptive per-VMA lock */
> > > > +       vma = find_and_lock_vma_rcu(mm, addr);
> > > > +       if (IS_ERR(vma)) {
> > > > +               /* failed to take per-VMA lock, fallback to mmap_lock */
> > > > +               if (mmap_read_lock_killable(mm))
> > > > +                       return ERR_PTR(-EINTR);
> > > > +
> > > > +               vma = find_vma(mm, addr);
> > > > +               if (vma) {
> > > > +                       /*
> > > > +                        * We cannot use vma_start_read() as it may fail due to
> > > > +                        * false locked (see comment in vma_start_read()). We
> > > > +                        * can avoid that by directly locking vm_lock under
> > > > +                        * mmap_lock, which guarantees that nobody can lock the
> > > > +                        * vma for write (vma_start_write()) under us.
> > > > +                        */
> > > > +                       down_read(&vma->vm_lock->lock);
> > >
> > > Hi Andrii,
> > > The above pattern of locking VMA under mmap_lock and then dropping
> > > mmap_lock is becoming more common. Matthew had an RFC proposal for an
> > > API to do this here:
> > > https://lore.kernel.org/all/ZivhG0yrbpFqORDw@casper.infradead.org/. It
> > > might be worth reviving that discussion.
> >
> > Sure, it would be nice to have generic and blessed primitives to use
> > here. But the good news is that once this is all figured out by you mm
> > folks, it should be easy to make use of those primitives here, right?
> >
> > >
> > > > +               }
> > > > +
> > > > +               mmap_read_unlock(mm);
> > >
> > > Later on in your code you are calling get_vma_name() which might call
> > > anon_vma_name() to retrieve user-defined VMA name. After this patch
> > > this operation will be done without holding mmap_lock, however per
> > > https://elixir.bootlin.com/linux/latest/source/include/linux/mm_types.h#L582
> > > this function has to be called with mmap_lock held for read. Indeed
> > > with debug flags enabled you should hit this assertion:
> > > https://elixir.bootlin.com/linux/latest/source/mm/madvise.c#L96.
>
> The documentation on the first link says to hold the lock or take a
> reference, but then we assert the lock.  If you take a reference to the
> anon vma name, then we will trigger the assert.  Either the
> documentation needs changing or the assert is incorrect - or I'm missing
> something?
>
> >
> > Sigh... Ok, what's the suggestion then? Should it be some variant of
> > mmap_assert_locked() || vma_assert_locked() logic, or it's not so
> > simple?
> >
> > Maybe I should just drop the CONFIG_PER_VMA_LOCK changes for now until
> > all these gotchas are figured out for /proc/<pid>/maps anyway, and
> > then we can adapt both text-based and ioctl-based /proc/<pid>/maps
> > APIs on top of whatever the final approach will end up being the right
> > one?
> >
> > Liam, any objections to this? The whole point of this patch set is to
> > add a new API, not all the CONFIG_PER_VMA_LOCK gotchas. My
> > implementation is structured in a way that should be easily amenable
> > to CONFIG_PER_VMA_LOCK changes, but if there are a few more subtle
> > things that need to be figured for existing text-based
> > /proc/<pid>/maps anyways, I think it would be best to use mmap_lock
> > for now for this new API, and then adopt the same final
> > CONFIG_PER_VMA_LOCK-aware solution.
>
> The reason I was hoping to have the new interface use the per-vma
> locking from the start is to ensure the guarantees that we provide to
> the users would not change.  We'd also avoid shifting to yet another
> mmap_lock users.
>

Yep, it's completely understandable. And you see that I changed the
structure quite a lot to abstract away mmap_lock vs vm_lock details.
I'm afraid anon_vma_name() is quite an obstacle, unfortunately, and
seems like it should be addressed first, but I'm just not qualified
enough to do this.

> I also didn't think it would complicate your series too much, so I
> understand why you want to revert to the old locking semantics.  I'm
> fine with you continuing with the series on the old lock.  Thanks for
> trying to make this work.
>

I'm happy to keep the existing structure of the code, and
(intentionally) all the CONFIG_PER_VMA_LOCK logic is in separate
patches, so it's easy to do. I'd love to help adopt a per-VMA lock
once all the pieces are figured out. Hopefully anon_vma_name() is the
last one remaining :) So please keep me cc'ed on relevant patches.

As I mentioned, I just don't feel like I would be able to solve the
anon_vma_name() problem, but of course I wouldn't want to be
completely blocked by it as well.

> Regards,
> Liam

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/9] fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API
  2024-06-06 18:09         ` Andrii Nakryiko
@ 2024-06-06 18:32           ` Liam R. Howlett
  0 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-06-06 18:32 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Suren Baghdasaryan, Andrii Nakryiko, linux-fsdevel, brauner, viro,
	akpm, linux-kernel, bpf, gregkh, linux-mm, rppt

* Andrii Nakryiko <andrii.nakryiko@gmail.com> [240606 14:09]:

...

> > > Liam, any objections to this? The whole point of this patch set is to
> > > add a new API, not all the CONFIG_PER_VMA_LOCK gotchas. My
> > > implementation is structured in a way that should be easily amenable
> > > to CONFIG_PER_VMA_LOCK changes, but if there are a few more subtle
> > > things that need to be figured for existing text-based
> > > /proc/<pid>/maps anyways, I think it would be best to use mmap_lock
> > > for now for this new API, and then adopt the same final
> > > CONFIG_PER_VMA_LOCK-aware solution.
> >
> > The reason I was hoping to have the new interface use the per-vma
> > locking from the start is to ensure the guarantees that we provide to
> > the users would not change.  We'd also avoid shifting to yet another
> > mmap_lock users.
> >
> 
> Yep, it's completely understandable. And you see that I changed the
> structure quite a lot to abstract away mmap_lock vs vm_lock details.
> I'm afraid anon_vma_name() is quite an obstacle, unfortunately, and
> seems like it should be addressed first, but I'm just not qualified
> enough to do this.
> 
> > I also didn't think it would complicate your series too much, so I
> > understand why you want to revert to the old locking semantics.  I'm
> > fine with you continuing with the series on the old lock.  Thanks for
> > trying to make this work.
> >
> 
> I'm happy to keep the existing structure of the code, and
> (intentionally) all the CONFIG_PER_VMA_LOCK logic is in separate
> patches, so it's easy to do. I'd love to help adopt a per-VMA lock
> once all the pieces are figured out. Hopefully anon_vma_name() is the
> last one remaining :) So please keep me cc'ed on relevant patches.
> 
> As I mentioned, I just don't feel like I would be able to solve the
> anon_vma_name() problem, but of course I wouldn't want to be
> completely blocked by it as well.
> 

Absolutely.  Thanks for trying.  To be clear, I'm fine with you dropping
the per-vma locking from this interface as well.

Thanks,
Liam

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 3/9] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
  2024-06-05  0:24 ` [PATCH v3 3/9] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps Andrii Nakryiko
@ 2024-06-07 22:31   ` Andrei Vagin
  2024-06-10  8:17     ` Andrii Nakryiko
  0 siblings, 1 reply; 31+ messages in thread
From: Andrei Vagin @ 2024-06-07 22:31 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-fsdevel, brauner, viro, akpm, linux-kernel, bpf, gregkh,
	linux-mm, liam.howlett, surenb, rppt

On Tue, Jun 04, 2024 at 05:24:48PM -0700, Andrii Nakryiko wrote:
> /proc/<pid>/maps file is extremely useful in practice for various tasks
> involving figuring out process memory layout, what files are backing any
> given memory range, etc. One important class of applications that
> absolutely rely on this are profilers/stack symbolizers (perf tool being one
> of them). Patterns of use differ, but they generally would fall into two
> categories.
> 
> In on-demand pattern, a profiler/symbolizer would normally capture stack
> trace containing absolute memory addresses of some functions, and would
> then use /proc/<pid>/maps file to find corresponding backing ELF files
> (normally, only executable VMAs are of interest), file offsets within
> them, and then continue from there to get yet more information (ELF
> symbols, DWARF information) to get human-readable symbolic information.
> This pattern is used by Meta's fleet-wide profiler, as one example.
> 
> In preprocessing pattern, application doesn't know the set of addresses
> of interest, so it has to fetch all relevant VMAs (again, probably only
> executable ones), store or cache them, then proceed with profiling and
> stack trace capture. Once done, it would do symbolization based on
> stored VMA information. This can happen at much later point in time.
> This patterns is used by perf tool, as an example.
> 
> In either case, there are both performance and correctness requirement
> involved. This address to VMA information translation has to be done as
> efficiently as possible, but also not miss any VMA (especially in the
> case of loading/unloading shared libraries). In practice, correctness
> can't be guaranteed (due to process dying before VMA data can be
> captured, or shared library being unloaded, etc), but any effort to
> maximize the chance of finding the VMA is appreciated.
> 
> Unfortunately, for all the /proc/<pid>/maps file universality and
> usefulness, it doesn't fit the above use cases 100%.
> 
> First, it's main purpose is to emit all VMAs sequentially, but in
> practice captured addresses would fall only into a smaller subset of all
> process' VMAs, mainly containing executable text. Yet, library would
> need to parse most or all of the contents to find needed VMAs, as there
> is no way to skip VMAs that are of no use. Efficient library can do the
> linear pass and it is still relatively efficient, but it's definitely an
> overhead that can be avoided, if there was a way to do more targeted
> querying of the relevant VMA information.
> 
> Second, it's a text based interface, which makes its programmatic use from
> applications and libraries more cumbersome and inefficient due to the
> need to handle text parsing to get necessary pieces of information. The
> overhead is actually payed both by kernel, formatting originally binary
> VMA data into text, and then by user space application, parsing it back
> into binary data for further use.

I was trying to solve all these issues in a more generic way:
https://lwn.net/Articles/683371/

We definitely interested in this new interface to use it in CRIU.

<snip>

> +
> +	if (karg.vma_name_size) {
> +		size_t name_buf_sz = min_t(size_t, PATH_MAX, karg.vma_name_size);
> +		const struct path *path;
> +		const char *name_fmt;
> +		size_t name_sz = 0;
> +
> +		get_vma_name(vma, &path, &name, &name_fmt);
> +
> +		if (path || name_fmt || name) {
> +			name_buf = kmalloc(name_buf_sz, GFP_KERNEL);
> +			if (!name_buf) {
> +				err = -ENOMEM;
> +				goto out;
> +			}
> +		}
> +		if (path) {
> +			name = d_path(path, name_buf, name_buf_sz);
> +			if (IS_ERR(name)) {
> +				err = PTR_ERR(name);
> +				goto out;

It always fails if a file path name is longer than PATH_MAX.

Can we add a flag to indicate whether file names are needed to be
resolved? In criu, we use special names like "vvar", "vdso", but we dump
files via /proc/pid/map_files.

> +			}
> +			name_sz = name_buf + name_buf_sz - name;
> +		} else if (name || name_fmt) {
> +			name_sz = 1 + snprintf(name_buf, name_buf_sz, name_fmt ?: "%s", name);
> +			name = name_buf;
> +		}
> +		if (name_sz > name_buf_sz) {
> +			err = -ENAMETOOLONG;
> +			goto out;
> +		}
> +		karg.vma_name_size = name_sz;
> +	}

Thanks,
Andrei

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 3/9] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
  2024-06-07 22:31   ` Andrei Vagin
@ 2024-06-10  8:17     ` Andrii Nakryiko
  2024-06-12 17:48       ` Andrei Vagin
  0 siblings, 1 reply; 31+ messages in thread
From: Andrii Nakryiko @ 2024-06-10  8:17 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Andrii Nakryiko, linux-fsdevel, brauner, viro, akpm, linux-kernel,
	bpf, gregkh, linux-mm, liam.howlett, surenb, rppt

On Fri, Jun 7, 2024 at 11:31 PM Andrei Vagin <avagin@gmail.com> wrote:
>
> On Tue, Jun 04, 2024 at 05:24:48PM -0700, Andrii Nakryiko wrote:
> > /proc/<pid>/maps file is extremely useful in practice for various tasks
> > involving figuring out process memory layout, what files are backing any
> > given memory range, etc. One important class of applications that
> > absolutely rely on this are profilers/stack symbolizers (perf tool being one
> > of them). Patterns of use differ, but they generally would fall into two
> > categories.
> >
> > In on-demand pattern, a profiler/symbolizer would normally capture stack
> > trace containing absolute memory addresses of some functions, and would
> > then use /proc/<pid>/maps file to find corresponding backing ELF files
> > (normally, only executable VMAs are of interest), file offsets within
> > them, and then continue from there to get yet more information (ELF
> > symbols, DWARF information) to get human-readable symbolic information.
> > This pattern is used by Meta's fleet-wide profiler, as one example.
> >
> > In preprocessing pattern, application doesn't know the set of addresses
> > of interest, so it has to fetch all relevant VMAs (again, probably only
> > executable ones), store or cache them, then proceed with profiling and
> > stack trace capture. Once done, it would do symbolization based on
> > stored VMA information. This can happen at much later point in time.
> > This patterns is used by perf tool, as an example.
> >
> > In either case, there are both performance and correctness requirement
> > involved. This address to VMA information translation has to be done as
> > efficiently as possible, but also not miss any VMA (especially in the
> > case of loading/unloading shared libraries). In practice, correctness
> > can't be guaranteed (due to process dying before VMA data can be
> > captured, or shared library being unloaded, etc), but any effort to
> > maximize the chance of finding the VMA is appreciated.
> >
> > Unfortunately, for all the /proc/<pid>/maps file universality and
> > usefulness, it doesn't fit the above use cases 100%.
> >
> > First, it's main purpose is to emit all VMAs sequentially, but in
> > practice captured addresses would fall only into a smaller subset of all
> > process' VMAs, mainly containing executable text. Yet, library would
> > need to parse most or all of the contents to find needed VMAs, as there
> > is no way to skip VMAs that are of no use. Efficient library can do the
> > linear pass and it is still relatively efficient, but it's definitely an
> > overhead that can be avoided, if there was a way to do more targeted
> > querying of the relevant VMA information.
> >
> > Second, it's a text based interface, which makes its programmatic use from
> > applications and libraries more cumbersome and inefficient due to the
> > need to handle text parsing to get necessary pieces of information. The
> > overhead is actually payed both by kernel, formatting originally binary
> > VMA data into text, and then by user space application, parsing it back
> > into binary data for further use.
>
> I was trying to solve all these issues in a more generic way:
> https://lwn.net/Articles/683371/
>

Can you please provide a tl;dr summary of that effort?

> We definitely interested in this new interface to use it in CRIU.
>
> <snip>
>
> > +
> > +     if (karg.vma_name_size) {
> > +             size_t name_buf_sz = min_t(size_t, PATH_MAX, karg.vma_name_size);
> > +             const struct path *path;
> > +             const char *name_fmt;
> > +             size_t name_sz = 0;
> > +
> > +             get_vma_name(vma, &path, &name, &name_fmt);
> > +
> > +             if (path || name_fmt || name) {
> > +                     name_buf = kmalloc(name_buf_sz, GFP_KERNEL);
> > +                     if (!name_buf) {
> > +                             err = -ENOMEM;
> > +                             goto out;
> > +                     }
> > +             }
> > +             if (path) {
> > +                     name = d_path(path, name_buf, name_buf_sz);
> > +                     if (IS_ERR(name)) {
> > +                             err = PTR_ERR(name);
> > +                             goto out;
>
> It always fails if a file path name is longer than PATH_MAX.
>
> Can we add a flag to indicate whether file names are needed to be

It's already supported. Getting a VMA name is optional. See a big
comment next to the vma_name_size field in the UAPI header. If
vma_name_size is set to zero, VMA name is not retrieved at all,
avoiding the overhead and this issue with PATH_MAX.

> resolved? In criu, we use special names like "vvar", "vdso", but we dump
> files via /proc/pid/map_files.
>
> > +                     }
> > +                     name_sz = name_buf + name_buf_sz - name;
> > +             } else if (name || name_fmt) {
> > +                     name_sz = 1 + snprintf(name_buf, name_buf_sz, name_fmt ?: "%s", name);
> > +                     name = name_buf;
> > +             }
> > +             if (name_sz > name_buf_sz) {
> > +                     err = -ENAMETOOLONG;
> > +                     goto out;
> > +             }
> > +             karg.vma_name_size = name_sz;
> > +     }
>
> Thanks,
> Andrei

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 3/9] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
  2024-06-10  8:17     ` Andrii Nakryiko
@ 2024-06-12 17:48       ` Andrei Vagin
  0 siblings, 0 replies; 31+ messages in thread
From: Andrei Vagin @ 2024-06-12 17:48 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, linux-fsdevel, brauner, viro, akpm, linux-kernel,
	bpf, gregkh, linux-mm, liam.howlett, surenb, rppt

On Mon, Jun 10, 2024 at 1:17 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Fri, Jun 7, 2024 at 11:31 PM Andrei Vagin <avagin@gmail.com> wrote:
> >
> > On Tue, Jun 04, 2024 at 05:24:48PM -0700, Andrii Nakryiko wrote:
> > > /proc/<pid>/maps file is extremely useful in practice for various tasks
> > > involving figuring out process memory layout, what files are backing any
> > > given memory range, etc. One important class of applications that
> > > absolutely rely on this are profilers/stack symbolizers (perf tool being one
> > > of them). Patterns of use differ, but they generally would fall into two
> > > categories.
> > >
> > > In on-demand pattern, a profiler/symbolizer would normally capture stack
> > > trace containing absolute memory addresses of some functions, and would
> > > then use /proc/<pid>/maps file to find corresponding backing ELF files
> > > (normally, only executable VMAs are of interest), file offsets within
> > > them, and then continue from there to get yet more information (ELF
> > > symbols, DWARF information) to get human-readable symbolic information.
> > > This pattern is used by Meta's fleet-wide profiler, as one example.
> > >
> > > In preprocessing pattern, application doesn't know the set of addresses
> > > of interest, so it has to fetch all relevant VMAs (again, probably only
> > > executable ones), store or cache them, then proceed with profiling and
> > > stack trace capture. Once done, it would do symbolization based on
> > > stored VMA information. This can happen at much later point in time.
> > > This patterns is used by perf tool, as an example.
> > >
> > > In either case, there are both performance and correctness requirement
> > > involved. This address to VMA information translation has to be done as
> > > efficiently as possible, but also not miss any VMA (especially in the
> > > case of loading/unloading shared libraries). In practice, correctness
> > > can't be guaranteed (due to process dying before VMA data can be
> > > captured, or shared library being unloaded, etc), but any effort to
> > > maximize the chance of finding the VMA is appreciated.
> > >
> > > Unfortunately, for all the /proc/<pid>/maps file universality and
> > > usefulness, it doesn't fit the above use cases 100%.
> > >
> > > First, it's main purpose is to emit all VMAs sequentially, but in
> > > practice captured addresses would fall only into a smaller subset of all
> > > process' VMAs, mainly containing executable text. Yet, library would
> > > need to parse most or all of the contents to find needed VMAs, as there
> > > is no way to skip VMAs that are of no use. Efficient library can do the
> > > linear pass and it is still relatively efficient, but it's definitely an
> > > overhead that can be avoided, if there was a way to do more targeted
> > > querying of the relevant VMA information.
> > >
> > > Second, it's a text based interface, which makes its programmatic use from
> > > applications and libraries more cumbersome and inefficient due to the
> > > need to handle text parsing to get necessary pieces of information. The
> > > overhead is actually payed both by kernel, formatting originally binary
> > > VMA data into text, and then by user space application, parsing it back
> > > into binary data for further use.
> >
> > I was trying to solve all these issues in a more generic way:
> > https://lwn.net/Articles/683371/
> >
>
> Can you please provide a tl;dr summary of that effort?

task_diag is a generic interface designed to efficiently gather
information about running processes. It addresses the limitations of
traditional /proc/PID/* files. This binary interface utilizes the
netlink protocol, inspired by the socket diag interface. Input is
provided as a netlink message detailing the desired information, and the
kernel responds with a set of netlink messages containing the results.
Compared to struct-based interfaces like this one or statx, the
netlink-based approach can be more flexible, particularly when
dealing with numerous optional parameters.  BTW, David Ahern made
some adjustments in task_diag to optimize the same things that are
targeted here.

task_diag hasn't been merged to the kernel. I don't remember all the
arguments, it was some time ago. The primary concern was the
introduction of redundant functionality. It would have been the second
interface offering similar capabilities, without a plan to deprecate the
older interface. Furthermore, there wasn't sufficient demand to justify
the addition of a new interface at the time.

Thanks,
Andrei

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2024-06-12 17:48 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-05  0:24 [PATCH v3 0/9] ioctl()-based API to query VMAs from /proc/<pid>/maps Andrii Nakryiko
2024-06-05  0:24 ` [PATCH v3 1/9] mm: add find_vma()-like API but RCU protected and taking VMA lock Andrii Nakryiko
2024-06-05  0:57   ` Matthew Wilcox
2024-06-05 13:33     ` Liam R. Howlett
2024-06-05 16:13       ` Andrii Nakryiko
2024-06-05 16:24         ` Andrii Nakryiko
2024-06-05 16:27           ` Andrii Nakryiko
2024-06-05 17:03             ` Liam R. Howlett
2024-06-05 23:22               ` Suren Baghdasaryan
2024-06-06 16:51                 ` Andrii Nakryiko
2024-06-06 17:13                   ` Suren Baghdasaryan
2024-06-05  0:24 ` [PATCH v3 2/9] fs/procfs: extract logic for getting VMA name constituents Andrii Nakryiko
2024-06-05  0:24 ` [PATCH v3 3/9] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps Andrii Nakryiko
2024-06-07 22:31   ` Andrei Vagin
2024-06-10  8:17     ` Andrii Nakryiko
2024-06-12 17:48       ` Andrei Vagin
2024-06-05  0:24 ` [PATCH v3 4/9] fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API Andrii Nakryiko
2024-06-05 23:15   ` Suren Baghdasaryan
2024-06-06 16:51     ` Andrii Nakryiko
2024-06-06 17:12       ` Suren Baghdasaryan
2024-06-06 18:03         ` Andrii Nakryiko
2024-06-06 17:15       ` Liam R. Howlett
2024-06-06 17:33         ` Suren Baghdasaryan
2024-06-06 18:07           ` Liam R. Howlett
2024-06-06 18:09         ` Andrii Nakryiko
2024-06-06 18:32           ` Liam R. Howlett
2024-06-05  0:24 ` [PATCH v3 5/9] fs/procfs: add build ID fetching to " Andrii Nakryiko
2024-06-05  0:24 ` [PATCH v3 6/9] docs/procfs: call out ioctl()-based PROCMAP_QUERY command existence Andrii Nakryiko
2024-06-05  0:24 ` [PATCH v3 7/9] tools: sync uapi/linux/fs.h header into tools subdir Andrii Nakryiko
2024-06-05  0:24 ` [PATCH v3 8/9] selftests/bpf: make use of PROCMAP_QUERY ioctl if available Andrii Nakryiko
2024-06-05  0:24 ` [PATCH v3 9/9] selftests/bpf: add simple benchmark tool for /proc/<pid>/maps APIs Andrii Nakryiko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).