* [RFC 0/3] reading proc/pid/maps under RCU
@ 2024-01-15 18:38 Suren Baghdasaryan
2024-01-15 18:38 ` [RFC 1/3] mm: make vm_area_struct anon_name field RCU-safe Suren Baghdasaryan
` (3 more replies)
0 siblings, 4 replies; 9+ messages in thread
From: Suren Baghdasaryan @ 2024-01-15 18:38 UTC (permalink / raw)
To: akpm
Cc: viro, brauner, jack, dchinner, casey, ben.wolsieffer, paulmck,
david, avagin, usama.anjum, peterx, hughd, ryan.roberts,
wangkefeng.wang, Liam.Howlett, yuzhao, axelrasmussen, lstoakes,
talumbau, willy, vbabka, mgorman, jhubbard, vishal.moola,
mathieu.desnoyers, dhowells, jgg, sidhartha.kumar,
andriy.shevchenko, yangxingui, keescook, linux-kernel,
linux-fsdevel, linux-mm, kernel-team, surenb
The issue this patchset is trying to address is mmap_lock contention when
a low priority task (monitoring, data collecting, etc.) blocks a higher
priority task from making updated to the address space. The contention is
due to the mmap_lock being held for read when reading proc/pid/maps.
With maple_tree introduction, VMA tree traversals are RCU-safe and per-vma
locks make VMA access RCU-safe. this provides an opportunity for lock-less
reading of proc/pid/maps. We still need to overcome a couple obstacles:
1. Make all VMA pointer fields used for proc/pid/maps content generation
RCU-safe;
2. Ensure that proc/pid/maps data tearing, which is currently possible at
page boundaries only, does not get worse.
The patchset deals with these issues but there is a downside which I would
like to get input on:
This change introduces unfairness towards the reader of proc/pid/maps,
which can be blocked by an overly active/malicious address space modifyer.
A couple of ways I though we can address this issue are:
1. After several lock-less retries (or some time limit) to fall back to
taking mmap_lock.
2. Employ lock-less reading only if the reader has low priority,
indicating that blocking it is not critical.
3. Introducing a separate procfs file which publishes the same data in
lock-less manner.
I imagine a combination of these approaches can also be employed.
I would like to get feedback on this from the Linux community.
Note: mmap_read_lock/mmap_read_unlock sequence inside validate_map()
can be replaced with more efficiend rwsem_wait() proposed by Matthew
in [1].
[1] https://lore.kernel.org/all/ZZ1+ZicgN8dZ3zj3@casper.infradead.org/
Suren Baghdasaryan (3):
mm: make vm_area_struct anon_name field RCU-safe
seq_file: add validate() operation to seq_operations
mm/maps: read proc/pid/maps under RCU
fs/proc/internal.h | 3 +
fs/proc/task_mmu.c | 130 ++++++++++++++++++++++++++++++++++----
fs/seq_file.c | 24 ++++++-
include/linux/mm_inline.h | 10 ++-
include/linux/mm_types.h | 3 +-
include/linux/seq_file.h | 1 +
mm/madvise.c | 30 +++++++--
7 files changed, 181 insertions(+), 20 deletions(-)
--
2.43.0.381.gb435a96ce8-goog
^ permalink raw reply [flat|nested] 9+ messages in thread
* [RFC 1/3] mm: make vm_area_struct anon_name field RCU-safe
2024-01-15 18:38 [RFC 0/3] reading proc/pid/maps under RCU Suren Baghdasaryan
@ 2024-01-15 18:38 ` Suren Baghdasaryan
2024-01-15 18:38 ` [RFC 2/3] seq_file: add validate() operation to seq_operations Suren Baghdasaryan
` (2 subsequent siblings)
3 siblings, 0 replies; 9+ messages in thread
From: Suren Baghdasaryan @ 2024-01-15 18:38 UTC (permalink / raw)
To: akpm
Cc: viro, brauner, jack, dchinner, casey, ben.wolsieffer, paulmck,
david, avagin, usama.anjum, peterx, hughd, ryan.roberts,
wangkefeng.wang, Liam.Howlett, yuzhao, axelrasmussen, lstoakes,
talumbau, willy, vbabka, mgorman, jhubbard, vishal.moola,
mathieu.desnoyers, dhowells, jgg, sidhartha.kumar,
andriy.shevchenko, yangxingui, keescook, linux-kernel,
linux-fsdevel, linux-mm, kernel-team, surenb
For lockless /proc/pid/maps reading we have to ensure all the fields
used when generating the output are RCU-safe. The only pointer fields
in vm_area_struct which are used to generate that file's output are
vm_file and anon_name. vm_file is RCU-safe but anon_name is not. Make
anon_name RCU-safe as well.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
include/linux/mm_inline.h | 10 +++++++++-
include/linux/mm_types.h | 3 ++-
mm/madvise.c | 30 ++++++++++++++++++++++++++----
3 files changed, 37 insertions(+), 6 deletions(-)
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index f4fe593c1400..bbdb0ca857f1 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -389,7 +389,7 @@ static inline void dup_anon_vma_name(struct vm_area_struct *orig_vma,
struct anon_vma_name *anon_name = anon_vma_name(orig_vma);
if (anon_name)
- new_vma->anon_name = anon_vma_name_reuse(anon_name);
+ rcu_assign_pointer(new_vma->anon_name, anon_vma_name_reuse(anon_name));
}
static inline void free_anon_vma_name(struct vm_area_struct *vma)
@@ -411,6 +411,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
!strcmp(anon_name1->name, anon_name2->name);
}
+struct anon_vma_name *anon_vma_name_get_rcu(struct vm_area_struct *vma);
+
#else /* CONFIG_ANON_VMA_NAME */
static inline void anon_vma_name_get(struct anon_vma_name *anon_name) {}
static inline void anon_vma_name_put(struct anon_vma_name *anon_name) {}
@@ -424,6 +426,12 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
return true;
}
+static inline
+struct anon_vma_name *anon_vma_name_get_rcu(struct vm_area_struct *vma)
+{
+ return NULL;
+}
+
#endif /* CONFIG_ANON_VMA_NAME */
static inline void init_tlb_flush_pending(struct mm_struct *mm)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b2d3a88a34d1..1f0a30c00795 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -545,6 +545,7 @@ struct vm_userfaultfd_ctx {};
struct anon_vma_name {
struct kref kref;
+ struct rcu_head rcu;
/* The name needs to be at the end because it is dynamically sized. */
char name[];
};
@@ -699,7 +700,7 @@ struct vm_area_struct {
* terminated string containing the name given to the vma, or NULL if
* unnamed. Serialized by mmap_lock. Use anon_vma_name to access.
*/
- struct anon_vma_name *anon_name;
+ struct anon_vma_name __rcu *anon_name;
#endif
#ifdef CONFIG_SWAP
atomic_long_t swap_readahead_info;
diff --git a/mm/madvise.c b/mm/madvise.c
index 912155a94ed5..0f222d464254 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -88,14 +88,15 @@ void anon_vma_name_free(struct kref *kref)
{
struct anon_vma_name *anon_name =
container_of(kref, struct anon_vma_name, kref);
- kfree(anon_name);
+ kfree_rcu(anon_name, rcu);
}
struct anon_vma_name *anon_vma_name(struct vm_area_struct *vma)
{
mmap_assert_locked(vma->vm_mm);
- return vma->anon_name;
+ return rcu_dereference_protected(vma->anon_name,
+ rwsem_is_locked(&vma->vm_mm->mmap_lock));
}
/* mmap_lock should be write-locked */
@@ -105,7 +106,7 @@ static int replace_anon_vma_name(struct vm_area_struct *vma,
struct anon_vma_name *orig_name = anon_vma_name(vma);
if (!anon_name) {
- vma->anon_name = NULL;
+ rcu_assign_pointer(vma->anon_name, NULL);
anon_vma_name_put(orig_name);
return 0;
}
@@ -113,11 +114,32 @@ static int replace_anon_vma_name(struct vm_area_struct *vma,
if (anon_vma_name_eq(orig_name, anon_name))
return 0;
- vma->anon_name = anon_vma_name_reuse(anon_name);
+ rcu_assign_pointer(vma->anon_name, anon_vma_name_reuse(anon_name));
anon_vma_name_put(orig_name);
return 0;
}
+
+/*
+ * Returned anon_vma_name is stable due to elevated refcount but not guaranteed
+ * to be assigned to the original VMA after the call.
+ */
+struct anon_vma_name *anon_vma_name_get_rcu(struct vm_area_struct *vma)
+{
+ struct anon_vma_name __rcu *anon_name;
+
+ WARN_ON_ONCE(!rcu_read_lock_held());
+
+ anon_name = rcu_dereference(vma->anon_name);
+ if (!anon_name)
+ return NULL;
+
+ if (unlikely(!kref_get_unless_zero(&anon_name->kref)))
+ return NULL;
+
+ return anon_name;
+}
+
#else /* CONFIG_ANON_VMA_NAME */
static int replace_anon_vma_name(struct vm_area_struct *vma,
struct anon_vma_name *anon_name)
--
2.43.0.381.gb435a96ce8-goog
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [RFC 2/3] seq_file: add validate() operation to seq_operations
2024-01-15 18:38 [RFC 0/3] reading proc/pid/maps under RCU Suren Baghdasaryan
2024-01-15 18:38 ` [RFC 1/3] mm: make vm_area_struct anon_name field RCU-safe Suren Baghdasaryan
@ 2024-01-15 18:38 ` Suren Baghdasaryan
2024-01-15 18:38 ` [RFC 3/3] mm/maps: read proc/pid/maps under RCU Suren Baghdasaryan
2024-01-16 14:42 ` [RFC 0/3] reading " Vlastimil Babka
3 siblings, 0 replies; 9+ messages in thread
From: Suren Baghdasaryan @ 2024-01-15 18:38 UTC (permalink / raw)
To: akpm
Cc: viro, brauner, jack, dchinner, casey, ben.wolsieffer, paulmck,
david, avagin, usama.anjum, peterx, hughd, ryan.roberts,
wangkefeng.wang, Liam.Howlett, yuzhao, axelrasmussen, lstoakes,
talumbau, willy, vbabka, mgorman, jhubbard, vishal.moola,
mathieu.desnoyers, dhowells, jgg, sidhartha.kumar,
andriy.shevchenko, yangxingui, keescook, linux-kernel,
linux-fsdevel, linux-mm, kernel-team, surenb
seq_file outputs data in chunks using seq_file.buf as the intermediate
storage before outputting the generated data for the current chunk. It is
possible for already buffered data to become stale before it gets reported.
In certain situations it is desirable to regenerate that data instead of
reporting the stale one. Provide a validate() operation called before
outputting the buffered data to allow users to validate buffered data.
To indicate valid data, user's validate callback should return 0, to
request regeneration of the stale data it should return -EAGAIN, any
other error will be considered fatal and read operation will be aborted.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
fs/seq_file.c | 24 +++++++++++++++++++++++-
include/linux/seq_file.h | 1 +
2 files changed, 24 insertions(+), 1 deletion(-)
diff --git a/fs/seq_file.c b/fs/seq_file.c
index f5fdaf3b1572..77833bbe5909 100644
--- a/fs/seq_file.c
+++ b/fs/seq_file.c
@@ -172,6 +172,8 @@ ssize_t seq_read_iter(struct kiocb *iocb, struct iov_iter *iter)
{
struct seq_file *m = iocb->ki_filp->private_data;
size_t copied = 0;
+ loff_t orig_index;
+ size_t orig_count;
size_t n;
void *p;
int err = 0;
@@ -220,6 +222,10 @@ ssize_t seq_read_iter(struct kiocb *iocb, struct iov_iter *iter)
if (m->count) // hadn't managed to copy everything
goto Done;
}
+
+ orig_index = m->index;
+ orig_count = m->count;
+Again:
// get a non-empty record in the buffer
m->from = 0;
p = m->op->start(m, &m->index);
@@ -278,6 +284,22 @@ ssize_t seq_read_iter(struct kiocb *iocb, struct iov_iter *iter)
}
}
m->op->stop(m, p);
+ /* Note: we validate even if err<0 to prevent publishing copied data */
+ if (m->op->validate) {
+ int val_err = m->op->validate(m, p);
+
+ if (val_err) {
+ if (val_err == -EAGAIN) {
+ m->index = orig_index;
+ m->count = orig_count;
+ // data is stale, retry
+ goto Again;
+ }
+ // data is invalid, return the last error
+ err = val_err;
+ goto Done;
+ }
+ }
n = copy_to_iter(m->buf, m->count, iter);
copied += n;
m->count -= n;
@@ -572,7 +594,7 @@ static void single_stop(struct seq_file *p, void *v)
int single_open(struct file *file, int (*show)(struct seq_file *, void *),
void *data)
{
- struct seq_operations *op = kmalloc(sizeof(*op), GFP_KERNEL_ACCOUNT);
+ struct seq_operations *op = kzalloc(sizeof(*op), GFP_KERNEL_ACCOUNT);
int res = -ENOMEM;
if (op) {
diff --git a/include/linux/seq_file.h b/include/linux/seq_file.h
index 234bcdb1fba4..d0fefac2990f 100644
--- a/include/linux/seq_file.h
+++ b/include/linux/seq_file.h
@@ -34,6 +34,7 @@ struct seq_operations {
void (*stop) (struct seq_file *m, void *v);
void * (*next) (struct seq_file *m, void *v, loff_t *pos);
int (*show) (struct seq_file *m, void *v);
+ int (*validate)(struct seq_file *m, void *v);
};
#define SEQ_SKIP 1
--
2.43.0.381.gb435a96ce8-goog
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [RFC 3/3] mm/maps: read proc/pid/maps under RCU
2024-01-15 18:38 [RFC 0/3] reading proc/pid/maps under RCU Suren Baghdasaryan
2024-01-15 18:38 ` [RFC 1/3] mm: make vm_area_struct anon_name field RCU-safe Suren Baghdasaryan
2024-01-15 18:38 ` [RFC 2/3] seq_file: add validate() operation to seq_operations Suren Baghdasaryan
@ 2024-01-15 18:38 ` Suren Baghdasaryan
2024-01-16 14:42 ` [RFC 0/3] reading " Vlastimil Babka
3 siblings, 0 replies; 9+ messages in thread
From: Suren Baghdasaryan @ 2024-01-15 18:38 UTC (permalink / raw)
To: akpm
Cc: viro, brauner, jack, dchinner, casey, ben.wolsieffer, paulmck,
david, avagin, usama.anjum, peterx, hughd, ryan.roberts,
wangkefeng.wang, Liam.Howlett, yuzhao, axelrasmussen, lstoakes,
talumbau, willy, vbabka, mgorman, jhubbard, vishal.moola,
mathieu.desnoyers, dhowells, jgg, sidhartha.kumar,
andriy.shevchenko, yangxingui, keescook, linux-kernel,
linux-fsdevel, linux-mm, kernel-team, surenb
With maple_tree supporting vma tree traversal under RCU and per-vma locks
making vma access RCU-safe, /proc/pid/maps can be read under RCU and
without the need to read-lock mmap_lock. However vma content can change
from under us, therefore we need to pin pointer fields used when
generating the output (currently only vm_file and anon_name).
In addition, we validate data before publishing it to the user using new
seq_file validate interface. This way we keep this mechanism consistent
with the previous behavior where data tearing is possible only at page
boundaries.
This change is designed to reduce mmap_lock contention and prevent a
process reading /proc/pid/maps files (often a low priority task, such as
monitoring/data collection services) from blocking address space updates.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
fs/proc/internal.h | 3 ++
fs/proc/task_mmu.c | 130 ++++++++++++++++++++++++++++++++++++++++-----
2 files changed, 120 insertions(+), 13 deletions(-)
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index a71ac5379584..47233408550b 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -290,6 +290,9 @@ struct proc_maps_private {
struct task_struct *task;
struct mm_struct *mm;
struct vma_iterator iter;
+ int mm_lock_seq;
+ struct anon_vma_name *anon_name;
+ struct file *vm_file;
#ifdef CONFIG_NUMA
struct mempolicy *task_mempolicy;
#endif
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 62b16f42d5d2..d4305cfdca58 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -141,6 +141,22 @@ static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
return vma;
}
+static const struct seq_operations proc_pid_maps_op;
+
+static inline bool needs_mmap_lock(struct seq_file *m)
+{
+#ifdef CONFIG_PER_VMA_LOCK
+ /*
+ * smaps and numa_maps perform page table walk, therefore require
+ * mmap_lock but maps can be read under RCU.
+ */
+ return m->op != &proc_pid_maps_op;
+#else
+ /* Without per-vma locks VMA access is not RCU-safe */
+ return true;
+#endif
+}
+
static void *m_start(struct seq_file *m, loff_t *ppos)
{
struct proc_maps_private *priv = m->private;
@@ -162,11 +178,17 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
return NULL;
}
- if (mmap_read_lock_killable(mm)) {
- mmput(mm);
- put_task_struct(priv->task);
- priv->task = NULL;
- return ERR_PTR(-EINTR);
+ if (needs_mmap_lock(m)) {
+ if (mmap_read_lock_killable(mm)) {
+ mmput(mm);
+ put_task_struct(priv->task);
+ priv->task = NULL;
+ return ERR_PTR(-EINTR);
+ }
+ } else {
+ /* For memory barrier see the comment for mm_lock_seq in mm_struct */
+ priv->mm_lock_seq = smp_load_acquire(&priv->mm->mm_lock_seq);
+ rcu_read_lock();
}
vma_iter_init(&priv->iter, mm, last_addr);
@@ -195,7 +217,10 @@ static void m_stop(struct seq_file *m, void *v)
return;
release_task_mempolicy(priv);
- mmap_read_unlock(mm);
+ if (needs_mmap_lock(m))
+ mmap_read_unlock(mm);
+ else
+ rcu_read_unlock();
mmput(mm);
put_task_struct(priv->task);
priv->task = NULL;
@@ -283,8 +308,10 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma)
start = vma->vm_start;
end = vma->vm_end;
show_vma_header_prefix(m, start, end, flags, pgoff, dev, ino);
- if (mm)
- anon_name = anon_vma_name(vma);
+ if (mm) {
+ anon_name = needs_mmap_lock(m) ? anon_vma_name(vma) :
+ anon_vma_name_get_rcu(vma);
+ }
/*
* Print the dentry name for named mappings, and a
@@ -338,19 +365,96 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma)
seq_puts(m, name);
}
seq_putc(m, '\n');
+ if (anon_name && !needs_mmap_lock(m))
+ anon_vma_name_put(anon_name);
+}
+
+/*
+ * Pin vm_area_struct fields used by show_map_vma. We also copy pinned fields
+ * into proc_maps_private because by the time put_vma_fields() is called, VMA
+ * might have changed and these fields might be pointing to different objects.
+ */
+static bool get_vma_fields(struct vm_area_struct *vma, struct proc_maps_private *priv)
+{
+ if (vma->vm_file) {
+ priv->vm_file = get_file_rcu(&vma->vm_file);
+ if (!priv->vm_file)
+ return false;
+
+ } else
+ priv->vm_file = NULL;
+
+ if (vma->anon_name) {
+ priv->anon_name = anon_vma_name_get_rcu(vma);
+ if (!priv->anon_name) {
+ if (priv->vm_file) {
+ fput(priv->vm_file);
+ return false;
+ }
+ }
+ } else
+ priv->anon_name = NULL;
+
+ return true;
+}
+
+static void put_vma_fields(struct proc_maps_private *priv)
+{
+ if (priv->anon_name)
+ anon_vma_name_put(priv->anon_name);
+ if (priv->vm_file)
+ fput(priv->vm_file);
}
static int show_map(struct seq_file *m, void *v)
{
- show_map_vma(m, v);
+ struct proc_maps_private *priv = m->private;
+
+ if (needs_mmap_lock(m))
+ show_map_vma(m, v);
+ else {
+ /*
+ * Stop immediately if the VMA changed from under us.
+ * Validation step will prevent publishing already cached data.
+ */
+ if (!get_vma_fields(v, priv))
+ return -EAGAIN;
+
+ show_map_vma(m, v);
+ put_vma_fields(priv);
+ }
+
return 0;
}
+static int validate_map(struct seq_file *m, void *v)
+{
+ if (!needs_mmap_lock(m)) {
+ struct proc_maps_private *priv = m->private;
+ int mm_lock_seq;
+
+ /* For memory barrier see the comment for mm_lock_seq in mm_struct */
+ mm_lock_seq = smp_load_acquire(&priv->mm->mm_lock_seq);
+ if (mm_lock_seq != priv->mm_lock_seq) {
+ /*
+ * mmap_lock contention is detected. Wait for mmap_lock
+ * write to be released, discard stale data and retry.
+ */
+ mmap_read_lock(priv->mm);
+ mmap_read_unlock(priv->mm);
+ return -EAGAIN;
+ }
+ }
+ return 0;
+
+}
+
static const struct seq_operations proc_pid_maps_op = {
- .start = m_start,
- .next = m_next,
- .stop = m_stop,
- .show = show_map
+ .start = m_start,
+ .next = m_next,
+ .stop = m_stop,
+ .show = show_map,
+ .validate = validate_map,
};
static int pid_maps_open(struct inode *inode, struct file *file)
--
2.43.0.381.gb435a96ce8-goog
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [RFC 0/3] reading proc/pid/maps under RCU
2024-01-15 18:38 [RFC 0/3] reading proc/pid/maps under RCU Suren Baghdasaryan
` (2 preceding siblings ...)
2024-01-15 18:38 ` [RFC 3/3] mm/maps: read proc/pid/maps under RCU Suren Baghdasaryan
@ 2024-01-16 14:42 ` Vlastimil Babka
2024-01-16 14:46 ` Vlastimil Babka
3 siblings, 1 reply; 9+ messages in thread
From: Vlastimil Babka @ 2024-01-16 14:42 UTC (permalink / raw)
To: Suren Baghdasaryan, akpm
Cc: viro, brauner, jack, dchinner, casey, ben.wolsieffer, paulmck,
david, avagin, usama.anjum, peterx, hughd, ryan.roberts,
wangkefeng.wang, Liam.Howlett, yuzhao, axelrasmussen, lstoakes,
talumbau, willy, mgorman, jhubbard, vishal.moola,
mathieu.desnoyers, dhowells, jgg, sidhartha.kumar,
andriy.shevchenko, yangxingui, keescook, linux-kernel,
linux-fsdevel, linux-mm, kernel-team
On 1/15/24 19:38, Suren Baghdasaryan wrote:
Hi,
> The issue this patchset is trying to address is mmap_lock contention when
> a low priority task (monitoring, data collecting, etc.) blocks a higher
> priority task from making updated to the address space. The contention is
> due to the mmap_lock being held for read when reading proc/pid/maps.
> With maple_tree introduction, VMA tree traversals are RCU-safe and per-vma
> locks make VMA access RCU-safe. this provides an opportunity for lock-less
> reading of proc/pid/maps. We still need to overcome a couple obstacles:
> 1. Make all VMA pointer fields used for proc/pid/maps content generation
> RCU-safe;
> 2. Ensure that proc/pid/maps data tearing, which is currently possible at
> page boundaries only, does not get worse.
Hm I thought we were to only choose this more complicated in case additional
tearing becomes a problem, and at first assume that if software can deal
with page boundary tearing, it can deal with sub-page tearing too?
> The patchset deals with these issues but there is a downside which I would
> like to get input on:
> This change introduces unfairness towards the reader of proc/pid/maps,
> which can be blocked by an overly active/malicious address space modifyer.
So this is a consequence of the validate() operation, right? We could avoid
this if we allowed sub-page tearing.
> A couple of ways I though we can address this issue are:
> 1. After several lock-less retries (or some time limit) to fall back to
> taking mmap_lock.
> 2. Employ lock-less reading only if the reader has low priority,
> indicating that blocking it is not critical.
> 3. Introducing a separate procfs file which publishes the same data in
> lock-less manner.
>
> I imagine a combination of these approaches can also be employed.
> I would like to get feedback on this from the Linux community.
>
> Note: mmap_read_lock/mmap_read_unlock sequence inside validate_map()
> can be replaced with more efficiend rwsem_wait() proposed by Matthew
> in [1].
>
> [1] https://lore.kernel.org/all/ZZ1+ZicgN8dZ3zj3@casper.infradead.org/
>
> Suren Baghdasaryan (3):
> mm: make vm_area_struct anon_name field RCU-safe
> seq_file: add validate() operation to seq_operations
> mm/maps: read proc/pid/maps under RCU
>
> fs/proc/internal.h | 3 +
> fs/proc/task_mmu.c | 130 ++++++++++++++++++++++++++++++++++----
> fs/seq_file.c | 24 ++++++-
> include/linux/mm_inline.h | 10 ++-
> include/linux/mm_types.h | 3 +-
> include/linux/seq_file.h | 1 +
> mm/madvise.c | 30 +++++++--
> 7 files changed, 181 insertions(+), 20 deletions(-)
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC 0/3] reading proc/pid/maps under RCU
2024-01-16 14:42 ` [RFC 0/3] reading " Vlastimil Babka
@ 2024-01-16 14:46 ` Vlastimil Babka
2024-01-16 17:57 ` Suren Baghdasaryan
0 siblings, 1 reply; 9+ messages in thread
From: Vlastimil Babka @ 2024-01-16 14:46 UTC (permalink / raw)
To: Suren Baghdasaryan, akpm
Cc: viro, brauner, jack, dchinner, casey, ben.wolsieffer, paulmck,
david, avagin, usama.anjum, peterx, hughd, ryan.roberts,
wangkefeng.wang, Liam.Howlett, yuzhao, axelrasmussen, lstoakes,
talumbau, willy, mgorman, jhubbard, vishal.moola,
mathieu.desnoyers, dhowells, jgg, sidhartha.kumar,
andriy.shevchenko, yangxingui, keescook, linux-kernel,
linux-fsdevel, linux-mm, kernel-team
On 1/16/24 15:42, Vlastimil Babka wrote:
> On 1/15/24 19:38, Suren Baghdasaryan wrote:
>
> Hi,
>
>> The issue this patchset is trying to address is mmap_lock contention when
>> a low priority task (monitoring, data collecting, etc.) blocks a higher
>> priority task from making updated to the address space. The contention is
>> due to the mmap_lock being held for read when reading proc/pid/maps.
>> With maple_tree introduction, VMA tree traversals are RCU-safe and per-vma
>> locks make VMA access RCU-safe. this provides an opportunity for lock-less
>> reading of proc/pid/maps. We still need to overcome a couple obstacles:
>> 1. Make all VMA pointer fields used for proc/pid/maps content generation
>> RCU-safe;
>> 2. Ensure that proc/pid/maps data tearing, which is currently possible at
>> page boundaries only, does not get worse.
>
> Hm I thought we were to only choose this more complicated in case additional
> tearing becomes a problem, and at first assume that if software can deal
> with page boundary tearing, it can deal with sub-page tearing too?
>
>> The patchset deals with these issues but there is a downside which I would
>> like to get input on:
>> This change introduces unfairness towards the reader of proc/pid/maps,
>> which can be blocked by an overly active/malicious address space modifyer.
>
> So this is a consequence of the validate() operation, right? We could avoid
> this if we allowed sub-page tearing.
>
>> A couple of ways I though we can address this issue are:
>> 1. After several lock-less retries (or some time limit) to fall back to
>> taking mmap_lock.
>> 2. Employ lock-less reading only if the reader has low priority,
>> indicating that blocking it is not critical.
>> 3. Introducing a separate procfs file which publishes the same data in
>> lock-less manner.
Oh and if this option 3 becomes necessary, then such new file shouldn't
validate() either, and whoever wants to avoid the reader contention and
converts their monitoring to the new file will have to account for this
possible extra tearing from the start. So I would suggest trying to change
the existing file with no validate() first, and if existing userspace gets
broken, employ option 3. This would mean no validate() in either case?
>> I imagine a combination of these approaches can also be employed.
>> I would like to get feedback on this from the Linux community.
>>
>> Note: mmap_read_lock/mmap_read_unlock sequence inside validate_map()
>> can be replaced with more efficiend rwsem_wait() proposed by Matthew
>> in [1].
>>
>> [1] https://lore.kernel.org/all/ZZ1+ZicgN8dZ3zj3@casper.infradead.org/
>>
>> Suren Baghdasaryan (3):
>> mm: make vm_area_struct anon_name field RCU-safe
>> seq_file: add validate() operation to seq_operations
>> mm/maps: read proc/pid/maps under RCU
>>
>> fs/proc/internal.h | 3 +
>> fs/proc/task_mmu.c | 130 ++++++++++++++++++++++++++++++++++----
>> fs/seq_file.c | 24 ++++++-
>> include/linux/mm_inline.h | 10 ++-
>> include/linux/mm_types.h | 3 +-
>> include/linux/seq_file.h | 1 +
>> mm/madvise.c | 30 +++++++--
>> 7 files changed, 181 insertions(+), 20 deletions(-)
>>
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC 0/3] reading proc/pid/maps under RCU
2024-01-16 14:46 ` Vlastimil Babka
@ 2024-01-16 17:57 ` Suren Baghdasaryan
2024-01-18 17:58 ` Suren Baghdasaryan
0 siblings, 1 reply; 9+ messages in thread
From: Suren Baghdasaryan @ 2024-01-16 17:57 UTC (permalink / raw)
To: Vlastimil Babka
Cc: akpm, viro, brauner, jack, dchinner, casey, ben.wolsieffer,
paulmck, david, avagin, usama.anjum, peterx, hughd, ryan.roberts,
wangkefeng.wang, Liam.Howlett, yuzhao, axelrasmussen, lstoakes,
talumbau, willy, mgorman, jhubbard, vishal.moola,
mathieu.desnoyers, dhowells, jgg, sidhartha.kumar,
andriy.shevchenko, yangxingui, keescook, linux-kernel,
linux-fsdevel, linux-mm, kernel-team
On Tue, Jan 16, 2024 at 6:46 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/16/24 15:42, Vlastimil Babka wrote:
> > On 1/15/24 19:38, Suren Baghdasaryan wrote:
> >
> > Hi,
> >
> >> The issue this patchset is trying to address is mmap_lock contention when
> >> a low priority task (monitoring, data collecting, etc.) blocks a higher
> >> priority task from making updated to the address space. The contention is
> >> due to the mmap_lock being held for read when reading proc/pid/maps.
> >> With maple_tree introduction, VMA tree traversals are RCU-safe and per-vma
> >> locks make VMA access RCU-safe. this provides an opportunity for lock-less
> >> reading of proc/pid/maps. We still need to overcome a couple obstacles:
> >> 1. Make all VMA pointer fields used for proc/pid/maps content generation
> >> RCU-safe;
> >> 2. Ensure that proc/pid/maps data tearing, which is currently possible at
> >> page boundaries only, does not get worse.
> >
> > Hm I thought we were to only choose this more complicated in case additional
> > tearing becomes a problem, and at first assume that if software can deal
> > with page boundary tearing, it can deal with sub-page tearing too?
Hi Vlastimil,
Thanks for the feedback!
Yes, originally I thought we wouldn't be able to avoid additional
tearing without a big change but then realized it's not that hard, so
I tried to keep the change in behavior transparent to the userspace.
> >
> >> The patchset deals with these issues but there is a downside which I would
> >> like to get input on:
> >> This change introduces unfairness towards the reader of proc/pid/maps,
> >> which can be blocked by an overly active/malicious address space modifyer.
> >
> > So this is a consequence of the validate() operation, right? We could avoid
> > this if we allowed sub-page tearing.
Yes, if we don't care about sub-page tearing then we could get rid of
validate step and this issue with updaters blocking the reader would
go away. If we choose that direction there will be one more issue to
fix, namely the maple_tree temporary inconsistent state when a VMA is
replaced with another one and we might observe NULL there. We might be
able to use Matthew's rwsem_wait() to deal with that issue.
> >
> >> A couple of ways I though we can address this issue are:
> >> 1. After several lock-less retries (or some time limit) to fall back to
> >> taking mmap_lock.
> >> 2. Employ lock-less reading only if the reader has low priority,
> >> indicating that blocking it is not critical.
> >> 3. Introducing a separate procfs file which publishes the same data in
> >> lock-less manner.
>
> Oh and if this option 3 becomes necessary, then such new file shouldn't
> validate() either, and whoever wants to avoid the reader contention and
> converts their monitoring to the new file will have to account for this
> possible extra tearing from the start. So I would suggest trying to change
> the existing file with no validate() first, and if existing userspace gets
> broken, employ option 3. This would mean no validate() in either case?
Yes but I was trying to avoid introducing additional file which
publishes the same content in a slightly different way. We will have
to explain when userspace should use one vs the other and that would
require going into low level implementation details, I think. Don't
know if that's acceptable/preferable.
Thanks,
Suren.
>
> >> I imagine a combination of these approaches can also be employed.
> >> I would like to get feedback on this from the Linux community.
> >>
> >> Note: mmap_read_lock/mmap_read_unlock sequence inside validate_map()
> >> can be replaced with more efficiend rwsem_wait() proposed by Matthew
> >> in [1].
> >>
> >> [1] https://lore.kernel.org/all/ZZ1+ZicgN8dZ3zj3@casper.infradead.org/
> >>
> >> Suren Baghdasaryan (3):
> >> mm: make vm_area_struct anon_name field RCU-safe
> >> seq_file: add validate() operation to seq_operations
> >> mm/maps: read proc/pid/maps under RCU
> >>
> >> fs/proc/internal.h | 3 +
> >> fs/proc/task_mmu.c | 130 ++++++++++++++++++++++++++++++++++----
> >> fs/seq_file.c | 24 ++++++-
> >> include/linux/mm_inline.h | 10 ++-
> >> include/linux/mm_types.h | 3 +-
> >> include/linux/seq_file.h | 1 +
> >> mm/madvise.c | 30 +++++++--
> >> 7 files changed, 181 insertions(+), 20 deletions(-)
> >>
> >
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC 0/3] reading proc/pid/maps under RCU
2024-01-16 17:57 ` Suren Baghdasaryan
@ 2024-01-18 17:58 ` Suren Baghdasaryan
2024-01-22 7:23 ` Suren Baghdasaryan
0 siblings, 1 reply; 9+ messages in thread
From: Suren Baghdasaryan @ 2024-01-18 17:58 UTC (permalink / raw)
To: Vlastimil Babka
Cc: akpm, viro, brauner, jack, dchinner, casey, ben.wolsieffer,
paulmck, david, avagin, usama.anjum, peterx, hughd, ryan.roberts,
wangkefeng.wang, Liam.Howlett, yuzhao, axelrasmussen, lstoakes,
talumbau, willy, mgorman, jhubbard, vishal.moola,
mathieu.desnoyers, dhowells, jgg, sidhartha.kumar,
andriy.shevchenko, yangxingui, keescook, linux-kernel,
linux-fsdevel, linux-mm, kernel-team
On Tue, Jan 16, 2024 at 9:57 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Jan 16, 2024 at 6:46 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > On 1/16/24 15:42, Vlastimil Babka wrote:
> > > On 1/15/24 19:38, Suren Baghdasaryan wrote:
> > >
> > > Hi,
> > >
> > >> The issue this patchset is trying to address is mmap_lock contention when
> > >> a low priority task (monitoring, data collecting, etc.) blocks a higher
> > >> priority task from making updated to the address space. The contention is
> > >> due to the mmap_lock being held for read when reading proc/pid/maps.
> > >> With maple_tree introduction, VMA tree traversals are RCU-safe and per-vma
> > >> locks make VMA access RCU-safe. this provides an opportunity for lock-less
> > >> reading of proc/pid/maps. We still need to overcome a couple obstacles:
> > >> 1. Make all VMA pointer fields used for proc/pid/maps content generation
> > >> RCU-safe;
> > >> 2. Ensure that proc/pid/maps data tearing, which is currently possible at
> > >> page boundaries only, does not get worse.
> > >
> > > Hm I thought we were to only choose this more complicated in case additional
> > > tearing becomes a problem, and at first assume that if software can deal
> > > with page boundary tearing, it can deal with sub-page tearing too?
>
> Hi Vlastimil,
> Thanks for the feedback!
> Yes, originally I thought we wouldn't be able to avoid additional
> tearing without a big change but then realized it's not that hard, so
> I tried to keep the change in behavior transparent to the userspace.
In the absence of other feedback I'm going to implement and post the
originally envisioned approach: remove validation step and avoid any
possibility of blocking but allowing for sub-page tearing. Will use
Matthew's rwsem_wait() to deal with possible inconsistent maple_tree
state.
Thanks,
Suren.
>
> > >
> > >> The patchset deals with these issues but there is a downside which I would
> > >> like to get input on:
> > >> This change introduces unfairness towards the reader of proc/pid/maps,
> > >> which can be blocked by an overly active/malicious address space modifyer.
> > >
> > > So this is a consequence of the validate() operation, right? We could avoid
> > > this if we allowed sub-page tearing.
>
> Yes, if we don't care about sub-page tearing then we could get rid of
> validate step and this issue with updaters blocking the reader would
> go away. If we choose that direction there will be one more issue to
> fix, namely the maple_tree temporary inconsistent state when a VMA is
> replaced with another one and we might observe NULL there. We might be
> able to use Matthew's rwsem_wait() to deal with that issue.
>
> > >
> > >> A couple of ways I though we can address this issue are:
> > >> 1. After several lock-less retries (or some time limit) to fall back to
> > >> taking mmap_lock.
> > >> 2. Employ lock-less reading only if the reader has low priority,
> > >> indicating that blocking it is not critical.
> > >> 3. Introducing a separate procfs file which publishes the same data in
> > >> lock-less manner.
> >
> > Oh and if this option 3 becomes necessary, then such new file shouldn't
> > validate() either, and whoever wants to avoid the reader contention and
> > converts their monitoring to the new file will have to account for this
> > possible extra tearing from the start. So I would suggest trying to change
> > the existing file with no validate() first, and if existing userspace gets
> > broken, employ option 3. This would mean no validate() in either case?
>
> Yes but I was trying to avoid introducing additional file which
> publishes the same content in a slightly different way. We will have
> to explain when userspace should use one vs the other and that would
> require going into low level implementation details, I think. Don't
> know if that's acceptable/preferable.
> Thanks,
> Suren.
>
> >
> > >> I imagine a combination of these approaches can also be employed.
> > >> I would like to get feedback on this from the Linux community.
> > >>
> > >> Note: mmap_read_lock/mmap_read_unlock sequence inside validate_map()
> > >> can be replaced with more efficiend rwsem_wait() proposed by Matthew
> > >> in [1].
> > >>
> > >> [1] https://lore.kernel.org/all/ZZ1+ZicgN8dZ3zj3@casper.infradead.org/
> > >>
> > >> Suren Baghdasaryan (3):
> > >> mm: make vm_area_struct anon_name field RCU-safe
> > >> seq_file: add validate() operation to seq_operations
> > >> mm/maps: read proc/pid/maps under RCU
> > >>
> > >> fs/proc/internal.h | 3 +
> > >> fs/proc/task_mmu.c | 130 ++++++++++++++++++++++++++++++++++----
> > >> fs/seq_file.c | 24 ++++++-
> > >> include/linux/mm_inline.h | 10 ++-
> > >> include/linux/mm_types.h | 3 +-
> > >> include/linux/seq_file.h | 1 +
> > >> mm/madvise.c | 30 +++++++--
> > >> 7 files changed, 181 insertions(+), 20 deletions(-)
> > >>
> > >
> >
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC 0/3] reading proc/pid/maps under RCU
2024-01-18 17:58 ` Suren Baghdasaryan
@ 2024-01-22 7:23 ` Suren Baghdasaryan
0 siblings, 0 replies; 9+ messages in thread
From: Suren Baghdasaryan @ 2024-01-22 7:23 UTC (permalink / raw)
To: Vlastimil Babka
Cc: akpm, viro, brauner, jack, dchinner, casey, ben.wolsieffer,
paulmck, david, avagin, usama.anjum, peterx, hughd, ryan.roberts,
wangkefeng.wang, Liam.Howlett, yuzhao, axelrasmussen, lstoakes,
talumbau, willy, mgorman, jhubbard, vishal.moola,
mathieu.desnoyers, dhowells, jgg, sidhartha.kumar,
andriy.shevchenko, yangxingui, keescook, linux-kernel,
linux-fsdevel, linux-mm, kernel-team
On Thu, Jan 18, 2024 at 9:58 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Jan 16, 2024 at 9:57 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Tue, Jan 16, 2024 at 6:46 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > >
> > > On 1/16/24 15:42, Vlastimil Babka wrote:
> > > > On 1/15/24 19:38, Suren Baghdasaryan wrote:
> > > >
> > > > Hi,
> > > >
> > > >> The issue this patchset is trying to address is mmap_lock contention when
> > > >> a low priority task (monitoring, data collecting, etc.) blocks a higher
> > > >> priority task from making updated to the address space. The contention is
> > > >> due to the mmap_lock being held for read when reading proc/pid/maps.
> > > >> With maple_tree introduction, VMA tree traversals are RCU-safe and per-vma
> > > >> locks make VMA access RCU-safe. this provides an opportunity for lock-less
> > > >> reading of proc/pid/maps. We still need to overcome a couple obstacles:
> > > >> 1. Make all VMA pointer fields used for proc/pid/maps content generation
> > > >> RCU-safe;
> > > >> 2. Ensure that proc/pid/maps data tearing, which is currently possible at
> > > >> page boundaries only, does not get worse.
> > > >
> > > > Hm I thought we were to only choose this more complicated in case additional
> > > > tearing becomes a problem, and at first assume that if software can deal
> > > > with page boundary tearing, it can deal with sub-page tearing too?
> >
> > Hi Vlastimil,
> > Thanks for the feedback!
> > Yes, originally I thought we wouldn't be able to avoid additional
> > tearing without a big change but then realized it's not that hard, so
> > I tried to keep the change in behavior transparent to the userspace.
>
> In the absence of other feedback I'm going to implement and post the
> originally envisioned approach: remove validation step and avoid any
> possibility of blocking but allowing for sub-page tearing. Will use
> Matthew's rwsem_wait() to deal with possible inconsistent maple_tree
> state.
I posted v1 at
https://lore.kernel.org/all/20240122071324.2099712-1-surenb@google.com/
In the RFC I used mm_struct.mm_lock_seq to detect if mm is being
changed but then I realized that won't work. mm_struct.mm_lock_seq is
incremented after mm is changed and right before mmap_lock is
write-unlocked. Instead I need a counter that changes once we
write-lock mmap_lock and before any mm changes. So the new patchset
introduces a separate counter to detect possible mm changes. In
addition, I could not use rwsem_wait() and instead had to take
mmap_lock for read to wait for the writer to finish and then record
the new counter while holding mmap_lock for read. That prevents
concurrent mm changes while we are recording the new counter value.
> Thanks,
> Suren.
>
> >
> > > >
> > > >> The patchset deals with these issues but there is a downside which I would
> > > >> like to get input on:
> > > >> This change introduces unfairness towards the reader of proc/pid/maps,
> > > >> which can be blocked by an overly active/malicious address space modifyer.
> > > >
> > > > So this is a consequence of the validate() operation, right? We could avoid
> > > > this if we allowed sub-page tearing.
> >
> > Yes, if we don't care about sub-page tearing then we could get rid of
> > validate step and this issue with updaters blocking the reader would
> > go away. If we choose that direction there will be one more issue to
> > fix, namely the maple_tree temporary inconsistent state when a VMA is
> > replaced with another one and we might observe NULL there. We might be
> > able to use Matthew's rwsem_wait() to deal with that issue.
> >
> > > >
> > > >> A couple of ways I though we can address this issue are:
> > > >> 1. After several lock-less retries (or some time limit) to fall back to
> > > >> taking mmap_lock.
> > > >> 2. Employ lock-less reading only if the reader has low priority,
> > > >> indicating that blocking it is not critical.
> > > >> 3. Introducing a separate procfs file which publishes the same data in
> > > >> lock-less manner.
> > >
> > > Oh and if this option 3 becomes necessary, then such new file shouldn't
> > > validate() either, and whoever wants to avoid the reader contention and
> > > converts their monitoring to the new file will have to account for this
> > > possible extra tearing from the start. So I would suggest trying to change
> > > the existing file with no validate() first, and if existing userspace gets
> > > broken, employ option 3. This would mean no validate() in either case?
> >
> > Yes but I was trying to avoid introducing additional file which
> > publishes the same content in a slightly different way. We will have
> > to explain when userspace should use one vs the other and that would
> > require going into low level implementation details, I think. Don't
> > know if that's acceptable/preferable.
> > Thanks,
> > Suren.
> >
> > >
> > > >> I imagine a combination of these approaches can also be employed.
> > > >> I would like to get feedback on this from the Linux community.
> > > >>
> > > >> Note: mmap_read_lock/mmap_read_unlock sequence inside validate_map()
> > > >> can be replaced with more efficiend rwsem_wait() proposed by Matthew
> > > >> in [1].
> > > >>
> > > >> [1] https://lore.kernel.org/all/ZZ1+ZicgN8dZ3zj3@casper.infradead.org/
> > > >>
> > > >> Suren Baghdasaryan (3):
> > > >> mm: make vm_area_struct anon_name field RCU-safe
> > > >> seq_file: add validate() operation to seq_operations
> > > >> mm/maps: read proc/pid/maps under RCU
> > > >>
> > > >> fs/proc/internal.h | 3 +
> > > >> fs/proc/task_mmu.c | 130 ++++++++++++++++++++++++++++++++++----
> > > >> fs/seq_file.c | 24 ++++++-
> > > >> include/linux/mm_inline.h | 10 ++-
> > > >> include/linux/mm_types.h | 3 +-
> > > >> include/linux/seq_file.h | 1 +
> > > >> mm/madvise.c | 30 +++++++--
> > > >> 7 files changed, 181 insertions(+), 20 deletions(-)
> > > >>
> > > >
> > >
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2024-01-22 7:24 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-15 18:38 [RFC 0/3] reading proc/pid/maps under RCU Suren Baghdasaryan
2024-01-15 18:38 ` [RFC 1/3] mm: make vm_area_struct anon_name field RCU-safe Suren Baghdasaryan
2024-01-15 18:38 ` [RFC 2/3] seq_file: add validate() operation to seq_operations Suren Baghdasaryan
2024-01-15 18:38 ` [RFC 3/3] mm/maps: read proc/pid/maps under RCU Suren Baghdasaryan
2024-01-16 14:42 ` [RFC 0/3] reading " Vlastimil Babka
2024-01-16 14:46 ` Vlastimil Babka
2024-01-16 17:57 ` Suren Baghdasaryan
2024-01-18 17:58 ` Suren Baghdasaryan
2024-01-22 7:23 ` Suren Baghdasaryan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).