[PATCH 1/2] mm: limit the scope of vma_start

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/2] mm: limit the scope of vma_start_read()
@ 2025-07-31  1:34 Suren Baghdasaryan
  2025-07-31  1:34 ` [PATCH 2/2] mm: change vma_start_read() to drop RCU lock on failure Suren Baghdasaryan
  2025-07-31 11:12 ` [PATCH 1/2] mm: limit the scope of vma_start_read() Lorenzo Stoakes
  0 siblings, 2 replies; 6+ messages in thread
From: Suren Baghdasaryan @ 2025-07-31  1:34 UTC (permalink / raw)
  To: akpm
  Cc: jannh, Liam.Howlett, lorenzo.stoakes, vbabka, pfalcato, linux-mm,
	linux-kernel, surenb

Limit the scope of vma_start_read() as it is used only as a helper for
higher-level locking functions implemented inside mmap_lock.c and we are
about to introduce more complex RCU rules for this function.
The change is pure code refactoring and has no functional changes.

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mmap_lock.h | 85 ---------------------------------------
 mm/mmap_lock.c            | 85 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 85 insertions(+), 85 deletions(-)

diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 11a078de9150..2c9fffa58714 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -147,91 +147,6 @@ static inline void vma_refcount_put(struct vm_area_struct *vma)
 	}
 }
 
-/*
- * Try to read-lock a vma. The function is allowed to occasionally yield false
- * locked result to avoid performance overhead, in which case we fall back to
- * using mmap_lock. The function should never yield false unlocked result.
- * False locked result is possible if mm_lock_seq overflows or if vma gets
- * reused and attached to a different mm before we lock it.
- * Returns the vma on success, NULL on failure to lock and EAGAIN if vma got
- * detached.
- *
- * WARNING! The vma passed to this function cannot be used if the function
- * fails to lock it because in certain cases RCU lock is dropped and then
- * reacquired. Once RCU lock is dropped the vma can be concurently freed.
- */
-static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm,
-						    struct vm_area_struct *vma)
-{
-	int oldcnt;
-
-	/*
-	 * Check before locking. A race might cause false locked result.
-	 * We can use READ_ONCE() for the mm_lock_seq here, and don't need
-	 * ACQUIRE semantics, because this is just a lockless check whose result
-	 * we don't rely on for anything - the mm_lock_seq read against which we
-	 * need ordering is below.
-	 */
-	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
-		return NULL;
-
-	/*
-	 * If VMA_LOCK_OFFSET is set, __refcount_inc_not_zero_limited_acquire()
-	 * will fail because VMA_REF_LIMIT is less than VMA_LOCK_OFFSET.
-	 * Acquire fence is required here to avoid reordering against later
-	 * vm_lock_seq check and checks inside lock_vma_under_rcu().
-	 */
-	if (unlikely(!__refcount_inc_not_zero_limited_acquire(&vma->vm_refcnt, &oldcnt,
-							      VMA_REF_LIMIT))) {
-		/* return EAGAIN if vma got detached from under us */
-		return oldcnt ? NULL : ERR_PTR(-EAGAIN);
-	}
-
-	rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_);
-
-	/*
-	 * If vma got attached to another mm from under us, that mm is not
-	 * stable and can be freed in the narrow window after vma->vm_refcnt
-	 * is dropped and before rcuwait_wake_up(mm) is called. Grab it before
-	 * releasing vma->vm_refcnt.
-	 */
-	if (unlikely(vma->vm_mm != mm)) {
-		/* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
-		struct mm_struct *other_mm = vma->vm_mm;
-
-		/*
-		 * __mmdrop() is a heavy operation and we don't need RCU
-		 * protection here. Release RCU lock during these operations.
-		 * We reinstate the RCU read lock as the caller expects it to
-		 * be held when this function returns even on error.
-		 */
-		rcu_read_unlock();
-		mmgrab(other_mm);
-		vma_refcount_put(vma);
-		mmdrop(other_mm);
-		rcu_read_lock();
-		return NULL;
-	}
-
-	/*
-	 * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
-	 * False unlocked result is impossible because we modify and check
-	 * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq
-	 * modification invalidates all existing locks.
-	 *
-	 * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
-	 * racing with vma_end_write_all(), we only start reading from the VMA
-	 * after it has been unlocked.
-	 * This pairs with RELEASE semantics in vma_end_write_all().
-	 */
-	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&mm->mm_lock_seq))) {
-		vma_refcount_put(vma);
-		return NULL;
-	}
-
-	return vma;
-}
-
 /*
  * Use only while holding mmap read lock which guarantees that locking will not
  * fail (nobody can concurrently write-lock the vma). vma_start_read() should
diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
index b006cec8e6fe..10826f347a9f 100644
--- a/mm/mmap_lock.c
+++ b/mm/mmap_lock.c
@@ -127,6 +127,91 @@ void vma_mark_detached(struct vm_area_struct *vma)
 	}
 }
 
+/*
+ * Try to read-lock a vma. The function is allowed to occasionally yield false
+ * locked result to avoid performance overhead, in which case we fall back to
+ * using mmap_lock. The function should never yield false unlocked result.
+ * False locked result is possible if mm_lock_seq overflows or if vma gets
+ * reused and attached to a different mm before we lock it.
+ * Returns the vma on success, NULL on failure to lock and EAGAIN if vma got
+ * detached.
+ *
+ * WARNING! The vma passed to this function cannot be used if the function
+ * fails to lock it because in certain cases RCU lock is dropped and then
+ * reacquired. Once RCU lock is dropped the vma can be concurently freed.
+ */
+static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm,
+						    struct vm_area_struct *vma)
+{
+	int oldcnt;
+
+	/*
+	 * Check before locking. A race might cause false locked result.
+	 * We can use READ_ONCE() for the mm_lock_seq here, and don't need
+	 * ACQUIRE semantics, because this is just a lockless check whose result
+	 * we don't rely on for anything - the mm_lock_seq read against which we
+	 * need ordering is below.
+	 */
+	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
+		return NULL;
+
+	/*
+	 * If VMA_LOCK_OFFSET is set, __refcount_inc_not_zero_limited_acquire()
+	 * will fail because VMA_REF_LIMIT is less than VMA_LOCK_OFFSET.
+	 * Acquire fence is required here to avoid reordering against later
+	 * vm_lock_seq check and checks inside lock_vma_under_rcu().
+	 */
+	if (unlikely(!__refcount_inc_not_zero_limited_acquire(&vma->vm_refcnt, &oldcnt,
+							      VMA_REF_LIMIT))) {
+		/* return EAGAIN if vma got detached from under us */
+		return oldcnt ? NULL : ERR_PTR(-EAGAIN);
+	}
+
+	rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_);
+
+	/*
+	 * If vma got attached to another mm from under us, that mm is not
+	 * stable and can be freed in the narrow window after vma->vm_refcnt
+	 * is dropped and before rcuwait_wake_up(mm) is called. Grab it before
+	 * releasing vma->vm_refcnt.
+	 */
+	if (unlikely(vma->vm_mm != mm)) {
+		/* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
+		struct mm_struct *other_mm = vma->vm_mm;
+
+		/*
+		 * __mmdrop() is a heavy operation and we don't need RCU
+		 * protection here. Release RCU lock during these operations.
+		 * We reinstate the RCU read lock as the caller expects it to
+		 * be held when this function returns even on error.
+		 */
+		rcu_read_unlock();
+		mmgrab(other_mm);
+		vma_refcount_put(vma);
+		mmdrop(other_mm);
+		rcu_read_lock();
+		return NULL;
+	}
+
+	/*
+	 * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
+	 * False unlocked result is impossible because we modify and check
+	 * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq
+	 * modification invalidates all existing locks.
+	 *
+	 * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
+	 * racing with vma_end_write_all(), we only start reading from the VMA
+	 * after it has been unlocked.
+	 * This pairs with RELEASE semantics in vma_end_write_all().
+	 */
+	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&mm->mm_lock_seq))) {
+		vma_refcount_put(vma);
+		return NULL;
+	}
+
+	return vma;
+}
+
 /*
  * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
  * stable and not isolated. If the VMA is not found or is being modified the
-- 
2.50.1.552.g942d659e1b-goog


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/2] mm: change vma_start_read() to drop RCU lock on failure
  2025-07-31  1:34 [PATCH 1/2] mm: limit the scope of vma_start_read() Suren Baghdasaryan
@ 2025-07-31  1:34 ` Suren Baghdasaryan
  2025-07-31 11:49   ` Lorenzo Stoakes
  2025-07-31 11:12 ` [PATCH 1/2] mm: limit the scope of vma_start_read() Lorenzo Stoakes
  1 sibling, 1 reply; 6+ messages in thread
From: Suren Baghdasaryan @ 2025-07-31  1:34 UTC (permalink / raw)
  To: akpm
  Cc: jannh, Liam.Howlett, lorenzo.stoakes, vbabka, pfalcato, linux-mm,
	linux-kernel, surenb

vma_start_read() can drop and reacquire RCU lock in certain failure
cases. It's not apparent that the RCU session started by the caller of
this function might be interrupted when vma_start_read() fails to lock
the vma. This might become a source of subtle bugs and to prevent that
we change the locking rules for vma_start_read() to drop RCU read lock
upon failure. This way it's more obvious that RCU-protected objects are
unsafe after vma locking fails.

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/mmap_lock.c | 76 +++++++++++++++++++++++++++++---------------------
 1 file changed, 44 insertions(+), 32 deletions(-)

diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
index 10826f347a9f..0129db8f652f 100644
--- a/mm/mmap_lock.c
+++ b/mm/mmap_lock.c
@@ -136,15 +136,21 @@ void vma_mark_detached(struct vm_area_struct *vma)
  * Returns the vma on success, NULL on failure to lock and EAGAIN if vma got
  * detached.
  *
- * WARNING! The vma passed to this function cannot be used if the function
- * fails to lock it because in certain cases RCU lock is dropped and then
- * reacquired. Once RCU lock is dropped the vma can be concurently freed.
+ * WARNING! On entrance to this function RCU read lock should be held and it
+ * is released if function fails to lock the vma, therefore vma passed to this
+ * function cannot be used if the function fails to lock it.
+ * When vma is successfully locked, RCU read lock is kept intact and RCU read
+ * session is not interrupted. This is important when locking is done while
+ * walking the vma tree under RCU using vma_iterator because if the RCU lock
+ * is dropped, the iterator becomes invalid.
  */
 static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm,
 						    struct vm_area_struct *vma)
 {
+	struct mm_struct *other_mm;
 	int oldcnt;
 
+	RCU_LOCKDEP_WARN(!rcu_read_lock_held(), "no rcu lock held");
 	/*
 	 * Check before locking. A race might cause false locked result.
 	 * We can use READ_ONCE() for the mm_lock_seq here, and don't need
@@ -152,8 +158,10 @@ static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm,
 	 * we don't rely on for anything - the mm_lock_seq read against which we
 	 * need ordering is below.
 	 */
-	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
-		return NULL;
+	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence)) {
+		vma = NULL;
+		goto err;
+	}
 
 	/*
 	 * If VMA_LOCK_OFFSET is set, __refcount_inc_not_zero_limited_acquire()
@@ -164,7 +172,8 @@ static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm,
 	if (unlikely(!__refcount_inc_not_zero_limited_acquire(&vma->vm_refcnt, &oldcnt,
 							      VMA_REF_LIMIT))) {
 		/* return EAGAIN if vma got detached from under us */
-		return oldcnt ? NULL : ERR_PTR(-EAGAIN);
+		vma = oldcnt ? NULL : ERR_PTR(-EAGAIN);
+		goto err;
 	}
 
 	rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_);
@@ -175,23 +184,8 @@ static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm,
 	 * is dropped and before rcuwait_wake_up(mm) is called. Grab it before
 	 * releasing vma->vm_refcnt.
 	 */
-	if (unlikely(vma->vm_mm != mm)) {
-		/* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
-		struct mm_struct *other_mm = vma->vm_mm;
-
-		/*
-		 * __mmdrop() is a heavy operation and we don't need RCU
-		 * protection here. Release RCU lock during these operations.
-		 * We reinstate the RCU read lock as the caller expects it to
-		 * be held when this function returns even on error.
-		 */
-		rcu_read_unlock();
-		mmgrab(other_mm);
-		vma_refcount_put(vma);
-		mmdrop(other_mm);
-		rcu_read_lock();
-		return NULL;
-	}
+	if (unlikely(vma->vm_mm != mm))
+		goto err_unstable;
 
 	/*
 	 * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
@@ -206,10 +200,26 @@ static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm,
 	 */
 	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&mm->mm_lock_seq))) {
 		vma_refcount_put(vma);
-		return NULL;
+		vma = NULL;
+		goto err;
 	}
 
 	return vma;
+err:
+	rcu_read_unlock();
+
+	return vma;
+err_unstable:
+	/* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
+	other_mm = vma->vm_mm;
+
+	/* __mmdrop() is a heavy operation, do it after dropping RCU lock. */
+	rcu_read_unlock();
+	mmgrab(other_mm);
+	vma_refcount_put(vma);
+	mmdrop(other_mm);
+
+	return NULL;
 }
 
 /*
@@ -223,8 +233,8 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	MA_STATE(mas, &mm->mm_mt, address, address);
 	struct vm_area_struct *vma;
 
-	rcu_read_lock();
 retry:
+	rcu_read_lock();
 	vma = mas_walk(&mas);
 	if (!vma)
 		goto inval;
@@ -241,6 +251,9 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 		/* Failed to lock the VMA */
 		goto inval;
 	}
+
+	rcu_read_unlock();
+
 	/*
 	 * At this point, we have a stable reference to a VMA: The VMA is
 	 * locked and we know it hasn't already been isolated.
@@ -249,16 +262,14 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	 */
 
 	/* Check if the vma we locked is the right one. */
-	if (unlikely(address < vma->vm_start || address >= vma->vm_end))
-		goto inval_end_read;
+	if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
+		vma_end_read(vma);
+		goto inval;
+	}
 
-	rcu_read_unlock();
 	return vma;
 
-inval_end_read:
-	vma_end_read(vma);
 inval:
-	rcu_read_unlock();
 	count_vm_vma_lock_event(VMA_LOCK_ABORT);
 	return NULL;
 }
@@ -313,6 +324,7 @@ struct vm_area_struct *lock_next_vma(struct mm_struct *mm,
 		 */
 		if (PTR_ERR(vma) == -EAGAIN) {
 			/* reset to search from the last address */
+			rcu_read_lock();
 			vma_iter_set(vmi, from_addr);
 			goto retry;
 		}
@@ -342,9 +354,9 @@ struct vm_area_struct *lock_next_vma(struct mm_struct *mm,
 	return vma;
 
 fallback_unlock:
+	rcu_read_unlock();
 	vma_end_read(vma);
 fallback:
-	rcu_read_unlock();
 	vma = lock_next_vma_under_mmap_lock(mm, vmi, from_addr);
 	rcu_read_lock();
 	/* Reinitialize the iterator after re-entering rcu read section */
-- 
2.50.1.552.g942d659e1b-goog


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/2] mm: limit the scope of vma_start_read()
  2025-07-31  1:34 [PATCH 1/2] mm: limit the scope of vma_start_read() Suren Baghdasaryan
  2025-07-31  1:34 ` [PATCH 2/2] mm: change vma_start_read() to drop RCU lock on failure Suren Baghdasaryan
@ 2025-07-31 11:12 ` Lorenzo Stoakes
  1 sibling, 0 replies; 6+ messages in thread
From: Lorenzo Stoakes @ 2025-07-31 11:12 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, jannh, Liam.Howlett, vbabka, pfalcato, linux-mm,
	linux-kernel

One small nit, or maybe just _my_ personal preference :P could you use a
cover letter for these? As I find the 2/2 replying to the 1/2 thing a bit
weird.

Possibly a 'Lorenzo' thing though ;)

On Wed, Jul 30, 2025 at 06:34:03PM -0700, Suren Baghdasaryan wrote:
> Limit the scope of vma_start_read() as it is used only as a helper for
> higher-level locking functions implemented inside mmap_lock.c and we are
> about to introduce more complex RCU rules for this function.
> The change is pure code refactoring and has no functional changes.
>
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

I've checked this carefully, compiles locally + all fine with no
perceivable delta other than move so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  include/linux/mmap_lock.h | 85 ---------------------------------------
>  mm/mmap_lock.c            | 85 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 85 insertions(+), 85 deletions(-)
>
> diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> index 11a078de9150..2c9fffa58714 100644
> --- a/include/linux/mmap_lock.h
> +++ b/include/linux/mmap_lock.h
> @@ -147,91 +147,6 @@ static inline void vma_refcount_put(struct vm_area_struct *vma)
>  	}
>  }
>
> -/*
> - * Try to read-lock a vma. The function is allowed to occasionally yield false
> - * locked result to avoid performance overhead, in which case we fall back to
> - * using mmap_lock. The function should never yield false unlocked result.
> - * False locked result is possible if mm_lock_seq overflows or if vma gets
> - * reused and attached to a different mm before we lock it.
> - * Returns the vma on success, NULL on failure to lock and EAGAIN if vma got
> - * detached.
> - *
> - * WARNING! The vma passed to this function cannot be used if the function
> - * fails to lock it because in certain cases RCU lock is dropped and then
> - * reacquired. Once RCU lock is dropped the vma can be concurently freed.
> - */
> -static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm,
> -						    struct vm_area_struct *vma)
> -{
> -	int oldcnt;
> -
> -	/*
> -	 * Check before locking. A race might cause false locked result.
> -	 * We can use READ_ONCE() for the mm_lock_seq here, and don't need
> -	 * ACQUIRE semantics, because this is just a lockless check whose result
> -	 * we don't rely on for anything - the mm_lock_seq read against which we
> -	 * need ordering is below.
> -	 */
> -	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
> -		return NULL;
> -
> -	/*
> -	 * If VMA_LOCK_OFFSET is set, __refcount_inc_not_zero_limited_acquire()
> -	 * will fail because VMA_REF_LIMIT is less than VMA_LOCK_OFFSET.
> -	 * Acquire fence is required here to avoid reordering against later
> -	 * vm_lock_seq check and checks inside lock_vma_under_rcu().
> -	 */
> -	if (unlikely(!__refcount_inc_not_zero_limited_acquire(&vma->vm_refcnt, &oldcnt,
> -							      VMA_REF_LIMIT))) {
> -		/* return EAGAIN if vma got detached from under us */
> -		return oldcnt ? NULL : ERR_PTR(-EAGAIN);
> -	}
> -
> -	rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_);
> -
> -	/*
> -	 * If vma got attached to another mm from under us, that mm is not
> -	 * stable and can be freed in the narrow window after vma->vm_refcnt
> -	 * is dropped and before rcuwait_wake_up(mm) is called. Grab it before
> -	 * releasing vma->vm_refcnt.
> -	 */
> -	if (unlikely(vma->vm_mm != mm)) {
> -		/* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
> -		struct mm_struct *other_mm = vma->vm_mm;
> -
> -		/*
> -		 * __mmdrop() is a heavy operation and we don't need RCU
> -		 * protection here. Release RCU lock during these operations.
> -		 * We reinstate the RCU read lock as the caller expects it to
> -		 * be held when this function returns even on error.
> -		 */
> -		rcu_read_unlock();
> -		mmgrab(other_mm);
> -		vma_refcount_put(vma);
> -		mmdrop(other_mm);
> -		rcu_read_lock();
> -		return NULL;
> -	}
> -
> -	/*
> -	 * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
> -	 * False unlocked result is impossible because we modify and check
> -	 * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq
> -	 * modification invalidates all existing locks.
> -	 *
> -	 * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
> -	 * racing with vma_end_write_all(), we only start reading from the VMA
> -	 * after it has been unlocked.
> -	 * This pairs with RELEASE semantics in vma_end_write_all().
> -	 */
> -	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&mm->mm_lock_seq))) {
> -		vma_refcount_put(vma);
> -		return NULL;
> -	}
> -
> -	return vma;
> -}
> -
>  /*
>   * Use only while holding mmap read lock which guarantees that locking will not
>   * fail (nobody can concurrently write-lock the vma). vma_start_read() should
> diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
> index b006cec8e6fe..10826f347a9f 100644
> --- a/mm/mmap_lock.c
> +++ b/mm/mmap_lock.c
> @@ -127,6 +127,91 @@ void vma_mark_detached(struct vm_area_struct *vma)
>  	}
>  }
>
> +/*
> + * Try to read-lock a vma. The function is allowed to occasionally yield false
> + * locked result to avoid performance overhead, in which case we fall back to
> + * using mmap_lock. The function should never yield false unlocked result.
> + * False locked result is possible if mm_lock_seq overflows or if vma gets
> + * reused and attached to a different mm before we lock it.
> + * Returns the vma on success, NULL on failure to lock and EAGAIN if vma got
> + * detached.
> + *
> + * WARNING! The vma passed to this function cannot be used if the function
> + * fails to lock it because in certain cases RCU lock is dropped and then
> + * reacquired. Once RCU lock is dropped the vma can be concurently freed.
> + */
> +static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm,
> +						    struct vm_area_struct *vma)
> +{
> +	int oldcnt;
> +
> +	/*
> +	 * Check before locking. A race might cause false locked result.
> +	 * We can use READ_ONCE() for the mm_lock_seq here, and don't need
> +	 * ACQUIRE semantics, because this is just a lockless check whose result
> +	 * we don't rely on for anything - the mm_lock_seq read against which we
> +	 * need ordering is below.
> +	 */
> +	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
> +		return NULL;
> +
> +	/*
> +	 * If VMA_LOCK_OFFSET is set, __refcount_inc_not_zero_limited_acquire()
> +	 * will fail because VMA_REF_LIMIT is less than VMA_LOCK_OFFSET.
> +	 * Acquire fence is required here to avoid reordering against later
> +	 * vm_lock_seq check and checks inside lock_vma_under_rcu().
> +	 */
> +	if (unlikely(!__refcount_inc_not_zero_limited_acquire(&vma->vm_refcnt, &oldcnt,
> +							      VMA_REF_LIMIT))) {
> +		/* return EAGAIN if vma got detached from under us */
> +		return oldcnt ? NULL : ERR_PTR(-EAGAIN);
> +	}
> +
> +	rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_);
> +
> +	/*
> +	 * If vma got attached to another mm from under us, that mm is not
> +	 * stable and can be freed in the narrow window after vma->vm_refcnt
> +	 * is dropped and before rcuwait_wake_up(mm) is called. Grab it before
> +	 * releasing vma->vm_refcnt.
> +	 */
> +	if (unlikely(vma->vm_mm != mm)) {
> +		/* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
> +		struct mm_struct *other_mm = vma->vm_mm;
> +
> +		/*
> +		 * __mmdrop() is a heavy operation and we don't need RCU
> +		 * protection here. Release RCU lock during these operations.
> +		 * We reinstate the RCU read lock as the caller expects it to
> +		 * be held when this function returns even on error.
> +		 */
> +		rcu_read_unlock();
> +		mmgrab(other_mm);
> +		vma_refcount_put(vma);
> +		mmdrop(other_mm);
> +		rcu_read_lock();
> +		return NULL;
> +	}
> +
> +	/*
> +	 * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
> +	 * False unlocked result is impossible because we modify and check
> +	 * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq
> +	 * modification invalidates all existing locks.
> +	 *
> +	 * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
> +	 * racing with vma_end_write_all(), we only start reading from the VMA
> +	 * after it has been unlocked.
> +	 * This pairs with RELEASE semantics in vma_end_write_all().
> +	 */
> +	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&mm->mm_lock_seq))) {
> +		vma_refcount_put(vma);
> +		return NULL;
> +	}
> +
> +	return vma;
> +}
> +
>  /*
>   * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
>   * stable and not isolated. If the VMA is not found or is being modified the
> --
> 2.50.1.552.g942d659e1b-goog
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] mm: change vma_start_read() to drop RCU lock on failure
  2025-07-31  1:34 ` [PATCH 2/2] mm: change vma_start_read() to drop RCU lock on failure Suren Baghdasaryan
@ 2025-07-31 11:49   ` Lorenzo Stoakes
  2025-07-31 14:06     ` Suren Baghdasaryan
  0 siblings, 1 reply; 6+ messages in thread
From: Lorenzo Stoakes @ 2025-07-31 11:49 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, jannh, Liam.Howlett, vbabka, pfalcato, linux-mm,
	linux-kernel

So this patch is broken :P

Am getting:

[    2.002807] ------------[ cut here ]------------
[    2.003014] Voluntary context switch within RCU read-side critical section!
[    2.003022] WARNING: CPU: 1 PID: 202 at kernel/rcu/tree_plugin.h:332 rcu_note_context_switch+0x506/0x580
[    2.003643] Modules linked in:
[    2.003765] CPU: 1 UID: 0 PID: 202 Comm: dhcpcd Not tainted 6.16.0-rc5+ #41 PREEMPT(voluntary)
[    2.004103] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.17.0-1-1 04/01/2014
[    2.004460] RIP: 0010:rcu_note_context_switch+0x506/0x580
[    2.004669] Code: 00 00 00 0f 85 f5 fd ff ff 49 89 90 a8 00 00 00 e9 e9 fd ff ff c6 05 86 69 90 01 01 90 48 c7 c7 98 c3 90 b8 e8 cb b4 f5 ff 90 <0f> 0b 90 90 e9 38 fb ff ff 48 8b 7d 20 48 89 3c 24 e8 64 26 d5 00
[    2.005382] RSP: 0018:ffffb36b40607aa8 EFLAGS: 00010082
[    2.005585] RAX: 000000000000003f RBX: ffff9c044128ad00 RCX: 0000000000000027
[    2.005866] RDX: ffff9c0577c97f48 RSI: 0000000000000001 RDI: ffff9c0577c97f40
[    2.006136] RBP: ffff9c0577ca9f80 R08: 40000000ffffe1f7 R09: 0000000000000000
[    2.006411] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[    2.006692] R13: ffff9c04fb0423d0 R14: ffffffffb82e2600 R15: ffff9c044128ad00
[    2.006968] FS:  00007fd12e7d9740(0000) GS:ffff9c05be614000(0000) knlGS:0000000000000000
[    2.007281] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.007498] CR2: 00007ffe2f0d2798 CR3: 00000001bb8b1000 CR4: 0000000000750ef0
[    2.007770] PKRU: 55555554
[    2.007880] Call Trace:
[    2.007985]  <TASK>
[    2.008076]  __schedule+0x94/0xee0
[    2.008212]  ? __pfx_bit_wait_io+0x10/0x10
[    2.008370]  schedule+0x22/0xd0
[    2.008517]  io_schedule+0x41/0x60
[    2.008653]  bit_wait_io+0xc/0x60
[    2.008783]  __wait_on_bit+0x25/0x90
[    2.008925]  out_of_line_wait_on_bit+0x85/0x90
[    2.009104]  ? __pfx_wake_bit_function+0x10/0x10
[    2.009288]  __ext4_find_entry+0x2b2/0x470
[    2.009449]  ? __d_alloc+0x117/0x1d0
[    2.009591]  ext4_lookup+0x6b/0x1f0
[    2.009733]  path_openat+0x895/0x1030
[    2.009880]  do_filp_open+0xc3/0x150
[    2.010021]  ? do_anonymous_page+0x5b1/0xae0
[    2.010195]  do_sys_openat2+0x76/0xc0
[    2.010339]  __x64_sys_openat+0x4f/0x70
[    2.010490]  do_syscall_64+0xa4/0x260
[    2.010638]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[    2.010840] RIP: 0033:0x7fd12e0a2006
[    2.010984] Code: 5d e8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 75 19 83 e2 39 83 fa 08 75 11 e8 26 ff ff ff 66 0f 1f 44 00 00 48 8b 45 10 0f 05 <48> 8b 5d f8 c9 c3 0f 1f 40 00 f3 0f 1e fa 55 48 89 e5 48 83 ec 08

and

[   23.004231] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[   23.004464] rcu: 	Tasks blocked on level-0 rcu_node (CPUs 0-3): P202/6:b..l
[   23.004736] rcu: 	(detected by 2, t=21002 jiffies, g=-663, q=940 ncpus=4)
[   23.004992] task:dhcpcd          state:S stack:0     pid:202   tgid:202   ppid:196    task_flags:0x400140 flags:0x00004002
[   23.005416] Call Trace:
[   23.005515]  <TASK>
[   23.005603]  __schedule+0x3ca/0xee0
[   23.005754]  schedule+0x22/0xd0
[   23.005878]  schedule_hrtimeout_range_clock+0xf2/0x100
[   23.006075]  poll_schedule_timeout.constprop.0+0x32/0x80
[   23.006281]  do_sys_poll+0x3bb/0x550
[   23.006424]  ? __pfx_pollwake+0x10/0x10
[   23.006573]  ? __pfx_pollwake+0x10/0x10
[   23.006712]  ? __pfx_pollwake+0x10/0x10
[   23.006848]  __x64_sys_ppoll+0xc9/0x160
[   23.006993]  do_syscall_64+0xa4/0x260
[   23.007140]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[   23.007339] RIP: 0033:0x7fd12e0a2006
[   23.007483] RSP: 002b:00007ffe2f0f28e0 EFLAGS: 00000202 ORIG_RAX: 000000000000010f
[   23.007770] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd12e0a2006
[   23.008035] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 000055abb0c5ae20
[   23.008309] RBP: 00007ffe2f0f2900 R08: 0000000000000008 R09: 0000000000000000
[   23.008588] R10: 00007ffe2f0f2c80 R11: 0000000000000202 R12: 000055abb0c3cd80
[   23.008869] R13: 00007ffe2f0f2c80 R14: 000055ab9297b5c0 R15: 0000000000000000
[   23.009141]  </TASK>

Here.

I identify the bug below.

On Wed, Jul 30, 2025 at 06:34:04PM -0700, Suren Baghdasaryan wrote:
> vma_start_read() can drop and reacquire RCU lock in certain failure
> cases. It's not apparent that the RCU session started by the caller of
> this function might be interrupted when vma_start_read() fails to lock
> the vma. This might become a source of subtle bugs and to prevent that
> we change the locking rules for vma_start_read() to drop RCU read lock
> upon failure. This way it's more obvious that RCU-protected objects are
> unsafe after vma locking fails.
>
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/mmap_lock.c | 76 +++++++++++++++++++++++++++++---------------------
>  1 file changed, 44 insertions(+), 32 deletions(-)
>
> diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
> index 10826f347a9f..0129db8f652f 100644
> --- a/mm/mmap_lock.c
> +++ b/mm/mmap_lock.c
> @@ -136,15 +136,21 @@ void vma_mark_detached(struct vm_area_struct *vma)
>   * Returns the vma on success, NULL on failure to lock and EAGAIN if vma got
>   * detached.
>   *
> - * WARNING! The vma passed to this function cannot be used if the function
> - * fails to lock it because in certain cases RCU lock is dropped and then
> - * reacquired. Once RCU lock is dropped the vma can be concurently freed.
> + * WARNING! On entrance to this function RCU read lock should be held and it
> + * is released if function fails to lock the vma, therefore vma passed to this
> + * function cannot be used if the function fails to lock it.
> + * When vma is successfully locked, RCU read lock is kept intact and RCU read
> + * session is not interrupted. This is important when locking is done while
> + * walking the vma tree under RCU using vma_iterator because if the RCU lock
> + * is dropped, the iterator becomes invalid.
>   */

I feel like this is a bit of a wall of noise, can we add a clearly separated line like:

	...
	*

	* IMPORTANT: RCU lock must be held upon entering the function, but
        *            upon error IT IS RELEASED. The caller must handle this
	*            correctly.
	*/

>  static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm,
>  						    struct vm_area_struct *vma)
>  {

I was thinking we could split this out into a wrapper __vma_start_read()
function but then the stability check won't really fit properly so never
mind :)

> +	struct mm_struct *other_mm;
>  	int oldcnt;
>
> +	RCU_LOCKDEP_WARN(!rcu_read_lock_held(), "no rcu lock held");

Good to add this.

>  	/*
>  	 * Check before locking. A race might cause false locked result.
>  	 * We can use READ_ONCE() for the mm_lock_seq here, and don't need
> @@ -152,8 +158,10 @@ static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm,
>  	 * we don't rely on for anything - the mm_lock_seq read against which we
>  	 * need ordering is below.
>  	 */
> -	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
> -		return NULL;
> +	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence)) {
> +		vma = NULL;
> +		goto err;
> +	}
>
>  	/*
>  	 * If VMA_LOCK_OFFSET is set, __refcount_inc_not_zero_limited_acquire()
> @@ -164,7 +172,8 @@ static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm,
>  	if (unlikely(!__refcount_inc_not_zero_limited_acquire(&vma->vm_refcnt, &oldcnt,
>  							      VMA_REF_LIMIT))) {
>  		/* return EAGAIN if vma got detached from under us */
> -		return oldcnt ? NULL : ERR_PTR(-EAGAIN);
> +		vma = oldcnt ? NULL : ERR_PTR(-EAGAIN);
> +		goto err;
>  	}
>
>  	rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_);
> @@ -175,23 +184,8 @@ static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm,
>  	 * is dropped and before rcuwait_wake_up(mm) is called. Grab it before
>  	 * releasing vma->vm_refcnt.
>  	 */

I feel like this comment above should be moved below to where the 'action' is.

> -	if (unlikely(vma->vm_mm != mm)) {
> -		/* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
> -		struct mm_struct *other_mm = vma->vm_mm;
> -
> -		/*
> -		 * __mmdrop() is a heavy operation and we don't need RCU
> -		 * protection here. Release RCU lock during these operations.
> -		 * We reinstate the RCU read lock as the caller expects it to
> -		 * be held when this function returns even on error.
> -		 */
> -		rcu_read_unlock();
> -		mmgrab(other_mm);
> -		vma_refcount_put(vma);
> -		mmdrop(other_mm);
> -		rcu_read_lock();
> -		return NULL;
> -	}
> +	if (unlikely(vma->vm_mm != mm))
> +		goto err_unstable;
>
>  	/*
>  	 * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
> @@ -206,10 +200,26 @@ static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm,
>  	 */
>  	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&mm->mm_lock_seq))) {
>  		vma_refcount_put(vma);
> -		return NULL;
> +		vma = NULL;
> +		goto err;
>  	}
>
>  	return vma;
> +err:
> +	rcu_read_unlock();
> +
> +	return vma;
> +err_unstable:

Move comment above here I think.

> +	/* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
> +	other_mm = vma->vm_mm;
> +
> +	/* __mmdrop() is a heavy operation, do it after dropping RCU lock. */
> +	rcu_read_unlock();
> +	mmgrab(other_mm);
> +	vma_refcount_put(vma);
> +	mmdrop(other_mm);
> +
> +	return NULL;
>  }
>
>  /*
> @@ -223,8 +233,8 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>  	MA_STATE(mas, &mm->mm_mt, address, address);
>  	struct vm_area_struct *vma;
>
> -	rcu_read_lock();
>  retry:
> +	rcu_read_lock();
>  	vma = mas_walk(&mas);
>  	if (!vma)
>  		goto inval;
                ^
                |---- this is incorrect, you took the RCU read lock above, but you don't unlock... :)

You can fix easily with:

	if (!vma) {
		rcu_read_unlock();
		goto inval;
	}

Which fixes the issue locally for me.

> @@ -241,6 +251,9 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>  		/* Failed to lock the VMA */
>  		goto inval;
>  	}
> +
> +	rcu_read_unlock();
> +
>  	/*
>  	 * At this point, we have a stable reference to a VMA: The VMA is
>  	 * locked and we know it hasn't already been isolated.
> @@ -249,16 +262,14 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>  	 */
>
>  	/* Check if the vma we locked is the right one. */
> -	if (unlikely(address < vma->vm_start || address >= vma->vm_end))
> -		goto inval_end_read;
> +	if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> +		vma_end_read(vma);
> +		goto inval;
> +	}
>
> -	rcu_read_unlock();
>  	return vma;
>
> -inval_end_read:
> -	vma_end_read(vma);
>  inval:
> -	rcu_read_unlock();
>  	count_vm_vma_lock_event(VMA_LOCK_ABORT);
>  	return NULL;
>  }
> @@ -313,6 +324,7 @@ struct vm_area_struct *lock_next_vma(struct mm_struct *mm,
>  		 */
>  		if (PTR_ERR(vma) == -EAGAIN) {
>  			/* reset to search from the last address */
> +			rcu_read_lock();
>  			vma_iter_set(vmi, from_addr);
>  			goto retry;
>  		}
> @@ -342,9 +354,9 @@ struct vm_area_struct *lock_next_vma(struct mm_struct *mm,
>  	return vma;
>
>  fallback_unlock:
> +	rcu_read_unlock();
>  	vma_end_read(vma);
>  fallback:
> -	rcu_read_unlock();
>  	vma = lock_next_vma_under_mmap_lock(mm, vmi, from_addr);
>  	rcu_read_lock();
>  	/* Reinitialize the iterator after re-entering rcu read section */
> --
> 2.50.1.552.g942d659e1b-goog
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] mm: change vma_start_read() to drop RCU lock on failure
  2025-07-31 11:49   ` Lorenzo Stoakes
@ 2025-07-31 14:06     ` Suren Baghdasaryan
  2025-08-01 10:41       ` Lorenzo Stoakes
  0 siblings, 1 reply; 6+ messages in thread
From: Suren Baghdasaryan @ 2025-07-31 14:06 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, jannh, Liam.Howlett, vbabka, pfalcato, linux-mm,
	linux-kernel

On Thu, Jul 31, 2025 at 4:49 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> So this patch is broken :P

Ugh, sorry about that. I must have had lockdep disabled when testing this...

>
> Am getting:
>
> [    2.002807] ------------[ cut here ]------------
> [    2.003014] Voluntary context switch within RCU read-side critical section!
> [    2.003022] WARNING: CPU: 1 PID: 202 at kernel/rcu/tree_plugin.h:332 rcu_note_context_switch+0x506/0x580
> [    2.003643] Modules linked in:
> [    2.003765] CPU: 1 UID: 0 PID: 202 Comm: dhcpcd Not tainted 6.16.0-rc5+ #41 PREEMPT(voluntary)
> [    2.004103] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.17.0-1-1 04/01/2014
> [    2.004460] RIP: 0010:rcu_note_context_switch+0x506/0x580
> [    2.004669] Code: 00 00 00 0f 85 f5 fd ff ff 49 89 90 a8 00 00 00 e9 e9 fd ff ff c6 05 86 69 90 01 01 90 48 c7 c7 98 c3 90 b8 e8 cb b4 f5 ff 90 <0f> 0b 90 90 e9 38 fb ff ff 48 8b 7d 20 48 89 3c 24 e8 64 26 d5 00
> [    2.005382] RSP: 0018:ffffb36b40607aa8 EFLAGS: 00010082
> [    2.005585] RAX: 000000000000003f RBX: ffff9c044128ad00 RCX: 0000000000000027
> [    2.005866] RDX: ffff9c0577c97f48 RSI: 0000000000000001 RDI: ffff9c0577c97f40
> [    2.006136] RBP: ffff9c0577ca9f80 R08: 40000000ffffe1f7 R09: 0000000000000000
> [    2.006411] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [    2.006692] R13: ffff9c04fb0423d0 R14: ffffffffb82e2600 R15: ffff9c044128ad00
> [    2.006968] FS:  00007fd12e7d9740(0000) GS:ffff9c05be614000(0000) knlGS:0000000000000000
> [    2.007281] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    2.007498] CR2: 00007ffe2f0d2798 CR3: 00000001bb8b1000 CR4: 0000000000750ef0
> [    2.007770] PKRU: 55555554
> [    2.007880] Call Trace:
> [    2.007985]  <TASK>
> [    2.008076]  __schedule+0x94/0xee0
> [    2.008212]  ? __pfx_bit_wait_io+0x10/0x10
> [    2.008370]  schedule+0x22/0xd0
> [    2.008517]  io_schedule+0x41/0x60
> [    2.008653]  bit_wait_io+0xc/0x60
> [    2.008783]  __wait_on_bit+0x25/0x90
> [    2.008925]  out_of_line_wait_on_bit+0x85/0x90
> [    2.009104]  ? __pfx_wake_bit_function+0x10/0x10
> [    2.009288]  __ext4_find_entry+0x2b2/0x470
> [    2.009449]  ? __d_alloc+0x117/0x1d0
> [    2.009591]  ext4_lookup+0x6b/0x1f0
> [    2.009733]  path_openat+0x895/0x1030
> [    2.009880]  do_filp_open+0xc3/0x150
> [    2.010021]  ? do_anonymous_page+0x5b1/0xae0
> [    2.010195]  do_sys_openat2+0x76/0xc0
> [    2.010339]  __x64_sys_openat+0x4f/0x70
> [    2.010490]  do_syscall_64+0xa4/0x260
> [    2.010638]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [    2.010840] RIP: 0033:0x7fd12e0a2006
> [    2.010984] Code: 5d e8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 75 19 83 e2 39 83 fa 08 75 11 e8 26 ff ff ff 66 0f 1f 44 00 00 48 8b 45 10 0f 05 <48> 8b 5d f8 c9 c3 0f 1f 40 00 f3 0f 1e fa 55 48 89 e5 48 83 ec 08
>
> and
>
> [   23.004231] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> [   23.004464] rcu:     Tasks blocked on level-0 rcu_node (CPUs 0-3): P202/6:b..l
> [   23.004736] rcu:     (detected by 2, t=21002 jiffies, g=-663, q=940 ncpus=4)
> [   23.004992] task:dhcpcd          state:S stack:0     pid:202   tgid:202   ppid:196    task_flags:0x400140 flags:0x00004002
> [   23.005416] Call Trace:
> [   23.005515]  <TASK>
> [   23.005603]  __schedule+0x3ca/0xee0
> [   23.005754]  schedule+0x22/0xd0
> [   23.005878]  schedule_hrtimeout_range_clock+0xf2/0x100
> [   23.006075]  poll_schedule_timeout.constprop.0+0x32/0x80
> [   23.006281]  do_sys_poll+0x3bb/0x550
> [   23.006424]  ? __pfx_pollwake+0x10/0x10
> [   23.006573]  ? __pfx_pollwake+0x10/0x10
> [   23.006712]  ? __pfx_pollwake+0x10/0x10
> [   23.006848]  __x64_sys_ppoll+0xc9/0x160
> [   23.006993]  do_syscall_64+0xa4/0x260
> [   23.007140]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [   23.007339] RIP: 0033:0x7fd12e0a2006
> [   23.007483] RSP: 002b:00007ffe2f0f28e0 EFLAGS: 00000202 ORIG_RAX: 000000000000010f
> [   23.007770] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd12e0a2006
> [   23.008035] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 000055abb0c5ae20
> [   23.008309] RBP: 00007ffe2f0f2900 R08: 0000000000000008 R09: 0000000000000000
> [   23.008588] R10: 00007ffe2f0f2c80 R11: 0000000000000202 R12: 000055abb0c3cd80
> [   23.008869] R13: 00007ffe2f0f2c80 R14: 000055ab9297b5c0 R15: 0000000000000000
> [   23.009141]  </TASK>
>
> Here.
>
> I identify the bug below.
>
> On Wed, Jul 30, 2025 at 06:34:04PM -0700, Suren Baghdasaryan wrote:
> > vma_start_read() can drop and reacquire RCU lock in certain failure
> > cases. It's not apparent that the RCU session started by the caller of
> > this function might be interrupted when vma_start_read() fails to lock
> > the vma. This might become a source of subtle bugs and to prevent that
> > we change the locking rules for vma_start_read() to drop RCU read lock
> > upon failure. This way it's more obvious that RCU-protected objects are
> > unsafe after vma locking fails.
> >
> > Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  mm/mmap_lock.c | 76 +++++++++++++++++++++++++++++---------------------
> >  1 file changed, 44 insertions(+), 32 deletions(-)
> >
> > diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
> > index 10826f347a9f..0129db8f652f 100644
> > --- a/mm/mmap_lock.c
> > +++ b/mm/mmap_lock.c
> > @@ -136,15 +136,21 @@ void vma_mark_detached(struct vm_area_struct *vma)
> >   * Returns the vma on success, NULL on failure to lock and EAGAIN if vma got
> >   * detached.
> >   *
> > - * WARNING! The vma passed to this function cannot be used if the function
> > - * fails to lock it because in certain cases RCU lock is dropped and then
> > - * reacquired. Once RCU lock is dropped the vma can be concurently freed.
> > + * WARNING! On entrance to this function RCU read lock should be held and it
> > + * is released if function fails to lock the vma, therefore vma passed to this
> > + * function cannot be used if the function fails to lock it.
> > + * When vma is successfully locked, RCU read lock is kept intact and RCU read
> > + * session is not interrupted. This is important when locking is done while
> > + * walking the vma tree under RCU using vma_iterator because if the RCU lock
> > + * is dropped, the iterator becomes invalid.
> >   */
>
> I feel like this is a bit of a wall of noise, can we add a clearly separated line like:
>
>         ...
>         *
>
>         * IMPORTANT: RCU lock must be held upon entering the function, but
>         *            upon error IT IS RELEASED. The caller must handle this
>         *            correctly.
>         */

Are you suggesting to replace my comment or amend it with this one? I
think the answer is "replace" but I want to confirm.

>
> >  static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm,
> >                                                   struct vm_area_struct *vma)
> >  {
>
> I was thinking we could split this out into a wrapper __vma_start_read()
> function but then the stability check won't really fit properly so never
> mind :)
>
> > +     struct mm_struct *other_mm;
> >       int oldcnt;
> >
> > +     RCU_LOCKDEP_WARN(!rcu_read_lock_held(), "no rcu lock held");
>
> Good to add this.
>
> >       /*
> >        * Check before locking. A race might cause false locked result.
> >        * We can use READ_ONCE() for the mm_lock_seq here, and don't need
> > @@ -152,8 +158,10 @@ static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm,
> >        * we don't rely on for anything - the mm_lock_seq read against which we
> >        * need ordering is below.
> >        */
> > -     if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
> > -             return NULL;
> > +     if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence)) {
> > +             vma = NULL;
> > +             goto err;
> > +     }
> >
> >       /*
> >        * If VMA_LOCK_OFFSET is set, __refcount_inc_not_zero_limited_acquire()
> > @@ -164,7 +172,8 @@ static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm,
> >       if (unlikely(!__refcount_inc_not_zero_limited_acquire(&vma->vm_refcnt, &oldcnt,
> >                                                             VMA_REF_LIMIT))) {
> >               /* return EAGAIN if vma got detached from under us */
> > -             return oldcnt ? NULL : ERR_PTR(-EAGAIN);
> > +             vma = oldcnt ? NULL : ERR_PTR(-EAGAIN);
> > +             goto err;
> >       }
> >
> >       rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_);
> > @@ -175,23 +184,8 @@ static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm,
> >        * is dropped and before rcuwait_wake_up(mm) is called. Grab it before
> >        * releasing vma->vm_refcnt.
> >        */
>
> I feel like this comment above should be moved below to where the 'action' is.

Ack.

>
> > -     if (unlikely(vma->vm_mm != mm)) {
> > -             /* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
> > -             struct mm_struct *other_mm = vma->vm_mm;
> > -
> > -             /*
> > -              * __mmdrop() is a heavy operation and we don't need RCU
> > -              * protection here. Release RCU lock during these operations.
> > -              * We reinstate the RCU read lock as the caller expects it to
> > -              * be held when this function returns even on error.
> > -              */
> > -             rcu_read_unlock();
> > -             mmgrab(other_mm);
> > -             vma_refcount_put(vma);
> > -             mmdrop(other_mm);
> > -             rcu_read_lock();
> > -             return NULL;
> > -     }
> > +     if (unlikely(vma->vm_mm != mm))
> > +             goto err_unstable;
> >
> >       /*
> >        * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
> > @@ -206,10 +200,26 @@ static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm,
> >        */
> >       if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&mm->mm_lock_seq))) {
> >               vma_refcount_put(vma);
> > -             return NULL;
> > +             vma = NULL;
> > +             goto err;
> >       }
> >
> >       return vma;
> > +err:
> > +     rcu_read_unlock();
> > +
> > +     return vma;
> > +err_unstable:
>
> Move comment above here I think.

Got it.

>
> > +     /* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
> > +     other_mm = vma->vm_mm;
> > +
> > +     /* __mmdrop() is a heavy operation, do it after dropping RCU lock. */
> > +     rcu_read_unlock();
> > +     mmgrab(other_mm);
> > +     vma_refcount_put(vma);
> > +     mmdrop(other_mm);
> > +
> > +     return NULL;
> >  }
> >
> >  /*
> > @@ -223,8 +233,8 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >       MA_STATE(mas, &mm->mm_mt, address, address);
> >       struct vm_area_struct *vma;
> >
> > -     rcu_read_lock();
> >  retry:
> > +     rcu_read_lock();
> >       vma = mas_walk(&mas);
> >       if (!vma)
> >               goto inval;
>                 ^
>                 |---- this is incorrect, you took the RCU read lock above, but you don't unlock... :)
>
> You can fix easily with:
>
>         if (!vma) {
>                 rcu_read_unlock();
>                 goto inval;
>         }
>
> Which fixes the issue locally for me.

Yes, I overlooked that here we did not yet attempt to lock the vma. Will fix.
Thanks!

>
> > @@ -241,6 +251,9 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >               /* Failed to lock the VMA */
> >               goto inval;
> >       }
> > +
> > +     rcu_read_unlock();
> > +
> >       /*
> >        * At this point, we have a stable reference to a VMA: The VMA is
> >        * locked and we know it hasn't already been isolated.
> > @@ -249,16 +262,14 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >        */
> >
> >       /* Check if the vma we locked is the right one. */
> > -     if (unlikely(address < vma->vm_start || address >= vma->vm_end))
> > -             goto inval_end_read;
> > +     if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> > +             vma_end_read(vma);
> > +             goto inval;
> > +     }
> >
> > -     rcu_read_unlock();
> >       return vma;
> >
> > -inval_end_read:
> > -     vma_end_read(vma);
> >  inval:
> > -     rcu_read_unlock();
> >       count_vm_vma_lock_event(VMA_LOCK_ABORT);
> >       return NULL;
> >  }
> > @@ -313,6 +324,7 @@ struct vm_area_struct *lock_next_vma(struct mm_struct *mm,
> >                */
> >               if (PTR_ERR(vma) == -EAGAIN) {
> >                       /* reset to search from the last address */
> > +                     rcu_read_lock();
> >                       vma_iter_set(vmi, from_addr);
> >                       goto retry;
> >               }
> > @@ -342,9 +354,9 @@ struct vm_area_struct *lock_next_vma(struct mm_struct *mm,
> >       return vma;
> >
> >  fallback_unlock:
> > +     rcu_read_unlock();
> >       vma_end_read(vma);
> >  fallback:
> > -     rcu_read_unlock();
> >       vma = lock_next_vma_under_mmap_lock(mm, vmi, from_addr);
> >       rcu_read_lock();
> >       /* Reinitialize the iterator after re-entering rcu read section */
> > --
> > 2.50.1.552.g942d659e1b-goog
> >

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] mm: change vma_start_read() to drop RCU lock on failure
  2025-07-31 14:06     ` Suren Baghdasaryan
@ 2025-08-01 10:41       ` Lorenzo Stoakes
  0 siblings, 0 replies; 6+ messages in thread
From: Lorenzo Stoakes @ 2025-08-01 10:41 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, jannh, Liam.Howlett, vbabka, pfalcato, linux-mm,
	linux-kernel

On Thu, Jul 31, 2025 at 07:06:41AM -0700, Suren Baghdasaryan wrote:
> On Thu, Jul 31, 2025 at 4:49 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > So this patch is broken :P
>
> Ugh, sorry about that. I must have had lockdep disabled when testing this...

No worries, curiously it doesn't seem to be lockdep that picked this up. RCU
tiny doesn't seem to do this so perhaps you somehow had this enabled? Not sure,
strange one!

[snip]

> > > Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > ---
> > >  mm/mmap_lock.c | 76 +++++++++++++++++++++++++++++---------------------
> > >  1 file changed, 44 insertions(+), 32 deletions(-)
> > >
> > > diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
> > > index 10826f347a9f..0129db8f652f 100644
> > > --- a/mm/mmap_lock.c
> > > +++ b/mm/mmap_lock.c
> > > @@ -136,15 +136,21 @@ void vma_mark_detached(struct vm_area_struct *vma)
> > >   * Returns the vma on success, NULL on failure to lock and EAGAIN if vma got
> > >   * detached.
> > >   *
> > > - * WARNING! The vma passed to this function cannot be used if the function
> > > - * fails to lock it because in certain cases RCU lock is dropped and then
> > > - * reacquired. Once RCU lock is dropped the vma can be concurently freed.
> > > + * WARNING! On entrance to this function RCU read lock should be held and it
> > > + * is released if function fails to lock the vma, therefore vma passed to this
> > > + * function cannot be used if the function fails to lock it.
> > > + * When vma is successfully locked, RCU read lock is kept intact and RCU read
> > > + * session is not interrupted. This is important when locking is done while
> > > + * walking the vma tree under RCU using vma_iterator because if the RCU lock
> > > + * is dropped, the iterator becomes invalid.
> > >   */
> >
> > I feel like this is a bit of a wall of noise, can we add a clearly separated line like:
> >
> >         ...
> >         *
> >
> >         * IMPORTANT: RCU lock must be held upon entering the function, but
> >         *            upon error IT IS RELEASED. The caller must handle this
> >         *            correctly.
> >         */
>
> Are you suggesting to replace my comment or amend it with this one? I
> think the answer is "replace" but I want to confirm.

Oh no, amend is fine!

I see you sent a v2 so will check that :P

[snip]

> > >  /*
> > > @@ -223,8 +233,8 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > >       MA_STATE(mas, &mm->mm_mt, address, address);
> > >       struct vm_area_struct *vma;
> > >
> > > -     rcu_read_lock();
> > >  retry:
> > > +     rcu_read_lock();
> > >       vma = mas_walk(&mas);
> > >       if (!vma)
> > >               goto inval;
> >                 ^
> >                 |---- this is incorrect, you took the RCU read lock above, but you don't unlock... :)
> >
> > You can fix easily with:
> >
> >         if (!vma) {
> >                 rcu_read_unlock();
> >                 goto inval;
> >         }
> >
> > Which fixes the issue locally for me.
>
> Yes, I overlooked that here we did not yet attempt to lock the vma. Will fix.
> Thanks!

Yeah easy one to miss!

This stuff is very subtle, but I hope that when I finally get VMA lock 'extra
docs' done (really written for us, or rather mm kernel devs, than for kernel
devs more broadly) we can help address some of the subtle aspects.

In reality of course, my main motivation with that is to ensure _I_ have docs I
can go reference when needed :P

[snip]

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-08-01 10:42 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-31  1:34 [PATCH 1/2] mm: limit the scope of vma_start_read() Suren Baghdasaryan
2025-07-31  1:34 ` [PATCH 2/2] mm: change vma_start_read() to drop RCU lock on failure Suren Baghdasaryan
2025-07-31 11:49   ` Lorenzo Stoakes
2025-07-31 14:06     ` Suren Baghdasaryan
2025-08-01 10:41       ` Lorenzo Stoakes
2025-07-31 11:12 ` [PATCH 1/2] mm: limit the scope of vma_start_read() Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).