* [PATCH 0/2] RFC: READ/WRITE_ONCE vma/mm cleanups
@ 2019-03-01  3:55 Andrea Arcangeli
  2019-03-01  3:55 ` [PATCH 1/2] coredump: use READ_ONCE to read mm->flags Andrea Arcangeli
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Andrea Arcangeli @ 2019-03-01  3:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, Hugh Dickins, Kirill A . Shutemov, Michal Hocko
Hello,
This was a well known issue for more than a decade, but until a few
months ago we relied on the compiler to stick to atomic accesses and
updates while walking and updating pagetables.
However now the 64bit native_set_pte finally uses WRITE_ONCE and
gup_pmd_range uses READ_ONCE as well.
This convert more racy VM places to avoid depending on the expected
compiler behavior to achieve kernel runtime correctness.
It mostly guarantees gcc to do atomic updates at 64bit granularity
(practically not needed) and it also prevents gcc to emit code that
risks getting confused if the memory unexpectedly changes under it
(unlikely to ever be needed).
The list of vm_start/end/pgoff to update isn't complete, I covered the
most obvious places, but before wasting too much time at doing a full
audit I thought it was safer to post it and get some comment. More
updates can be posted incrementally anyway.
Andrea Arcangeli (2):
  coredump: use READ_ONCE to read mm->flags
  mm: use READ/WRITE_ONCE to access anonymous vmas
    vm_start/vm_end/vm_pgoff
 fs/coredump.c |  2 +-
 mm/gup.c      | 23 +++++++++++++----------
 mm/internal.h |  3 ++-
 mm/memory.c   |  2 +-
 mm/mmap.c     | 16 ++++++++--------
 mm/rmap.c     |  3 ++-
 mm/vmacache.c |  3 ++-
 7 files changed, 29 insertions(+), 23 deletions(-)
^ permalink raw reply	[flat|nested] 9+ messages in thread
* [PATCH 1/2] coredump: use READ_ONCE to read mm->flags
  2019-03-01  3:55 [PATCH 0/2] RFC: READ/WRITE_ONCE vma/mm cleanups Andrea Arcangeli
@ 2019-03-01  3:55 ` Andrea Arcangeli
  2019-03-01  3:55 ` [PATCH 2/2] mm: use READ/WRITE_ONCE to access anonymous vmas vm_start/vm_end/vm_pgoff Andrea Arcangeli
  2019-03-01  9:37 ` [PATCH 0/2] RFC: READ/WRITE_ONCE vma/mm cleanups Kirill A. Shutemov
  2 siblings, 0 replies; 9+ messages in thread
From: Andrea Arcangeli @ 2019-03-01  3:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, Hugh Dickins, Kirill A . Shutemov, Michal Hocko
mm->flags can still change freely under the coredump using atomic
bitops in proc_coredump_filter_write(). So read the mm->flags with
READ_ONCE for correctness.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/coredump.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/coredump.c b/fs/coredump.c
index e42e17e55bfd..cc175d52090a 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -560,7 +560,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 		 * inconsistency of bit flags, since this flag is not protected
 		 * by any locks.
 		 */
-		.mm_flags = mm->flags,
+		.mm_flags = READ_ONCE(mm->flags),
 	};
 
 	audit_core_dumps(siginfo->si_signo);
^ permalink raw reply related	[flat|nested] 9+ messages in thread
* [PATCH 2/2] mm: use READ/WRITE_ONCE to access anonymous vmas vm_start/vm_end/vm_pgoff
  2019-03-01  3:55 [PATCH 0/2] RFC: READ/WRITE_ONCE vma/mm cleanups Andrea Arcangeli
  2019-03-01  3:55 ` [PATCH 1/2] coredump: use READ_ONCE to read mm->flags Andrea Arcangeli
@ 2019-03-01  3:55 ` Andrea Arcangeli
  2019-03-01  9:37 ` [PATCH 0/2] RFC: READ/WRITE_ONCE vma/mm cleanups Kirill A. Shutemov
  2 siblings, 0 replies; 9+ messages in thread
From: Andrea Arcangeli @ 2019-03-01  3:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, Hugh Dickins, Kirill A . Shutemov, Michal Hocko
This converts the updates under mmap_sem for reading, rmap lock for
writing and PT lock to vm_start/end/pgoff of anonymous vmas to use
WRITE_ONCE().
This also converts some of the accesses under mmap_sem for reading
that are concurrent with the aforementioned WRITE_ONCE()s to use
READ_ONCE().
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/gup.c      | 23 +++++++++++++----------
 mm/internal.h |  3 ++-
 mm/memory.c   |  2 +-
 mm/mmap.c     | 16 ++++++++--------
 mm/rmap.c     |  3 ++-
 mm/vmacache.c |  3 ++-
 6 files changed, 28 insertions(+), 22 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index 75029649baca..5cac5c462b40 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -699,7 +699,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned int page_increm;
 
 		/* first iteration or cross vma bound */
-		if (!vma || start >= vma->vm_end) {
+		if (!vma || start >= READ_ONCE(vma->vm_end)) {
 			vma = find_extend_vma(mm, start);
 			if (!vma && in_gate_area(mm, start)) {
 				ret = get_gate_page(mm, start & PAGE_MASK,
@@ -850,7 +850,7 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
 
 retry:
 	vma = find_extend_vma(mm, address);
-	if (!vma || address < vma->vm_start)
+	if (!vma || address < READ_ONCE(vma->vm_start))
 		return -EFAULT;
 
 	if (!vma_permits_fault(vma, fault_flags))
@@ -1218,8 +1218,8 @@ long populate_vma_page_range(struct vm_area_struct *vma,
 
 	VM_BUG_ON(start & ~PAGE_MASK);
 	VM_BUG_ON(end   & ~PAGE_MASK);
-	VM_BUG_ON_VMA(start < vma->vm_start, vma);
-	VM_BUG_ON_VMA(end   > vma->vm_end, vma);
+	VM_BUG_ON_VMA(start < READ_ONCE(vma->vm_start), vma);
+	VM_BUG_ON_VMA(end   > READ_ONCE(vma->vm_end), vma);
 	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_sem), mm);
 
 	gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK;
@@ -1258,7 +1258,7 @@ long populate_vma_page_range(struct vm_area_struct *vma,
 int __mm_populate(unsigned long start, unsigned long len, int ignore_errors)
 {
 	struct mm_struct *mm = current->mm;
-	unsigned long end, nstart, nend;
+	unsigned long end, nstart, nend, vma_start, vma_end;
 	struct vm_area_struct *vma = NULL;
 	int locked = 0;
 	long ret = 0;
@@ -1274,19 +1274,22 @@ int __mm_populate(unsigned long start, unsigned long len, int ignore_errors)
 			locked = 1;
 			down_read(&mm->mmap_sem);
 			vma = find_vma(mm, nstart);
-		} else if (nstart >= vma->vm_end)
+		} else if (nstart >= vma_end)
 			vma = vma->vm_next;
-		if (!vma || vma->vm_start >= end)
+		if (!vma)
 			break;
+		vma_start = READ_ONCE(vma->vm_start);
+		if (vma_start >= end)
+			break;
+		vma_end = READ_ONCE(vma->vm_end);
 		/*
 		 * Set [nstart; nend) to intersection of desired address
 		 * range with the first VMA. Also, skip undesirable VMA types.
 		 */
-		nend = min(end, vma->vm_end);
+		nend = min(end, vma_end);
 		if (vma->vm_flags & (VM_IO | VM_PFNMAP))
 			continue;
-		if (nstart < vma->vm_start)
-			nstart = vma->vm_start;
+		nstart = max(nstart, vma_start);
 		/*
 		 * Now fault in a range of pages. populate_vma_page_range()
 		 * double checks the vma flags, so that it won't mlock pages
diff --git a/mm/internal.h b/mm/internal.h
index f4a7bb02decf..839dbcf3c7ed 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -337,7 +337,8 @@ static inline unsigned long
 __vma_address(struct page *page, struct vm_area_struct *vma)
 {
 	pgoff_t pgoff = page_to_pgoff(page);
-	return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+	return READ_ONCE(vma->vm_start) +
+		((pgoff - READ_ONCE(vma->vm_pgoff)) << PAGE_SHIFT);
 }
 
 static inline unsigned long
diff --git a/mm/memory.c b/mm/memory.c
index 896d8aa08c0a..b76b659a026d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4257,7 +4257,7 @@ int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
 			 * we can access using slightly different code.
 			 */
 			vma = find_vma(mm, addr);
-			if (!vma || vma->vm_start > addr)
+			if (!vma || READ_ONCE(vma->vm_start) > addr)
 				break;
 			if (vma->vm_ops && vma->vm_ops->access)
 				ret = vma->vm_ops->access(vma, addr, buf,
diff --git a/mm/mmap.c b/mm/mmap.c
index f901065c4c64..9b84617c11c6 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2240,9 +2240,9 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
 
 		tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb);
 
-		if (tmp->vm_end > addr) {
+		if (READ_ONCE(tmp->vm_end) > addr) {
 			vma = tmp;
-			if (tmp->vm_start <= addr)
+			if (READ_ONCE(tmp->vm_start) <= addr)
 				break;
 			rb_node = rb_node->rb_left;
 		} else
@@ -2399,7 +2399,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 					mm->locked_vm += grow;
 				vm_stat_account(mm, vma->vm_flags, grow);
 				anon_vma_interval_tree_pre_update_vma(vma);
-				vma->vm_end = address;
+				WRITE_ONCE(vma->vm_end, address);
 				anon_vma_interval_tree_post_update_vma(vma);
 				if (vma->vm_next)
 					vma_gap_update(vma->vm_next);
@@ -2480,8 +2480,8 @@ int expand_downwards(struct vm_area_struct *vma,
 					mm->locked_vm += grow;
 				vm_stat_account(mm, vma->vm_flags, grow);
 				anon_vma_interval_tree_pre_update_vma(vma);
-				vma->vm_start = address;
-				vma->vm_pgoff -= grow;
+				WRITE_ONCE(vma->vm_start, address);
+				WRITE_ONCE(vma->vm_pgoff, vma->vm_pgoff - grow);
 				anon_vma_interval_tree_post_update_vma(vma);
 				vma_gap_update(vma);
 				spin_unlock(&mm->page_table_lock);
@@ -2530,7 +2530,7 @@ find_extend_vma(struct mm_struct *mm, unsigned long addr)
 	if (!prev || expand_stack(prev, addr))
 		return NULL;
 	if (prev->vm_flags & VM_LOCKED)
-		populate_vma_page_range(prev, addr, prev->vm_end, NULL);
+		populate_vma_page_range(prev, addr, READ_ONCE(prev->vm_end), NULL);
 	return prev;
 }
 #else
@@ -2549,11 +2549,11 @@ find_extend_vma(struct mm_struct *mm, unsigned long addr)
 	vma = find_vma(mm, addr);
 	if (!vma)
 		return NULL;
-	if (vma->vm_start <= addr)
+	start = READ_ONCE(vma->vm_start);
+	if (start <= addr)
 		return vma;
 	if (!(vma->vm_flags & VM_GROWSDOWN))
 		return NULL;
-	start = vma->vm_start;
 	if (expand_stack(vma, addr))
 		return NULL;
 	if (vma->vm_flags & VM_LOCKED)
diff --git a/mm/rmap.c b/mm/rmap.c
index 0454ecc29537..d8d06bb87381 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -702,7 +702,8 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
 	} else
 		return -EFAULT;
 	address = __vma_address(page, vma);
-	if (unlikely(address < vma->vm_start || address >= vma->vm_end))
+	if (unlikely(address < READ_ONCE(vma->vm_start) ||
+		     address >= READ_ONCE(vma->vm_end)))
 		return -EFAULT;
 	return address;
 }
diff --git a/mm/vmacache.c b/mm/vmacache.c
index cdc32a3b02fa..655554c85bdb 100644
--- a/mm/vmacache.c
+++ b/mm/vmacache.c
@@ -77,7 +77,8 @@ struct vm_area_struct *vmacache_find(struct mm_struct *mm, unsigned long addr)
 			if (WARN_ON_ONCE(vma->vm_mm != mm))
 				break;
 #endif
-			if (vma->vm_start <= addr && vma->vm_end > addr) {
+			if (READ_ONCE(vma->vm_start) <= addr &&
+			    READ_ONCE(vma->vm_end) > addr) {
 				count_vm_vmacache_event(VMACACHE_FIND_HITS);
 				return vma;
 			}
^ permalink raw reply related	[flat|nested] 9+ messages in thread
* Re: [PATCH 0/2] RFC: READ/WRITE_ONCE vma/mm cleanups
  2019-03-01  3:55 [PATCH 0/2] RFC: READ/WRITE_ONCE vma/mm cleanups Andrea Arcangeli
  2019-03-01  3:55 ` [PATCH 1/2] coredump: use READ_ONCE to read mm->flags Andrea Arcangeli
  2019-03-01  3:55 ` [PATCH 2/2] mm: use READ/WRITE_ONCE to access anonymous vmas vm_start/vm_end/vm_pgoff Andrea Arcangeli
@ 2019-03-01  9:37 ` Kirill A. Shutemov
  2019-03-01 13:04   ` Vlastimil Babka
  2 siblings, 1 reply; 9+ messages in thread
From: Kirill A. Shutemov @ 2019-03-01  9:37 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-mm, Hugh Dickins, Kirill A . Shutemov,
	Michal Hocko
On Thu, Feb 28, 2019 at 10:55:48PM -0500, Andrea Arcangeli wrote:
> Hello,
> 
> This was a well known issue for more than a decade, but until a few
> months ago we relied on the compiler to stick to atomic accesses and
> updates while walking and updating pagetables.
> 
> However now the 64bit native_set_pte finally uses WRITE_ONCE and
> gup_pmd_range uses READ_ONCE as well.
> 
> This convert more racy VM places to avoid depending on the expected
> compiler behavior to achieve kernel runtime correctness.
> 
> It mostly guarantees gcc to do atomic updates at 64bit granularity
> (practically not needed) and it also prevents gcc to emit code that
> risks getting confused if the memory unexpectedly changes under it
> (unlikely to ever be needed).
> 
> The list of vm_start/end/pgoff to update isn't complete, I covered the
> most obvious places, but before wasting too much time at doing a full
> audit I thought it was safer to post it and get some comment. More
> updates can be posted incrementally anyway.
The intention is described well to my eyes.
Do I understand correctly, that it's attempt to get away with modifying
vma's fields under down_read(mmap_sem)?
I'm not fan of this.
It can help with producing stable value for the one field, but it doesn't
help if more than one thing changed under you. Like if both vm_start and
vm_end modifed under you, it can lead to inconsistency. Like vm_end <
vm_start.
-- 
 Kirill A. Shutemov
^ permalink raw reply	[flat|nested] 9+ messages in thread
* Re: [PATCH 0/2] RFC: READ/WRITE_ONCE vma/mm cleanups
  2019-03-01  9:37 ` [PATCH 0/2] RFC: READ/WRITE_ONCE vma/mm cleanups Kirill A. Shutemov
@ 2019-03-01 13:04   ` Vlastimil Babka
  2019-03-01 16:54     ` Andrea Arcangeli
  0 siblings, 1 reply; 9+ messages in thread
From: Vlastimil Babka @ 2019-03-01 13:04 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrea Arcangeli
  Cc: Andrew Morton, linux-mm, Hugh Dickins, Kirill A . Shutemov,
	Michal Hocko
On 3/1/19 10:37 AM, Kirill A. Shutemov wrote:
> On Thu, Feb 28, 2019 at 10:55:48PM -0500, Andrea Arcangeli wrote:
>> Hello,
>>
>> This was a well known issue for more than a decade, but until a few
>> months ago we relied on the compiler to stick to atomic accesses and
>> updates while walking and updating pagetables.
>>
>> However now the 64bit native_set_pte finally uses WRITE_ONCE and
>> gup_pmd_range uses READ_ONCE as well.
>>
>> This convert more racy VM places to avoid depending on the expected
>> compiler behavior to achieve kernel runtime correctness.
>>
>> It mostly guarantees gcc to do atomic updates at 64bit granularity
>> (practically not needed) and it also prevents gcc to emit code that
>> risks getting confused if the memory unexpectedly changes under it
>> (unlikely to ever be needed).
>>
>> The list of vm_start/end/pgoff to update isn't complete, I covered the
>> most obvious places, but before wasting too much time at doing a full
>> audit I thought it was safer to post it and get some comment. More
>> updates can be posted incrementally anyway.
> 
> The intention is described well to my eyes.
> 
> Do I understand correctly, that it's attempt to get away with modifying
> vma's fields under down_read(mmap_sem)?
If that's the intention, then IMHO it's not that well described. It
talks about "racy VM places" but e.g. the __mm_populate() changes are
for code protected by down_read(). So what's going on here?
> I'm not fan of this.
> 
> It can help with producing stable value for the one field, but it doesn't
> help if more than one thing changed under you. Like if both vm_start and
> vm_end modifed under you, it can lead to inconsistency. Like vm_end <
> vm_start.
> 
^ permalink raw reply	[flat|nested] 9+ messages in thread
* Re: [PATCH 0/2] RFC: READ/WRITE_ONCE vma/mm cleanups
  2019-03-01 13:04   ` Vlastimil Babka
@ 2019-03-01 16:54     ` Andrea Arcangeli
  2019-03-01 18:49       ` Davidlohr Bueso
  2019-03-04 10:12       ` Kirill A. Shutemov
  0 siblings, 2 replies; 9+ messages in thread
From: Andrea Arcangeli @ 2019-03-01 16:54 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, linux-mm, Hugh Dickins,
	Kirill A . Shutemov, Michal Hocko
Hello Kirill and Vlastimil,
On Fri, Mar 01, 2019 at 02:04:38PM +0100, Vlastimil Babka wrote:
> On 3/1/19 10:37 AM, Kirill A. Shutemov wrote:
> > On Thu, Feb 28, 2019 at 10:55:48PM -0500, Andrea Arcangeli wrote:
> >> Hello,
> >>
> >> This was a well known issue for more than a decade, but until a few
> >> months ago we relied on the compiler to stick to atomic accesses and
> >> updates while walking and updating pagetables.
> >>
> >> However now the 64bit native_set_pte finally uses WRITE_ONCE and
> >> gup_pmd_range uses READ_ONCE as well.
> >>
> >> This convert more racy VM places to avoid depending on the expected
> >> compiler behavior to achieve kernel runtime correctness.
> >>
> >> It mostly guarantees gcc to do atomic updates at 64bit granularity
> >> (practically not needed) and it also prevents gcc to emit code that
> >> risks getting confused if the memory unexpectedly changes under it
> >> (unlikely to ever be needed).
> >>
> >> The list of vm_start/end/pgoff to update isn't complete, I covered the
> >> most obvious places, but before wasting too much time at doing a full
> >> audit I thought it was safer to post it and get some comment. More
> >> updates can be posted incrementally anyway.
> > 
> > The intention is described well to my eyes.
> > 
> > Do I understand correctly, that it's attempt to get away with modifying
> > vma's fields under down_read(mmap_sem)?
The issue is that we already get away with it, but we do it without
READ/WRITE_ONCE. The patch should changes nothing, it should only
reduce the dependency on the compiler to do what we expect.
> If that's the intention, then IMHO it's not that well described. It
> talks about "racy VM places" but e.g. the __mm_populate() changes are
> for code protected by down_read(). So what's going on here?
expand_stack can move anonymous vma vm_end up or vm_start/pgoff down,
while we hold the mmap_sem for writing. See the location of the three
WRITE_ONCE in the patch.
So whenever we deal with a vma that we don't know if it's filebacked
(filebacked vmas cannot growsup/down) and that we don't know if it has
VM_GROWSDOWN/UP set, we shall use READ_ONCE to access
vm_start/end/pgoff. This is the only thing the patch is about, and it
should make no runtime difference at all, but then the WRITE_ONCE in
native_set_pte also should make no runtime difference just like the
READ_ONCE in gup_pmd_range should make no runtime difference. I mean
we don't trust the compiler with gup_fast but then we trust it with
expand_stack vs find_vma.
^ permalink raw reply	[flat|nested] 9+ messages in thread
* Re: [PATCH 0/2] RFC: READ/WRITE_ONCE vma/mm cleanups
  2019-03-01 16:54     ` Andrea Arcangeli
@ 2019-03-01 18:49       ` Davidlohr Bueso
  2019-03-04 10:12       ` Kirill A. Shutemov
  1 sibling, 0 replies; 9+ messages in thread
From: Davidlohr Bueso @ 2019-03-01 18:49 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Vlastimil Babka, Kirill A. Shutemov, Andrew Morton, linux-mm,
	Hugh Dickins, Kirill A . Shutemov, Michal Hocko
On Fri, 01 Mar 2019, Andrea Arcangeli wrote:
>
>On Fri, Mar 01, 2019 at 02:04:38PM +0100, Vlastimil Babka wrote:
>> On 3/1/19 10:37 AM, Kirill A. Shutemov wrote:
>> > On Thu, Feb 28, 2019 at 10:55:48PM -0500, Andrea Arcangeli wrote:
>> >> Hello,
>> >>
>> >> This was a well known issue for more than a decade, but until a few
>> >> months ago we relied on the compiler to stick to atomic accesses and
>> >> updates while walking and updating pagetables.
>> >>
>> >> However now the 64bit native_set_pte finally uses WRITE_ONCE and
>> >> gup_pmd_range uses READ_ONCE as well.
>> >>
>> >> This convert more racy VM places to avoid depending on the expected
>> >> compiler behavior to achieve kernel runtime correctness.
>> >>
>> >> It mostly guarantees gcc to do atomic updates at 64bit granularity
>> >> (practically not needed) and it also prevents gcc to emit code that
>> >> risks getting confused if the memory unexpectedly changes under it
>> >> (unlikely to ever be needed).
>> >>
>> >> The list of vm_start/end/pgoff to update isn't complete, I covered the
>> >> most obvious places, but before wasting too much time at doing a full
>> >> audit I thought it was safer to post it and get some comment. More
>> >> updates can be posted incrementally anyway.
>> >
>> > The intention is described well to my eyes.
>> >
>> > Do I understand correctly, that it's attempt to get away with modifying
>> > vma's fields under down_read(mmap_sem)?
>
>The issue is that we already get away with it, but we do it without
>READ/WRITE_ONCE. The patch should changes nothing, it should only
>reduce the dependency on the compiler to do what we expect.
>
>> If that's the intention, then IMHO it's not that well described. It
>> talks about "racy VM places" but e.g. the __mm_populate() changes are
>> for code protected by down_read(). So what's going on here?
>
>expand_stack can move anonymous vma vm_end up or vm_start/pgoff down,
>while we hold the mmap_sem for writing. See the location of the three
>WRITE_ONCE in the patch.
You mean for reading, right? Yes, with expand_stack being held for read
such members were never really serialized by the mmap_sem and thus we
should not be computing stale values.
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Thanks.
^ permalink raw reply	[flat|nested] 9+ messages in thread
* Re: [PATCH 0/2] RFC: READ/WRITE_ONCE vma/mm cleanups
  2019-03-01 16:54     ` Andrea Arcangeli
  2019-03-01 18:49       ` Davidlohr Bueso
@ 2019-03-04 10:12       ` Kirill A. Shutemov
  2019-03-05 13:00         ` Michal Hocko
  1 sibling, 1 reply; 9+ messages in thread
From: Kirill A. Shutemov @ 2019-03-04 10:12 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Vlastimil Babka, Andrew Morton, linux-mm, Hugh Dickins,
	Kirill A . Shutemov, Michal Hocko, Peter Zijlstra
On Fri, Mar 01, 2019 at 11:54:52AM -0500, Andrea Arcangeli wrote:
> Hello Kirill and Vlastimil,
> 
> On Fri, Mar 01, 2019 at 02:04:38PM +0100, Vlastimil Babka wrote:
> > On 3/1/19 10:37 AM, Kirill A. Shutemov wrote:
> > > On Thu, Feb 28, 2019 at 10:55:48PM -0500, Andrea Arcangeli wrote:
> > >> Hello,
> > >>
> > >> This was a well known issue for more than a decade, but until a few
> > >> months ago we relied on the compiler to stick to atomic accesses and
> > >> updates while walking and updating pagetables.
> > >>
> > >> However now the 64bit native_set_pte finally uses WRITE_ONCE and
> > >> gup_pmd_range uses READ_ONCE as well.
> > >>
> > >> This convert more racy VM places to avoid depending on the expected
> > >> compiler behavior to achieve kernel runtime correctness.
> > >>
> > >> It mostly guarantees gcc to do atomic updates at 64bit granularity
> > >> (practically not needed) and it also prevents gcc to emit code that
> > >> risks getting confused if the memory unexpectedly changes under it
> > >> (unlikely to ever be needed).
> > >>
> > >> The list of vm_start/end/pgoff to update isn't complete, I covered the
> > >> most obvious places, but before wasting too much time at doing a full
> > >> audit I thought it was safer to post it and get some comment. More
> > >> updates can be posted incrementally anyway.
> > > 
> > > The intention is described well to my eyes.
> > > 
> > > Do I understand correctly, that it's attempt to get away with modifying
> > > vma's fields under down_read(mmap_sem)?
> 
> The issue is that we already get away with it, but we do it without
> READ/WRITE_ONCE. The patch should changes nothing, it should only
> reduce the dependency on the compiler to do what we expect.
Yes, it is pre-existing problem. And yes, complier may screw this up.
The patch may reduce dependency on the compiler, but it doesn't mean it
reduces chance of race.
Consider your changes into __mm_populate() and populate_vma_page_range().
You put READ_ONCE() in both functions. But populate_vma_page_range() gets
called from __mm_populate(). Before your change compiler may optimize the
code and load from the memory once for a field. With your changes complier
will issue two loads.
It *increases* chances of the race, not reduces them.
The current locking scheme doesn't allow modifying VMA field without
down_write(mmap_sem).
We do have hacks[1] that try to bypass the limitation, but AFAIK we never
had a solid explanation why this should work. Sparkling READ_ONCE()
doesn't help with this, but makes it appears legitimate.
[1] I believe we also touch vm_flags without proper locking to set/clear
VM_LOCKED.
-- 
 Kirill A. Shutemov
^ permalink raw reply	[flat|nested] 9+ messages in thread
* Re: [PATCH 0/2] RFC: READ/WRITE_ONCE vma/mm cleanups
  2019-03-04 10:12       ` Kirill A. Shutemov
@ 2019-03-05 13:00         ` Michal Hocko
  0 siblings, 0 replies; 9+ messages in thread
From: Michal Hocko @ 2019-03-05 13:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Vlastimil Babka, Andrew Morton, linux-mm,
	Hugh Dickins, Kirill A . Shutemov, Peter Zijlstra
On Mon 04-03-19 13:12:10, Kirill A. Shutemov wrote:
> On Fri, Mar 01, 2019 at 11:54:52AM -0500, Andrea Arcangeli wrote:
> > Hello Kirill and Vlastimil,
> > 
> > On Fri, Mar 01, 2019 at 02:04:38PM +0100, Vlastimil Babka wrote:
> > > On 3/1/19 10:37 AM, Kirill A. Shutemov wrote:
> > > > On Thu, Feb 28, 2019 at 10:55:48PM -0500, Andrea Arcangeli wrote:
> > > >> Hello,
> > > >>
> > > >> This was a well known issue for more than a decade, but until a few
> > > >> months ago we relied on the compiler to stick to atomic accesses and
> > > >> updates while walking and updating pagetables.
> > > >>
> > > >> However now the 64bit native_set_pte finally uses WRITE_ONCE and
> > > >> gup_pmd_range uses READ_ONCE as well.
> > > >>
> > > >> This convert more racy VM places to avoid depending on the expected
> > > >> compiler behavior to achieve kernel runtime correctness.
> > > >>
> > > >> It mostly guarantees gcc to do atomic updates at 64bit granularity
> > > >> (practically not needed) and it also prevents gcc to emit code that
> > > >> risks getting confused if the memory unexpectedly changes under it
> > > >> (unlikely to ever be needed).
> > > >>
> > > >> The list of vm_start/end/pgoff to update isn't complete, I covered the
> > > >> most obvious places, but before wasting too much time at doing a full
> > > >> audit I thought it was safer to post it and get some comment. More
> > > >> updates can be posted incrementally anyway.
> > > > 
> > > > The intention is described well to my eyes.
> > > > 
> > > > Do I understand correctly, that it's attempt to get away with modifying
> > > > vma's fields under down_read(mmap_sem)?
> > 
> > The issue is that we already get away with it, but we do it without
> > READ/WRITE_ONCE. The patch should changes nothing, it should only
> > reduce the dependency on the compiler to do what we expect.
> 
> Yes, it is pre-existing problem. And yes, complier may screw this up.
> The patch may reduce dependency on the compiler, but it doesn't mean it
> reduces chance of race.
> 
> Consider your changes into __mm_populate() and populate_vma_page_range().
> You put READ_ONCE() in both functions. But populate_vma_page_range() gets
> called from __mm_populate(). Before your change compiler may optimize the
> code and load from the memory once for a field. With your changes complier
> will issue two loads.
> 
> It *increases* chances of the race, not reduces them.
> 
> The current locking scheme doesn't allow modifying VMA field without
> down_write(mmap_sem).
> 
> We do have hacks[1] that try to bypass the limitation, but AFAIK we never
> had a solid explanation why this should work. Sparkling READ_ONCE()
> doesn't help with this, but makes it appears legitimate.
I do agree with Kirill here. Sprinkling {READ,WRITE}_ONCE around just
doesn't solve anything. I am pretty sure that people will not think
about it and we will end up in a similar half covered situation in few
years again. I would rather remove all those hacks and use a saner
locking scheme instead.
> [1] I believe we also touch vm_flags without proper locking to set/clear
> VM_LOCKED.
-- 
Michal Hocko
SUSE Labs
^ permalink raw reply	[flat|nested] 9+ messages in thread
end of thread, other threads:[~2019-03-05 13:01 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-03-01  3:55 [PATCH 0/2] RFC: READ/WRITE_ONCE vma/mm cleanups Andrea Arcangeli
2019-03-01  3:55 ` [PATCH 1/2] coredump: use READ_ONCE to read mm->flags Andrea Arcangeli
2019-03-01  3:55 ` [PATCH 2/2] mm: use READ/WRITE_ONCE to access anonymous vmas vm_start/vm_end/vm_pgoff Andrea Arcangeli
2019-03-01  9:37 ` [PATCH 0/2] RFC: READ/WRITE_ONCE vma/mm cleanups Kirill A. Shutemov
2019-03-01 13:04   ` Vlastimil Babka
2019-03-01 16:54     ` Andrea Arcangeli
2019-03-01 18:49       ` Davidlohr Bueso
2019-03-04 10:12       ` Kirill A. Shutemov
2019-03-05 13:00         ` Michal Hocko
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).