* [PATCH 0/2] RFC: READ/WRITE_ONCE vma/mm cleanups @ 2019-03-01 3:55 Andrea Arcangeli 2019-03-01 3:55 ` [PATCH 1/2] coredump: use READ_ONCE to read mm->flags Andrea Arcangeli ` (2 more replies) 0 siblings, 3 replies; 9+ messages in thread From: Andrea Arcangeli @ 2019-03-01 3:55 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-mm, Hugh Dickins, Kirill A . Shutemov, Michal Hocko Hello, This was a well known issue for more than a decade, but until a few months ago we relied on the compiler to stick to atomic accesses and updates while walking and updating pagetables. However now the 64bit native_set_pte finally uses WRITE_ONCE and gup_pmd_range uses READ_ONCE as well. This convert more racy VM places to avoid depending on the expected compiler behavior to achieve kernel runtime correctness. It mostly guarantees gcc to do atomic updates at 64bit granularity (practically not needed) and it also prevents gcc to emit code that risks getting confused if the memory unexpectedly changes under it (unlikely to ever be needed). The list of vm_start/end/pgoff to update isn't complete, I covered the most obvious places, but before wasting too much time at doing a full audit I thought it was safer to post it and get some comment. More updates can be posted incrementally anyway. Andrea Arcangeli (2): coredump: use READ_ONCE to read mm->flags mm: use READ/WRITE_ONCE to access anonymous vmas vm_start/vm_end/vm_pgoff fs/coredump.c | 2 +- mm/gup.c | 23 +++++++++++++---------- mm/internal.h | 3 ++- mm/memory.c | 2 +- mm/mmap.c | 16 ++++++++-------- mm/rmap.c | 3 ++- mm/vmacache.c | 3 ++- 7 files changed, 29 insertions(+), 23 deletions(-) ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 1/2] coredump: use READ_ONCE to read mm->flags 2019-03-01 3:55 [PATCH 0/2] RFC: READ/WRITE_ONCE vma/mm cleanups Andrea Arcangeli @ 2019-03-01 3:55 ` Andrea Arcangeli 2019-03-01 3:55 ` [PATCH 2/2] mm: use READ/WRITE_ONCE to access anonymous vmas vm_start/vm_end/vm_pgoff Andrea Arcangeli 2019-03-01 9:37 ` [PATCH 0/2] RFC: READ/WRITE_ONCE vma/mm cleanups Kirill A. Shutemov 2 siblings, 0 replies; 9+ messages in thread From: Andrea Arcangeli @ 2019-03-01 3:55 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-mm, Hugh Dickins, Kirill A . Shutemov, Michal Hocko mm->flags can still change freely under the coredump using atomic bitops in proc_coredump_filter_write(). So read the mm->flags with READ_ONCE for correctness. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- fs/coredump.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/coredump.c b/fs/coredump.c index e42e17e55bfd..cc175d52090a 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -560,7 +560,7 @@ void do_coredump(const kernel_siginfo_t *siginfo) * inconsistency of bit flags, since this flag is not protected * by any locks. */ - .mm_flags = mm->flags, + .mm_flags = READ_ONCE(mm->flags), }; audit_core_dumps(siginfo->si_signo); ^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 2/2] mm: use READ/WRITE_ONCE to access anonymous vmas vm_start/vm_end/vm_pgoff 2019-03-01 3:55 [PATCH 0/2] RFC: READ/WRITE_ONCE vma/mm cleanups Andrea Arcangeli 2019-03-01 3:55 ` [PATCH 1/2] coredump: use READ_ONCE to read mm->flags Andrea Arcangeli @ 2019-03-01 3:55 ` Andrea Arcangeli 2019-03-01 9:37 ` [PATCH 0/2] RFC: READ/WRITE_ONCE vma/mm cleanups Kirill A. Shutemov 2 siblings, 0 replies; 9+ messages in thread From: Andrea Arcangeli @ 2019-03-01 3:55 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-mm, Hugh Dickins, Kirill A . Shutemov, Michal Hocko This converts the updates under mmap_sem for reading, rmap lock for writing and PT lock to vm_start/end/pgoff of anonymous vmas to use WRITE_ONCE(). This also converts some of the accesses under mmap_sem for reading that are concurrent with the aforementioned WRITE_ONCE()s to use READ_ONCE(). Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- mm/gup.c | 23 +++++++++++++---------- mm/internal.h | 3 ++- mm/memory.c | 2 +- mm/mmap.c | 16 ++++++++-------- mm/rmap.c | 3 ++- mm/vmacache.c | 3 ++- 6 files changed, 28 insertions(+), 22 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index 75029649baca..5cac5c462b40 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -699,7 +699,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned int page_increm; /* first iteration or cross vma bound */ - if (!vma || start >= vma->vm_end) { + if (!vma || start >= READ_ONCE(vma->vm_end)) { vma = find_extend_vma(mm, start); if (!vma && in_gate_area(mm, start)) { ret = get_gate_page(mm, start & PAGE_MASK, @@ -850,7 +850,7 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, retry: vma = find_extend_vma(mm, address); - if (!vma || address < vma->vm_start) + if (!vma || address < READ_ONCE(vma->vm_start)) return -EFAULT; if (!vma_permits_fault(vma, fault_flags)) @@ -1218,8 +1218,8 @@ long populate_vma_page_range(struct vm_area_struct *vma, VM_BUG_ON(start & ~PAGE_MASK); VM_BUG_ON(end & ~PAGE_MASK); - VM_BUG_ON_VMA(start < vma->vm_start, vma); - VM_BUG_ON_VMA(end > vma->vm_end, vma); + VM_BUG_ON_VMA(start < READ_ONCE(vma->vm_start), vma); + VM_BUG_ON_VMA(end > READ_ONCE(vma->vm_end), vma); VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_sem), mm); gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK; @@ -1258,7 +1258,7 @@ long populate_vma_page_range(struct vm_area_struct *vma, int __mm_populate(unsigned long start, unsigned long len, int ignore_errors) { struct mm_struct *mm = current->mm; - unsigned long end, nstart, nend; + unsigned long end, nstart, nend, vma_start, vma_end; struct vm_area_struct *vma = NULL; int locked = 0; long ret = 0; @@ -1274,19 +1274,22 @@ int __mm_populate(unsigned long start, unsigned long len, int ignore_errors) locked = 1; down_read(&mm->mmap_sem); vma = find_vma(mm, nstart); - } else if (nstart >= vma->vm_end) + } else if (nstart >= vma_end) vma = vma->vm_next; - if (!vma || vma->vm_start >= end) + if (!vma) break; + vma_start = READ_ONCE(vma->vm_start); + if (vma_start >= end) + break; + vma_end = READ_ONCE(vma->vm_end); /* * Set [nstart; nend) to intersection of desired address * range with the first VMA. Also, skip undesirable VMA types. */ - nend = min(end, vma->vm_end); + nend = min(end, vma_end); if (vma->vm_flags & (VM_IO | VM_PFNMAP)) continue; - if (nstart < vma->vm_start) - nstart = vma->vm_start; + nstart = max(nstart, vma_start); /* * Now fault in a range of pages. populate_vma_page_range() * double checks the vma flags, so that it won't mlock pages diff --git a/mm/internal.h b/mm/internal.h index f4a7bb02decf..839dbcf3c7ed 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -337,7 +337,8 @@ static inline unsigned long __vma_address(struct page *page, struct vm_area_struct *vma) { pgoff_t pgoff = page_to_pgoff(page); - return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); + return READ_ONCE(vma->vm_start) + + ((pgoff - READ_ONCE(vma->vm_pgoff)) << PAGE_SHIFT); } static inline unsigned long diff --git a/mm/memory.c b/mm/memory.c index 896d8aa08c0a..b76b659a026d 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4257,7 +4257,7 @@ int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm, * we can access using slightly different code. */ vma = find_vma(mm, addr); - if (!vma || vma->vm_start > addr) + if (!vma || READ_ONCE(vma->vm_start) > addr) break; if (vma->vm_ops && vma->vm_ops->access) ret = vma->vm_ops->access(vma, addr, buf, diff --git a/mm/mmap.c b/mm/mmap.c index f901065c4c64..9b84617c11c6 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2240,9 +2240,9 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr) tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb); - if (tmp->vm_end > addr) { + if (READ_ONCE(tmp->vm_end) > addr) { vma = tmp; - if (tmp->vm_start <= addr) + if (READ_ONCE(tmp->vm_start) <= addr) break; rb_node = rb_node->rb_left; } else @@ -2399,7 +2399,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) mm->locked_vm += grow; vm_stat_account(mm, vma->vm_flags, grow); anon_vma_interval_tree_pre_update_vma(vma); - vma->vm_end = address; + WRITE_ONCE(vma->vm_end, address); anon_vma_interval_tree_post_update_vma(vma); if (vma->vm_next) vma_gap_update(vma->vm_next); @@ -2480,8 +2480,8 @@ int expand_downwards(struct vm_area_struct *vma, mm->locked_vm += grow; vm_stat_account(mm, vma->vm_flags, grow); anon_vma_interval_tree_pre_update_vma(vma); - vma->vm_start = address; - vma->vm_pgoff -= grow; + WRITE_ONCE(vma->vm_start, address); + WRITE_ONCE(vma->vm_pgoff, vma->vm_pgoff - grow); anon_vma_interval_tree_post_update_vma(vma); vma_gap_update(vma); spin_unlock(&mm->page_table_lock); @@ -2530,7 +2530,7 @@ find_extend_vma(struct mm_struct *mm, unsigned long addr) if (!prev || expand_stack(prev, addr)) return NULL; if (prev->vm_flags & VM_LOCKED) - populate_vma_page_range(prev, addr, prev->vm_end, NULL); + populate_vma_page_range(prev, addr, READ_ONCE(prev->vm_end), NULL); return prev; } #else @@ -2549,11 +2549,11 @@ find_extend_vma(struct mm_struct *mm, unsigned long addr) vma = find_vma(mm, addr); if (!vma) return NULL; - if (vma->vm_start <= addr) + start = READ_ONCE(vma->vm_start); + if (start <= addr) return vma; if (!(vma->vm_flags & VM_GROWSDOWN)) return NULL; - start = vma->vm_start; if (expand_stack(vma, addr)) return NULL; if (vma->vm_flags & VM_LOCKED) diff --git a/mm/rmap.c b/mm/rmap.c index 0454ecc29537..d8d06bb87381 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -702,7 +702,8 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma) } else return -EFAULT; address = __vma_address(page, vma); - if (unlikely(address < vma->vm_start || address >= vma->vm_end)) + if (unlikely(address < READ_ONCE(vma->vm_start) || + address >= READ_ONCE(vma->vm_end))) return -EFAULT; return address; } diff --git a/mm/vmacache.c b/mm/vmacache.c index cdc32a3b02fa..655554c85bdb 100644 --- a/mm/vmacache.c +++ b/mm/vmacache.c @@ -77,7 +77,8 @@ struct vm_area_struct *vmacache_find(struct mm_struct *mm, unsigned long addr) if (WARN_ON_ONCE(vma->vm_mm != mm)) break; #endif - if (vma->vm_start <= addr && vma->vm_end > addr) { + if (READ_ONCE(vma->vm_start) <= addr && + READ_ONCE(vma->vm_end) > addr) { count_vm_vmacache_event(VMACACHE_FIND_HITS); return vma; } ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 0/2] RFC: READ/WRITE_ONCE vma/mm cleanups 2019-03-01 3:55 [PATCH 0/2] RFC: READ/WRITE_ONCE vma/mm cleanups Andrea Arcangeli 2019-03-01 3:55 ` [PATCH 1/2] coredump: use READ_ONCE to read mm->flags Andrea Arcangeli 2019-03-01 3:55 ` [PATCH 2/2] mm: use READ/WRITE_ONCE to access anonymous vmas vm_start/vm_end/vm_pgoff Andrea Arcangeli @ 2019-03-01 9:37 ` Kirill A. Shutemov 2019-03-01 13:04 ` Vlastimil Babka 2 siblings, 1 reply; 9+ messages in thread From: Kirill A. Shutemov @ 2019-03-01 9:37 UTC (permalink / raw) To: Andrea Arcangeli Cc: Andrew Morton, linux-mm, Hugh Dickins, Kirill A . Shutemov, Michal Hocko On Thu, Feb 28, 2019 at 10:55:48PM -0500, Andrea Arcangeli wrote: > Hello, > > This was a well known issue for more than a decade, but until a few > months ago we relied on the compiler to stick to atomic accesses and > updates while walking and updating pagetables. > > However now the 64bit native_set_pte finally uses WRITE_ONCE and > gup_pmd_range uses READ_ONCE as well. > > This convert more racy VM places to avoid depending on the expected > compiler behavior to achieve kernel runtime correctness. > > It mostly guarantees gcc to do atomic updates at 64bit granularity > (practically not needed) and it also prevents gcc to emit code that > risks getting confused if the memory unexpectedly changes under it > (unlikely to ever be needed). > > The list of vm_start/end/pgoff to update isn't complete, I covered the > most obvious places, but before wasting too much time at doing a full > audit I thought it was safer to post it and get some comment. More > updates can be posted incrementally anyway. The intention is described well to my eyes. Do I understand correctly, that it's attempt to get away with modifying vma's fields under down_read(mmap_sem)? I'm not fan of this. It can help with producing stable value for the one field, but it doesn't help if more than one thing changed under you. Like if both vm_start and vm_end modifed under you, it can lead to inconsistency. Like vm_end < vm_start. -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 0/2] RFC: READ/WRITE_ONCE vma/mm cleanups 2019-03-01 9:37 ` [PATCH 0/2] RFC: READ/WRITE_ONCE vma/mm cleanups Kirill A. Shutemov @ 2019-03-01 13:04 ` Vlastimil Babka 2019-03-01 16:54 ` Andrea Arcangeli 0 siblings, 1 reply; 9+ messages in thread From: Vlastimil Babka @ 2019-03-01 13:04 UTC (permalink / raw) To: Kirill A. Shutemov, Andrea Arcangeli Cc: Andrew Morton, linux-mm, Hugh Dickins, Kirill A . Shutemov, Michal Hocko On 3/1/19 10:37 AM, Kirill A. Shutemov wrote: > On Thu, Feb 28, 2019 at 10:55:48PM -0500, Andrea Arcangeli wrote: >> Hello, >> >> This was a well known issue for more than a decade, but until a few >> months ago we relied on the compiler to stick to atomic accesses and >> updates while walking and updating pagetables. >> >> However now the 64bit native_set_pte finally uses WRITE_ONCE and >> gup_pmd_range uses READ_ONCE as well. >> >> This convert more racy VM places to avoid depending on the expected >> compiler behavior to achieve kernel runtime correctness. >> >> It mostly guarantees gcc to do atomic updates at 64bit granularity >> (practically not needed) and it also prevents gcc to emit code that >> risks getting confused if the memory unexpectedly changes under it >> (unlikely to ever be needed). >> >> The list of vm_start/end/pgoff to update isn't complete, I covered the >> most obvious places, but before wasting too much time at doing a full >> audit I thought it was safer to post it and get some comment. More >> updates can be posted incrementally anyway. > > The intention is described well to my eyes. > > Do I understand correctly, that it's attempt to get away with modifying > vma's fields under down_read(mmap_sem)? If that's the intention, then IMHO it's not that well described. It talks about "racy VM places" but e.g. the __mm_populate() changes are for code protected by down_read(). So what's going on here? > I'm not fan of this. > > It can help with producing stable value for the one field, but it doesn't > help if more than one thing changed under you. Like if both vm_start and > vm_end modifed under you, it can lead to inconsistency. Like vm_end < > vm_start. > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 0/2] RFC: READ/WRITE_ONCE vma/mm cleanups 2019-03-01 13:04 ` Vlastimil Babka @ 2019-03-01 16:54 ` Andrea Arcangeli 2019-03-01 18:49 ` Davidlohr Bueso 2019-03-04 10:12 ` Kirill A. Shutemov 0 siblings, 2 replies; 9+ messages in thread From: Andrea Arcangeli @ 2019-03-01 16:54 UTC (permalink / raw) To: Vlastimil Babka Cc: Kirill A. Shutemov, Andrew Morton, linux-mm, Hugh Dickins, Kirill A . Shutemov, Michal Hocko Hello Kirill and Vlastimil, On Fri, Mar 01, 2019 at 02:04:38PM +0100, Vlastimil Babka wrote: > On 3/1/19 10:37 AM, Kirill A. Shutemov wrote: > > On Thu, Feb 28, 2019 at 10:55:48PM -0500, Andrea Arcangeli wrote: > >> Hello, > >> > >> This was a well known issue for more than a decade, but until a few > >> months ago we relied on the compiler to stick to atomic accesses and > >> updates while walking and updating pagetables. > >> > >> However now the 64bit native_set_pte finally uses WRITE_ONCE and > >> gup_pmd_range uses READ_ONCE as well. > >> > >> This convert more racy VM places to avoid depending on the expected > >> compiler behavior to achieve kernel runtime correctness. > >> > >> It mostly guarantees gcc to do atomic updates at 64bit granularity > >> (practically not needed) and it also prevents gcc to emit code that > >> risks getting confused if the memory unexpectedly changes under it > >> (unlikely to ever be needed). > >> > >> The list of vm_start/end/pgoff to update isn't complete, I covered the > >> most obvious places, but before wasting too much time at doing a full > >> audit I thought it was safer to post it and get some comment. More > >> updates can be posted incrementally anyway. > > > > The intention is described well to my eyes. > > > > Do I understand correctly, that it's attempt to get away with modifying > > vma's fields under down_read(mmap_sem)? The issue is that we already get away with it, but we do it without READ/WRITE_ONCE. The patch should changes nothing, it should only reduce the dependency on the compiler to do what we expect. > If that's the intention, then IMHO it's not that well described. It > talks about "racy VM places" but e.g. the __mm_populate() changes are > for code protected by down_read(). So what's going on here? expand_stack can move anonymous vma vm_end up or vm_start/pgoff down, while we hold the mmap_sem for writing. See the location of the three WRITE_ONCE in the patch. So whenever we deal with a vma that we don't know if it's filebacked (filebacked vmas cannot growsup/down) and that we don't know if it has VM_GROWSDOWN/UP set, we shall use READ_ONCE to access vm_start/end/pgoff. This is the only thing the patch is about, and it should make no runtime difference at all, but then the WRITE_ONCE in native_set_pte also should make no runtime difference just like the READ_ONCE in gup_pmd_range should make no runtime difference. I mean we don't trust the compiler with gup_fast but then we trust it with expand_stack vs find_vma. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 0/2] RFC: READ/WRITE_ONCE vma/mm cleanups 2019-03-01 16:54 ` Andrea Arcangeli @ 2019-03-01 18:49 ` Davidlohr Bueso 2019-03-04 10:12 ` Kirill A. Shutemov 1 sibling, 0 replies; 9+ messages in thread From: Davidlohr Bueso @ 2019-03-01 18:49 UTC (permalink / raw) To: Andrea Arcangeli Cc: Vlastimil Babka, Kirill A. Shutemov, Andrew Morton, linux-mm, Hugh Dickins, Kirill A . Shutemov, Michal Hocko On Fri, 01 Mar 2019, Andrea Arcangeli wrote: > >On Fri, Mar 01, 2019 at 02:04:38PM +0100, Vlastimil Babka wrote: >> On 3/1/19 10:37 AM, Kirill A. Shutemov wrote: >> > On Thu, Feb 28, 2019 at 10:55:48PM -0500, Andrea Arcangeli wrote: >> >> Hello, >> >> >> >> This was a well known issue for more than a decade, but until a few >> >> months ago we relied on the compiler to stick to atomic accesses and >> >> updates while walking and updating pagetables. >> >> >> >> However now the 64bit native_set_pte finally uses WRITE_ONCE and >> >> gup_pmd_range uses READ_ONCE as well. >> >> >> >> This convert more racy VM places to avoid depending on the expected >> >> compiler behavior to achieve kernel runtime correctness. >> >> >> >> It mostly guarantees gcc to do atomic updates at 64bit granularity >> >> (practically not needed) and it also prevents gcc to emit code that >> >> risks getting confused if the memory unexpectedly changes under it >> >> (unlikely to ever be needed). >> >> >> >> The list of vm_start/end/pgoff to update isn't complete, I covered the >> >> most obvious places, but before wasting too much time at doing a full >> >> audit I thought it was safer to post it and get some comment. More >> >> updates can be posted incrementally anyway. >> > >> > The intention is described well to my eyes. >> > >> > Do I understand correctly, that it's attempt to get away with modifying >> > vma's fields under down_read(mmap_sem)? > >The issue is that we already get away with it, but we do it without >READ/WRITE_ONCE. The patch should changes nothing, it should only >reduce the dependency on the compiler to do what we expect. > >> If that's the intention, then IMHO it's not that well described. It >> talks about "racy VM places" but e.g. the __mm_populate() changes are >> for code protected by down_read(). So what's going on here? > >expand_stack can move anonymous vma vm_end up or vm_start/pgoff down, >while we hold the mmap_sem for writing. See the location of the three >WRITE_ONCE in the patch. You mean for reading, right? Yes, with expand_stack being held for read such members were never really serialized by the mmap_sem and thus we should not be computing stale values. Acked-by: Davidlohr Bueso <dbueso@suse.de> Thanks. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 0/2] RFC: READ/WRITE_ONCE vma/mm cleanups 2019-03-01 16:54 ` Andrea Arcangeli 2019-03-01 18:49 ` Davidlohr Bueso @ 2019-03-04 10:12 ` Kirill A. Shutemov 2019-03-05 13:00 ` Michal Hocko 1 sibling, 1 reply; 9+ messages in thread From: Kirill A. Shutemov @ 2019-03-04 10:12 UTC (permalink / raw) To: Andrea Arcangeli Cc: Vlastimil Babka, Andrew Morton, linux-mm, Hugh Dickins, Kirill A . Shutemov, Michal Hocko, Peter Zijlstra On Fri, Mar 01, 2019 at 11:54:52AM -0500, Andrea Arcangeli wrote: > Hello Kirill and Vlastimil, > > On Fri, Mar 01, 2019 at 02:04:38PM +0100, Vlastimil Babka wrote: > > On 3/1/19 10:37 AM, Kirill A. Shutemov wrote: > > > On Thu, Feb 28, 2019 at 10:55:48PM -0500, Andrea Arcangeli wrote: > > >> Hello, > > >> > > >> This was a well known issue for more than a decade, but until a few > > >> months ago we relied on the compiler to stick to atomic accesses and > > >> updates while walking and updating pagetables. > > >> > > >> However now the 64bit native_set_pte finally uses WRITE_ONCE and > > >> gup_pmd_range uses READ_ONCE as well. > > >> > > >> This convert more racy VM places to avoid depending on the expected > > >> compiler behavior to achieve kernel runtime correctness. > > >> > > >> It mostly guarantees gcc to do atomic updates at 64bit granularity > > >> (practically not needed) and it also prevents gcc to emit code that > > >> risks getting confused if the memory unexpectedly changes under it > > >> (unlikely to ever be needed). > > >> > > >> The list of vm_start/end/pgoff to update isn't complete, I covered the > > >> most obvious places, but before wasting too much time at doing a full > > >> audit I thought it was safer to post it and get some comment. More > > >> updates can be posted incrementally anyway. > > > > > > The intention is described well to my eyes. > > > > > > Do I understand correctly, that it's attempt to get away with modifying > > > vma's fields under down_read(mmap_sem)? > > The issue is that we already get away with it, but we do it without > READ/WRITE_ONCE. The patch should changes nothing, it should only > reduce the dependency on the compiler to do what we expect. Yes, it is pre-existing problem. And yes, complier may screw this up. The patch may reduce dependency on the compiler, but it doesn't mean it reduces chance of race. Consider your changes into __mm_populate() and populate_vma_page_range(). You put READ_ONCE() in both functions. But populate_vma_page_range() gets called from __mm_populate(). Before your change compiler may optimize the code and load from the memory once for a field. With your changes complier will issue two loads. It *increases* chances of the race, not reduces them. The current locking scheme doesn't allow modifying VMA field without down_write(mmap_sem). We do have hacks[1] that try to bypass the limitation, but AFAIK we never had a solid explanation why this should work. Sparkling READ_ONCE() doesn't help with this, but makes it appears legitimate. [1] I believe we also touch vm_flags without proper locking to set/clear VM_LOCKED. -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 0/2] RFC: READ/WRITE_ONCE vma/mm cleanups 2019-03-04 10:12 ` Kirill A. Shutemov @ 2019-03-05 13:00 ` Michal Hocko 0 siblings, 0 replies; 9+ messages in thread From: Michal Hocko @ 2019-03-05 13:00 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Andrea Arcangeli, Vlastimil Babka, Andrew Morton, linux-mm, Hugh Dickins, Kirill A . Shutemov, Peter Zijlstra On Mon 04-03-19 13:12:10, Kirill A. Shutemov wrote: > On Fri, Mar 01, 2019 at 11:54:52AM -0500, Andrea Arcangeli wrote: > > Hello Kirill and Vlastimil, > > > > On Fri, Mar 01, 2019 at 02:04:38PM +0100, Vlastimil Babka wrote: > > > On 3/1/19 10:37 AM, Kirill A. Shutemov wrote: > > > > On Thu, Feb 28, 2019 at 10:55:48PM -0500, Andrea Arcangeli wrote: > > > >> Hello, > > > >> > > > >> This was a well known issue for more than a decade, but until a few > > > >> months ago we relied on the compiler to stick to atomic accesses and > > > >> updates while walking and updating pagetables. > > > >> > > > >> However now the 64bit native_set_pte finally uses WRITE_ONCE and > > > >> gup_pmd_range uses READ_ONCE as well. > > > >> > > > >> This convert more racy VM places to avoid depending on the expected > > > >> compiler behavior to achieve kernel runtime correctness. > > > >> > > > >> It mostly guarantees gcc to do atomic updates at 64bit granularity > > > >> (practically not needed) and it also prevents gcc to emit code that > > > >> risks getting confused if the memory unexpectedly changes under it > > > >> (unlikely to ever be needed). > > > >> > > > >> The list of vm_start/end/pgoff to update isn't complete, I covered the > > > >> most obvious places, but before wasting too much time at doing a full > > > >> audit I thought it was safer to post it and get some comment. More > > > >> updates can be posted incrementally anyway. > > > > > > > > The intention is described well to my eyes. > > > > > > > > Do I understand correctly, that it's attempt to get away with modifying > > > > vma's fields under down_read(mmap_sem)? > > > > The issue is that we already get away with it, but we do it without > > READ/WRITE_ONCE. The patch should changes nothing, it should only > > reduce the dependency on the compiler to do what we expect. > > Yes, it is pre-existing problem. And yes, complier may screw this up. > The patch may reduce dependency on the compiler, but it doesn't mean it > reduces chance of race. > > Consider your changes into __mm_populate() and populate_vma_page_range(). > You put READ_ONCE() in both functions. But populate_vma_page_range() gets > called from __mm_populate(). Before your change compiler may optimize the > code and load from the memory once for a field. With your changes complier > will issue two loads. > > It *increases* chances of the race, not reduces them. > > The current locking scheme doesn't allow modifying VMA field without > down_write(mmap_sem). > > We do have hacks[1] that try to bypass the limitation, but AFAIK we never > had a solid explanation why this should work. Sparkling READ_ONCE() > doesn't help with this, but makes it appears legitimate. I do agree with Kirill here. Sprinkling {READ,WRITE}_ONCE around just doesn't solve anything. I am pretty sure that people will not think about it and we will end up in a similar half covered situation in few years again. I would rather remove all those hacks and use a saner locking scheme instead. > [1] I believe we also touch vm_flags without proper locking to set/clear > VM_LOCKED. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2019-03-05 13:01 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-03-01 3:55 [PATCH 0/2] RFC: READ/WRITE_ONCE vma/mm cleanups Andrea Arcangeli 2019-03-01 3:55 ` [PATCH 1/2] coredump: use READ_ONCE to read mm->flags Andrea Arcangeli 2019-03-01 3:55 ` [PATCH 2/2] mm: use READ/WRITE_ONCE to access anonymous vmas vm_start/vm_end/vm_pgoff Andrea Arcangeli 2019-03-01 9:37 ` [PATCH 0/2] RFC: READ/WRITE_ONCE vma/mm cleanups Kirill A. Shutemov 2019-03-01 13:04 ` Vlastimil Babka 2019-03-01 16:54 ` Andrea Arcangeli 2019-03-01 18:49 ` Davidlohr Bueso 2019-03-04 10:12 ` Kirill A. Shutemov 2019-03-05 13:00 ` Michal Hocko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).