From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Wed, 7 Oct 2015 15:39:30 -0600 From: Ross Zwisler Subject: Re: [PATCH v4 1/2] Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX" Message-ID: <20151007213930.GA11743@linux.intel.com> References: <1444170529-12814-1-git-send-email-ross.zwisler@linux.intel.com> <1444170529-12814-2-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Dan Williams Cc: Ross Zwisler , "linux-kernel@vger.kernel.org" , Alexander Viro , Matthew Wilcox , linux-fsdevel , Linux MM , Andrew Morton , Dave Chinner , Jan Kara , "Kirill A. Shutemov" , "linux-nvdimm@lists.01.org" , Matthew Wilcox List-ID: On Wed, Oct 07, 2015 at 09:19:28AM -0700, Dan Williams wrote: > On Tue, Oct 6, 2015 at 3:28 PM, Ross Zwisler > wrote: > > diff --git a/mm/memory.c b/mm/memory.c > > index 9cb2747..5ec066f 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -2426,10 +2426,17 @@ void unmap_mapping_range(struct address_space *mapping, > > if (details.last_index < details.first_index) > > details.last_index = ULONG_MAX; > > > > - i_mmap_lock_write(mapping); > > + > > + /* > > + * DAX already holds i_mmap_lock to serialise file truncate vs > > + * page fault and page fault vs page fault. > > + */ > > + if (!IS_DAX(mapping->host)) > > + i_mmap_lock_write(mapping); > > if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap))) > > unmap_mapping_range_tree(&mapping->i_mmap, &details); > > - i_mmap_unlock_write(mapping); > > + if (!IS_DAX(mapping->host)) > > + i_mmap_unlock_write(mapping); > > } > > EXPORT_SYMBOL(unmap_mapping_range); > > What about cases where unmap_mapping_range() is called without an fs > lock? For the get_user_pages() and ZONE_DEVICE implementation I'm > looking to call truncate_pagecache() from the driver shutdown path to > revoke usage of the struct page's that were allocated by > devm_memremap_pages(). > > Likely I'm introducing a path through unmap_mapping_range() that does > not exist today, but I don't like that unmap_mapping_range() with this > change is presuming a given locking context. It's not clear to me how > this routine is safe when it optionally takes i_mmap_lock_write(), at > a minimum this needs documenting, and possibly assertions if the > locking assumptions are violated. Yep, this is very confusing - these changes were undone by the second revert in the series (they were done and then undone by separate patches, both of which are getting reverted). After the series is applied in total unmap_mapping_range() takes the locks unconditionally: /* DAX uses i_mmap_lock to serialise file truncate vs page fault */ i_mmap_lock_write(mapping); if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap))) unmap_mapping_range_tree(&mapping->i_mmap, &details); i_mmap_unlock_write(mapping); } EXPORT_SYMBOL(unmap_mapping_range); Yes, I totally agree this is confusing - I'll just bit the bullet, collapse the two reverts together and call it "dax locking fixes" or something. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <1444170529-12814-2-git-send-email-ross.zwisler@linux.intel.com> References: <1444170529-12814-1-git-send-email-ross.zwisler@linux.intel.com> <1444170529-12814-2-git-send-email-ross.zwisler@linux.intel.com> Date: Wed, 7 Oct 2015 09:19:28 -0700 Message-ID: Subject: Re: [PATCH v4 1/2] Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX" From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org To: Ross Zwisler Cc: "linux-kernel@vger.kernel.org" , Alexander Viro , Matthew Wilcox , linux-fsdevel , Linux MM , Andrew Morton , Dave Chinner , Jan Kara , "Kirill A. Shutemov" , "linux-nvdimm@lists.01.org" , Matthew Wilcox List-ID: On Tue, Oct 6, 2015 at 3:28 PM, Ross Zwisler wrote: > This reverts commits 46c043ede4711e8d598b9d63c5616c1fedb0605e > and 8346c416d17bf5b4ea1508662959bb62e73fd6a5. > > The following two locking commits in the DAX code: > > commit 843172978bb9 ("dax: fix race between simultaneous faults") > commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX") > > introduced a number of deadlocks and other issues, and need to be > reverted for the v4.3 kernel. The list of issues in DAX after these > commits (some newly introduced by the commits, some preexisting) can be > found here: > > https://lkml.org/lkml/2015/9/25/602 > > This revert keeps the PMEM API changes to the zeroing code in > __dax_pmd_fault(), which were added by this commit: > > commit d77e92e270ed ("dax: update PMD fault handler with PMEM API") > > It also keeps the code dropping mapping->i_mmap_rwsem before calling > unmap_mapping_range(), but converts it to a read lock since that's what is > now used by the rest of __dax_pmd_fault(). This is needed to avoid > recursively acquiring mapping->i_mmap_rwsem, once with a read lock in > __dax_pmd_fault() and once with a write lock in unmap_mapping_range(). I think it is safe to say that this has now morphed into a full blown fix and the "revert" label no longer applies. But, I'll let Andrew weigh in if he wants that fixed up or will replace these patches in -mm: revert-mm-take-i_mmap_lock-in-unmap_mapping_range-for-dax.patch revert-dax-fix-race-between-simultaneous-faults.patch dax-temporarily-disable-dax-pmd-fault-path.patch ...with this new series. However, a question below: > Signed-off-by: Ross Zwisler > --- > fs/dax.c | 37 +++++++++++++------------------------ > mm/memory.c | 11 +++++++++-- > 2 files changed, 22 insertions(+), 26 deletions(-) > > diff --git a/fs/dax.c b/fs/dax.c > index bcfb14b..f665bc9 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -569,36 +569,14 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, > if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) > goto fallback; > > - sector = bh.b_blocknr << (blkbits - 9); > - > - if (buffer_unwritten(&bh) || buffer_new(&bh)) { > - int i; > - > - length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn, > - bh.b_size); > - if (length < 0) { > - result = VM_FAULT_SIGBUS; > - goto out; > - } > - if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR)) > - goto fallback; > - > - for (i = 0; i < PTRS_PER_PMD; i++) > - clear_pmem(kaddr + i * PAGE_SIZE, PAGE_SIZE); > - wmb_pmem(); > - count_vm_event(PGMAJFAULT); > - mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); > - result |= VM_FAULT_MAJOR; > - } > - > /* > * If we allocated new storage, make sure no process has any > * zero pages covering this hole > */ > if (buffer_new(&bh)) { > - i_mmap_unlock_write(mapping); > + i_mmap_unlock_read(mapping); > unmap_mapping_range(mapping, pgoff << PAGE_SHIFT, PMD_SIZE, 0); > - i_mmap_lock_write(mapping); > + i_mmap_lock_read(mapping); > } > > /* > @@ -635,6 +613,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, > result = VM_FAULT_NOPAGE; > spin_unlock(ptl); > } else { > + sector = bh.b_blocknr << (blkbits - 9); > length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn, > bh.b_size); > if (length < 0) { > @@ -644,6 +623,16 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, > if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR)) > goto fallback; > > + if (buffer_unwritten(&bh) || buffer_new(&bh)) { > + int i; > + for (i = 0; i < PTRS_PER_PMD; i++) > + clear_pmem(kaddr + i * PAGE_SIZE, PAGE_SIZE); > + wmb_pmem(); > + count_vm_event(PGMAJFAULT); > + mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); > + result |= VM_FAULT_MAJOR; > + } > + > result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write); > } > > diff --git a/mm/memory.c b/mm/memory.c > index 9cb2747..5ec066f 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2426,10 +2426,17 @@ void unmap_mapping_range(struct address_space *mapping, > if (details.last_index < details.first_index) > details.last_index = ULONG_MAX; > > - i_mmap_lock_write(mapping); > + > + /* > + * DAX already holds i_mmap_lock to serialise file truncate vs > + * page fault and page fault vs page fault. > + */ > + if (!IS_DAX(mapping->host)) > + i_mmap_lock_write(mapping); > if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap))) > unmap_mapping_range_tree(&mapping->i_mmap, &details); > - i_mmap_unlock_write(mapping); > + if (!IS_DAX(mapping->host)) > + i_mmap_unlock_write(mapping); > } > EXPORT_SYMBOL(unmap_mapping_range); What about cases where unmap_mapping_range() is called without an fs lock? For the get_user_pages() and ZONE_DEVICE implementation I'm looking to call truncate_pagecache() from the driver shutdown path to revoke usage of the struct page's that were allocated by devm_memremap_pages(). Likely I'm introducing a path through unmap_mapping_range() that does not exist today, but I don't like that unmap_mapping_range() with this change is presuming a given locking context. It's not clear to me how this routine is safe when it optionally takes i_mmap_lock_write(), at a minimum this needs documenting, and possibly assertions if the locking assumptions are violated. invalidate_inode_pages2_range() seems to call unmap_mapping_range() without the the correct locking, but this was just a quick scan. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Ross Zwisler Subject: [PATCH v4 1/2] Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX" Date: Tue, 6 Oct 2015 16:28:48 -0600 Message-Id: <1444170529-12814-2-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1444170529-12814-1-git-send-email-ross.zwisler@linux.intel.com> References: <1444170529-12814-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , Alexander Viro , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , "Kirill A. Shutemov" , linux-nvdimm@lists.01.org, Matthew Wilcox List-ID: This reverts commits 46c043ede4711e8d598b9d63c5616c1fedb0605e and 8346c416d17bf5b4ea1508662959bb62e73fd6a5. The following two locking commits in the DAX code: commit 843172978bb9 ("dax: fix race between simultaneous faults") commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX") introduced a number of deadlocks and other issues, and need to be reverted for the v4.3 kernel. The list of issues in DAX after these commits (some newly introduced by the commits, some preexisting) can be found here: https://lkml.org/lkml/2015/9/25/602 This revert keeps the PMEM API changes to the zeroing code in __dax_pmd_fault(), which were added by this commit: commit d77e92e270ed ("dax: update PMD fault handler with PMEM API") It also keeps the code dropping mapping->i_mmap_rwsem before calling unmap_mapping_range(), but converts it to a read lock since that's what is now used by the rest of __dax_pmd_fault(). This is needed to avoid recursively acquiring mapping->i_mmap_rwsem, once with a read lock in __dax_pmd_fault() and once with a write lock in unmap_mapping_range(). Signed-off-by: Ross Zwisler --- fs/dax.c | 37 +++++++++++++------------------------ mm/memory.c | 11 +++++++++-- 2 files changed, 22 insertions(+), 26 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index bcfb14b..f665bc9 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -569,36 +569,14 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) goto fallback; - sector = bh.b_blocknr << (blkbits - 9); - - if (buffer_unwritten(&bh) || buffer_new(&bh)) { - int i; - - length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn, - bh.b_size); - if (length < 0) { - result = VM_FAULT_SIGBUS; - goto out; - } - if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR)) - goto fallback; - - for (i = 0; i < PTRS_PER_PMD; i++) - clear_pmem(kaddr + i * PAGE_SIZE, PAGE_SIZE); - wmb_pmem(); - count_vm_event(PGMAJFAULT); - mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); - result |= VM_FAULT_MAJOR; - } - /* * If we allocated new storage, make sure no process has any * zero pages covering this hole */ if (buffer_new(&bh)) { - i_mmap_unlock_write(mapping); + i_mmap_unlock_read(mapping); unmap_mapping_range(mapping, pgoff << PAGE_SHIFT, PMD_SIZE, 0); - i_mmap_lock_write(mapping); + i_mmap_lock_read(mapping); } /* @@ -635,6 +613,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, result = VM_FAULT_NOPAGE; spin_unlock(ptl); } else { + sector = bh.b_blocknr << (blkbits - 9); length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn, bh.b_size); if (length < 0) { @@ -644,6 +623,16 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR)) goto fallback; + if (buffer_unwritten(&bh) || buffer_new(&bh)) { + int i; + for (i = 0; i < PTRS_PER_PMD; i++) + clear_pmem(kaddr + i * PAGE_SIZE, PAGE_SIZE); + wmb_pmem(); + count_vm_event(PGMAJFAULT); + mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); + result |= VM_FAULT_MAJOR; + } + result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write); } diff --git a/mm/memory.c b/mm/memory.c index 9cb2747..5ec066f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2426,10 +2426,17 @@ void unmap_mapping_range(struct address_space *mapping, if (details.last_index < details.first_index) details.last_index = ULONG_MAX; - i_mmap_lock_write(mapping); + + /* + * DAX already holds i_mmap_lock to serialise file truncate vs + * page fault and page fault vs page fault. + */ + if (!IS_DAX(mapping->host)) + i_mmap_lock_write(mapping); if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap))) unmap_mapping_range_tree(&mapping->i_mmap, &details); - i_mmap_unlock_write(mapping); + if (!IS_DAX(mapping->host)) + i_mmap_unlock_write(mapping); } EXPORT_SYMBOL(unmap_mapping_range); -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Ross Zwisler Subject: [PATCH v4 2/2] Revert "dax: fix race between simultaneous faults" Date: Tue, 6 Oct 2015 16:28:49 -0600 Message-Id: <1444170529-12814-3-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1444170529-12814-1-git-send-email-ross.zwisler@linux.intel.com> References: <1444170529-12814-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , Alexander Viro , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , "Kirill A. Shutemov" , linux-nvdimm@lists.01.org, Matthew Wilcox List-ID: This reverts commit 843172978bb92997310d2f7fbc172ece423cfc02. The following two locking commits in the DAX code: commit 843172978bb9 ("dax: fix race between simultaneous faults") commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX") introduced a number of deadlocks and other issues, and need to be reverted for the v4.3 kernel. The list of issues in DAX after these commits (some newly introduced by the commits, some preexisting) can be found here: https://lkml.org/lkml/2015/9/25/602 Signed-off-by: Ross Zwisler --- fs/dax.c | 33 ++++++++++++++++----------------- mm/memory.c | 11 +++-------- 2 files changed, 19 insertions(+), 25 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index f665bc9..a86d3cc 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -285,6 +285,7 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh, static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, struct vm_area_struct *vma, struct vm_fault *vmf) { + struct address_space *mapping = inode->i_mapping; sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9); unsigned long vaddr = (unsigned long)vmf->virtual_address; void __pmem *addr; @@ -292,6 +293,8 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, pgoff_t size; int error; + i_mmap_lock_read(mapping); + /* * Check truncate didn't happen while we were allocating a block. * If it did, this block may or may not be still allocated to the @@ -321,6 +324,8 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, error = vm_insert_mixed(vma, vaddr, pfn); out: + i_mmap_unlock_read(mapping); + return error; } @@ -382,17 +387,15 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, * from a read fault and we've raced with a truncate */ error = -EIO; - goto unlock; + goto unlock_page; } - } else { - i_mmap_lock_write(mapping); } error = get_block(inode, block, &bh, 0); if (!error && (bh.b_size < PAGE_SIZE)) error = -EIO; /* fs corruption? */ if (error) - goto unlock; + goto unlock_page; if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page) { if (vmf->flags & FAULT_FLAG_WRITE) { @@ -403,9 +406,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, if (!error && (bh.b_size < PAGE_SIZE)) error = -EIO; if (error) - goto unlock; + goto unlock_page; } else { - i_mmap_unlock_write(mapping); return dax_load_hole(mapping, page, vmf); } } @@ -417,15 +419,17 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, else clear_user_highpage(new_page, vaddr); if (error) - goto unlock; + goto unlock_page; vmf->page = page; if (!page) { + i_mmap_lock_read(mapping); /* Check we didn't race with truncate */ size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; if (vmf->pgoff >= size) { + i_mmap_unlock_read(mapping); error = -EIO; - goto unlock; + goto out; } } return VM_FAULT_LOCKED; @@ -461,8 +465,6 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, WARN_ON_ONCE(!(vmf->flags & FAULT_FLAG_WRITE)); } - if (!page) - i_mmap_unlock_write(mapping); out: if (error == -ENOMEM) return VM_FAULT_OOM | major; @@ -471,14 +473,11 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, return VM_FAULT_SIGBUS | major; return VM_FAULT_NOPAGE | major; - unlock: + unlock_page: if (page) { unlock_page(page); page_cache_release(page); - } else { - i_mmap_unlock_write(mapping); } - goto out; } EXPORT_SYMBOL(__dax_fault); @@ -556,10 +555,10 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, block = (sector_t)pgoff << (PAGE_SHIFT - blkbits); bh.b_size = PMD_SIZE; - i_mmap_lock_write(mapping); length = get_block(inode, block, &bh, write); if (length) return VM_FAULT_SIGBUS; + i_mmap_lock_read(mapping); /* * If the filesystem isn't willing to tell us the length of a hole, @@ -637,11 +636,11 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, } out: + i_mmap_unlock_read(mapping); + if (buffer_unwritten(&bh)) complete_unwritten(&bh, !(result & VM_FAULT_ERROR)); - i_mmap_unlock_write(mapping); - return result; fallback: diff --git a/mm/memory.c b/mm/memory.c index 5ec066f..deb679c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2427,16 +2427,11 @@ void unmap_mapping_range(struct address_space *mapping, details.last_index = ULONG_MAX; - /* - * DAX already holds i_mmap_lock to serialise file truncate vs - * page fault and page fault vs page fault. - */ - if (!IS_DAX(mapping->host)) - i_mmap_lock_write(mapping); + /* DAX uses i_mmap_lock to serialise file truncate vs page fault */ + i_mmap_lock_write(mapping); if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap))) unmap_mapping_range_tree(&mapping->i_mmap, &details); - if (!IS_DAX(mapping->host)) - i_mmap_unlock_write(mapping); + i_mmap_unlock_write(mapping); } EXPORT_SYMBOL(unmap_mapping_range); -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Ross Zwisler Subject: [PATCH v4 0/2] Revert locking changes in DAX for v4.3 Date: Tue, 6 Oct 2015 16:28:47 -0600 Message-Id: <1444170529-12814-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , Alexander Viro , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , "Kirill A. Shutemov" , linux-nvdimm@lists.01.org, Matthew Wilcox List-ID: This series reverts some recent changes to the locking scheme in DAX introduced by these two commits: commit 843172978bb9 ("dax: fix race between simultaneous faults") commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX") Changes from v3: - reduced the revert of 46c043ede471 in patch 1 so that we still drop the mapping->i_mmap_rwsem before calling unmap_mapping_range(). This prevents the deadlock in the __dax_pmd_fault() path so there is no longer a need to temporarily disable DAX PMD faults. Ross Zwisler (2): Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX" Revert "dax: fix race between simultaneous faults" fs/dax.c | 70 +++++++++++++++++++++++++------------------------------------ mm/memory.c | 2 ++ 2 files changed, 31 insertions(+), 41 deletions(-) -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ross Zwisler Subject: Re: [PATCH v4 1/2] Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX" Date: Wed, 7 Oct 2015 15:39:30 -0600 Message-ID: <20151007213930.GA11743@linux.intel.com> References: <1444170529-12814-1-git-send-email-ross.zwisler@linux.intel.com> <1444170529-12814-2-git-send-email-ross.zwisler@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Ross Zwisler , "linux-kernel@vger.kernel.org" , Alexander Viro , Matthew Wilcox , linux-fsdevel , Linux MM , Andrew Morton , Dave Chinner , Jan Kara , "Kirill A. Shutemov" , "linux-nvdimm@lists.01.org" , Matthew Wilcox To: Dan Williams Return-path: Content-Disposition: inline In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org On Wed, Oct 07, 2015 at 09:19:28AM -0700, Dan Williams wrote: > On Tue, Oct 6, 2015 at 3:28 PM, Ross Zwisler > wrote: > > diff --git a/mm/memory.c b/mm/memory.c > > index 9cb2747..5ec066f 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -2426,10 +2426,17 @@ void unmap_mapping_range(struct address_space *mapping, > > if (details.last_index < details.first_index) > > details.last_index = ULONG_MAX; > > > > - i_mmap_lock_write(mapping); > > + > > + /* > > + * DAX already holds i_mmap_lock to serialise file truncate vs > > + * page fault and page fault vs page fault. > > + */ > > + if (!IS_DAX(mapping->host)) > > + i_mmap_lock_write(mapping); > > if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap))) > > unmap_mapping_range_tree(&mapping->i_mmap, &details); > > - i_mmap_unlock_write(mapping); > > + if (!IS_DAX(mapping->host)) > > + i_mmap_unlock_write(mapping); > > } > > EXPORT_SYMBOL(unmap_mapping_range); > > What about cases where unmap_mapping_range() is called without an fs > lock? For the get_user_pages() and ZONE_DEVICE implementation I'm > looking to call truncate_pagecache() from the driver shutdown path to > revoke usage of the struct page's that were allocated by > devm_memremap_pages(). > > Likely I'm introducing a path through unmap_mapping_range() that does > not exist today, but I don't like that unmap_mapping_range() with this > change is presuming a given locking context. It's not clear to me how > this routine is safe when it optionally takes i_mmap_lock_write(), at > a minimum this needs documenting, and possibly assertions if the > locking assumptions are violated. Yep, this is very confusing - these changes were undone by the second revert in the series (they were done and then undone by separate patches, both of which are getting reverted). After the series is applied in total unmap_mapping_range() takes the locks unconditionally: /* DAX uses i_mmap_lock to serialise file truncate vs page fault */ i_mmap_lock_write(mapping); if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap))) unmap_mapping_range_tree(&mapping->i_mmap, &details); i_mmap_unlock_write(mapping); } EXPORT_SYMBOL(unmap_mapping_range); Yes, I totally agree this is confusing - I'll just bit the bullet, collapse the two reverts together and call it "dax locking fixes" or something. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753299AbbJFW3K (ORCPT ); Tue, 6 Oct 2015 18:29:10 -0400 Received: from mga02.intel.com ([134.134.136.20]:15970 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752760AbbJFW3I (ORCPT ); Tue, 6 Oct 2015 18:29:08 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.17,646,1437462000"; d="scan'208";a="805120768" From: Ross Zwisler To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , Alexander Viro , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , "Kirill A. Shutemov" , linux-nvdimm@ml01.01.org, Matthew Wilcox Subject: [PATCH v4 0/2] Revert locking changes in DAX for v4.3 Date: Tue, 6 Oct 2015 16:28:47 -0600 Message-Id: <1444170529-12814-1-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: git-send-email 2.1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This series reverts some recent changes to the locking scheme in DAX introduced by these two commits: commit 843172978bb9 ("dax: fix race between simultaneous faults") commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX") Changes from v3: - reduced the revert of 46c043ede471 in patch 1 so that we still drop the mapping->i_mmap_rwsem before calling unmap_mapping_range(). This prevents the deadlock in the __dax_pmd_fault() path so there is no longer a need to temporarily disable DAX PMD faults. Ross Zwisler (2): Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX" Revert "dax: fix race between simultaneous faults" fs/dax.c | 70 +++++++++++++++++++++++++------------------------------------ mm/memory.c | 2 ++ 2 files changed, 31 insertions(+), 41 deletions(-) -- 2.1.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753413AbbJFW3h (ORCPT ); Tue, 6 Oct 2015 18:29:37 -0400 Received: from mga02.intel.com ([134.134.136.20]:15970 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752918AbbJFW3I (ORCPT ); Tue, 6 Oct 2015 18:29:08 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.17,646,1437462000"; d="scan'208";a="805120774" From: Ross Zwisler To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , Alexander Viro , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , "Kirill A. Shutemov" , linux-nvdimm@ml01.01.org, Matthew Wilcox Subject: [PATCH v4 1/2] Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX" Date: Tue, 6 Oct 2015 16:28:48 -0600 Message-Id: <1444170529-12814-2-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: git-send-email 2.1.0 In-Reply-To: <1444170529-12814-1-git-send-email-ross.zwisler@linux.intel.com> References: <1444170529-12814-1-git-send-email-ross.zwisler@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This reverts commits 46c043ede4711e8d598b9d63c5616c1fedb0605e and 8346c416d17bf5b4ea1508662959bb62e73fd6a5. The following two locking commits in the DAX code: commit 843172978bb9 ("dax: fix race between simultaneous faults") commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX") introduced a number of deadlocks and other issues, and need to be reverted for the v4.3 kernel. The list of issues in DAX after these commits (some newly introduced by the commits, some preexisting) can be found here: https://lkml.org/lkml/2015/9/25/602 This revert keeps the PMEM API changes to the zeroing code in __dax_pmd_fault(), which were added by this commit: commit d77e92e270ed ("dax: update PMD fault handler with PMEM API") It also keeps the code dropping mapping->i_mmap_rwsem before calling unmap_mapping_range(), but converts it to a read lock since that's what is now used by the rest of __dax_pmd_fault(). This is needed to avoid recursively acquiring mapping->i_mmap_rwsem, once with a read lock in __dax_pmd_fault() and once with a write lock in unmap_mapping_range(). Signed-off-by: Ross Zwisler --- fs/dax.c | 37 +++++++++++++------------------------ mm/memory.c | 11 +++++++++-- 2 files changed, 22 insertions(+), 26 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index bcfb14b..f665bc9 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -569,36 +569,14 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) goto fallback; - sector = bh.b_blocknr << (blkbits - 9); - - if (buffer_unwritten(&bh) || buffer_new(&bh)) { - int i; - - length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn, - bh.b_size); - if (length < 0) { - result = VM_FAULT_SIGBUS; - goto out; - } - if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR)) - goto fallback; - - for (i = 0; i < PTRS_PER_PMD; i++) - clear_pmem(kaddr + i * PAGE_SIZE, PAGE_SIZE); - wmb_pmem(); - count_vm_event(PGMAJFAULT); - mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); - result |= VM_FAULT_MAJOR; - } - /* * If we allocated new storage, make sure no process has any * zero pages covering this hole */ if (buffer_new(&bh)) { - i_mmap_unlock_write(mapping); + i_mmap_unlock_read(mapping); unmap_mapping_range(mapping, pgoff << PAGE_SHIFT, PMD_SIZE, 0); - i_mmap_lock_write(mapping); + i_mmap_lock_read(mapping); } /* @@ -635,6 +613,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, result = VM_FAULT_NOPAGE; spin_unlock(ptl); } else { + sector = bh.b_blocknr << (blkbits - 9); length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn, bh.b_size); if (length < 0) { @@ -644,6 +623,16 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR)) goto fallback; + if (buffer_unwritten(&bh) || buffer_new(&bh)) { + int i; + for (i = 0; i < PTRS_PER_PMD; i++) + clear_pmem(kaddr + i * PAGE_SIZE, PAGE_SIZE); + wmb_pmem(); + count_vm_event(PGMAJFAULT); + mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); + result |= VM_FAULT_MAJOR; + } + result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write); } diff --git a/mm/memory.c b/mm/memory.c index 9cb2747..5ec066f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2426,10 +2426,17 @@ void unmap_mapping_range(struct address_space *mapping, if (details.last_index < details.first_index) details.last_index = ULONG_MAX; - i_mmap_lock_write(mapping); + + /* + * DAX already holds i_mmap_lock to serialise file truncate vs + * page fault and page fault vs page fault. + */ + if (!IS_DAX(mapping->host)) + i_mmap_lock_write(mapping); if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap))) unmap_mapping_range_tree(&mapping->i_mmap, &details); - i_mmap_unlock_write(mapping); + if (!IS_DAX(mapping->host)) + i_mmap_unlock_write(mapping); } EXPORT_SYMBOL(unmap_mapping_range); -- 2.1.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753370AbbJFW3O (ORCPT ); Tue, 6 Oct 2015 18:29:14 -0400 Received: from mga02.intel.com ([134.134.136.20]:15970 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753080AbbJFW3J (ORCPT ); Tue, 6 Oct 2015 18:29:09 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.17,646,1437462000"; d="scan'208";a="805120777" From: Ross Zwisler To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , Alexander Viro , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , "Kirill A. Shutemov" , linux-nvdimm@ml01.01.org, Matthew Wilcox Subject: [PATCH v4 2/2] Revert "dax: fix race between simultaneous faults" Date: Tue, 6 Oct 2015 16:28:49 -0600 Message-Id: <1444170529-12814-3-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: git-send-email 2.1.0 In-Reply-To: <1444170529-12814-1-git-send-email-ross.zwisler@linux.intel.com> References: <1444170529-12814-1-git-send-email-ross.zwisler@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This reverts commit 843172978bb92997310d2f7fbc172ece423cfc02. The following two locking commits in the DAX code: commit 843172978bb9 ("dax: fix race between simultaneous faults") commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX") introduced a number of deadlocks and other issues, and need to be reverted for the v4.3 kernel. The list of issues in DAX after these commits (some newly introduced by the commits, some preexisting) can be found here: https://lkml.org/lkml/2015/9/25/602 Signed-off-by: Ross Zwisler --- fs/dax.c | 33 ++++++++++++++++----------------- mm/memory.c | 11 +++-------- 2 files changed, 19 insertions(+), 25 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index f665bc9..a86d3cc 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -285,6 +285,7 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh, static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, struct vm_area_struct *vma, struct vm_fault *vmf) { + struct address_space *mapping = inode->i_mapping; sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9); unsigned long vaddr = (unsigned long)vmf->virtual_address; void __pmem *addr; @@ -292,6 +293,8 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, pgoff_t size; int error; + i_mmap_lock_read(mapping); + /* * Check truncate didn't happen while we were allocating a block. * If it did, this block may or may not be still allocated to the @@ -321,6 +324,8 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, error = vm_insert_mixed(vma, vaddr, pfn); out: + i_mmap_unlock_read(mapping); + return error; } @@ -382,17 +387,15 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, * from a read fault and we've raced with a truncate */ error = -EIO; - goto unlock; + goto unlock_page; } - } else { - i_mmap_lock_write(mapping); } error = get_block(inode, block, &bh, 0); if (!error && (bh.b_size < PAGE_SIZE)) error = -EIO; /* fs corruption? */ if (error) - goto unlock; + goto unlock_page; if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page) { if (vmf->flags & FAULT_FLAG_WRITE) { @@ -403,9 +406,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, if (!error && (bh.b_size < PAGE_SIZE)) error = -EIO; if (error) - goto unlock; + goto unlock_page; } else { - i_mmap_unlock_write(mapping); return dax_load_hole(mapping, page, vmf); } } @@ -417,15 +419,17 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, else clear_user_highpage(new_page, vaddr); if (error) - goto unlock; + goto unlock_page; vmf->page = page; if (!page) { + i_mmap_lock_read(mapping); /* Check we didn't race with truncate */ size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; if (vmf->pgoff >= size) { + i_mmap_unlock_read(mapping); error = -EIO; - goto unlock; + goto out; } } return VM_FAULT_LOCKED; @@ -461,8 +465,6 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, WARN_ON_ONCE(!(vmf->flags & FAULT_FLAG_WRITE)); } - if (!page) - i_mmap_unlock_write(mapping); out: if (error == -ENOMEM) return VM_FAULT_OOM | major; @@ -471,14 +473,11 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, return VM_FAULT_SIGBUS | major; return VM_FAULT_NOPAGE | major; - unlock: + unlock_page: if (page) { unlock_page(page); page_cache_release(page); - } else { - i_mmap_unlock_write(mapping); } - goto out; } EXPORT_SYMBOL(__dax_fault); @@ -556,10 +555,10 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, block = (sector_t)pgoff << (PAGE_SHIFT - blkbits); bh.b_size = PMD_SIZE; - i_mmap_lock_write(mapping); length = get_block(inode, block, &bh, write); if (length) return VM_FAULT_SIGBUS; + i_mmap_lock_read(mapping); /* * If the filesystem isn't willing to tell us the length of a hole, @@ -637,11 +636,11 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, } out: + i_mmap_unlock_read(mapping); + if (buffer_unwritten(&bh)) complete_unwritten(&bh, !(result & VM_FAULT_ERROR)); - i_mmap_unlock_write(mapping); - return result; fallback: diff --git a/mm/memory.c b/mm/memory.c index 5ec066f..deb679c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2427,16 +2427,11 @@ void unmap_mapping_range(struct address_space *mapping, details.last_index = ULONG_MAX; - /* - * DAX already holds i_mmap_lock to serialise file truncate vs - * page fault and page fault vs page fault. - */ - if (!IS_DAX(mapping->host)) - i_mmap_lock_write(mapping); + /* DAX uses i_mmap_lock to serialise file truncate vs page fault */ + i_mmap_lock_write(mapping); if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap))) unmap_mapping_range_tree(&mapping->i_mmap, &details); - if (!IS_DAX(mapping->host)) - i_mmap_unlock_write(mapping); + i_mmap_unlock_write(mapping); } EXPORT_SYMBOL(unmap_mapping_range); -- 2.1.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755285AbbJGQTc (ORCPT ); Wed, 7 Oct 2015 12:19:32 -0400 Received: from mail-wi0-f180.google.com ([209.85.212.180]:36191 "EHLO mail-wi0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754980AbbJGQTa (ORCPT ); Wed, 7 Oct 2015 12:19:30 -0400 MIME-Version: 1.0 In-Reply-To: <1444170529-12814-2-git-send-email-ross.zwisler@linux.intel.com> References: <1444170529-12814-1-git-send-email-ross.zwisler@linux.intel.com> <1444170529-12814-2-git-send-email-ross.zwisler@linux.intel.com> Date: Wed, 7 Oct 2015 09:19:28 -0700 Message-ID: Subject: Re: [PATCH v4 1/2] Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX" From: Dan Williams To: Ross Zwisler Cc: "linux-kernel@vger.kernel.org" , Alexander Viro , Matthew Wilcox , linux-fsdevel , Linux MM , Andrew Morton , Dave Chinner , Jan Kara , "Kirill A. Shutemov" , "linux-nvdimm@lists.01.org" , Matthew Wilcox Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 6, 2015 at 3:28 PM, Ross Zwisler wrote: > This reverts commits 46c043ede4711e8d598b9d63c5616c1fedb0605e > and 8346c416d17bf5b4ea1508662959bb62e73fd6a5. > > The following two locking commits in the DAX code: > > commit 843172978bb9 ("dax: fix race between simultaneous faults") > commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX") > > introduced a number of deadlocks and other issues, and need to be > reverted for the v4.3 kernel. The list of issues in DAX after these > commits (some newly introduced by the commits, some preexisting) can be > found here: > > https://lkml.org/lkml/2015/9/25/602 > > This revert keeps the PMEM API changes to the zeroing code in > __dax_pmd_fault(), which were added by this commit: > > commit d77e92e270ed ("dax: update PMD fault handler with PMEM API") > > It also keeps the code dropping mapping->i_mmap_rwsem before calling > unmap_mapping_range(), but converts it to a read lock since that's what is > now used by the rest of __dax_pmd_fault(). This is needed to avoid > recursively acquiring mapping->i_mmap_rwsem, once with a read lock in > __dax_pmd_fault() and once with a write lock in unmap_mapping_range(). I think it is safe to say that this has now morphed into a full blown fix and the "revert" label no longer applies. But, I'll let Andrew weigh in if he wants that fixed up or will replace these patches in -mm: revert-mm-take-i_mmap_lock-in-unmap_mapping_range-for-dax.patch revert-dax-fix-race-between-simultaneous-faults.patch dax-temporarily-disable-dax-pmd-fault-path.patch ...with this new series. However, a question below: > Signed-off-by: Ross Zwisler > --- > fs/dax.c | 37 +++++++++++++------------------------ > mm/memory.c | 11 +++++++++-- > 2 files changed, 22 insertions(+), 26 deletions(-) > > diff --git a/fs/dax.c b/fs/dax.c > index bcfb14b..f665bc9 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -569,36 +569,14 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, > if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) > goto fallback; > > - sector = bh.b_blocknr << (blkbits - 9); > - > - if (buffer_unwritten(&bh) || buffer_new(&bh)) { > - int i; > - > - length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn, > - bh.b_size); > - if (length < 0) { > - result = VM_FAULT_SIGBUS; > - goto out; > - } > - if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR)) > - goto fallback; > - > - for (i = 0; i < PTRS_PER_PMD; i++) > - clear_pmem(kaddr + i * PAGE_SIZE, PAGE_SIZE); > - wmb_pmem(); > - count_vm_event(PGMAJFAULT); > - mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); > - result |= VM_FAULT_MAJOR; > - } > - > /* > * If we allocated new storage, make sure no process has any > * zero pages covering this hole > */ > if (buffer_new(&bh)) { > - i_mmap_unlock_write(mapping); > + i_mmap_unlock_read(mapping); > unmap_mapping_range(mapping, pgoff << PAGE_SHIFT, PMD_SIZE, 0); > - i_mmap_lock_write(mapping); > + i_mmap_lock_read(mapping); > } > > /* > @@ -635,6 +613,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, > result = VM_FAULT_NOPAGE; > spin_unlock(ptl); > } else { > + sector = bh.b_blocknr << (blkbits - 9); > length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn, > bh.b_size); > if (length < 0) { > @@ -644,6 +623,16 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, > if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR)) > goto fallback; > > + if (buffer_unwritten(&bh) || buffer_new(&bh)) { > + int i; > + for (i = 0; i < PTRS_PER_PMD; i++) > + clear_pmem(kaddr + i * PAGE_SIZE, PAGE_SIZE); > + wmb_pmem(); > + count_vm_event(PGMAJFAULT); > + mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); > + result |= VM_FAULT_MAJOR; > + } > + > result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write); > } > > diff --git a/mm/memory.c b/mm/memory.c > index 9cb2747..5ec066f 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2426,10 +2426,17 @@ void unmap_mapping_range(struct address_space *mapping, > if (details.last_index < details.first_index) > details.last_index = ULONG_MAX; > > - i_mmap_lock_write(mapping); > + > + /* > + * DAX already holds i_mmap_lock to serialise file truncate vs > + * page fault and page fault vs page fault. > + */ > + if (!IS_DAX(mapping->host)) > + i_mmap_lock_write(mapping); > if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap))) > unmap_mapping_range_tree(&mapping->i_mmap, &details); > - i_mmap_unlock_write(mapping); > + if (!IS_DAX(mapping->host)) > + i_mmap_unlock_write(mapping); > } > EXPORT_SYMBOL(unmap_mapping_range); What about cases where unmap_mapping_range() is called without an fs lock? For the get_user_pages() and ZONE_DEVICE implementation I'm looking to call truncate_pagecache() from the driver shutdown path to revoke usage of the struct page's that were allocated by devm_memremap_pages(). Likely I'm introducing a path through unmap_mapping_range() that does not exist today, but I don't like that unmap_mapping_range() with this change is presuming a given locking context. It's not clear to me how this routine is safe when it optionally takes i_mmap_lock_write(), at a minimum this needs documenting, and possibly assertions if the locking assumptions are violated. invalidate_inode_pages2_range() seems to call unmap_mapping_range() without the the correct locking, but this was just a quick scan. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756496AbbJGVje (ORCPT ); Wed, 7 Oct 2015 17:39:34 -0400 Received: from mga01.intel.com ([192.55.52.88]:9129 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756471AbbJGVjc (ORCPT ); Wed, 7 Oct 2015 17:39:32 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.17,651,1437462000"; d="scan'208";a="821766936" Date: Wed, 7 Oct 2015 15:39:30 -0600 From: Ross Zwisler To: Dan Williams Cc: Ross Zwisler , "linux-kernel@vger.kernel.org" , Alexander Viro , Matthew Wilcox , linux-fsdevel , Linux MM , Andrew Morton , Dave Chinner , Jan Kara , "Kirill A. Shutemov" , "linux-nvdimm@lists.01.org" , Matthew Wilcox Subject: Re: [PATCH v4 1/2] Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX" Message-ID: <20151007213930.GA11743@linux.intel.com> Mail-Followup-To: Ross Zwisler , Dan Williams , "linux-kernel@vger.kernel.org" , Alexander Viro , Matthew Wilcox , linux-fsdevel , Linux MM , Andrew Morton , Dave Chinner , Jan Kara , "Kirill A. Shutemov" , "linux-nvdimm@lists.01.org" , Matthew Wilcox References: <1444170529-12814-1-git-send-email-ross.zwisler@linux.intel.com> <1444170529-12814-2-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Oct 07, 2015 at 09:19:28AM -0700, Dan Williams wrote: > On Tue, Oct 6, 2015 at 3:28 PM, Ross Zwisler > wrote: > > diff --git a/mm/memory.c b/mm/memory.c > > index 9cb2747..5ec066f 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -2426,10 +2426,17 @@ void unmap_mapping_range(struct address_space *mapping, > > if (details.last_index < details.first_index) > > details.last_index = ULONG_MAX; > > > > - i_mmap_lock_write(mapping); > > + > > + /* > > + * DAX already holds i_mmap_lock to serialise file truncate vs > > + * page fault and page fault vs page fault. > > + */ > > + if (!IS_DAX(mapping->host)) > > + i_mmap_lock_write(mapping); > > if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap))) > > unmap_mapping_range_tree(&mapping->i_mmap, &details); > > - i_mmap_unlock_write(mapping); > > + if (!IS_DAX(mapping->host)) > > + i_mmap_unlock_write(mapping); > > } > > EXPORT_SYMBOL(unmap_mapping_range); > > What about cases where unmap_mapping_range() is called without an fs > lock? For the get_user_pages() and ZONE_DEVICE implementation I'm > looking to call truncate_pagecache() from the driver shutdown path to > revoke usage of the struct page's that were allocated by > devm_memremap_pages(). > > Likely I'm introducing a path through unmap_mapping_range() that does > not exist today, but I don't like that unmap_mapping_range() with this > change is presuming a given locking context. It's not clear to me how > this routine is safe when it optionally takes i_mmap_lock_write(), at > a minimum this needs documenting, and possibly assertions if the > locking assumptions are violated. Yep, this is very confusing - these changes were undone by the second revert in the series (they were done and then undone by separate patches, both of which are getting reverted). After the series is applied in total unmap_mapping_range() takes the locks unconditionally: /* DAX uses i_mmap_lock to serialise file truncate vs page fault */ i_mmap_lock_write(mapping); if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap))) unmap_mapping_range_tree(&mapping->i_mmap, &details); i_mmap_unlock_write(mapping); } EXPORT_SYMBOL(unmap_mapping_range); Yes, I totally agree this is confusing - I'll just bit the bullet, collapse the two reverts together and call it "dax locking fixes" or something.