From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 995B03C552E for ; Fri, 10 Apr 2026 13:40:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775828424; cv=none; b=JLNFP6kNCggNHzScx0vPs/UMKzGO3kra3Wup03MGj9W9Mj3RG5rxAUX8Qh7T3hrZykRV4iTGvAQ7xcj+qWpGGLj2vjkOEI8dobIuhfKfPfJXNIF/vpx2Z6XpaswHwpk378ZOL/7nj8EWiD39rWiadcNAE/AufmgX5UKcWZRDFj4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775828424; c=relaxed/simple; bh=Jmq4huoKFN3asEcsgm2oAg98uUDJCLdndkFfOq1NdGM=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=gCb4cmJfSYUqDXoBkQc1qZac9G4wClkukU67gTmsVHDyriSVtU4On/biF/s1dZ1Oze/ixFuXxRMo4OV+qzak/5OVm7iaYUwWQbpfdj+AF0YdmzjtpGD6AHxUARJndbkjZiU4Aow0UEK0PG9r895USXO7Q10W15BG+CB1K4+g0FY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=UHSTE6IM; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="UHSTE6IM" Received: by smtp.kernel.org (Postfix) with ESMTPSA id A3AB4C19421; Fri, 10 Apr 2026 13:40:21 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1775828424; bh=Jmq4huoKFN3asEcsgm2oAg98uUDJCLdndkFfOq1NdGM=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=UHSTE6IMVQDomDbCGIFhYASkfTGfr6oz0VYFpZEq3J1oQsGPQ537Iv+Sl8WUzLKsC rkvyLW6o45EQPOXK8FeQZgcGTedLsnwXEhBIht0mPGhQMsLWePBoMcR+wPPEb9d073 uD2LYNWGvCaKObN1+A3zma6BHBSvpWcePvmyraAu1PqQbZDoVpQHAynmy0yU4SlzHP Jj5wBfR7j3UbhO4lrD+IiuPDHa+rvfMUeiuiwtTwcD7KVn+lFaXG8bJkdWLW8OfJiD i21JoqNUv4HgRNMTYTrYQZdNraep8pIHCx4EU1WOLENsfGUCE+B44BuDSuJ6jrBKOa SHdBrJXo951qg== Date: Fri, 10 Apr 2026 14:40:19 +0100 From: Lorenzo Stoakes To: jiang.kun2@zte.com.cn Cc: akpm@linux-foundation.org, liam.howlett@oracle.com, david@kernel.org, vbabka@kernel.org, jannh@google.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, xu.xin16@zte.com.cn, wang.yaxin@zte.com.cn, lu.zhongjun@zte.com.cn Subject: Re: [PATCH v2] mm/madvise: prefer VMA lock for MADV_REMOVE Message-ID: References: <20260410160249749i98jwNgNLmLMKRNVeoKVe@zte.com.cn> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260410160249749i98jwNgNLmLMKRNVeoKVe@zte.com.cn> On Fri, Apr 10, 2026 at 04:02:49PM +0800, jiang.kun2@zte.com.cn wrote: > From: Jiang Kun > > MADV_REMOVE prefers the per-VMA read lock for single-VMA, local-mm, > non-UFFD-armed ranges, avoiding mmap_lock contention for such ranges. I'm not sure what 'MADV_REMOVE prefers' means here? You're making the change, you should be making a case for it, ideally with performance numbers. I'm really not sure we should just be enabling the VMA lock for things arbitrarily unless there's a real need for it. > > However, calling into the filesystem while holding vm_lock (VMA lock) can Not sure what vm_lock is? :) > create lock ordering issues. syzbot reported a possible deadlock in > blkdev_fallocate() when vfs_fallocate() is called under vm_lock. It's not really worth mentioning an issue that came up in a v1 of this code in the commit message. The commit message should describe what you're actually doing, why, and any details that might be helpful to people looking at the patch. > > Fix this by dropping the VMA lock before invoking vfs_fallocate(), after > taking an extra reference to the file. Keep the existing mmap_lock fallback > path and its userfaultfd coordination unchanged. It's worth mentioning this, but not in terms of a bug that happened in v1 > > Repeated benchmark runs show no regression in the uncontended case, and show > benefit once mmap_lock contention is introduced. Numbers please :) ideally with some statistics like standard deviation etc. > > Link: https://ci.syzbot.org/series/30acb9df-ca55-4cbf-81ed-89b84da8edc1 > Link: https://lore.kernel.org/all/aWcZCwz__qwwKbxw@casper.infradead.org/ > Signed-off-by: Jiang Kun > Signed-off-by: Yaxin Wang Some things to address, I have attached a patch that reworks what you have here in a way that avoids duplication, so if you're good with that feel free to send as v3, I don't need attribution other than a Suggested-by if you like. The logic in general seems correct, but we need numbers! And the commit message to be improved as per the above also. Thanks, Lorenzo > --- > mm/madvise.c | 21 +++++++++++++++++---- > 1 file changed, 17 insertions(+), 4 deletions(-) > > diff --git a/mm/madvise.c b/mm/madvise.c > index 69708e953cf5..0932579bccb4 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -1008,8 +1008,6 @@ static long madvise_remove(struct madvise_behavior *madv_behavior) > unsigned long start = madv_behavior->range.start; > unsigned long end = madv_behavior->range.end; > > - mark_mmap_lock_dropped(madv_behavior); > - > if (vma->vm_flags & VM_LOCKED) > return -EINVAL; > > @@ -1025,6 +1023,20 @@ static long madvise_remove(struct madvise_behavior *madv_behavior) > offset = (loff_t)(start - vma->vm_start) > + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); > > + /* Avoid calling into the filesystem while holding a VMA lock. */ > + if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) { > + get_file(f); > + vma_end_read(vma); > + madv_behavior->vma = NULL; > + error = vfs_fallocate(f, > + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, > + offset, end - start); > + fput(f); > + return error; > + } > + > + mark_mmap_lock_dropped(madv_behavior); I don't like this code duplication, you're literally copy/pasting code here. Let's use what we already have. We can avoid duplicating like this, make mark_mmap_lock_dropped() valid for VMA locks too and rename it to mark_lock_dropped(), then change the logic around a bit and have something a bit nicer I think. See attached patch, you're welcome to use this as-is (you can add a Suggested-by if you like), but that avoids the duplication and keeps things unified. > + > /* > * Filesystem's fallocate may need to take i_rwsem. We need to > * explicitly grab a reference because the vma (and hence the > @@ -1677,7 +1689,8 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior) > if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK && > try_vma_read_lock(madv_behavior)) { > error = madvise_vma_behavior(madv_behavior); > - vma_end_read(madv_behavior->vma); > + if (madv_behavior->vma) > + vma_end_read(madv_behavior->vma); > return error; > } > > @@ -1746,7 +1759,6 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi > return MADVISE_NO_LOCK; > > switch (madv_behavior->behavior) { > - case MADV_REMOVE: > case MADV_WILLNEED: > case MADV_COLD: > case MADV_PAGEOUT: > @@ -1754,6 +1766,7 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi > case MADV_POPULATE_WRITE: > case MADV_COLLAPSE: > return MADVISE_MMAP_READ_LOCK; > + case MADV_REMOVE: > case MADV_GUARD_INSTALL: > case MADV_GUARD_REMOVE: > case MADV_DONTNEED: > -- > 2.53.0 Suggested reworked version below: ----8<---- >From 97f6ff9749a38256e919a24ff2867a077a6f8cfc Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Fri, 10 Apr 2026 14:37:23 +0100 Subject: [PATCH] slightly reworked --- mm/madvise.c | 31 ++++++++++++++++--------------- 1 file changed, 16 insertions(+), 15 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index 69708e953cf5..4af0565809f2 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -269,9 +269,8 @@ static void shmem_swapin_range(struct vm_area_struct *vma, } #endif /* CONFIG_SWAP */ -static void mark_mmap_lock_dropped(struct madvise_behavior *madv_behavior) +static void mark_lock_dropped(struct madvise_behavior *madv_behavior) { - VM_WARN_ON_ONCE(madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK); madv_behavior->lock_dropped = true; } @@ -315,7 +314,7 @@ static long madvise_willneed(struct madvise_behavior *madv_behavior) * vma's reference to the file) can go away as soon as we drop * mmap_lock. */ - mark_mmap_lock_dropped(madv_behavior); + mark_lock_dropped(madv_behavior); get_file(file); offset = (loff_t)(start - vma->vm_start) + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); @@ -909,7 +908,7 @@ static long madvise_dontneed_free(struct madvise_behavior *madv_behavior) if (!userfaultfd_remove(madv_behavior->vma, range->start, range->end)) { struct vm_area_struct *vma; - mark_mmap_lock_dropped(madv_behavior); + mark_lock_dropped(madv_behavior); mmap_read_lock(mm); madv_behavior->vma = vma = vma_lookup(mm, range->start); if (!vma) @@ -1005,10 +1004,9 @@ static long madvise_remove(struct madvise_behavior *madv_behavior) struct file *f; struct mm_struct *mm = madv_behavior->mm; struct vm_area_struct *vma = madv_behavior->vma; - unsigned long start = madv_behavior->range.start; - unsigned long end = madv_behavior->range.end; - - mark_mmap_lock_dropped(madv_behavior); + const unsigned long start = madv_behavior->range.start; + const unsigned long end = madv_behavior->range.end; + const bool vma_locked = madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK; if (vma->vm_flags & VM_LOCKED) return -EINVAL; @@ -1029,18 +1027,20 @@ static long madvise_remove(struct madvise_behavior *madv_behavior) * Filesystem's fallocate may need to take i_rwsem. We need to * explicitly grab a reference because the vma (and hence the * vma's reference to the file) can go away as soon as we drop - * mmap_lock. + * its lock. */ get_file(f); - if (userfaultfd_remove(vma, start, end)) { - /* mmap_lock was not released by userfaultfd_remove() */ + mark_lock_dropped(madv_behavior); + if (vma_locked) + vma_end_read(vma); + else if (userfaultfd_remove(vma, start, end)) /* uffd didn't drop? */ mmap_read_unlock(mm); - } error = vfs_fallocate(f, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, offset, end - start); fput(f); - mmap_read_lock(mm); + if (!vma_locked) + mmap_read_lock(mm); return error; } @@ -1677,7 +1677,8 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior) if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK && try_vma_read_lock(madv_behavior)) { error = madvise_vma_behavior(madv_behavior); - vma_end_read(madv_behavior->vma); + if (!madv_behavior->lock_dropped) + vma_end_read(madv_behavior->vma); return error; } @@ -1746,7 +1747,6 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi return MADVISE_NO_LOCK; switch (madv_behavior->behavior) { - case MADV_REMOVE: case MADV_WILLNEED: case MADV_COLD: case MADV_PAGEOUT: @@ -1754,6 +1754,7 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi case MADV_POPULATE_WRITE: case MADV_COLLAPSE: return MADVISE_MMAP_READ_LOCK; + case MADV_REMOVE: case MADV_GUARD_INSTALL: case MADV_GUARD_REMOVE: case MADV_DONTNEED: -- 2.53.0