From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-172.mta0.migadu.com (out-172.mta0.migadu.com [91.218.175.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 85EF038E113 for ; Thu, 26 Feb 2026 07:10:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772089810; cv=none; b=n+cmqOORoeGNEbaNM5a+kAfaKv8r2dr0kO6UTjACFZ09ABmdI4lEf1d9t+BAwLbySsKQ3SLr7S1NWj0zgoiCNSYWMi/eKQkrBySZdz7Lxj/kLUM+FxRHKDx7/oc903bcnNH1U7kv4W/uQp7At0Uur8pb7X3XCLIw6iZ8gUWlY1I= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772089810; c=relaxed/simple; bh=D6++cHH+a2/7y34ZoTBmygVERYOURHbInGKI2a7Vy7Y=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=oF4mApvCEWugi5EgHzZmS8hDJKToR4bWtDMGFvDC9wy5ncDuYFXtKCZMIrhuSDyiCnRlx+DtFOPNMcPvStCGyHOkrIjQtDyddYyewDoNcP2JIZa38zc9V5OUAU+h/q2jdzUZgFxM7gWBfObNdPJmqAlMkxrSrLCOL8XSTFl8vVk= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=T5RbrvYi; arc=none smtp.client-ip=91.218.175.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="T5RbrvYi" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1772089806; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=blph2lJMOWUKWT5xAcasjhPc5lsVgPYBkYe9OiBnrFo=; b=T5RbrvYiPNIdRhfvw5JAikOpxhVjBz1xrmoTd92fOVEGmDQ2xWgabqHv755kCfxYLwuix/ WCR/+EXQ1md0c18KGEyzr/nZ40BHt9egIeTftkrFB8u0OPfAugKD67i84YwbQH3rOQ0kZT G+E9PoYBAV8U3l3PoT11fjOSTDxsIIg= From: Lance Yang To: david@kernel.org Cc: Liam.Howlett@oracle.com, akpm@linux-foundation.org, baohua@kernel.org, dev.jain@arm.com, harry.yoo@oracle.com, jannh@google.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, lorenzo.stoakes@oracle.com, riel@surriel.com, stable@kernel.org, vbabka@kernel.org, Lance Yang Subject: Re: [PATCH] mm/rmap: fix incorrect pte restoration for lazyfree folios Date: Thu, 26 Feb 2026 15:09:40 +0800 Message-ID: <20260226070940.96226-1-lance.yang@linux.dev> In-Reply-To: <36e676b4-dc6f-45f7-b885-8685227ac6a8@kernel.org> References: <36e676b4-dc6f-45f7-b885-8685227ac6a8@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT On Tue, Feb 24, 2026 at 05:01:50PM +0100, David Hildenbrand (Arm) wrote: >On 2/24/26 12:43, Lorenzo Stoakes wrote: >> On Tue, Feb 24, 2026 at 11:31:24AM +0000, Lorenzo Stoakes wrote: >>> Thanks Dev. >>> >>> Andrew - why was commit 354dffd29575 ("mm: support batched unmap for lazyfree >>> large folios during reclamation") merged? >>> >>> It had enormous amounts of review commentary at >>> https://lore.kernel.org/all/146b4cb1-aa1e-4519-9e03-f98cfb1135d2@redhat.com/ and >>> no tags, this should be a signal to wait for a respin _at least_, and really if >>> late in cycle suggests it should wait a cycle. >>> >>> I've said going forward I'm going to check THP series for tags and if not >>> present NAK if they hit mm-stable, I guess I'll extend that to rmap also. >> >> Sorry I misread the original mail rushing through this is old... so this is less >> pressing than I thought (for some reason I thought it was merged last cycle...!) >> but it's a good example of how stuff can go unnoticed for a while. >> >> In that case maybe a revert is a bit much and we just want the simplest possible >> fix for backporting. > >Dev volunteered to un-messify some of the stuff here. In particular, to >extend batching to all cases, not just some hand-selected ones. > >Support for file folios is on the way. > >> >> But is the proposed 'just assume wrprotect' sensible? David? > >In general, I think so. If PTEs were writable, they certainly have >PAE set. The write-fault handler can fully recover from that (as PAE is >set). If it's ever a performance problem (doubt), we can revisit. > >I'm wondering whether we should just perform the wrprotect earlier: > >diff --git a/mm/rmap.c b/mm/rmap.c >index 0f00570d1b9e..19b875ee3fad 100644 >--- a/mm/rmap.c >+++ b/mm/rmap.c >@@ -2150,6 +2150,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, > > /* Nuke the page table entry. */ > pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages); >+ >+ /* >+ * Our batch might include writable and read-only >+ * PTEs. When we have to restore the mapping, just >+ * assume read-only to not accidentally upgrade >+ * write permissions for PTEs that must not be >+ * writable. >+ */ >+ pteval = pte_wrprotect(pteval); >+ > /* > * We clear the PTE but do not flush so potentially > * a remote CPU could still be writing to the folio > > >Given that nobody asks for writability (pte_write()) later. > >Or does someone care? > >Staring at set_tlb_ubc_flush_pending()->pte_accessible() I am >not 100% sure. Could pte_wrprotect() turn a PTE inaccessible on some >architecture (write-only)? I don't think so. > > >We have the following options: > >1) pte_wrprotect(): fake that all was read-only. > >Either we do it like Dev suggests, or we do it as above early. > >The downside is that any code that might later want to know "was >this possibly writable" would get that information. Well, it wouldn't >get that information reliably *today* already (and that sounds a bit shaky). Makes sense to me :) >2) Tell batching logic to honor pte_write() > >Sounds suboptimal for some cases that really don't care in the future. > >3) Tell batching logic to tell us if any pte was writable: FPB_MERGE_WRITE > >... then we know for sure whether any PTE was writable and we could > >(a) Pass it as we did before around to all checks, like pte_accessible(). > >(b) Have an explicit restore PTE where we play save. > > >I raised to Dev in private that softdirty handling is also shaky, as we >batch over that. Meaning that we could lose or gain softdirty PTE bits in >a batch. I guess we won't lose soft_dirty bits - only gain them (false positive): 1) get_and_clear_ptes() merges dirty bits from all PTEs via pte_mkdirty() 2) pte_mkdirty() atomically sets both _PAGE_DIRTY and _PAGE_SOFT_DIRTY on all architectures that support soft_dirty (x86, s390, powerpc, riscv) 3) set_ptes() uses pte_advance_pfn() which keeps all flags intact So if any PTE in the batch was dirty, all PTEs become soft_dirty after restore. >For lazyfree and file folios it doesn't really matter I guess. But it will >matter once we unlock it for all anon folios. > > >1) sounds simplest, 3) sounds cleanest long-term. >