From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-171.mta0.migadu.com (out-171.mta0.migadu.com [91.218.175.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3D1C93A0B3F for ; Thu, 26 Feb 2026 10:28:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.171 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772101716; cv=none; b=CXgkWlNqkNgVxSWEp/+RlHZfFEAS0YR21Zxp3SuvYe4U853P2GmrZbBOdywvqnoXPYIM0U18QAK9Zv73A96VaHBLMS5w4Y2B/VCuuRrFVGqZ0bOwWfR1zVA+e9ORYVmUrpEQlnv4100zbIFnrJBV7bmOj1Fiix5x2QI5j9VEpvY= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772101716; c=relaxed/simple; bh=FhOJd0zd3zzGvFhze/c7duA6Ocv5zgP+R77tjuzjaRM=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=Z0KbASzcC//dvhOoX4GpVyD07DwqBwej98dPfafVgYHp61FQ+hdUto62g1VyBaFH9LWAdGzWTrIPO2WyzQXfMyr6k0JEYbhwRqBw4rpNn/NvnGEQvhQBZvP0saLnA2dWKOs51bap3ZLOcsUm7OTnMMrAwdXoW0xiJurZH90ppw0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=YAVrmS6n; arc=none smtp.client-ip=91.218.175.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="YAVrmS6n" Message-ID: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1772101712; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=S+yFBRNqmJ5O/9wR1LsjXptvmYsNUxZetzHoIAzFZaU=; b=YAVrmS6nLqZsCg6F0yNd7g0+0S+UrhZBNrsQMxarcA45GZ9rploasDDZG3zBZSsvtqu4V1 ECAaFMOU82waYaNw0uoudNJkZG/zShB6fLGXc24OaJox6bMyFxjr9yGpUarKUGEhz2ukum 27LYG00ZV4ikGkVU5Gl5m+aEL7o7tEc= Date: Thu, 26 Feb 2026 18:28:23 +0800 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Subject: Re: [PATCH] mm/rmap: fix incorrect pte restoration for lazyfree folios Content-Language: en-US To: "David Hildenbrand (Arm)" Cc: Liam.Howlett@oracle.com, akpm@linux-foundation.org, baohua@kernel.org, dev.jain@arm.com, harry.yoo@oracle.com, jannh@google.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, lorenzo.stoakes@oracle.com, riel@surriel.com, stable@kernel.org, vbabka@kernel.org References: <36e676b4-dc6f-45f7-b885-8685227ac6a8@kernel.org> <20260226070940.96226-1-lance.yang@linux.dev> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Lance Yang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Migadu-Flow: FLOW_OUT On 2026/2/26 18:06, David Hildenbrand (Arm) wrote: > On 2/26/26 08:09, Lance Yang wrote: >> >> On Tue, Feb 24, 2026 at 05:01:50PM +0100, David Hildenbrand (Arm) wrote: >>> On 2/24/26 12:43, Lorenzo Stoakes wrote: >>>> >>>> Sorry I misread the original mail rushing through this is old... so this is less >>>> pressing than I thought (for some reason I thought it was merged last cycle...!) >>>> but it's a good example of how stuff can go unnoticed for a while. >>>> >>>> In that case maybe a revert is a bit much and we just want the simplest possible >>>> fix for backporting. >>> >>> Dev volunteered to un-messify some of the stuff here. In particular, to >>> extend batching to all cases, not just some hand-selected ones. >>> >>> Support for file folios is on the way. >>> >>>> >>>> But is the proposed 'just assume wrprotect' sensible? David? >>> >>> In general, I think so. If PTEs were writable, they certainly have >>> PAE set. The write-fault handler can fully recover from that (as PAE is >>> set). If it's ever a performance problem (doubt), we can revisit. >>> >>> I'm wondering whether we should just perform the wrprotect earlier: >>> >>> diff --git a/mm/rmap.c b/mm/rmap.c >>> index 0f00570d1b9e..19b875ee3fad 100644 >>> --- a/mm/rmap.c >>> +++ b/mm/rmap.c >>> @@ -2150,6 +2150,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, >>> >>> /* Nuke the page table entry. */ >>> pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages); >>> + >>> + /* >>> + * Our batch might include writable and read-only >>> + * PTEs. When we have to restore the mapping, just >>> + * assume read-only to not accidentally upgrade >>> + * write permissions for PTEs that must not be >>> + * writable. >>> + */ >>> + pteval = pte_wrprotect(pteval); >>> + >>> /* >>> * We clear the PTE but do not flush so potentially >>> * a remote CPU could still be writing to the folio >>> >>> >>> Given that nobody asks for writability (pte_write()) later. >>> >>> Or does someone care? >>> >>> Staring at set_tlb_ubc_flush_pending()->pte_accessible() I am >>> not 100% sure. Could pte_wrprotect() turn a PTE inaccessible on some >>> architecture (write-only)? I don't think so. >>> >>> >>> We have the following options: >>> >>> 1) pte_wrprotect(): fake that all was read-only. >>> >>> Either we do it like Dev suggests, or we do it as above early. >>> >>> The downside is that any code that might later want to know "was >>> this possibly writable" would get that information. Well, it wouldn't >>> get that information reliably *today* already (and that sounds a bit shaky). >> >> Makes sense to me :) >> >>> 2) Tell batching logic to honor pte_write() >>> >>> Sounds suboptimal for some cases that really don't care in the future. >>> >>> 3) Tell batching logic to tell us if any pte was writable: FPB_MERGE_WRITE >>> >>> ... then we know for sure whether any PTE was writable and we could >>> >>> (a) Pass it as we did before around to all checks, like pte_accessible(). >>> >>> (b) Have an explicit restore PTE where we play save. >>> >>> >>> I raised to Dev in private that softdirty handling is also shaky, as we >>> batch over that. Meaning that we could lose or gain softdirty PTE bits in >>> a batch. >> >> I guess we won't lose soft_dirty bits - only gain them (false positive): >> >> 1) get_and_clear_ptes() merges dirty bits from all PTEs via pte_mkdirty() >> 2) pte_mkdirty() atomically sets both _PAGE_DIRTY and _PAGE_SOFT_DIRTY on >> all architectures that support soft_dirty (x86, s390, powerpc, riscv) >> 3) set_ptes() uses pte_advance_pfn() which keeps all flags intact >> >> So if any PTE in the batch was dirty, all PTEs become soft_dirty after >> restore. > > PTEs can be softdirty without being dirty. That over-complicates the > situation. Ah, it's even trickier then :D