From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-179.mta1.migadu.com (out-179.mta1.migadu.com [95.215.58.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B40E238735A for ; Tue, 20 Jan 2026 11:38:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.179 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1768909106; cv=none; b=EKJwOFC+at6dMoom4UfX8CH9OOVzvpNQJus+jaWlXCPeoNPMOnoXfg87LKaGhyGq3vKIerhRDryZa0wQ+qFeXN1CiEEnP2ojFUaFvjyj0pMr+nb4EJmgEMnQ9FaWlgpJMr89qmsv4WacX8APcaHXQ+P8wbE2i5g4WUAPvs00X54= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1768909106; c=relaxed/simple; bh=q+VH7mqFHfc3oaF1T5X9pRUTJ94GPcpmVL+thxNFvZo=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=ddUhSr6Z0kX2Z/8aYplAIPAmD71cHLvA9vndSOL8BbaQ7iTNMXSQ+cRh/2/bwhjRrulYW/wX31y9FnwIvRKVs+qtD7+/pAo2OUB6XdLCZzb9M5ZuI/o+4QUiggr5rgAuB0OIktKIlb1EbLELQW6eLiOGRanLTlJ6rKx9Srgxv0o= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=xoQOYj1f; arc=none smtp.client-ip=95.215.58.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="xoQOYj1f" Message-ID: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1768909099; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=HLjY/4B4aU0tiyH4zKLoN88x79CrKM/2QcD82GdY0Aw=; b=xoQOYj1fvHhW3ghjYtoNiZgeOBuw8F5d3DA2srvPBuKdBnAIsaL9Hx+kz35x7AvUnVMJ0G K4NPHfEpaq0vmZLwMYRxp7jWRLCfGFQbrcCg6Gz3kyKdrTZemgwqm4CruTlCwLN5SNUCGC j0srSKc66/KBL0Qvedu67IEcm55VDZE= Date: Tue, 20 Jan 2026 19:38:05 +0800 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Subject: Re: [PATCH v1 1/1] mm/khugepaged: move tlb_remove_table_sync_one out To: hughd@google.com, baolin.wang@linux.alibaba.com Cc: Liam.Howlett@oracle.com, akpm@linux-foundation.org, baohua@kernel.org, david@kernel.org, dev.jain@arm.com, ioworker0@gmail.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, lorenzo.stoakes@oracle.com, mhocko@suse.com, npache@redhat.com, rppt@kernel.org, ryan.roberts@arm.com, surenb@google.com, vbabka@suse.cz, ziy@nvidia.com, qi.zheng@linux.dev References: <62e637cf-91e6-454d-a943-e5946bdf7784@linux.dev> <20260118083911.21523-1-lance.yang@linux.dev> Content-Language: en-US X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Lance Yang In-Reply-To: <20260118083911.21523-1-lance.yang@linux.dev> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Migadu-Flow: FLOW_OUT On 2026/1/18 16:39, Lance Yang wrote: > Hi Hugh, > > Could you check if my understanding is correct? > > On PAE, pmdp_get_lockless() reads pmd_low first, then pmd_high. There's a > risk of reading mismatched values if another CPU modifies the PMD between > the two reads. > > Commit 146b42e07494[1] introduced local_irq_save() to protect the > split-read, blocking TLB flush IPIs during the operation. > > After modifying the PMD, pmdp_get_lockless_sync() sends an IPI to ensure > all ongoing split-reads complete before proceeding with pte_free_defer(). > > As commit 146b42e07494[1] says: > > ``` > Complement this pmdp_get_lockless_start() and pmdp_get_lockless_end(), > used only locally in __pte_offset_map(), with a pmdp_get_lockless_sync() > synonym for tlb_remove_table_sync_one(): to send the necessary interrupt > at the right moment on those configs which do not already send it. > ``` > > And commit 1043173eb5eb[2] says: > > ``` > Follow the pattern in retract_page_tables(); and using pte_free_defer() > removes most of the need for tlb_remove_table_sync_one() here; but call > pmdp_get_lockless_sync() to use it in the PAE case. > ``` > > Regarding moving pmdp_get_lockless_sync() out from under PTL: Since > lockless readers (e.g., GUP-fast, __pte_offset_map()) are protected by > local_irq_save() rather than PTL, pmdp_get_lockless_sync() can be called > outside PTL as long as it's before pte_free_defer(). Looking at commit 146b42e07494[1] again, it says pmdp_get_lockless_sync() should be called "at the right moment". I now realize moving it outside PTL might not be safe. If we release PTL before calling pmdp_get_lockless_sync(), another CPU could set a new PMD while a lockless reader is still in local_irq_save() reading the old PMD (split-read). I'm not sure if this race is actually possible, but if it is, it would hit the ABA problem where the reader gets mismatched pmd_low (old) and pmd_high (new) - the "faint risk" mentioned in commit 146b42e07494[1]. On Native x86 PAE, pmdp_collapse_flush() sends IPI and waits, preventing this race. But on PV, the hypercall returns immediately, so we need pmdp_get_lockless_sync() to ensure all IRQ-off readers complete before releasing PTL. I should keep it under PTL to be safe. Sorry for the churn :( [1] https://github.com/torvalds/linux/commit/146b42e07494e45f7c7bcf2cbf7afd1424afd78e Thanks, Lance > > In contrast, for non-PAE, PMD reads are atomic, so pmdp_get_lockless_sync() > is a no-op. > > [1] https://github.com/torvalds/linux/commit/146b42e07494e45f7c7bcf2cbf7afd1424afd78e > [2] https://github.com/torvalds/linux/commit/1043173eb5eb351a1dba11cca12705075fe74a9e > > > Thanks, > Lance > > On Fri, 16 Jan 2026 09:25:54 +0800, Lance Yang wrote: >> >> >> On 2026/1/16 09:03, Baolin Wang wrote: >>> >>> >>> On 1/15/26 8:28 PM, Lance Yang wrote: >>>> >>>> >>>> On 2026/1/15 18:00, Baolin Wang wrote: >>>>> Hi Lance, >>>>> >>>>> On 1/15/26 3:16 PM, Lance Yang wrote: >>>>>> From: Lance Yang >>>>>> >>>>>> tlb_remove_table_sync_one() sends IPIs to all CPUs and waits for them, >>>>>> which we really don't want to do while holding PTL. >>>>> >>>>> Could you add more comments to explain why this is safe for the PAE >>>>> case? >>>> >>>> Yep, IIUC, it is safe because we've already done pmdp_collapse_flush() >>>> which ensures the PMD change is visible. >>>> >>>> pmdp_get_lockless_sync() (which calls tlb_remove_table_sync_one() on PAE) >>>> is just to ensure any ongoing lockless pmd readers (e.g., GUP-fast) >>>> complete >>>> before we proceed. It sends IPIs to all CPUs and waits for responses - >>>> a CPU >>>> can only respond when it's not between local_irq_save() and >>>> local_irq_restore(). >>>> >>>> Moving it out from under PTL doesn't change the synchronization >>>> semantics, >>>> since lockless readers don't depend on PTL anyway. >>> >>> Cc Hugh who introduced the pmdp_get_lockless_sync(), to double check. >>> >>> Sounds reasonable to me, please add these comments into the commit >>> message. Thanks. >> >> Yes, will do. Thanks! >> >>> >>>>> For the non-PAE case, you added a new tlb_remove_table_sync_one(), >>>>> why we need this (to solve what problem)? Please also add more >>>>> comments to explain. >>>> >>>> Oops, you're right, the original macro was a no-op for non-PAE. >>>> >>>> I should just move the macro call out from under PTL, rather than >>>> replacing it with direct tlb_remove_table_sync_one() calls. >>> >>> OK. >> >> Cheers, >> Lance >>