From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 23530C433FE for ; Tue, 1 Nov 2022 09:30:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 96D4E6B0072; Tue, 1 Nov 2022 05:30:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 91C7A8E0001; Tue, 1 Nov 2022 05:30:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 80B0E6B0074; Tue, 1 Nov 2022 05:30:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 6FBEA6B0072 for ; Tue, 1 Nov 2022 05:30:09 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 3C0F7A0E7C for ; Tue, 1 Nov 2022 09:30:09 +0000 (UTC) X-FDA: 80084352138.10.BE11AAD Received: from out0.migadu.com (out0.migadu.com [94.23.1.103]) by imf03.hostedemail.com (Postfix) with ESMTP id 82F912000D for ; Tue, 1 Nov 2022 09:30:07 +0000 (UTC) Content-Type: text/plain; charset=us-ascii DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1667295005; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=TJnkLWBDb/g+EIJR+QCS7+bzQuK8KqDlxpwgbdmako0=; b=I3+tX0DGdXq93FnPXwK/TAd0YbWOlhvU00/OVLZZHIOaE1O/DlWDA+eZ0n1pfVOuWL6bmy Ru0lZ59Afj6AC8Rdr/lRYbAISG+aawR6lwiuIwF9HDyUTz9BPBYCbN/FK3m5HHrqohW/9h /NIPShIw5Owt6Zzx5xIAIOt8KZ2bdWc= MIME-Version: 1.0 Subject: Re: [PATCH -next 1/1] mm: hugetlb_vmemmap: Fix WARN_ON in vmemmap_remap_pte X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Muchun Song In-Reply-To: Date: Tue, 1 Nov 2022 17:29:49 +0800 Cc: Anshuman Khandual , Wupeng Ma , Andrew Morton , Mike Kravetz , Muchun Song , Michal Hocko , Oscar Salvador , Linux Memory Management List , linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Message-Id: <1FD8D6EF-B7E9-4200-931C-3CCD81605741@linux.dev> References: <20221025014215.3466904-1-mawupeng1@huawei.com> <614E3E83-1EAB-4C39-AF9C-83C0CCF26218@linux.dev> <35dd51eb-c266-f221-298a-21309c17971a@arm.com> <3D6FDA43-A812-4907-B9C8-C2B25567DBBC@linux.dev> <3c545133-71aa-9a8d-8a13-09186c4fa767@arm.com> To: Catalin Marinas X-Migadu-Flow: FLOW_OUT ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1667295008; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=TJnkLWBDb/g+EIJR+QCS7+bzQuK8KqDlxpwgbdmako0=; b=ySn40hgnRuTcMIlxZqmYH2RR0c6kUzfbyrtMqLDQR3PIHGK1tJgyAzZ8gqIq57yQi4ous5 r1B4RhqvVjdN5qfP/uKlUGnDgJ/GG1XLYMkQtZMt6+1470JdQn/HQOovtbtoJhiyKtTilh EuUaLgXOvKlAdWSD7pyNQSu769/yP5M= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=I3+tX0DG; spf=pass (imf03.hostedemail.com: domain of muchun.song@linux.dev designates 94.23.1.103 as permitted sender) smtp.mailfrom=muchun.song@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1667295008; a=rsa-sha256; cv=none; b=oiRfhUyjWKGE1OBZC6h5nqeUIGc+Aw9slWxVAaPeVv+govijOG3nHXm5grOTlRpDSG7M74 wbWAv6hq3VE/3oCyghdOf+wzIE8BiMKbxKtY1b3e8QgwO2lVCPP6GFoEr84tsFxioBPPDf LBj3eV4EHTmZhp2y5GoQbwpE4/arcxk= X-Rspam-User: X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 82F912000D Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=I3+tX0DG; spf=pass (imf03.hostedemail.com: domain of muchun.song@linux.dev designates 94.23.1.103 as permitted sender) smtp.mailfrom=muchun.song@linux.dev; dmarc=pass (policy=none) header.from=linux.dev X-Stat-Signature: ejtn94yj6gwg3oi1o4ipy1kaznsiychd X-HE-Tag: 1667295007-645331 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: > On Oct 28, 2022, at 23:53, Catalin Marinas = wrote: >=20 > On Fri, Oct 28, 2022 at 10:45:09AM +0800, Muchun Song wrote: >> On Oct 27, 2022, at 18:50, Catalin Marinas = wrote: >>> On Wed, Oct 26, 2022 at 02:06:00PM +0530, Anshuman Khandual wrote: >>>> On 10/26/22 12:31, Muchun Song wrote: >>>>>> On 10/25/22 12:06, Muchun Song wrote: >>>>>>>> On Oct 25, 2022, at 09:42, Wupeng Ma = wrote: >>>>>>>> From: Ma Wupeng >>>>>>>>=20 >>>>>>>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages = associated with >>>>>>>> each HugeTLB page") add vmemmap_remap_pte to remap the tail = pages as >>>>>>>> read-only to catch illegal write operation to the tail page. >>>>>>>>=20 >>>>>>>> However this will lead to WARN_ON in arm64 in = __check_racy_pte_update() >>>>>>>=20 >>>>>>> Thanks for your finding this issue. >>>>>>>=20 >>>>>>>> since this may lead to dirty state cleaned. This check is = introduced by >>>>>>>> commit 2f4b829c625e ("arm64: Add support for hardware updates = of the >>>>>>>> access and dirty pte bits") and the initial check is as follow: >>>>>>>>=20 >>>>>>>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte)); >>>>>>>>=20 >>>>>>>> Since we do need to mark this pte as read-only to catch illegal = write >>>>>>>> operation to the tail pages, use set_pte to replace set_pte_at = to bypass >>>>>>>> this check. >>>>>>>=20 >>>>>>> In theory, the waring does not affect anything since the tail = vmemmap >>>>>>> pages are supposed to be read-only. So, skipping this check for = vmemmap >>>>>>=20 >>>>>> Tails vmemmap pages are supposed to be read-only, in practice but = their >>>>>> backing pages do have pte_write() enabled. Otherwise the = VM_WARN_ONCE() >>>>>> warning would not have triggered. >>>>>=20 >>>>> Right. >>>>>=20 >>>>>>=20 >>>>>> VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte), >>>>>> "%s: racy dirty state clearing: 0x%016llx -> = 0x%016llx", >>>>>> __func__, pte_val(old_pte), pte_val(pte)); >>>>>>=20 >>>>>> Also, is not it true that the pte being remapped into a different = page >>>>>> as read only, than what it had originally (which will be freed = up) i.e=20 >>>>>> the PFN in 'old_pte' and 'pte' will be different. Hence is there = still >>>>>=20 >>>>> Right. >>>>>=20 >>>>>> a possibility for a race condition even when the PFN changes ? >>>>>=20 >>>>> Sorry, I didn't get this question. Did you mean the PTE is changed = from >>>>> new (pte) to the old one (old_pte) by the hardware because of the = update >>>>> of dirty bit when a concurrent write operation to the tail vmemmap = page? >>>>=20 >>>> No, but is not vmemmap_remap_pte() reuses walk->reuse_page for all = remaining >>>> tails pages ? Is not there a PFN change, along with access = permission change >>>> involved in this remapping process ? >>>=20 >>> For the record, as we discussed offline, changing the output address >>> (pfn) of a pte is not safe without break-before-make if at least one = of >>> the mappings was writeable. The caller (vmemmap_remap_pte()) would = need >>> to be fixed to first invalidate the pte and then write the new pte. = I >>=20 >> Could you expose more details about what issue it will be caused? I = am >> not familiar with arm64. >=20 > Well, it's not allowed by the architecture, so some CPU = implementations > may do weird things like accessing incorrect memory or triggering TLB > conflict aborts if, for some reason, they end up with two entries in > the TLB for the same VA but pointing to different pfns. The hardware > expects an invalid PTE and TLB invalidation between such changes. In > practice most likely nothing happens and this works fine but we need = to > stick to the architecture requirements in case some CPUs take = advantage > of this requirement. Got it. Thanks for your nice explanation. >=20 >>> assume no other CPU accesses this part of the vmemmap while the pte = is >>> being remapped. >>=20 >> However, there is no guarantee that no other CPU accesses this pte. >> E.g. memory failure or memory compaction, both can obtain head page >> from any tail struct pages (only read) anytime. >=20 > Oh, so we cannot safely go through a break-before-make sequence here > (zero the PTE, flush the TLB, write the new PTE) as some CPU may = access > this pte. Right. Muchun >=20 > --=20 > Catalin