From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id D27CDC5B548 for ; Wed, 28 Aug 2024 14:44:18 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Transfer-Encoding: Content-Type:In-Reply-To:MIME-Version:References:Message-ID:Subject:Cc:To: From:Date:Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=hvZdgP0NUcxYPpTm2+LyySi+WaC5CrwfGRx24HOtBqA=; b=kj10Fl1s6i28RGlJ+zUmrAWj/e QTmllmdsPxZgNB/bGs/Y8a20k+NRhd5GTGpiOMVeRu+HK6b70B18cwnJy77jJVkCedeYAWfaiNOSw QJ5Aenf00z65c2Di6dwqqvMRkflRSu+vccA1t32GHB6xF3Jbx9MS2o4BWX/CoH88CnDy6DP52wUXx zE6Kw7QkcdRmy8xM7AcfTfw6acMJmfGVP3PU/w4olbPPNuxreOoIbV8VKX0+w7VEOJqY5PFgW5QBK S9fU/XWTA1hKVWSKpJ6JEL8Hrx2r/958EdXayG4E1zcQU3LLfKzL/CGvXokWxFuYjM4V2WAGq6rgv QYLWXavA==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.97.1 #2 (Red Hat Linux)) id 1sjJu6-0000000Fnoh-1xRw; Wed, 28 Aug 2024 14:44:10 +0000 Received: from us-smtp-delivery-124.mimecast.com ([170.10.129.124]) by bombadil.infradead.org with esmtps (Exim 4.97.1 #2 (Red Hat Linux)) id 1sjJrq-0000000Fn67-2Has for linux-arm-kernel@lists.infradead.org; Wed, 28 Aug 2024 14:41:52 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1724856109; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=hvZdgP0NUcxYPpTm2+LyySi+WaC5CrwfGRx24HOtBqA=; b=bmudBAr+lyqtIlosckHvmO4o39zm6Ii8dauk2G16F20I5fMmmPp8wqaYo89JVR+vvM7AlV GLWk6quQImSmBcBx6ghxdRT7D1nFIVigIqop5N4xIr3g5W/XRso1pqN1/HldPFTOyCvRLr vTh0jcgs7c9YIkGsXQUR7cThFNmt2D0= Received: from mail-oi1-f197.google.com (mail-oi1-f197.google.com [209.85.167.197]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-589-jXt3fYN2O-6L0FwJVA-K0g-1; Wed, 28 Aug 2024 10:41:46 -0400 X-MC-Unique: jXt3fYN2O-6L0FwJVA-K0g-1 Received: by mail-oi1-f197.google.com with SMTP id 5614622812f47-3df037fb6edso264116b6e.0 for ; Wed, 28 Aug 2024 07:41:46 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724856105; x=1725460905; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=hvZdgP0NUcxYPpTm2+LyySi+WaC5CrwfGRx24HOtBqA=; b=ZEGpOBpYf6tSKBUIj5suJPBlALoSEmQCOZLtU43RdV7ijvNbCmw/yqgnD8/NbaxzZs prP9+Rw67a0WqNnFNhQinYTL4SqgZBWcz8rmf94t2862c1TO6jkL9EWiprrfrRNuQ45A XQ01mKrfyYxrEnUeZqoAy4MSFmdK7blSMG7ElbdHoSVGTaD1kdJy7bCEXYN4SOfQBLg1 K+lp068avrehPY5/E2gtwZOCj8IA/zUXBl+tDahjI1roTy3YGz9vscRaAeaV3SIaBj8y vB6i3MFazwzhzp5ET1N+UTR2WNhXmUnB2mSWfn2WX0pCbVSEj+TK5edJUIwNXBKq0eIZ W+ng== X-Forwarded-Encrypted: i=1; AJvYcCWuLGE2/0sAiBs7Qj9hwScpRWfVHgt045VBSwaD3sXFoTIPGpwIJpaOQMZo1BKcq0z1o6ohS/hYpfcJm0DnRiWm@lists.infradead.org X-Gm-Message-State: AOJu0Yz6LVXqytkycaNP76MS3lEcNr6cykkuS9hx+XHeww/ijd+SOUTD n2rAO6RQOhu/nAFvYpai3cXWJoQBpNqcC8pXfaXlz/eFTKCwAbtlwY6QB4rknBA4n5HpG9yrUMS 0afgunsd0oD3VqK0crSjJr0nB5IONfSu/1zRafankhPBhBnIokrMgk9JVNIlZ7tA8iIKC6uLc7+ VdXwsCqc0= X-Received: by 2002:a05:6870:b1c9:b0:261:10b7:8c48 with SMTP id 586e51a60fabf-273e64e6a65mr18733928fac.27.1724856105111; Wed, 28 Aug 2024 07:41:45 -0700 (PDT) X-Google-Smtp-Source: AGHT+IG1e8zs2rKVQDk+rUuWOOwvpT4TAWto4m4s1MPtOxpMbOLBvZ55G4kPBLJOts7QICf1VZvizA== X-Received: by 2002:a05:6870:b1c9:b0:261:10b7:8c48 with SMTP id 586e51a60fabf-273e64e6a65mr18733899fac.27.1724856104728; Wed, 28 Aug 2024 07:41:44 -0700 (PDT) Received: from x1n (pool-99-254-121-117.cpe.net.cable.rogers.com. [99.254.121.117]) by smtp.gmail.com with ESMTPSA id 586e51a60fabf-273ce9e38c9sm3780542fac.19.2024.08.28.07.41.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 28 Aug 2024 07:41:44 -0700 (PDT) Date: Wed, 28 Aug 2024 10:41:40 -0400 From: Peter Xu To: Jiaqi Yan Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Gavin Shan , Catalin Marinas , x86@kernel.org, Ingo Molnar , Andrew Morton , Paolo Bonzini , Dave Hansen , Thomas Gleixner , Alistair Popple , kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, Sean Christopherson , Oscar Salvador , Jason Gunthorpe , Borislav Petkov , Zi Yan , Axel Rasmussen , David Hildenbrand , Yan Zhao , Will Deacon , Kefeng Wang , Alex Williamson Subject: Re: [PATCH v2 00/19] mm: Support huge pfnmaps Message-ID: References: <20240826204353.2228736-1-peterx@redhat.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20240828_074150_685147_40374DC9 X-CRM114-Status: GOOD ( 38.52 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Tue, Aug 27, 2024 at 05:42:21PM -0700, Jiaqi Yan wrote: > On Tue, Aug 27, 2024 at 3:57 PM Peter Xu wrote: > > > > On Tue, Aug 27, 2024 at 03:36:07PM -0700, Jiaqi Yan wrote: > > > Hi Peter, > > > > Hi, Jiaqi, > > > > > I am curious if there is any work needed for unmap_mapping_range? If a > > > driver hugely remap_pfn_range()ed at 1G granularity, can the driver > > > unmap at PAGE_SIZE granularity? For example, when handling a PFN is > > > > Yes it can, but it'll invoke the split_huge_pud() which default routes to > > removal of the whole pud right now (currently only covers either DAX > > mappings or huge pfnmaps; it won't for anonymous if it comes, for example). > > > > In that case it'll rely on the driver providing proper fault() / > > huge_fault() to refault things back with smaller sizes later when accessed > > again. > > I see, so the driver needs to drive the recovery process, and code > needs to be in the driver. > > But it seems to me the recovery process will be more or less the same > to different drivers? In that case does it make sense that > memory_failure do the common things for all drivers? > > Instead of removing the whole pud, can driver or memory_failure do > something similar to non-struct-page-version of split_huge_page? So > driver doesn't need to re-fault good pages back? I think we can, it's just that we don't yet have a valid use case. DAX is definitely fault-able. While for the new huge pfnmap, currently vfio is the only user, and vfio only requires to either zap all or map all. In that case there's no real need to ask for what you described yet. Meanwhile it's also faultable, so if / when needed it should hopefully still do the work properly. I believe it's not usual requirement too for most of the rest drivers, as most of them don't even support fault() afaiu. remap_pfn_range() can start to use huge mappings, however I'd expect they're mostly not ready for random tearing down of any MMIO mappings. It sounds doable to me though when there's a need of what you're describing, but I don't think I know well on the use case yet. > > > > > > > poisoned in the 1G mapping, it would be great if the mapping can be > > > splitted to 2M mappings + 4k mappings, so only the single poisoned PFN > > > is lost. (Pretty much like the past proposal* to use HGM** to improve > > > hugetlb's memory failure handling). > > > > Note that we're only talking about MMIO mappings here, in which case the > > PFN doesn't even have a struct page, so the whole poison idea shouldn't > > apply, afaiu. > > Yes, there won't be any struct page. Ankit proposed this patchset* for > handling poisoning. I wonder if someday the vfio-nvgrace-gpu-pci > driver adopts your change via new remap_pfn_range (install PMD/PUD > instead of PTE), and memory_failure_pfn still > unmap_mapping_range(pfn_space->mapping, pfn << PAGE_SHIFT, PAGE_SIZE, > 0), can it somehow just work and no re-fault needed? > > * https://lore.kernel.org/lkml/20231123003513.24292-2-ankita@nvidia.com/#t I see now, interesting.. Thanks for the link. In that case of nvgpu usage, one way is to do as what you said; we can enhance the pmd/pud split for pfnmap, but maybe that's an overkill. I saw that the nvgpu will need a fault() anyway so as to detect poisoned PFNs, then it's also feasible that in the new nvgrace_gpu_vfio_pci_fault() when it supports huge pfnmaps it'll need to try to detect whether the whole faulting range contains any poisoned PFNs, then provide FALLBACK if so (rather than VM_FAULT_HWPOISON). E.g., when 4K of 2M is poisoned, we'll erase the 2M completely. When access happens, as long as the accessed 4K is not on top of the poisoned 4k, huge_fault() should still detect that there's 4k range poisoned, then it'll not inject pmd but return FALLBACK, then in the fault() it'll see the accessed 4k range is not poisoned, then install a pte. Thanks, -- Peter Xu