From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-lf1-f48.google.com (mail-lf1-f48.google.com [209.85.167.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DE38515ADB4 for ; Mon, 11 Aug 2025 09:16:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.167.48 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1754903765; cv=none; b=MzGEuCEYHcmGbHm5hqRJB1gXHgYgUO3iGV/HSv5Fm4WXdsY5PAdDml8VvAGjPaMMIHuMezcSc81lBfc+hLH6co+Zk3cP6e/lrGQsz25O5f/e5xMANS6EzNZRtmj/wfOWO0kboJthHXmbdX3BESp0xZr+3HxZ1Wf00rYMsTGfWKs= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1754903765; c=relaxed/simple; bh=JPSAJBEftYZ0F/ryaeFTW1S8tpF0D8X8nemI7lk4Ov0=; h=From:Date:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=E6zrjFm8Nj8nKh1ZS+MQ/i7DD4NAsecVbKCE2kiKa8UG8u9DVxTQ+s4Kee3RrzYOlZo5iBa3Mnpdqox3R00b21yogr+aQ6zx1hZjoiAsf6Mgz5UkDTxKHCNRGc/3rPqgK3l/L8UiBuhdZfb4N+F+Gu94e4AC7+PF9dVJQKExFZI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=OydY1yyV; arc=none smtp.client-ip=209.85.167.48 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="OydY1yyV" Received: by mail-lf1-f48.google.com with SMTP id 2adb3069b0e04-55b733911b3so4710775e87.2 for ; Mon, 11 Aug 2025 02:16:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1754903762; x=1755508562; darn=lists.linux.dev; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:date:from:from:to :cc:subject:date:message-id:reply-to; bh=ZbLfZuJibKMWWd+Di+kV/C8asmfMe0JJbEqm1icB4tM=; b=OydY1yyVIoH82/dsDX7QVVF+A/CeHT7YPPUrLy2LV9kpkNeqSG9lRxvf3HRgeRI+0m LVCP3uCDRkO0UYs4EZqCjELvzn6FbV6ocFC54504RWKpIuJSiAjubeobT7HP2K1bMIDs mQbWr6FjSifn6Ww0YEllskjk044alplpLvdMH/jJespQrfdm+uZXinnisNhP/cBsTeKL RgLtBDISAPLTo2J3pF2xrWvHvVrITF23g1whVr2FIwrk00SUqox8DdeWW/Vjcmbhe8Nn 7EcVYbegCdAbyP4FUjUogEX1ljkEE2OuY6+n1Le/z3hPYEl7DbevzthiDxiPeT0xcWt9 jjjA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1754903762; x=1755508562; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:date:from :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=ZbLfZuJibKMWWd+Di+kV/C8asmfMe0JJbEqm1icB4tM=; b=fJ9AzztLL/RbwJIz6ewxEZmO9c/OPFfaPa7v7k3dQWlWrPEhPMjj71FVje84chgy/W wrh/i4cWhOnPcWS289seyUW74gaOc1DUq+v3GQLtF98eZZF41fymIwxGkMHDh1eQtJ+c kAap40piruFDnwhdTM131a6PQDerrmLOE9APwDZL40X2hK/msO+sP7vUUg1XaRBRlRZE kAKo4qxKnlBxlhLpKDnQ3jFI0B4A1CqN+ahbP6gQwvK2foKuKEFo0mqPXsXfS6c7SmJA IslwTqcvUVVo9aBs0rwLK26SFvUMyk3T0IYibMyr+15fPw6Y1r7p7efMZHvh7gczd5GI AX1Q== X-Forwarded-Encrypted: i=1; AJvYcCVms0CaJV+Dem48Jx7wPs0BdhStgzb101hg628YYjuBYUcIAQF/D5m755tHxMqRr9rlZeo3Zg==@lists.linux.dev X-Gm-Message-State: AOJu0YyghzmqfHtaMLPQISLftkZQoWS1OwLUyBZ0C8Rs7FFED6xe8Ob6 Lj7KbGgV13m9N/Q4pdTn7kK5hOE4oyky4NhML6RnDy4CY+g+73Ka/u2n X-Gm-Gg: ASbGnctLhu1oiWPNXsbnv1Ey3f9sKLIYkdUNb4/VrlKxvg4XDsvfZF2R6ncf50WtUsr 83kQQG9QoTXIg2iMbDWLsulIDnIfDOMDWVkw1l4n9GWAFysxBiB5f5J5SgBw28PQ49vAxJzp8Rt JsaLmZFySHGmASmlgLpd/hhZxXsKVoJTFoB/CI728HSAkXpqbXwmy0elKT5gw47H32zHRLMy/Tz dVmDjaoZ8EOLxyDUamgfj4lcD9B3AVgEAEq7ZP7JhHwKcR9416ieN8Er7694DYu9CF8jJ9j1MSo eKl7viPJG5zA1bfRlrwIOyX7bHqGgv+xfJy/9rAjCmqU8xpW7xxDa9+X2QmvSiSKBEn+dWSgOSt gCoWUVMECCI/bsEqIM+ZqdPICOa8dmxrLfcjNCexbIXqBQz+rYw== X-Google-Smtp-Source: AGHT+IElIGCLzD5l7Ej9guePvaniyav9FRhNwGtWqAJO0RmsmAmOPJQk/HhgSuRWmtJVFhe0j1KpUQ== X-Received: by 2002:a05:6512:63d3:20b0:55c:c9d5:d347 with SMTP id 2adb3069b0e04-55cc9d5dc3emr1292257e87.35.1754903761586; Mon, 11 Aug 2025 02:16:01 -0700 (PDT) Received: from pc636 (host-95-203-26-173.mobileonline.telia.com. [95.203.26.173]) by smtp.gmail.com with ESMTPSA id 2adb3069b0e04-55b8898c164sm4118657e87.23.2025.08.11.02.15.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 11 Aug 2025 02:16:00 -0700 (PDT) From: Uladzislau Rezki X-Google-Original-From: Uladzislau Rezki Date: Mon, 11 Aug 2025 11:15:58 +0200 To: Ethan Zhao , Baolu Lu Cc: Baolu Lu , Dave Hansen , Jason Gunthorpe , Joerg Roedel , Will Deacon , Robin Murphy , Kevin Tian , Jann Horn , Vasant Hegde , Alistair Popple , Peter Zijlstra , Uladzislau Rezki , Jean-Philippe Brucker , Andy Lutomirski , Yi Lai , iommu@lists.linux.dev, security@kernel.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org Subject: Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush Message-ID: References: <20250806052505.3113108-1-baolu.lu@linux.intel.com> <20250806155223.GV184255@nvidia.com> <20250806160904.GX184255@nvidia.com> <62d21545-9e75-41e3-89a3-f21dda15bf16@intel.com> <4a8df0e8-bd5a-44e4-acce-46ba75594846@linux.intel.com> <4ce79c80-1fc8-4684-920a-c8d82c4c3dc8@intel.com> <2611981e-3678-4619-b2ab-d9daace5a68a@gmail.com> Precedence: bulk X-Mailing-List: iommu@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <2611981e-3678-4619-b2ab-d9daace5a68a@gmail.com> On Sun, Aug 10, 2025 at 03:19:58PM +0800, Ethan Zhao wrote: > > > On 8/8/2025 1:15 PM, Baolu Lu wrote: > > On 8/7/25 23:31, Dave Hansen wrote: > > > > +void pte_free_kernel(struct mm_struct *mm, pte_t *pte) > > > > +{ > > > > +    struct page *page = virt_to_page(pte); > > > > + > > > > +    guard(spinlock)(&kernel_pte_work.lock); > > > > +    list_add(&page->lru, &kernel_pte_work.list); > > > > +    schedule_work(&kernel_pte_work.work); > > > > +} > > > > diff --git a/include/asm-generic/pgalloc.h > > > > b/include/asm-generic/ pgalloc.h > > > > index 3c8ec3bfea44..716ebab67636 100644 > > > > --- a/include/asm-generic/pgalloc.h > > > > +++ b/include/asm-generic/pgalloc.h > > > > @@ -46,6 +46,7 @@ static inline pte_t > > > > *pte_alloc_one_kernel_noprof(struct mm_struct *mm) > > > >   #define pte_alloc_one_kernel(...) > > > > alloc_hooks(pte_alloc_one_kernel_noprof(__VA_ARGS__)) > > > >   #endif > > > > > > > > +#ifndef __HAVE_ARCH_PTE_FREE_KERNEL > > > >   /** > > > >    * pte_free_kernel - free PTE-level kernel page table memory > > > >    * @mm: the mm_struct of the current context > > > > @@ -55,6 +56,7 @@ static inline void pte_free_kernel(struct mm_struct > > > > *mm, pte_t *pte) > > > >   { > > > >       pagetable_dtor_free(virt_to_ptdesc(pte)); > > > >   } > > > > +#endif > > > > > > > >   /** > > > >    * __pte_alloc_one - allocate memory for a PTE-level user page table > > > I'd much rather the arch-generic code looked like this: > > > > > > #ifdef CONFIG_ASYNC_PGTABLE_FREE > > > // code and struct here, or dump them over in some > > > // other file and do this in a header > > > #else > > > static void pte_free_kernel_async(struct page *page) {} > > > #endif > > > > > > void pte_free_kernel(struct mm_struct *mm, pte_t *pte) > > > { > > >      struct page *page = virt_to_page(pte); > > > > > >      if (IS_DEFINED(CONFIG_ASYNC_PGTABLE_FREE)) { > > >     pte_free_kernel_async(page); > > >      else > > >     pagetable_dtor_free(page_ptdesc(page)); > > > } > > > > > > Then in Kconfig, you end up with something like: > > > > > > config ASYNC_PGTABLE_FREE > > >     def_bool y > > >     depends on INTEL_IOMMU_WHATEVER > > > > > > That very much tells much more of the whole story in code. It also gives > > > the x86 folks that compile out the IOMMU the exact same code as the > > > arch-generic folks. It_also_ makes it dirt simple and obvious for the > > > x86 folks to optimize out the async behavior if they don't like it in > > > the future by replacing the compile-time IOMMU check with a runtime one. > > > > > > Also, if another crazy IOMMU implementation comes along that happens to > > > do what the x86 IOMMUs do, then they have a single Kconfig switch to > > > flip. If they follow what this patch tries to do, they'll start by > > > copying and pasting the x86 implementation. > > > > I'll do it like this.  Does that look good to you? > > > > diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig > > index 70d29b14d851..6f1113e024fa 100644 > > --- a/drivers/iommu/Kconfig > > +++ b/drivers/iommu/Kconfig > > @@ -160,6 +160,7 @@ config IOMMU_DMA > >  # Shared Virtual Addressing > >  config IOMMU_SVA > >      select IOMMU_MM_DATA > > +    select ASYNC_PGTABLE_FREE if X86 > >      bool > > > >  config IOMMU_IOPF > > diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h > > index 3c8ec3bfea44..dbddacdca2ce 100644 > > --- a/include/asm-generic/pgalloc.h > > +++ b/include/asm-generic/pgalloc.h > > @@ -46,6 +46,19 @@ static inline pte_t > > *pte_alloc_one_kernel_noprof(struct mm_struct *mm) > >  #define pte_alloc_one_kernel(...) > > alloc_hooks(pte_alloc_one_kernel_noprof(__VA_ARGS__)) > >  #endif > > > > +#ifdef CONFIG_ASYNC_PGTABLE_FREE > > +struct pgtable_free_work { > > +    struct list_head list; > > +    spinlock_t lock; > > +    struct work_struct work; > > +}; > > +extern struct pgtable_free_work kernel_pte_work; > > + > > +void pte_free_kernel_async(struct ptdesc *ptdesc); > > +#else > > +static inline void pte_free_kernel_async(struct ptdesc *ptdesc) {} > > +#endif > > + > >  /** > >   * pte_free_kernel - free PTE-level kernel page table memory > >   * @mm: the mm_struct of the current context > > @@ -53,7 +66,12 @@ static inline pte_t > > *pte_alloc_one_kernel_noprof(struct mm_struct *mm) > >   */ > >  static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) > >  { > > -    pagetable_dtor_free(virt_to_ptdesc(pte)); > > +    struct ptdesc *ptdesc = virt_to_ptdesc(pte); > > + > > +    if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE)) > > +        pte_free_kernel_async(ptdesc); > > +    else > > +        pagetable_dtor_free(ptdesc); > >  } > > > >  /** > > diff --git a/mm/Kconfig b/mm/Kconfig > > index e443fe8cd6cf..528550cfa7fe 100644 > > --- a/mm/Kconfig > > +++ b/mm/Kconfig > > @@ -1346,6 +1346,13 @@ config LOCK_MM_AND_FIND_VMA > >  config IOMMU_MM_DATA > >      bool > > > > +config ASYNC_PGTABLE_FREE > > +    bool "Asynchronous kernel page table freeing" > > +    help > > +      Perform kernel page table freeing asynchronously. This is required > > +      for systems with IOMMU Shared Virtual Address (SVA) to flush IOTLB > > +      paging structure caches. > > + > >  config EXECMEM > >      bool > > > > diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c > > index 567e2d084071..6639ee6641d4 100644 > > --- a/mm/pgtable-generic.c > > +++ b/mm/pgtable-generic.c > > @@ -13,6 +13,7 @@ > >  #include > >  #include > >  #include > > +#include > >  #include > >  #include > > > > @@ -406,3 +407,32 @@ pte_t *__pte_offset_map_lock(struct mm_struct *mm, > > pmd_t *pmd, > >      pte_unmap_unlock(pte, ptl); > >      goto again; > >  } > > + > > +#ifdef CONFIG_ASYNC_PGTABLE_FREE > > +static void kernel_pte_work_func(struct work_struct *work); > > +struct pgtable_free_work kernel_pte_work = { > > +    .list = LIST_HEAD_INIT(kernel_pte_work.list), > > +    .lock = __SPIN_LOCK_UNLOCKED(kernel_pte_work.lock), > > +    .work = __WORK_INITIALIZER(kernel_pte_work.work, > > kernel_pte_work_func), > > +}; > > + > > +static void kernel_pte_work_func(struct work_struct *work) > > +{ > > +    struct ptdesc *ptdesc, *next; > > + > > +    iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL); > > + > > +    guard(spinlock)(&kernel_pte_work.lock); > > +    list_for_each_entry_safe(ptdesc, next, &kernel_pte_work.list, > > pt_list) { > > +        list_del_init(&ptdesc->pt_list); > > +        pagetable_dtor_free(ptdesc); > > +    } > > +} > > + > > +void pte_free_kernel_async(struct ptdesc *ptdesc) > > +{ > > +    guard(spinlock)(&kernel_pte_work.lock); > > +    list_add(&ptdesc->pt_list, &kernel_pte_work.list); > > +    schedule_work(&kernel_pte_work.work); > > +} > kernel_pte_work.list is global shared var, it would make the producer > pte_free_kernel() and the consumer kernel_pte_work_func() to operate in > serialized timing. In a large system, I don't think you design this > deliberately :) > Sorry for jumping. Agree, unless it is never considered as a hot path or something that can be really contented. It looks like you can use just a per-cpu llist to drain thinks. As for reference you can have a look at how vfree_atomic() handles deferred freeing. Thanks! -- Uladzislau Rezki