linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [REGRESSION][BISECTED] Performance Regression in IOMMU/VT-d Since Kernel 6.10
@ 2025-07-01 17:11 Ioanna Alifieraki
  2025-07-02  5:14 ` Baolu Lu
  0 siblings, 1 reply; 5+ messages in thread
From: Ioanna Alifieraki @ 2025-07-01 17:11 UTC (permalink / raw)
  To: baolu.lu, kevin.tian, jroedel, robin.murphy, will, joro, dwmw2,
	iommu, linux-kernel, regressions, stable

#regzbot introduced: 129dab6e1286

Hello everyone,

We've identified a performance regression that starts with linux
kernel 6.10 and persists through 6.16(tested at commit e540341508ce).
Bisection pointed to commit:
129dab6e1286 ("iommu/vt-d: Use cache_tag_flush_range_np() in iotlb_sync_map").

The issue occurs when running fio against two NVMe devices located
under the same PCIe bridge (dual-port NVMe configuration). Performance
drops compared to configurations where the devices are on different
bridges.

Observed Performance:
- Before the commit: ~6150 MiB/s, regardless of NVMe device placement.
- After the commit:
  -- Same PCIe bridge: ~4985 MiB/s
  -- Different PCIe bridges: ~6150 MiB/s


Currently we can only reproduce the issue on a Z3 metal instance on
gcp. I suspect the issue can be reproducible if you have a dual port
nvme on any machine.
At [1] there's a more detailed description of the issue and details
on the reproducer. 

Could you please advise on the appropriate path forward to mitigate or
address this regression?

Thanks,
Jo

[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2115738

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [REGRESSION][BISECTED] Performance Regression in IOMMU/VT-d Since Kernel 6.10
  2025-07-01 17:11 [REGRESSION][BISECTED] Performance Regression in IOMMU/VT-d Since Kernel 6.10 Ioanna Alifieraki
@ 2025-07-02  5:14 ` Baolu Lu
  2025-07-02  9:00   ` Baolu Lu
  0 siblings, 1 reply; 5+ messages in thread
From: Baolu Lu @ 2025-07-02  5:14 UTC (permalink / raw)
  To: Ioanna Alifieraki, kevin.tian, jroedel, robin.murphy, will, joro,
	dwmw2, iommu, linux-kernel, regressions, stable

On 7/2/25 01:11, Ioanna Alifieraki wrote:
> #regzbot introduced: 129dab6e1286
> 
> Hello everyone,
> 
> We've identified a performance regression that starts with linux
> kernel 6.10 and persists through 6.16(tested at commit e540341508ce).
> Bisection pointed to commit:
> 129dab6e1286 ("iommu/vt-d: Use cache_tag_flush_range_np() in iotlb_sync_map").
> 
> The issue occurs when running fio against two NVMe devices located
> under the same PCIe bridge (dual-port NVMe configuration). Performance
> drops compared to configurations where the devices are on different
> bridges.
> 
> Observed Performance:
> - Before the commit: ~6150 MiB/s, regardless of NVMe device placement.
> - After the commit:
>    -- Same PCIe bridge: ~4985 MiB/s
>    -- Different PCIe bridges: ~6150 MiB/s
> 
> 
> Currently we can only reproduce the issue on a Z3 metal instance on
> gcp. I suspect the issue can be reproducible if you have a dual port
> nvme on any machine.
> At [1] there's a more detailed description of the issue and details
> on the reproducer.

This test was running on bare metal hardware instead of any
virtualization guest, right? If that's the case,
cache_tag_flush_range_np() is almost a no-op.

Can you please show me the capability register of the IOMMU by:

#cat /sys/bus/pci/devices/[pci_dev_name]/iommu/intel-iommu/cap

> 
> Could you please advise on the appropriate path forward to mitigate or
> address this regression?
> 
> Thanks,
> Jo
> 
> [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2115738

Thanks,
baolu

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [REGRESSION][BISECTED] Performance Regression in IOMMU/VT-d Since Kernel 6.10
  2025-07-02  5:14 ` Baolu Lu
@ 2025-07-02  9:00   ` Baolu Lu
  2025-07-02 16:45     ` Ioanna Alifieraki
  0 siblings, 1 reply; 5+ messages in thread
From: Baolu Lu @ 2025-07-02  9:00 UTC (permalink / raw)
  To: Ioanna Alifieraki, kevin.tian, jroedel, robin.murphy, will, joro,
	dwmw2, iommu, linux-kernel, regressions, stable

[-- Attachment #1: Type: text/plain, Size: 3513 bytes --]

On 7/2/2025 1:14 PM, Baolu Lu wrote:
> On 7/2/25 01:11, Ioanna Alifieraki wrote:
>> #regzbot introduced: 129dab6e1286
>>
>> Hello everyone,
>>
>> We've identified a performance regression that starts with linux
>> kernel 6.10 and persists through 6.16(tested at commit e540341508ce).
>> Bisection pointed to commit:
>> 129dab6e1286 ("iommu/vt-d: Use cache_tag_flush_range_np() in 
>> iotlb_sync_map").
>>
>> The issue occurs when running fio against two NVMe devices located
>> under the same PCIe bridge (dual-port NVMe configuration). Performance
>> drops compared to configurations where the devices are on different
>> bridges.
>>
>> Observed Performance:
>> - Before the commit: ~6150 MiB/s, regardless of NVMe device placement.
>> - After the commit:
>>    -- Same PCIe bridge: ~4985 MiB/s
>>    -- Different PCIe bridges: ~6150 MiB/s
>>
>>
>> Currently we can only reproduce the issue on a Z3 metal instance on
>> gcp. I suspect the issue can be reproducible if you have a dual port
>> nvme on any machine.
>> At [1] there's a more detailed description of the issue and details
>> on the reproducer.
> 
> This test was running on bare metal hardware instead of any
> virtualization guest, right? If that's the case,
> cache_tag_flush_range_np() is almost a no-op.
> 
> Can you please show me the capability register of the IOMMU by:
> 
> #cat /sys/bus/pci/devices/[pci_dev_name]/iommu/intel-iommu/cap

Also, can you please try whether the below changes make any difference?
I've also attached a patch file to this email so you can apply the
change more easily.

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 7aa3932251b2..f60201ee4be0 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1796,6 +1796,18 @@ static int domain_setup_first_level(struct 
intel_iommu *iommu,
  					  (pgd_t *)pgd, flags, old);
  }

+static bool domain_need_iotlb_sync_map(struct dmar_domain *domain,
+				       struct intel_iommu *iommu)
+{
+	if (cap_caching_mode(iommu->cap) && !domain->use_first_level)
+		return true;
+
+	if (rwbf_quirk || cap_rwbf(iommu->cap))
+		return true;
+
+	return false;
+}
+
  static int dmar_domain_attach_device(struct dmar_domain *domain,
  				     struct device *dev)
  {
@@ -1833,6 +1845,8 @@ static int dmar_domain_attach_device(struct 
dmar_domain *domain,
  	if (ret)
  		goto out_block_translation;

+	domain->iotlb_sync_map |= domain_need_iotlb_sync_map(domain, iommu);
+
  	return 0;

  out_block_translation:
@@ -3945,7 +3959,10 @@ static bool risky_device(struct pci_dev *pdev)
  static int intel_iommu_iotlb_sync_map(struct iommu_domain *domain,
  				      unsigned long iova, size_t size)
  {
-	cache_tag_flush_range_np(to_dmar_domain(domain), iova, iova + size - 1);
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+
+	if (dmar_domain->iotlb_sync_map)
+		cache_tag_flush_range_np(dmar_domain, iova, iova + size - 1);

  	return 0;
  }
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index 3ddbcc603de2..7ab2c34a5ecc 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -614,6 +614,9 @@ struct dmar_domain {
  	u8 has_mappings:1;		/* Has mappings configured through
  					 * iommu_map() interface.
  					 */
+	u8 iotlb_sync_map:1;		/* Need to flush IOTLB cache or write
+					 * buffer when creating mappings.
+					 */

  	spinlock_t lock;		/* Protect device tracking lists */
  	struct list_head devices;	/* all devices' list */
-- 
2.43.0

Thanks,
baolu

[-- Attachment #2: 0001-iommu-vt-d-Avoid-unnecessary-cache_tag_flush_range_n.patch --]
[-- Type: text/plain, Size: 2362 bytes --]

From ddc4210a3365147df978bd0bf45d824b9c869877 Mon Sep 17 00:00:00 2001
From: Lu Baolu <baolu.lu@linux.intel.com>
Date: Wed, 2 Jul 2025 16:51:48 +0800
Subject: [PATCH 1/1] iommu/vt-d: Avoid unnecessary cache_tag_flush_range_np()

For test purpose only!

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/iommu.c | 19 ++++++++++++++++++-
 drivers/iommu/intel/iommu.h |  3 +++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 7aa3932251b2..f60201ee4be0 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1796,6 +1796,18 @@ static int domain_setup_first_level(struct intel_iommu *iommu,
 					  (pgd_t *)pgd, flags, old);
 }
 
+static bool domain_need_iotlb_sync_map(struct dmar_domain *domain,
+				       struct intel_iommu *iommu)
+{
+	if (cap_caching_mode(iommu->cap) && !domain->use_first_level)
+		return true;
+
+	if (rwbf_quirk || cap_rwbf(iommu->cap))
+		return true;
+
+	return false;
+}
+
 static int dmar_domain_attach_device(struct dmar_domain *domain,
 				     struct device *dev)
 {
@@ -1833,6 +1845,8 @@ static int dmar_domain_attach_device(struct dmar_domain *domain,
 	if (ret)
 		goto out_block_translation;
 
+	domain->iotlb_sync_map |= domain_need_iotlb_sync_map(domain, iommu);
+
 	return 0;
 
 out_block_translation:
@@ -3945,7 +3959,10 @@ static bool risky_device(struct pci_dev *pdev)
 static int intel_iommu_iotlb_sync_map(struct iommu_domain *domain,
 				      unsigned long iova, size_t size)
 {
-	cache_tag_flush_range_np(to_dmar_domain(domain), iova, iova + size - 1);
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+
+	if (dmar_domain->iotlb_sync_map)
+		cache_tag_flush_range_np(dmar_domain, iova, iova + size - 1);
 
 	return 0;
 }
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index 3ddbcc603de2..7ab2c34a5ecc 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -614,6 +614,9 @@ struct dmar_domain {
 	u8 has_mappings:1;		/* Has mappings configured through
 					 * iommu_map() interface.
 					 */
+	u8 iotlb_sync_map:1;		/* Need to flush IOTLB cache or write
+					 * buffer when creating mappings.
+					 */
 
 	spinlock_t lock;		/* Protect device tracking lists */
 	struct list_head devices;	/* all devices' list */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [REGRESSION][BISECTED] Performance Regression in IOMMU/VT-d Since Kernel 6.10
  2025-07-02  9:00   ` Baolu Lu
@ 2025-07-02 16:45     ` Ioanna Alifieraki
  2025-07-03  2:03       ` Baolu Lu
  0 siblings, 1 reply; 5+ messages in thread
From: Ioanna Alifieraki @ 2025-07-02 16:45 UTC (permalink / raw)
  To: Baolu Lu
  Cc: kevin.tian, jroedel, robin.murphy, will, joro, dwmw2, iommu,
	linux-kernel, regressions, stable

On Wed, Jul 2, 2025 at 12:00 PM Baolu Lu <baolu.lu@linux.intel.com> wrote:
>
> On 7/2/2025 1:14 PM, Baolu Lu wrote:
> > On 7/2/25 01:11, Ioanna Alifieraki wrote:
> >> #regzbot introduced: 129dab6e1286
> >>
> >> Hello everyone,
> >>
> >> We've identified a performance regression that starts with linux
> >> kernel 6.10 and persists through 6.16(tested at commit e540341508ce).
> >> Bisection pointed to commit:
> >> 129dab6e1286 ("iommu/vt-d: Use cache_tag_flush_range_np() in
> >> iotlb_sync_map").
> >>
> >> The issue occurs when running fio against two NVMe devices located
> >> under the same PCIe bridge (dual-port NVMe configuration). Performance
> >> drops compared to configurations where the devices are on different
> >> bridges.
> >>
> >> Observed Performance:
> >> - Before the commit: ~6150 MiB/s, regardless of NVMe device placement.
> >> - After the commit:
> >>    -- Same PCIe bridge: ~4985 MiB/s
> >>    -- Different PCIe bridges: ~6150 MiB/s
> >>
> >>
> >> Currently we can only reproduce the issue on a Z3 metal instance on
> >> gcp. I suspect the issue can be reproducible if you have a dual port
> >> nvme on any machine.
> >> At [1] there's a more detailed description of the issue and details
> >> on the reproducer.
> >
> > This test was running on bare metal hardware instead of any
> > virtualization guest, right? If that's the case,
> > cache_tag_flush_range_np() is almost a no-op.
> >
> > Can you please show me the capability register of the IOMMU by:
> >
> > #cat /sys/bus/pci/devices/[pci_dev_name]/iommu/intel-iommu/cap
>
> Also, can you please try whether the below changes make any difference?
> I've also attached a patch file to this email so you can apply the
> change more easily.
Thanks for the patch Baolu, I've tested and I can confirm we get ~6150MiB/s
for nvme pairs both under the same and different bridge.
The output of
cat /sys/bus/pci/devices/[pci_dev_name]/iommu/intel-iommu/cap
19ed008c40780c66
for all nvmes.
I got confirmation there's no virtualization happening on this instance
at all.
FWIW, I had run perf when initially investigating the issue and it was
showing quite some time spent in cache_tag_flush_range_np().

Thanks again!
Jo
>
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 7aa3932251b2..f60201ee4be0 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -1796,6 +1796,18 @@ static int domain_setup_first_level(struct
> intel_iommu *iommu,
>                                           (pgd_t *)pgd, flags, old);
>   }
>
> +static bool domain_need_iotlb_sync_map(struct dmar_domain *domain,
> +                                      struct intel_iommu *iommu)
> +{
> +       if (cap_caching_mode(iommu->cap) && !domain->use_first_level)
> +               return true;
> +
> +       if (rwbf_quirk || cap_rwbf(iommu->cap))
> +               return true;
> +
> +       return false;
> +}
> +
>   static int dmar_domain_attach_device(struct dmar_domain *domain,
>                                      struct device *dev)
>   {
> @@ -1833,6 +1845,8 @@ static int dmar_domain_attach_device(struct
> dmar_domain *domain,
>         if (ret)
>                 goto out_block_translation;
>
> +       domain->iotlb_sync_map |= domain_need_iotlb_sync_map(domain, iommu);
> +
>         return 0;
>
>   out_block_translation:
> @@ -3945,7 +3959,10 @@ static bool risky_device(struct pci_dev *pdev)
>   static int intel_iommu_iotlb_sync_map(struct iommu_domain *domain,
>                                       unsigned long iova, size_t size)
>   {
> -       cache_tag_flush_range_np(to_dmar_domain(domain), iova, iova + size - 1);
> +       struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> +
> +       if (dmar_domain->iotlb_sync_map)
> +               cache_tag_flush_range_np(dmar_domain, iova, iova + size - 1);
>
>         return 0;
>   }
> diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
> index 3ddbcc603de2..7ab2c34a5ecc 100644
> --- a/drivers/iommu/intel/iommu.h
> +++ b/drivers/iommu/intel/iommu.h
> @@ -614,6 +614,9 @@ struct dmar_domain {
>         u8 has_mappings:1;              /* Has mappings configured through
>                                          * iommu_map() interface.
>                                          */
> +       u8 iotlb_sync_map:1;            /* Need to flush IOTLB cache or write
> +                                        * buffer when creating mappings.
> +                                        */
>
>         spinlock_t lock;                /* Protect device tracking lists */
>         struct list_head devices;       /* all devices' list */
> --
> 2.43.0
>
> Thanks,
> baolu

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [REGRESSION][BISECTED] Performance Regression in IOMMU/VT-d Since Kernel 6.10
  2025-07-02 16:45     ` Ioanna Alifieraki
@ 2025-07-03  2:03       ` Baolu Lu
  0 siblings, 0 replies; 5+ messages in thread
From: Baolu Lu @ 2025-07-03  2:03 UTC (permalink / raw)
  To: Ioanna Alifieraki
  Cc: kevin.tian, jroedel, robin.murphy, will, joro, dwmw2, iommu,
	linux-kernel, regressions, stable

On 7/3/25 00:45, Ioanna Alifieraki wrote:
> On Wed, Jul 2, 2025 at 12:00 PM Baolu Lu<baolu.lu@linux.intel.com> wrote:
>> On 7/2/2025 1:14 PM, Baolu Lu wrote:
>>> On 7/2/25 01:11, Ioanna Alifieraki wrote:
>>>> #regzbot introduced: 129dab6e1286
>>>>
>>>> Hello everyone,
>>>>
>>>> We've identified a performance regression that starts with linux
>>>> kernel 6.10 and persists through 6.16(tested at commit e540341508ce).
>>>> Bisection pointed to commit:
>>>> 129dab6e1286 ("iommu/vt-d: Use cache_tag_flush_range_np() in
>>>> iotlb_sync_map").
>>>>
>>>> The issue occurs when running fio against two NVMe devices located
>>>> under the same PCIe bridge (dual-port NVMe configuration). Performance
>>>> drops compared to configurations where the devices are on different
>>>> bridges.
>>>>
>>>> Observed Performance:
>>>> - Before the commit: ~6150 MiB/s, regardless of NVMe device placement.
>>>> - After the commit:
>>>>     -- Same PCIe bridge: ~4985 MiB/s
>>>>     -- Different PCIe bridges: ~6150 MiB/s
>>>>
>>>>
>>>> Currently we can only reproduce the issue on a Z3 metal instance on
>>>> gcp. I suspect the issue can be reproducible if you have a dual port
>>>> nvme on any machine.
>>>> At [1] there's a more detailed description of the issue and details
>>>> on the reproducer.
>>> This test was running on bare metal hardware instead of any
>>> virtualization guest, right? If that's the case,
>>> cache_tag_flush_range_np() is almost a no-op.
>>>
>>> Can you please show me the capability register of the IOMMU by:
>>>
>>> #cat/sys/bus/pci/devices/[pci_dev_name]/iommu/intel-iommu/cap
>> Also, can you please try whether the below changes make any difference?
>> I've also attached a patch file to this email so you can apply the
>> change more easily.
> Thanks for the patch Baolu, I've tested and I can confirm we get ~6150MiB/s
> for nvme pairs both under the same and different bridge.
> The output of
> cat/sys/bus/pci/devices/[pci_dev_name]/iommu/intel-iommu/cap
> 19ed008c40780c66
> for all nvmes.
> I got confirmation there's no virtualization happening on this instance
> at all.
> FWIW, I had run perf when initially investigating the issue and it was
> showing quite some time spent in cache_tag_flush_range_np().

Okay, I will post a formal fix patch for this. Thank you!

Thanks,
baolu

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-07-03  2:04 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-01 17:11 [REGRESSION][BISECTED] Performance Regression in IOMMU/VT-d Since Kernel 6.10 Ioanna Alifieraki
2025-07-02  5:14 ` Baolu Lu
2025-07-02  9:00   ` Baolu Lu
2025-07-02 16:45     ` Ioanna Alifieraki
2025-07-03  2:03       ` Baolu Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).