* [REGRESSION][BISECTED] Performance Regression in IOMMU/VT-d Since Kernel 6.10 @ 2025-07-01 17:11 Ioanna Alifieraki 2025-07-02 5:14 ` Baolu Lu 0 siblings, 1 reply; 5+ messages in thread From: Ioanna Alifieraki @ 2025-07-01 17:11 UTC (permalink / raw) To: baolu.lu, kevin.tian, jroedel, robin.murphy, will, joro, dwmw2, iommu, linux-kernel, regressions, stable #regzbot introduced: 129dab6e1286 Hello everyone, We've identified a performance regression that starts with linux kernel 6.10 and persists through 6.16(tested at commit e540341508ce). Bisection pointed to commit: 129dab6e1286 ("iommu/vt-d: Use cache_tag_flush_range_np() in iotlb_sync_map"). The issue occurs when running fio against two NVMe devices located under the same PCIe bridge (dual-port NVMe configuration). Performance drops compared to configurations where the devices are on different bridges. Observed Performance: - Before the commit: ~6150 MiB/s, regardless of NVMe device placement. - After the commit: -- Same PCIe bridge: ~4985 MiB/s -- Different PCIe bridges: ~6150 MiB/s Currently we can only reproduce the issue on a Z3 metal instance on gcp. I suspect the issue can be reproducible if you have a dual port nvme on any machine. At [1] there's a more detailed description of the issue and details on the reproducer. Could you please advise on the appropriate path forward to mitigate or address this regression? Thanks, Jo [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2115738 ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [REGRESSION][BISECTED] Performance Regression in IOMMU/VT-d Since Kernel 6.10 2025-07-01 17:11 [REGRESSION][BISECTED] Performance Regression in IOMMU/VT-d Since Kernel 6.10 Ioanna Alifieraki @ 2025-07-02 5:14 ` Baolu Lu 2025-07-02 9:00 ` Baolu Lu 0 siblings, 1 reply; 5+ messages in thread From: Baolu Lu @ 2025-07-02 5:14 UTC (permalink / raw) To: Ioanna Alifieraki, kevin.tian, jroedel, robin.murphy, will, joro, dwmw2, iommu, linux-kernel, regressions, stable On 7/2/25 01:11, Ioanna Alifieraki wrote: > #regzbot introduced: 129dab6e1286 > > Hello everyone, > > We've identified a performance regression that starts with linux > kernel 6.10 and persists through 6.16(tested at commit e540341508ce). > Bisection pointed to commit: > 129dab6e1286 ("iommu/vt-d: Use cache_tag_flush_range_np() in iotlb_sync_map"). > > The issue occurs when running fio against two NVMe devices located > under the same PCIe bridge (dual-port NVMe configuration). Performance > drops compared to configurations where the devices are on different > bridges. > > Observed Performance: > - Before the commit: ~6150 MiB/s, regardless of NVMe device placement. > - After the commit: > -- Same PCIe bridge: ~4985 MiB/s > -- Different PCIe bridges: ~6150 MiB/s > > > Currently we can only reproduce the issue on a Z3 metal instance on > gcp. I suspect the issue can be reproducible if you have a dual port > nvme on any machine. > At [1] there's a more detailed description of the issue and details > on the reproducer. This test was running on bare metal hardware instead of any virtualization guest, right? If that's the case, cache_tag_flush_range_np() is almost a no-op. Can you please show me the capability register of the IOMMU by: #cat /sys/bus/pci/devices/[pci_dev_name]/iommu/intel-iommu/cap > > Could you please advise on the appropriate path forward to mitigate or > address this regression? > > Thanks, > Jo > > [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2115738 Thanks, baolu ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [REGRESSION][BISECTED] Performance Regression in IOMMU/VT-d Since Kernel 6.10 2025-07-02 5:14 ` Baolu Lu @ 2025-07-02 9:00 ` Baolu Lu 2025-07-02 16:45 ` Ioanna Alifieraki 0 siblings, 1 reply; 5+ messages in thread From: Baolu Lu @ 2025-07-02 9:00 UTC (permalink / raw) To: Ioanna Alifieraki, kevin.tian, jroedel, robin.murphy, will, joro, dwmw2, iommu, linux-kernel, regressions, stable [-- Attachment #1: Type: text/plain, Size: 3513 bytes --] On 7/2/2025 1:14 PM, Baolu Lu wrote: > On 7/2/25 01:11, Ioanna Alifieraki wrote: >> #regzbot introduced: 129dab6e1286 >> >> Hello everyone, >> >> We've identified a performance regression that starts with linux >> kernel 6.10 and persists through 6.16(tested at commit e540341508ce). >> Bisection pointed to commit: >> 129dab6e1286 ("iommu/vt-d: Use cache_tag_flush_range_np() in >> iotlb_sync_map"). >> >> The issue occurs when running fio against two NVMe devices located >> under the same PCIe bridge (dual-port NVMe configuration). Performance >> drops compared to configurations where the devices are on different >> bridges. >> >> Observed Performance: >> - Before the commit: ~6150 MiB/s, regardless of NVMe device placement. >> - After the commit: >> -- Same PCIe bridge: ~4985 MiB/s >> -- Different PCIe bridges: ~6150 MiB/s >> >> >> Currently we can only reproduce the issue on a Z3 metal instance on >> gcp. I suspect the issue can be reproducible if you have a dual port >> nvme on any machine. >> At [1] there's a more detailed description of the issue and details >> on the reproducer. > > This test was running on bare metal hardware instead of any > virtualization guest, right? If that's the case, > cache_tag_flush_range_np() is almost a no-op. > > Can you please show me the capability register of the IOMMU by: > > #cat /sys/bus/pci/devices/[pci_dev_name]/iommu/intel-iommu/cap Also, can you please try whether the below changes make any difference? I've also attached a patch file to this email so you can apply the change more easily. diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 7aa3932251b2..f60201ee4be0 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -1796,6 +1796,18 @@ static int domain_setup_first_level(struct intel_iommu *iommu, (pgd_t *)pgd, flags, old); } +static bool domain_need_iotlb_sync_map(struct dmar_domain *domain, + struct intel_iommu *iommu) +{ + if (cap_caching_mode(iommu->cap) && !domain->use_first_level) + return true; + + if (rwbf_quirk || cap_rwbf(iommu->cap)) + return true; + + return false; +} + static int dmar_domain_attach_device(struct dmar_domain *domain, struct device *dev) { @@ -1833,6 +1845,8 @@ static int dmar_domain_attach_device(struct dmar_domain *domain, if (ret) goto out_block_translation; + domain->iotlb_sync_map |= domain_need_iotlb_sync_map(domain, iommu); + return 0; out_block_translation: @@ -3945,7 +3959,10 @@ static bool risky_device(struct pci_dev *pdev) static int intel_iommu_iotlb_sync_map(struct iommu_domain *domain, unsigned long iova, size_t size) { - cache_tag_flush_range_np(to_dmar_domain(domain), iova, iova + size - 1); + struct dmar_domain *dmar_domain = to_dmar_domain(domain); + + if (dmar_domain->iotlb_sync_map) + cache_tag_flush_range_np(dmar_domain, iova, iova + size - 1); return 0; } diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h index 3ddbcc603de2..7ab2c34a5ecc 100644 --- a/drivers/iommu/intel/iommu.h +++ b/drivers/iommu/intel/iommu.h @@ -614,6 +614,9 @@ struct dmar_domain { u8 has_mappings:1; /* Has mappings configured through * iommu_map() interface. */ + u8 iotlb_sync_map:1; /* Need to flush IOTLB cache or write + * buffer when creating mappings. + */ spinlock_t lock; /* Protect device tracking lists */ struct list_head devices; /* all devices' list */ -- 2.43.0 Thanks, baolu [-- Attachment #2: 0001-iommu-vt-d-Avoid-unnecessary-cache_tag_flush_range_n.patch --] [-- Type: text/plain, Size: 2362 bytes --] From ddc4210a3365147df978bd0bf45d824b9c869877 Mon Sep 17 00:00:00 2001 From: Lu Baolu <baolu.lu@linux.intel.com> Date: Wed, 2 Jul 2025 16:51:48 +0800 Subject: [PATCH 1/1] iommu/vt-d: Avoid unnecessary cache_tag_flush_range_np() For test purpose only! Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> --- drivers/iommu/intel/iommu.c | 19 ++++++++++++++++++- drivers/iommu/intel/iommu.h | 3 +++ 2 files changed, 21 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 7aa3932251b2..f60201ee4be0 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -1796,6 +1796,18 @@ static int domain_setup_first_level(struct intel_iommu *iommu, (pgd_t *)pgd, flags, old); } +static bool domain_need_iotlb_sync_map(struct dmar_domain *domain, + struct intel_iommu *iommu) +{ + if (cap_caching_mode(iommu->cap) && !domain->use_first_level) + return true; + + if (rwbf_quirk || cap_rwbf(iommu->cap)) + return true; + + return false; +} + static int dmar_domain_attach_device(struct dmar_domain *domain, struct device *dev) { @@ -1833,6 +1845,8 @@ static int dmar_domain_attach_device(struct dmar_domain *domain, if (ret) goto out_block_translation; + domain->iotlb_sync_map |= domain_need_iotlb_sync_map(domain, iommu); + return 0; out_block_translation: @@ -3945,7 +3959,10 @@ static bool risky_device(struct pci_dev *pdev) static int intel_iommu_iotlb_sync_map(struct iommu_domain *domain, unsigned long iova, size_t size) { - cache_tag_flush_range_np(to_dmar_domain(domain), iova, iova + size - 1); + struct dmar_domain *dmar_domain = to_dmar_domain(domain); + + if (dmar_domain->iotlb_sync_map) + cache_tag_flush_range_np(dmar_domain, iova, iova + size - 1); return 0; } diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h index 3ddbcc603de2..7ab2c34a5ecc 100644 --- a/drivers/iommu/intel/iommu.h +++ b/drivers/iommu/intel/iommu.h @@ -614,6 +614,9 @@ struct dmar_domain { u8 has_mappings:1; /* Has mappings configured through * iommu_map() interface. */ + u8 iotlb_sync_map:1; /* Need to flush IOTLB cache or write + * buffer when creating mappings. + */ spinlock_t lock; /* Protect device tracking lists */ struct list_head devices; /* all devices' list */ -- 2.43.0 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [REGRESSION][BISECTED] Performance Regression in IOMMU/VT-d Since Kernel 6.10 2025-07-02 9:00 ` Baolu Lu @ 2025-07-02 16:45 ` Ioanna Alifieraki 2025-07-03 2:03 ` Baolu Lu 0 siblings, 1 reply; 5+ messages in thread From: Ioanna Alifieraki @ 2025-07-02 16:45 UTC (permalink / raw) To: Baolu Lu Cc: kevin.tian, jroedel, robin.murphy, will, joro, dwmw2, iommu, linux-kernel, regressions, stable On Wed, Jul 2, 2025 at 12:00 PM Baolu Lu <baolu.lu@linux.intel.com> wrote: > > On 7/2/2025 1:14 PM, Baolu Lu wrote: > > On 7/2/25 01:11, Ioanna Alifieraki wrote: > >> #regzbot introduced: 129dab6e1286 > >> > >> Hello everyone, > >> > >> We've identified a performance regression that starts with linux > >> kernel 6.10 and persists through 6.16(tested at commit e540341508ce). > >> Bisection pointed to commit: > >> 129dab6e1286 ("iommu/vt-d: Use cache_tag_flush_range_np() in > >> iotlb_sync_map"). > >> > >> The issue occurs when running fio against two NVMe devices located > >> under the same PCIe bridge (dual-port NVMe configuration). Performance > >> drops compared to configurations where the devices are on different > >> bridges. > >> > >> Observed Performance: > >> - Before the commit: ~6150 MiB/s, regardless of NVMe device placement. > >> - After the commit: > >> -- Same PCIe bridge: ~4985 MiB/s > >> -- Different PCIe bridges: ~6150 MiB/s > >> > >> > >> Currently we can only reproduce the issue on a Z3 metal instance on > >> gcp. I suspect the issue can be reproducible if you have a dual port > >> nvme on any machine. > >> At [1] there's a more detailed description of the issue and details > >> on the reproducer. > > > > This test was running on bare metal hardware instead of any > > virtualization guest, right? If that's the case, > > cache_tag_flush_range_np() is almost a no-op. > > > > Can you please show me the capability register of the IOMMU by: > > > > #cat /sys/bus/pci/devices/[pci_dev_name]/iommu/intel-iommu/cap > > Also, can you please try whether the below changes make any difference? > I've also attached a patch file to this email so you can apply the > change more easily. Thanks for the patch Baolu, I've tested and I can confirm we get ~6150MiB/s for nvme pairs both under the same and different bridge. The output of cat /sys/bus/pci/devices/[pci_dev_name]/iommu/intel-iommu/cap 19ed008c40780c66 for all nvmes. I got confirmation there's no virtualization happening on this instance at all. FWIW, I had run perf when initially investigating the issue and it was showing quite some time spent in cache_tag_flush_range_np(). Thanks again! Jo > > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c > index 7aa3932251b2..f60201ee4be0 100644 > --- a/drivers/iommu/intel/iommu.c > +++ b/drivers/iommu/intel/iommu.c > @@ -1796,6 +1796,18 @@ static int domain_setup_first_level(struct > intel_iommu *iommu, > (pgd_t *)pgd, flags, old); > } > > +static bool domain_need_iotlb_sync_map(struct dmar_domain *domain, > + struct intel_iommu *iommu) > +{ > + if (cap_caching_mode(iommu->cap) && !domain->use_first_level) > + return true; > + > + if (rwbf_quirk || cap_rwbf(iommu->cap)) > + return true; > + > + return false; > +} > + > static int dmar_domain_attach_device(struct dmar_domain *domain, > struct device *dev) > { > @@ -1833,6 +1845,8 @@ static int dmar_domain_attach_device(struct > dmar_domain *domain, > if (ret) > goto out_block_translation; > > + domain->iotlb_sync_map |= domain_need_iotlb_sync_map(domain, iommu); > + > return 0; > > out_block_translation: > @@ -3945,7 +3959,10 @@ static bool risky_device(struct pci_dev *pdev) > static int intel_iommu_iotlb_sync_map(struct iommu_domain *domain, > unsigned long iova, size_t size) > { > - cache_tag_flush_range_np(to_dmar_domain(domain), iova, iova + size - 1); > + struct dmar_domain *dmar_domain = to_dmar_domain(domain); > + > + if (dmar_domain->iotlb_sync_map) > + cache_tag_flush_range_np(dmar_domain, iova, iova + size - 1); > > return 0; > } > diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h > index 3ddbcc603de2..7ab2c34a5ecc 100644 > --- a/drivers/iommu/intel/iommu.h > +++ b/drivers/iommu/intel/iommu.h > @@ -614,6 +614,9 @@ struct dmar_domain { > u8 has_mappings:1; /* Has mappings configured through > * iommu_map() interface. > */ > + u8 iotlb_sync_map:1; /* Need to flush IOTLB cache or write > + * buffer when creating mappings. > + */ > > spinlock_t lock; /* Protect device tracking lists */ > struct list_head devices; /* all devices' list */ > -- > 2.43.0 > > Thanks, > baolu ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [REGRESSION][BISECTED] Performance Regression in IOMMU/VT-d Since Kernel 6.10 2025-07-02 16:45 ` Ioanna Alifieraki @ 2025-07-03 2:03 ` Baolu Lu 0 siblings, 0 replies; 5+ messages in thread From: Baolu Lu @ 2025-07-03 2:03 UTC (permalink / raw) To: Ioanna Alifieraki Cc: kevin.tian, jroedel, robin.murphy, will, joro, dwmw2, iommu, linux-kernel, regressions, stable On 7/3/25 00:45, Ioanna Alifieraki wrote: > On Wed, Jul 2, 2025 at 12:00 PM Baolu Lu<baolu.lu@linux.intel.com> wrote: >> On 7/2/2025 1:14 PM, Baolu Lu wrote: >>> On 7/2/25 01:11, Ioanna Alifieraki wrote: >>>> #regzbot introduced: 129dab6e1286 >>>> >>>> Hello everyone, >>>> >>>> We've identified a performance regression that starts with linux >>>> kernel 6.10 and persists through 6.16(tested at commit e540341508ce). >>>> Bisection pointed to commit: >>>> 129dab6e1286 ("iommu/vt-d: Use cache_tag_flush_range_np() in >>>> iotlb_sync_map"). >>>> >>>> The issue occurs when running fio against two NVMe devices located >>>> under the same PCIe bridge (dual-port NVMe configuration). Performance >>>> drops compared to configurations where the devices are on different >>>> bridges. >>>> >>>> Observed Performance: >>>> - Before the commit: ~6150 MiB/s, regardless of NVMe device placement. >>>> - After the commit: >>>> -- Same PCIe bridge: ~4985 MiB/s >>>> -- Different PCIe bridges: ~6150 MiB/s >>>> >>>> >>>> Currently we can only reproduce the issue on a Z3 metal instance on >>>> gcp. I suspect the issue can be reproducible if you have a dual port >>>> nvme on any machine. >>>> At [1] there's a more detailed description of the issue and details >>>> on the reproducer. >>> This test was running on bare metal hardware instead of any >>> virtualization guest, right? If that's the case, >>> cache_tag_flush_range_np() is almost a no-op. >>> >>> Can you please show me the capability register of the IOMMU by: >>> >>> #cat/sys/bus/pci/devices/[pci_dev_name]/iommu/intel-iommu/cap >> Also, can you please try whether the below changes make any difference? >> I've also attached a patch file to this email so you can apply the >> change more easily. > Thanks for the patch Baolu, I've tested and I can confirm we get ~6150MiB/s > for nvme pairs both under the same and different bridge. > The output of > cat/sys/bus/pci/devices/[pci_dev_name]/iommu/intel-iommu/cap > 19ed008c40780c66 > for all nvmes. > I got confirmation there's no virtualization happening on this instance > at all. > FWIW, I had run perf when initially investigating the issue and it was > showing quite some time spent in cache_tag_flush_range_np(). Okay, I will post a formal fix patch for this. Thank you! Thanks, baolu ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-07-03 2:04 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-07-01 17:11 [REGRESSION][BISECTED] Performance Regression in IOMMU/VT-d Since Kernel 6.10 Ioanna Alifieraki 2025-07-02 5:14 ` Baolu Lu 2025-07-02 9:00 ` Baolu Lu 2025-07-02 16:45 ` Ioanna Alifieraki 2025-07-03 2:03 ` Baolu Lu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).