linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] PCI: avoid SBR for NVIDIA T4
@ 2023-03-29 11:58 Wu Zongyong
  2023-03-29 17:05 ` Bjorn Helgaas
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Wu Zongyong @ 2023-03-29 11:58 UTC (permalink / raw)
  To: sdonthineni, bhelgaas, linux-pci, linux-kernel
  Cc: Wu Zongyong, wllenyj, wutu.xq2, gerry

Secondary bus reset will fail if NVIDIA T4 card is direct attached to a
root port.
So avoid to do bus reset,  pci_parent_bus_reset() works nomarlly.

Maybe NVIDIA guys can do some detailed explanation abount the SBR
behaviour of GPUs.

Signed-off-by: Wu Zongyong <wuzongyong@linux.alibaba.com>
---
 drivers/pci/quirks.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 44cab813bf95..be86b93b9e38 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3618,7 +3618,7 @@ static void quirk_no_bus_reset(struct pci_dev *dev)
  */
 static void quirk_nvidia_no_bus_reset(struct pci_dev *dev)
 {
-	if ((dev->device & 0xffc0) == 0x2340)
+	if ((dev->device & 0xffc0) == 0x2340 || dev->device == 0x1eb8)
 		quirk_no_bus_reset(dev);
 }
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
-- 
2.34.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] PCI: avoid SBR for NVIDIA T4
  2023-03-29 11:58 [RFC PATCH] PCI: avoid SBR for NVIDIA T4 Wu Zongyong
@ 2023-03-29 17:05 ` Bjorn Helgaas
  2023-03-30  2:10   ` Wu Zongyong
  2023-03-30  2:41 ` Lukas Wunner
  2023-04-10 12:34 ` [PATCH] PCI: Mark NVIDIA T4 GPUs to avoid bus reset Wu Zongyong
  2 siblings, 1 reply; 13+ messages in thread
From: Bjorn Helgaas @ 2023-03-29 17:05 UTC (permalink / raw)
  To: Wu Zongyong
  Cc: sdonthineni, bhelgaas, linux-pci, linux-kernel, wllenyj, wutu.xq2,
	gerry

On Wed, Mar 29, 2023 at 07:58:45PM +0800, Wu Zongyong wrote:
> Secondary bus reset will fail if NVIDIA T4 card is direct attached to a
> root port.

Blank line between paragraphs.  Rewrap to fill 75 columns if it's a
single paragraph.

Is this only a problem when direct attached to a Root Port?  Why would
that be?  If it's *not* related to being directly under a Root Port,
don't mention that at all.

> So avoid to do bus reset,  pci_parent_bus_reset() works nomarlly.
> 
> Maybe NVIDIA guys can do some detailed explanation abount the SBR
> behaviour of GPUs.

This is a follow-on to 4c207e7121fa ("PCI: Mark some NVIDIA GPUs to
avoid bus reset"), so probably should have a Fixes: tag so it goes
whereever that commit goes.

Also copy the subject line from 4c207e7121fa, e.g.,

  PCI: Mark NVIDIA T4 GPUs to avoid bus reset

Are there any problem reports or bugzilla issues you can include a URL
to?

> Signed-off-by: Wu Zongyong <wuzongyong@linux.alibaba.com>
> ---
>  drivers/pci/quirks.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 44cab813bf95..be86b93b9e38 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3618,7 +3618,7 @@ static void quirk_no_bus_reset(struct pci_dev *dev)
>   */
>  static void quirk_nvidia_no_bus_reset(struct pci_dev *dev)
>  {
> -	if ((dev->device & 0xffc0) == 0x2340)
> +	if ((dev->device & 0xffc0) == 0x2340 || dev->device == 0x1eb8)
>  		quirk_no_bus_reset(dev);
>  }
>  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
> -- 
> 2.34.3
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] PCI: avoid SBR for NVIDIA T4
  2023-03-29 17:05 ` Bjorn Helgaas
@ 2023-03-30  2:10   ` Wu Zongyong
  2023-03-30 15:49     ` Bjorn Helgaas
  0 siblings, 1 reply; 13+ messages in thread
From: Wu Zongyong @ 2023-03-30  2:10 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: sdonthineni, bhelgaas, linux-pci, linux-kernel, wllenyj, wutu.xq2,
	gerry

On Wed, Mar 29, 2023 at 12:05:15PM -0500, Bjorn Helgaas wrote:
> On Wed, Mar 29, 2023 at 07:58:45PM +0800, Wu Zongyong wrote:
> > Secondary bus reset will fail if NVIDIA T4 card is direct attached to a
> > root port.
> 
> Blank line between paragraphs.  Rewrap to fill 75 columns if it's a
> single paragraph.
Will be fixed.
> 
> Is this only a problem when direct attached to a Root Port?  Why would
> that be?  If it's *not* related to being directly under a Root Port,
> don't mention that at all.
Yes, this problem occurs only when the T4 card is direct attached to a
Root Port.
I have test it with a T4 card attached to a PCIe Switch or a PCI Bridge,
and it works well.

> 
> > So avoid to do bus reset,  pci_parent_bus_reset() works nomarlly.
> > 
> > Maybe NVIDIA guys can do some detailed explanation abount the SBR
> > behaviour of GPUs.
> 
> This is a follow-on to 4c207e7121fa ("PCI: Mark some NVIDIA GPUs to
> avoid bus reset"), so probably should have a Fixes: tag so it goes
> whereever that commit goes.
> 
> Also copy the subject line from 4c207e7121fa, e.g.,
> 
>   PCI: Mark NVIDIA T4 GPUs to avoid bus reset
Will be fixed too.
> 
> Are there any problem reports or bugzilla issues you can include a URL
> to?
No, I just find the problem in our test environment and I didn't find a
similar report.
> 
> > Signed-off-by: Wu Zongyong <wuzongyong@linux.alibaba.com>
> > ---
> >  drivers/pci/quirks.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index 44cab813bf95..be86b93b9e38 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -3618,7 +3618,7 @@ static void quirk_no_bus_reset(struct pci_dev *dev)
> >   */
> >  static void quirk_nvidia_no_bus_reset(struct pci_dev *dev)
> >  {
> > -	if ((dev->device & 0xffc0) == 0x2340)
> > +	if ((dev->device & 0xffc0) == 0x2340 || dev->device == 0x1eb8)
> >  		quirk_no_bus_reset(dev);
> >  }
> >  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
> > -- 
> > 2.34.3
> > 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] PCI: avoid SBR for NVIDIA T4
  2023-03-29 11:58 [RFC PATCH] PCI: avoid SBR for NVIDIA T4 Wu Zongyong
  2023-03-29 17:05 ` Bjorn Helgaas
@ 2023-03-30  2:41 ` Lukas Wunner
  2023-04-10 12:34 ` [PATCH] PCI: Mark NVIDIA T4 GPUs to avoid bus reset Wu Zongyong
  2 siblings, 0 replies; 13+ messages in thread
From: Lukas Wunner @ 2023-03-30  2:41 UTC (permalink / raw)
  To: Wu Zongyong
  Cc: sdonthineni, bhelgaas, linux-pci, linux-kernel, wllenyj, wutu.xq2,
	gerry

On Wed, Mar 29, 2023 at 07:58:45PM +0800, Wu Zongyong wrote:
> Secondary bus reset will fail if NVIDIA T4 card is direct attached to a
> root port.
> So avoid to do bus reset,  pci_parent_bus_reset() works nomarlly.

If you cherry-pick commits

  8ef0217227b4 ("PCI/PM: Observe reset delay irrespective of bridge_d3")
  ac91e6980563 ("PCI: Unify delay handling for reset and resume")

from v6.3-rc1, does the issue still persist?

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] PCI: avoid SBR for NVIDIA T4
  2023-03-30  2:10   ` Wu Zongyong
@ 2023-03-30 15:49     ` Bjorn Helgaas
  2023-03-31  2:11       ` Wu Zongyong
  0 siblings, 1 reply; 13+ messages in thread
From: Bjorn Helgaas @ 2023-03-30 15:49 UTC (permalink / raw)
  To: Wu Zongyong
  Cc: sdonthineni, bhelgaas, linux-pci, linux-kernel, wllenyj, wutu.xq2,
	gerry

On Thu, Mar 30, 2023 at 10:10:16AM +0800, Wu Zongyong wrote:
> On Wed, Mar 29, 2023 at 12:05:15PM -0500, Bjorn Helgaas wrote:
> > On Wed, Mar 29, 2023 at 07:58:45PM +0800, Wu Zongyong wrote:
> > > Secondary bus reset will fail if NVIDIA T4 card is direct attached to a
> > > root port.
> > 
> > Is this only a problem when direct attached to a Root Port?  Why would
> > that be?  If it's *not* related to being directly under a Root Port,
> > don't mention that at all.
>
> Yes, this problem occurs only when the T4 card is direct attached to a
> Root Port.
> I have test it with a T4 card attached to a PCIe Switch or a PCI Bridge,
> and it works well.

From an electrical and protocol point of view, the device should not
be able to tell the difference, so Lukas' suggestion that it may be
related to reset delays seems very pertinent.

Bjorn

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] PCI: avoid SBR for NVIDIA T4
  2023-03-30 15:49     ` Bjorn Helgaas
@ 2023-03-31  2:11       ` Wu Zongyong
  2023-04-03  4:02         ` Wu Zongyong
  0 siblings, 1 reply; 13+ messages in thread
From: Wu Zongyong @ 2023-03-31  2:11 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: sdonthineni, bhelgaas, linux-pci, linux-kernel, wllenyj, wutu.xq2,
	gerry

On Thu, Mar 30, 2023 at 10:49:26AM -0500, Bjorn Helgaas wrote:
> On Thu, Mar 30, 2023 at 10:10:16AM +0800, Wu Zongyong wrote:
> > On Wed, Mar 29, 2023 at 12:05:15PM -0500, Bjorn Helgaas wrote:
> > > On Wed, Mar 29, 2023 at 07:58:45PM +0800, Wu Zongyong wrote:
> > > > Secondary bus reset will fail if NVIDIA T4 card is direct attached to a
> > > > root port.
> > > 
> > > Is this only a problem when direct attached to a Root Port?  Why would
> > > that be?  If it's *not* related to being directly under a Root Port,
> > > don't mention that at all.
> >
> > Yes, this problem occurs only when the T4 card is direct attached to a
> > Root Port.
> > I have test it with a T4 card attached to a PCIe Switch or a PCI Bridge,
> > and it works well.
> 
> From an electrical and protocol point of view, the device should not
> be able to tell the difference, so Lukas' suggestion that it may be
> related to reset delays seems very pertinent.
I will test it with the commits mentioned above.
But it may take some time since it is not easy to replace kernel in our
environment.
> 
> Bjorn

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] PCI: avoid SBR for NVIDIA T4
  2023-03-31  2:11       ` Wu Zongyong
@ 2023-04-03  4:02         ` Wu Zongyong
  0 siblings, 0 replies; 13+ messages in thread
From: Wu Zongyong @ 2023-04-03  4:02 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: sdonthineni, bhelgaas, linux-pci, linux-kernel, wllenyj, wutu.xq2,
	gerry

On Fri, Mar 31, 2023 at 10:11:15AM +0800, Wu Zongyong wrote:
> On Thu, Mar 30, 2023 at 10:49:26AM -0500, Bjorn Helgaas wrote:
> > On Thu, Mar 30, 2023 at 10:10:16AM +0800, Wu Zongyong wrote:
> > > On Wed, Mar 29, 2023 at 12:05:15PM -0500, Bjorn Helgaas wrote:
> > > > On Wed, Mar 29, 2023 at 07:58:45PM +0800, Wu Zongyong wrote:
> > > > > Secondary bus reset will fail if NVIDIA T4 card is direct attached to a
> > > > > root port.
> > > > 
> > > > Is this only a problem when direct attached to a Root Port?  Why would
> > > > that be?  If it's *not* related to being directly under a Root Port,
> > > > don't mention that at all.
> > >
> > > Yes, this problem occurs only when the T4 card is direct attached to a
> > > Root Port.
> > > I have test it with a T4 card attached to a PCIe Switch or a PCI Bridge,
> > > and it works well.
> > 
> > From an electrical and protocol point of view, the device should not
> > be able to tell the difference, so Lukas' suggestion that it may be
> > related to reset delays seems very pertinent.
> I will test it with the commits mentioned above.
> But it may take some time since it is not easy to replace kernel in our
> environment.

I have tested it with Lukas' suggestion and it didn't work for T4 cards.

My base kernel is v5.10, and I cherry-picked the following patches:

  730643d33e2d ("PCI/PM: Resume subordinate bus in bus type callbacks")
  8ef0217227b4 ("PCI/PM: Observe reset delay irrespective of bridge_d3")
  ac91e6980563 ("PCI: Unify delay handling for reset and resume")

Any other necessary patches I should apply?

> > 
> > Bjorn

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] PCI: Mark NVIDIA T4 GPUs to avoid bus reset
  2023-03-29 11:58 [RFC PATCH] PCI: avoid SBR for NVIDIA T4 Wu Zongyong
  2023-03-29 17:05 ` Bjorn Helgaas
  2023-03-30  2:41 ` Lukas Wunner
@ 2023-04-10 12:34 ` Wu Zongyong
  2023-04-26  8:13   ` Wu Zongyong
  2023-08-09 23:05   ` Bjorn Helgaas
  2 siblings, 2 replies; 13+ messages in thread
From: Wu Zongyong @ 2023-04-10 12:34 UTC (permalink / raw)
  To: lukas, sdonthineni, bhelgaas, linux-pci, linux-kernel
  Cc: Wu Zongyong, wllenyj, wutu.xq2, gerry

NVIDIA T4 GPUs do not work with SBR. This problem is found when the T4
card is direct attached to a Root Port only. So avoid bus reset by
marking T4 GPUs PCI_DEV_FLAGS_NO_BUS_RESET.

Fixes: 4c207e7121fa ("PCI: Mark some NVIDIA GPUs to avoid bus reset")
Signed-off-by: Wu Zongyong <wuzongyong@linux.alibaba.com>
---
 drivers/pci/quirks.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 44cab813bf95..be86b93b9e38 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3618,7 +3618,7 @@ static void quirk_no_bus_reset(struct pci_dev *dev)
  */
 static void quirk_nvidia_no_bus_reset(struct pci_dev *dev)
 {
-	if ((dev->device & 0xffc0) == 0x2340)
+	if ((dev->device & 0xffc0) == 0x2340 || dev->device == 0x1eb8)
 		quirk_no_bus_reset(dev);
 }
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
-- 
2.34.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI: Mark NVIDIA T4 GPUs to avoid bus reset
  2023-04-10 12:34 ` [PATCH] PCI: Mark NVIDIA T4 GPUs to avoid bus reset Wu Zongyong
@ 2023-04-26  8:13   ` Wu Zongyong
  2023-08-09 23:05   ` Bjorn Helgaas
  1 sibling, 0 replies; 13+ messages in thread
From: Wu Zongyong @ 2023-04-26  8:13 UTC (permalink / raw)
  To: lukas, sdonthineni, bhelgaas, linux-pci, linux-kernel
  Cc: wllenyj, wutu.xq2, gerry

On Mon, Apr 10, 2023 at 08:34:11PM +0800, Wu Zongyong wrote:
> NVIDIA T4 GPUs do not work with SBR. This problem is found when the T4
> card is direct attached to a Root Port only. So avoid bus reset by
> marking T4 GPUs PCI_DEV_FLAGS_NO_BUS_RESET.
> 
> Fixes: 4c207e7121fa ("PCI: Mark some NVIDIA GPUs to avoid bus reset")
> Signed-off-by: Wu Zongyong <wuzongyong@linux.alibaba.com>
> ---
>  drivers/pci/quirks.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 44cab813bf95..be86b93b9e38 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3618,7 +3618,7 @@ static void quirk_no_bus_reset(struct pci_dev *dev)
>   */
>  static void quirk_nvidia_no_bus_reset(struct pci_dev *dev)
>  {
> -	if ((dev->device & 0xffc0) == 0x2340)
> +	if ((dev->device & 0xffc0) == 0x2340 || dev->device == 0x1eb8)
>  		quirk_no_bus_reset(dev);
>  }
>  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
> -- 
> 2.34.3
Any further comments about this patch? 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI: Mark NVIDIA T4 GPUs to avoid bus reset
  2023-04-10 12:34 ` [PATCH] PCI: Mark NVIDIA T4 GPUs to avoid bus reset Wu Zongyong
  2023-04-26  8:13   ` Wu Zongyong
@ 2023-08-09 23:05   ` Bjorn Helgaas
  2023-09-08  2:50     ` Wu Zongyong
  1 sibling, 1 reply; 13+ messages in thread
From: Bjorn Helgaas @ 2023-08-09 23:05 UTC (permalink / raw)
  To: Wu Zongyong
  Cc: lukas, sdonthineni, bhelgaas, linux-pci, linux-kernel, wllenyj,
	wutu.xq2, gerry

On Mon, Apr 10, 2023 at 08:34:11PM +0800, Wu Zongyong wrote:
> NVIDIA T4 GPUs do not work with SBR. This problem is found when the T4
> card is direct attached to a Root Port only. So avoid bus reset by
> marking T4 GPUs PCI_DEV_FLAGS_NO_BUS_RESET.
> 
> Fixes: 4c207e7121fa ("PCI: Mark some NVIDIA GPUs to avoid bus reset")
> Signed-off-by: Wu Zongyong <wuzongyong@linux.alibaba.com>

Applied to pci/virtualization for v6.6, thanks!

> ---
>  drivers/pci/quirks.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 44cab813bf95..be86b93b9e38 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3618,7 +3618,7 @@ static void quirk_no_bus_reset(struct pci_dev *dev)
>   */
>  static void quirk_nvidia_no_bus_reset(struct pci_dev *dev)
>  {
> -	if ((dev->device & 0xffc0) == 0x2340)
> +	if ((dev->device & 0xffc0) == 0x2340 || dev->device == 0x1eb8)
>  		quirk_no_bus_reset(dev);
>  }
>  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
> -- 
> 2.34.3
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI: Mark NVIDIA T4 GPUs to avoid bus reset
  2023-08-09 23:05   ` Bjorn Helgaas
@ 2023-09-08  2:50     ` Wu Zongyong
  2023-09-08  3:40       ` Alex Williamson
  0 siblings, 1 reply; 13+ messages in thread
From: Wu Zongyong @ 2023-09-08  2:50 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Wu Zongyong, lukas, sdonthineni, bhelgaas, linux-pci,
	linux-kernel, wllenyj, wutu.xq2, gerry, pjaroszynski

On Wed, Aug 09, 2023 at 06:05:18PM -0500, Bjorn Helgaas wrote:
> On Mon, Apr 10, 2023 at 08:34:11PM +0800, Wu Zongyong wrote:
> > NVIDIA T4 GPUs do not work with SBR. This problem is found when the T4
> > card is direct attached to a Root Port only. So avoid bus reset by
> > marking T4 GPUs PCI_DEV_FLAGS_NO_BUS_RESET.
> > 
> > Fixes: 4c207e7121fa ("PCI: Mark some NVIDIA GPUs to avoid bus reset")
> > Signed-off-by: Wu Zongyong <wuzongyong@linux.alibaba.com>
> 
> Applied to pci/virtualization for v6.6, thanks!

I talk about the issue with NVIDIA, and they think the issue is probably related
the pci link instead of the T4 GPU card.

I will try to describe the issue I met in detail.

The T4 card which is direct attached to a Root Port and I rebind it to
vfio-pci driver. Then I try to use to call some vfio-related api and the
ioctl VFIO_GROUP_GET_DEVICE_FD failed.

The stack is (base on kernel v5.10):
    vfio_group_fops_unl_ioctl
         vfio_group_get_device_fd
            vfio_pci_open
                vfio_pci_enable // return value is -19
                    pci_try_reset_function
                        __pci_reset_function_locked

After the __pci_reset_function_locked(), the dmesg shows:
   [12207494.508467] pcieport 0000:3f:00.0: pciehp: Slot(5-1): Link Down
   [12207494.508535] vfio-pci 0000:40:00.0: No device request channel registered, blocked until released by user
   [12207494.518426] pci 0000:40:00.0: Removing from iommu group 84
   [12207495.532365] pcieport 0000:3f:00.0: pciehp: Slot(5-1): Card present
   [12207495.532367] pcieport 0000:3f:00.0: pciehp: Slot(5-1): Link Up

NVIDIA people thinks this root port is not going through this reset logic and getting the
link down/hot plug interrupts[1].

Can you revert the patch I sent and maybe we should dig it deeply.

> 
> > ---
> >  drivers/pci/quirks.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index 44cab813bf95..be86b93b9e38 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -3618,7 +3618,7 @@ static void quirk_no_bus_reset(struct pci_dev *dev)
> >   */
> >  static void quirk_nvidia_no_bus_reset(struct pci_dev *dev)
> >  {
> > -	if ((dev->device & 0xffc0) == 0x2340)
> > +	if ((dev->device & 0xffc0) == 0x2340 || dev->device == 0x1eb8)
> >  		quirk_no_bus_reset(dev);
> >  }
> >  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
> > -- 
> > 2.34.3
> > 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI: Mark NVIDIA T4 GPUs to avoid bus reset
  2023-09-08  2:50     ` Wu Zongyong
@ 2023-09-08  3:40       ` Alex Williamson
  2023-09-08 20:11         ` Bjorn Helgaas
  0 siblings, 1 reply; 13+ messages in thread
From: Alex Williamson @ 2023-09-08  3:40 UTC (permalink / raw)
  To: Wu Zongyong
  Cc: Bjorn Helgaas, lukas, sdonthineni, bhelgaas, linux-pci,
	linux-kernel, wllenyj, wutu.xq2, gerry, pjaroszynski

On Fri, 8 Sep 2023 10:50:48 +0800
Wu Zongyong <wuzongyong@linux.alibaba.com> wrote:

> On Wed, Aug 09, 2023 at 06:05:18PM -0500, Bjorn Helgaas wrote:
> > On Mon, Apr 10, 2023 at 08:34:11PM +0800, Wu Zongyong wrote:  
> > > NVIDIA T4 GPUs do not work with SBR. This problem is found when the T4
> > > card is direct attached to a Root Port only. So avoid bus reset by
> > > marking T4 GPUs PCI_DEV_FLAGS_NO_BUS_RESET.
> > > 
> > > Fixes: 4c207e7121fa ("PCI: Mark some NVIDIA GPUs to avoid bus reset")
> > > Signed-off-by: Wu Zongyong <wuzongyong@linux.alibaba.com>  
> > 
> > Applied to pci/virtualization for v6.6, thanks!  
> 
> I talk about the issue with NVIDIA, and they think the issue is probably related
> the pci link instead of the T4 GPU card.
> 
> I will try to describe the issue I met in detail.
> 
> The T4 card which is direct attached to a Root Port and I rebind it to
> vfio-pci driver. Then I try to use to call some vfio-related api and the
> ioctl VFIO_GROUP_GET_DEVICE_FD failed.
> 
> The stack is (base on kernel v5.10):
>     vfio_group_fops_unl_ioctl
>          vfio_group_get_device_fd
>             vfio_pci_open
>                 vfio_pci_enable // return value is -19
>                     pci_try_reset_function
>                         __pci_reset_function_locked
> 
> After the __pci_reset_function_locked(), the dmesg shows:
>    [12207494.508467] pcieport 0000:3f:00.0: pciehp: Slot(5-1): Link Down
>    [12207494.508535] vfio-pci 0000:40:00.0: No device request channel registered, blocked until released by user
>    [12207494.518426] pci 0000:40:00.0: Removing from iommu group 84
>    [12207495.532365] pcieport 0000:3f:00.0: pciehp: Slot(5-1): Card present
>    [12207495.532367] pcieport 0000:3f:00.0: pciehp: Slot(5-1): Link Up
> 
> NVIDIA people thinks this root port is not going through this reset logic and getting the
> link down/hot plug interrupts[1].
> 
> Can you revert the patch I sent and maybe we should dig it deeply.

Yes, please revert, we do testing with T4 and have not seen any issues
with bus reset.  The T4 provides neither PM nor FLR reset, so masking
bus reset compromises this device for assignment scenarios.  I can send
a revert patch if requested.  Thanks,

Alex

> > > ---
> > >  drivers/pci/quirks.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > > index 44cab813bf95..be86b93b9e38 100644
> > > --- a/drivers/pci/quirks.c
> > > +++ b/drivers/pci/quirks.c
> > > @@ -3618,7 +3618,7 @@ static void quirk_no_bus_reset(struct pci_dev *dev)
> > >   */
> > >  static void quirk_nvidia_no_bus_reset(struct pci_dev *dev)
> > >  {
> > > -	if ((dev->device & 0xffc0) == 0x2340)
> > > +	if ((dev->device & 0xffc0) == 0x2340 || dev->device == 0x1eb8)
> > >  		quirk_no_bus_reset(dev);
> > >  }
> > >  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
> > > -- 
> > > 2.34.3
> > >   
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI: Mark NVIDIA T4 GPUs to avoid bus reset
  2023-09-08  3:40       ` Alex Williamson
@ 2023-09-08 20:11         ` Bjorn Helgaas
  0 siblings, 0 replies; 13+ messages in thread
From: Bjorn Helgaas @ 2023-09-08 20:11 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Wu Zongyong, lukas, sdonthineni, bhelgaas, linux-pci,
	linux-kernel, wllenyj, wutu.xq2, gerry, pjaroszynski

On Thu, Sep 07, 2023 at 09:40:37PM -0600, Alex Williamson wrote:
> On Fri, 8 Sep 2023 10:50:48 +0800
> Wu Zongyong <wuzongyong@linux.alibaba.com> wrote:
> 
> > On Wed, Aug 09, 2023 at 06:05:18PM -0500, Bjorn Helgaas wrote:
> > > On Mon, Apr 10, 2023 at 08:34:11PM +0800, Wu Zongyong wrote:  
> > > > NVIDIA T4 GPUs do not work with SBR. This problem is found when the T4
> > > > card is direct attached to a Root Port only. So avoid bus reset by
> > > > marking T4 GPUs PCI_DEV_FLAGS_NO_BUS_RESET.
> > > > 
> > > > Fixes: 4c207e7121fa ("PCI: Mark some NVIDIA GPUs to avoid bus reset")
> > > > Signed-off-by: Wu Zongyong <wuzongyong@linux.alibaba.com>  
> > > 
> > > Applied to pci/virtualization for v6.6, thanks!  
> > 
> > I talk about the issue with NVIDIA, and they think the issue is probably related
> > the pci link instead of the T4 GPU card.
> > 
> > I will try to describe the issue I met in detail.
> > 
> > The T4 card which is direct attached to a Root Port and I rebind it to
> > vfio-pci driver. Then I try to use to call some vfio-related api and the
> > ioctl VFIO_GROUP_GET_DEVICE_FD failed.
> > 
> > The stack is (base on kernel v5.10):
> >     vfio_group_fops_unl_ioctl
> >          vfio_group_get_device_fd
> >             vfio_pci_open
> >                 vfio_pci_enable // return value is -19
> >                     pci_try_reset_function
> >                         __pci_reset_function_locked
> > 
> > After the __pci_reset_function_locked(), the dmesg shows:
> >    [12207494.508467] pcieport 0000:3f:00.0: pciehp: Slot(5-1): Link Down
> >    [12207494.508535] vfio-pci 0000:40:00.0: No device request channel registered, blocked until released by user
> >    [12207494.518426] pci 0000:40:00.0: Removing from iommu group 84
> >    [12207495.532365] pcieport 0000:3f:00.0: pciehp: Slot(5-1): Card present
> >    [12207495.532367] pcieport 0000:3f:00.0: pciehp: Slot(5-1): Link Up
> > 
> > NVIDIA people thinks this root port is not going through this reset logic and getting the
> > link down/hot plug interrupts[1].
> > 
> > Can you revert the patch I sent and maybe we should dig it deeply.
> 
> Yes, please revert, we do testing with T4 and have not seen any issues
> with bus reset.  The T4 provides neither PM nor FLR reset, so masking
> bus reset compromises this device for assignment scenarios.  I can send
> a revert patch if requested.  Thanks,

Reverted as below.  Hopefully this will make v6.6-rc1.

commit 42f5c40846f3 ("Revert "PCI: Mark NVIDIA T4 GPUs to avoid bus reset"")
Author: Bjorn Helgaas <bhelgaas@google.com>
Date:   Fri Sep 8 14:55:30 2023 -0500

    Revert "PCI: Mark NVIDIA T4 GPUs to avoid bus reset"
    
    This reverts commit d5af729dc2071273f14cbb94abbc60608142fd83.
    
    d5af729dc207 ("PCI: Mark NVIDIA T4 GPUs to avoid bus reset") avoided
    Secondary Bus Reset on the T4 because the reset seemed to not work when the
    T4 was directly attached to a Root Port.
    
    But NVIDIA thinks the issue is probably related to some issue with the Root
    Port, not with the T4.  The T4 provides neither PM nor FLR reset, so
    masking bus reset compromises this device for assignment scenarios.
    
    Revert d5af729dc207 as requested by Wu Zongyong.  This will leave SBR
    broken in the specific configuration Wu tested, as it was in v6.5, so Wu
    will debug that further.
    
    Link: https://lore.kernel.org/r/ZPqMCDWvITlOLHgJ@wuzongyong-alibaba
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>


diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 5de09d2eb014..eeec1d6f9023 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3726,7 +3726,7 @@ static void quirk_no_bus_reset(struct pci_dev *dev)
  */
 static void quirk_nvidia_no_bus_reset(struct pci_dev *dev)
 {
-	if ((dev->device & 0xffc0) == 0x2340 || dev->device == 0x1eb8)
+	if ((dev->device & 0xffc0) == 0x2340)
 		quirk_no_bus_reset(dev);
 }
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,

^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-09-08 20:11 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-03-29 11:58 [RFC PATCH] PCI: avoid SBR for NVIDIA T4 Wu Zongyong
2023-03-29 17:05 ` Bjorn Helgaas
2023-03-30  2:10   ` Wu Zongyong
2023-03-30 15:49     ` Bjorn Helgaas
2023-03-31  2:11       ` Wu Zongyong
2023-04-03  4:02         ` Wu Zongyong
2023-03-30  2:41 ` Lukas Wunner
2023-04-10 12:34 ` [PATCH] PCI: Mark NVIDIA T4 GPUs to avoid bus reset Wu Zongyong
2023-04-26  8:13   ` Wu Zongyong
2023-08-09 23:05   ` Bjorn Helgaas
2023-09-08  2:50     ` Wu Zongyong
2023-09-08  3:40       ` Alex Williamson
2023-09-08 20:11         ` Bjorn Helgaas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).