linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Alex Williamson <alex.williamson@redhat.com>
To: Wu Zongyong <wuzongyong@linux.alibaba.com>
Cc: Bjorn Helgaas <helgaas@kernel.org>,
	lukas@wunner.de, sdonthineni@nvidia.com, bhelgaas@google.com,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
	wllenyj@linux.alibaba.com, wutu.xq2@linux.alibaba.com,
	gerry@linux.alibaba.com, pjaroszynski@nvidia.com
Subject: Re: [PATCH] PCI: Mark NVIDIA T4 GPUs to avoid bus reset
Date: Thu, 7 Sep 2023 21:40:37 -0600	[thread overview]
Message-ID: <20230907214037.7f35f26a.alex.williamson@redhat.com> (raw)
In-Reply-To: <ZPqMCDWvITlOLHgJ@wuzongyong-alibaba>

On Fri, 8 Sep 2023 10:50:48 +0800
Wu Zongyong <wuzongyong@linux.alibaba.com> wrote:

> On Wed, Aug 09, 2023 at 06:05:18PM -0500, Bjorn Helgaas wrote:
> > On Mon, Apr 10, 2023 at 08:34:11PM +0800, Wu Zongyong wrote:  
> > > NVIDIA T4 GPUs do not work with SBR. This problem is found when the T4
> > > card is direct attached to a Root Port only. So avoid bus reset by
> > > marking T4 GPUs PCI_DEV_FLAGS_NO_BUS_RESET.
> > > 
> > > Fixes: 4c207e7121fa ("PCI: Mark some NVIDIA GPUs to avoid bus reset")
> > > Signed-off-by: Wu Zongyong <wuzongyong@linux.alibaba.com>  
> > 
> > Applied to pci/virtualization for v6.6, thanks!  
> 
> I talk about the issue with NVIDIA, and they think the issue is probably related
> the pci link instead of the T4 GPU card.
> 
> I will try to describe the issue I met in detail.
> 
> The T4 card which is direct attached to a Root Port and I rebind it to
> vfio-pci driver. Then I try to use to call some vfio-related api and the
> ioctl VFIO_GROUP_GET_DEVICE_FD failed.
> 
> The stack is (base on kernel v5.10):
>     vfio_group_fops_unl_ioctl
>          vfio_group_get_device_fd
>             vfio_pci_open
>                 vfio_pci_enable // return value is -19
>                     pci_try_reset_function
>                         __pci_reset_function_locked
> 
> After the __pci_reset_function_locked(), the dmesg shows:
>    [12207494.508467] pcieport 0000:3f:00.0: pciehp: Slot(5-1): Link Down
>    [12207494.508535] vfio-pci 0000:40:00.0: No device request channel registered, blocked until released by user
>    [12207494.518426] pci 0000:40:00.0: Removing from iommu group 84
>    [12207495.532365] pcieport 0000:3f:00.0: pciehp: Slot(5-1): Card present
>    [12207495.532367] pcieport 0000:3f:00.0: pciehp: Slot(5-1): Link Up
> 
> NVIDIA people thinks this root port is not going through this reset logic and getting the
> link down/hot plug interrupts[1].
> 
> Can you revert the patch I sent and maybe we should dig it deeply.

Yes, please revert, we do testing with T4 and have not seen any issues
with bus reset.  The T4 provides neither PM nor FLR reset, so masking
bus reset compromises this device for assignment scenarios.  I can send
a revert patch if requested.  Thanks,

Alex

> > > ---
> > >  drivers/pci/quirks.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > > index 44cab813bf95..be86b93b9e38 100644
> > > --- a/drivers/pci/quirks.c
> > > +++ b/drivers/pci/quirks.c
> > > @@ -3618,7 +3618,7 @@ static void quirk_no_bus_reset(struct pci_dev *dev)
> > >   */
> > >  static void quirk_nvidia_no_bus_reset(struct pci_dev *dev)
> > >  {
> > > -	if ((dev->device & 0xffc0) == 0x2340)
> > > +	if ((dev->device & 0xffc0) == 0x2340 || dev->device == 0x1eb8)
> > >  		quirk_no_bus_reset(dev);
> > >  }
> > >  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
> > > -- 
> > > 2.34.3
> > >   
> 


  reply	other threads:[~2023-09-08  3:41 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-29 11:58 [RFC PATCH] PCI: avoid SBR for NVIDIA T4 Wu Zongyong
2023-03-29 17:05 ` Bjorn Helgaas
2023-03-30  2:10   ` Wu Zongyong
2023-03-30 15:49     ` Bjorn Helgaas
2023-03-31  2:11       ` Wu Zongyong
2023-04-03  4:02         ` Wu Zongyong
2023-03-30  2:41 ` Lukas Wunner
2023-04-10 12:34 ` [PATCH] PCI: Mark NVIDIA T4 GPUs to avoid bus reset Wu Zongyong
2023-04-26  8:13   ` Wu Zongyong
2023-08-09 23:05   ` Bjorn Helgaas
2023-09-08  2:50     ` Wu Zongyong
2023-09-08  3:40       ` Alex Williamson [this message]
2023-09-08 20:11         ` Bjorn Helgaas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230907214037.7f35f26a.alex.williamson@redhat.com \
    --to=alex.williamson@redhat.com \
    --cc=bhelgaas@google.com \
    --cc=gerry@linux.alibaba.com \
    --cc=helgaas@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=pjaroszynski@nvidia.com \
    --cc=sdonthineni@nvidia.com \
    --cc=wllenyj@linux.alibaba.com \
    --cc=wutu.xq2@linux.alibaba.com \
    --cc=wuzongyong@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).