All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Dragos Tatulea <dtatulea@nvidia.com>
Cc: "virtualization@lists.linux-foundation.org" 
	<virtualization@lists.linux-foundation.org>,
	"jasowang@redhat.com" <jasowang@redhat.com>,
	Saeed Mahameed <saeedm@nvidia.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"stable@vger.kernel.org" <stable@vger.kernel.org>,
	"xuanzhuo@linux.alibaba.com" <xuanzhuo@linux.alibaba.com>
Subject: Re: [PATCH] vdpa/mlx5: Fix crash on shutdown for when no ndev exists
Date: Mon, 31 Jul 2023 05:08:15 -0400	[thread overview]
Message-ID: <20230731050200-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <b97484f15824c86f5cee4fe673794f17419bcb1b.camel@nvidia.com>

On Mon, Jul 31, 2023 at 07:15:31AM +0000, Dragos Tatulea wrote:
> On Thu, 2023-07-27 at 12:28 -0400, Michael S. Tsirkin wrote:
> > On Thu, Jul 27, 2023 at 04:02:16PM +0000, Dragos Tatulea wrote:
> > > On Wed, 2023-07-26 at 15:26 -0400, Michael S. Tsirkin wrote:
> > > > On Wed, Jul 26, 2023 at 10:07:38PM +0300, Dragos Tatulea wrote:
> > > > > The ndev was accessed on shutdown without a check if it actually exists.
> > > > > This triggered the crash pasted below. This patch simply adds a check
> > > > > before using ndev.
> > > > > 
> > > > >  BUG: kernel NULL pointer dereference, address: 0000000000000300
> > > > >  #PF: supervisor read access in kernel mode
> > > > >  #PF: error_code(0x0000) - not-present page
> > > > >  PGD 0 P4D 0
> > > > >  Oops: 0000 [#1] SMP
> > > > >  CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 6.5.0-
> > > > > rc2_for_upstream_min_debug_2023_07_17_15_05 #1
> > > > >  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-
> > > > > gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> > > > >  RIP: 0010:mlx5v_shutdown+0xe/0x50 [mlx5_vdpa]
> > > > >  RSP: 0018:ffff8881003bfdc0 EFLAGS: 00010286
> > > > >  RAX: ffff888103befba0 RBX: ffff888109d28008 RCX: 0000000000000017
> > > > >  RDX: 0000000000000001 RSI: 0000000000000212 RDI: ffff888109d28000
> > > > >  RBP: 0000000000000000 R08: 0000000d3a3a3882 R09: 0000000000000001
> > > > >  R10: 0000000000000000 R11: 0000000000000000 R12: ffff888109d28000
> > > > >  R13: ffff888109d28080 R14: 00000000fee1dead R15: 0000000000000000
> > > > >  FS:  00007f4969e0be40(0000) GS:ffff88852c800000(0000)
> > > > > knlGS:0000000000000000
> > > > >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > >  CR2: 0000000000000300 CR3: 00000001051cd006 CR4: 0000000000370eb0
> > > > >  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > >  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > >  Call Trace:
> > > > >   <TASK>
> > > > >   ? __die+0x20/0x60
> > > > >   ? page_fault_oops+0x14c/0x3c0
> > > > >   ? exc_page_fault+0x75/0x140
> > > > >   ? asm_exc_page_fault+0x22/0x30
> > > > >   ? mlx5v_shutdown+0xe/0x50 [mlx5_vdpa]
> > > > >   device_shutdown+0x13e/0x1e0
> > > > >   kernel_restart+0x36/0x90
> > > > >   __do_sys_reboot+0x141/0x210
> > > > >   ? vfs_writev+0xcd/0x140
> > > > >   ? handle_mm_fault+0x161/0x260
> > > > >   ? do_writev+0x6b/0x110
> > > > >   do_syscall_64+0x3d/0x90
> > > > >   entry_SYSCALL_64_after_hwframe+0x46/0xb0
> > > > >  RIP: 0033:0x7f496990fb56
> > > > >  RSP: 002b:00007fffc7bdde88 EFLAGS: 00000206 ORIG_RAX: 00000000000000a9
> > > > >  RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f496990fb56
> > > > >  RDX: 0000000001234567 RSI: 0000000028121969 RDI: fffffffffee1dead
> > > > >  RBP: 00007fffc7bde1d0 R08: 0000000000000000 R09: 0000000000000000
> > > > >  R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
> > > > >  R13: 00007fffc7bddf10 R14: 0000000000000000 R15: 00007fffc7bde2b8
> > > > >   </TASK>
> > > > >  CR2: 0000000000000300
> > > > >  ---[ end trace 0000000000000000 ]---
> > > > > 
> > > > > Fixes: bc9a2b3e686e ("vdpa/mlx5: Support interrupt bypassing")
> > > > > Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
> > > > > ---
> > > > >  drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 ++-
> > > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > > > > b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > > > > index 9138ef2fb2c8..e2e7ebd71798 100644
> > > > > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > > > > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > > > > @@ -3556,7 +3556,8 @@ static void mlx5v_shutdown(struct auxiliary_device
> > > > > *auxdev)
> > > > >         mgtdev = auxiliary_get_drvdata(auxdev);
> > > > >         ndev = mgtdev->ndev;
> > > > >  
> > > > > -       free_irqs(ndev);
> > > > > +       if (ndev)
> > > > > +               free_irqs(ndev);
> > > > >  }
> > > > >  
> > > > 
> > > > something I don't get:
> > > > irqs are allocated in mlx5_vdpa_dev_add
> > > > why are they not freed in mlx5_vdpa_dev_del?
> > > > 
> > > That is a good point. I will try to find out. I also don't get why free_irq
> > > is
> > > called in the vdpa dev .free op instead of mlx5_vdpa_dev_del. Maybe I can
> > > change
> > > that in a different refactoring.
> > 
> > as it is I have no idea whether e.g. ndev can change
> > between these two call sites. that would make the check
> > pointless.
> > 
> > > > this is what's creating all this mess.
> > > > 
> > > > 
> > > Not quite: mlx5_vdpa_dev_del (which is a .dev_del of for struct
> > > vdpa_mgmtdev_ops) doesn't get called on shutdown. At least that's what I
> > > see. Or
> > > am I missing something?
> > 
> > and why do we care whether irqs are freed on shutdown?
> > 
> Had to ask around a bit to find out the answer: there can be issues with kexec
> IRQ allocation on some platforms. It is documented here [0] for mlx5_core.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/mellanox/mlx5/core/main.c#n2129
> 
> Thanks,
> Dragos

It's quite weird. 
	 * Some platforms requiring freeing the IRQ's in the shutdown
	 * flow. If they aren't freed they can't be allocated after
	 * kexec. There is no need to cleanup the mlx5_core software
	 * contexts.

but most drivers don't have a shutdown callback how do they work then?
do you know which platforms these are?

I don't really know much about why shutdown callback is even necessary.
I guess this is to detect shutdown and do a faster cleanup than
the slow, graceful removal, just cleaning hardware resources?


-- 
MST


WARNING: multiple messages have this Message-ID (diff)
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Dragos Tatulea <dtatulea@nvidia.com>
Cc: "xuanzhuo@linux.alibaba.com" <xuanzhuo@linux.alibaba.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"stable@vger.kernel.org" <stable@vger.kernel.org>,
	"virtualization@lists.linux-foundation.org"
	<virtualization@lists.linux-foundation.org>,
	Saeed Mahameed <saeedm@nvidia.com>
Subject: Re: [PATCH] vdpa/mlx5: Fix crash on shutdown for when no ndev exists
Date: Mon, 31 Jul 2023 05:08:15 -0400	[thread overview]
Message-ID: <20230731050200-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <b97484f15824c86f5cee4fe673794f17419bcb1b.camel@nvidia.com>

On Mon, Jul 31, 2023 at 07:15:31AM +0000, Dragos Tatulea wrote:
> On Thu, 2023-07-27 at 12:28 -0400, Michael S. Tsirkin wrote:
> > On Thu, Jul 27, 2023 at 04:02:16PM +0000, Dragos Tatulea wrote:
> > > On Wed, 2023-07-26 at 15:26 -0400, Michael S. Tsirkin wrote:
> > > > On Wed, Jul 26, 2023 at 10:07:38PM +0300, Dragos Tatulea wrote:
> > > > > The ndev was accessed on shutdown without a check if it actually exists.
> > > > > This triggered the crash pasted below. This patch simply adds a check
> > > > > before using ndev.
> > > > > 
> > > > >  BUG: kernel NULL pointer dereference, address: 0000000000000300
> > > > >  #PF: supervisor read access in kernel mode
> > > > >  #PF: error_code(0x0000) - not-present page
> > > > >  PGD 0 P4D 0
> > > > >  Oops: 0000 [#1] SMP
> > > > >  CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 6.5.0-
> > > > > rc2_for_upstream_min_debug_2023_07_17_15_05 #1
> > > > >  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-
> > > > > gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> > > > >  RIP: 0010:mlx5v_shutdown+0xe/0x50 [mlx5_vdpa]
> > > > >  RSP: 0018:ffff8881003bfdc0 EFLAGS: 00010286
> > > > >  RAX: ffff888103befba0 RBX: ffff888109d28008 RCX: 0000000000000017
> > > > >  RDX: 0000000000000001 RSI: 0000000000000212 RDI: ffff888109d28000
> > > > >  RBP: 0000000000000000 R08: 0000000d3a3a3882 R09: 0000000000000001
> > > > >  R10: 0000000000000000 R11: 0000000000000000 R12: ffff888109d28000
> > > > >  R13: ffff888109d28080 R14: 00000000fee1dead R15: 0000000000000000
> > > > >  FS:  00007f4969e0be40(0000) GS:ffff88852c800000(0000)
> > > > > knlGS:0000000000000000
> > > > >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > >  CR2: 0000000000000300 CR3: 00000001051cd006 CR4: 0000000000370eb0
> > > > >  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > >  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > >  Call Trace:
> > > > >   <TASK>
> > > > >   ? __die+0x20/0x60
> > > > >   ? page_fault_oops+0x14c/0x3c0
> > > > >   ? exc_page_fault+0x75/0x140
> > > > >   ? asm_exc_page_fault+0x22/0x30
> > > > >   ? mlx5v_shutdown+0xe/0x50 [mlx5_vdpa]
> > > > >   device_shutdown+0x13e/0x1e0
> > > > >   kernel_restart+0x36/0x90
> > > > >   __do_sys_reboot+0x141/0x210
> > > > >   ? vfs_writev+0xcd/0x140
> > > > >   ? handle_mm_fault+0x161/0x260
> > > > >   ? do_writev+0x6b/0x110
> > > > >   do_syscall_64+0x3d/0x90
> > > > >   entry_SYSCALL_64_after_hwframe+0x46/0xb0
> > > > >  RIP: 0033:0x7f496990fb56
> > > > >  RSP: 002b:00007fffc7bdde88 EFLAGS: 00000206 ORIG_RAX: 00000000000000a9
> > > > >  RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f496990fb56
> > > > >  RDX: 0000000001234567 RSI: 0000000028121969 RDI: fffffffffee1dead
> > > > >  RBP: 00007fffc7bde1d0 R08: 0000000000000000 R09: 0000000000000000
> > > > >  R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
> > > > >  R13: 00007fffc7bddf10 R14: 0000000000000000 R15: 00007fffc7bde2b8
> > > > >   </TASK>
> > > > >  CR2: 0000000000000300
> > > > >  ---[ end trace 0000000000000000 ]---
> > > > > 
> > > > > Fixes: bc9a2b3e686e ("vdpa/mlx5: Support interrupt bypassing")
> > > > > Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
> > > > > ---
> > > > >  drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 ++-
> > > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > > > > b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > > > > index 9138ef2fb2c8..e2e7ebd71798 100644
> > > > > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > > > > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > > > > @@ -3556,7 +3556,8 @@ static void mlx5v_shutdown(struct auxiliary_device
> > > > > *auxdev)
> > > > >         mgtdev = auxiliary_get_drvdata(auxdev);
> > > > >         ndev = mgtdev->ndev;
> > > > >  
> > > > > -       free_irqs(ndev);
> > > > > +       if (ndev)
> > > > > +               free_irqs(ndev);
> > > > >  }
> > > > >  
> > > > 
> > > > something I don't get:
> > > > irqs are allocated in mlx5_vdpa_dev_add
> > > > why are they not freed in mlx5_vdpa_dev_del?
> > > > 
> > > That is a good point. I will try to find out. I also don't get why free_irq
> > > is
> > > called in the vdpa dev .free op instead of mlx5_vdpa_dev_del. Maybe I can
> > > change
> > > that in a different refactoring.
> > 
> > as it is I have no idea whether e.g. ndev can change
> > between these two call sites. that would make the check
> > pointless.
> > 
> > > > this is what's creating all this mess.
> > > > 
> > > > 
> > > Not quite: mlx5_vdpa_dev_del (which is a .dev_del of for struct
> > > vdpa_mgmtdev_ops) doesn't get called on shutdown. At least that's what I
> > > see. Or
> > > am I missing something?
> > 
> > and why do we care whether irqs are freed on shutdown?
> > 
> Had to ask around a bit to find out the answer: there can be issues with kexec
> IRQ allocation on some platforms. It is documented here [0] for mlx5_core.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/mellanox/mlx5/core/main.c#n2129
> 
> Thanks,
> Dragos

It's quite weird. 
	 * Some platforms requiring freeing the IRQ's in the shutdown
	 * flow. If they aren't freed they can't be allocated after
	 * kexec. There is no need to cleanup the mlx5_core software
	 * contexts.

but most drivers don't have a shutdown callback how do they work then?
do you know which platforms these are?

I don't really know much about why shutdown callback is even necessary.
I guess this is to detect shutdown and do a faster cleanup than
the slow, graceful removal, just cleaning hardware resources?


-- 
MST

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

  reply	other threads:[~2023-07-31  9:09 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-26 19:07 [PATCH] vdpa/mlx5: Fix crash on shutdown for when no ndev exists Dragos Tatulea
2023-07-26 19:07 ` Dragos Tatulea via Virtualization
2023-07-26 19:10 ` kernel test robot
2023-07-26 19:26 ` Michael S. Tsirkin
2023-07-26 19:26   ` Michael S. Tsirkin
2023-07-27 16:02   ` Dragos Tatulea
2023-07-27 16:02     ` Dragos Tatulea via Virtualization
2023-07-27 16:28     ` Michael S. Tsirkin
2023-07-27 16:28       ` Michael S. Tsirkin
2023-07-31  7:15       ` Dragos Tatulea
2023-07-31  7:15         ` Dragos Tatulea via Virtualization
2023-07-31  9:08         ` Michael S. Tsirkin [this message]
2023-07-31  9:08           ` Michael S. Tsirkin
2023-08-01  3:59           ` Jason Wang
2023-08-01  3:59             ` Jason Wang
2023-08-01  8:17             ` Dragos Tatulea
2023-08-01  8:17               ` Dragos Tatulea via Virtualization
2023-08-02  2:51               ` Jason Wang
2023-08-02  2:51                 ` Jason Wang
2023-08-02  7:56                 ` Dragos Tatulea
2023-08-02  7:56                   ` Dragos Tatulea via Virtualization
2023-08-03 15:02                   ` Dragos Tatulea
2023-08-03 15:02                     ` Dragos Tatulea via Virtualization
2023-08-03 15:12                     ` Michael S. Tsirkin
2023-08-03 15:12                       ` Michael S. Tsirkin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230731050200-mutt-send-email-mst@kernel.org \
    --to=mst@redhat.com \
    --cc=dtatulea@nvidia.com \
    --cc=jasowang@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=saeedm@nvidia.com \
    --cc=stable@vger.kernel.org \
    --cc=virtualization@lists.linux-foundation.org \
    --cc=xuanzhuo@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.