Re: [PATCH] vdpa/mlx5: Fix crash on shutdown for when no ndev exists

From: "Michael S. Tsirkin" <mst@redhat.com>
To: Dragos Tatulea <dtatulea@nvidia.com>
Cc: "virtualization@lists.linux-foundation.org" 
	<virtualization@lists.linux-foundation.org>,
	"jasowang@redhat.com" <jasowang@redhat.com>,
	Saeed Mahameed <saeedm@nvidia.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"stable@vger.kernel.org" <stable@vger.kernel.org>,
	"xuanzhuo@linux.alibaba.com" <xuanzhuo@linux.alibaba.com>
Subject: Re: [PATCH] vdpa/mlx5: Fix crash on shutdown for when no ndev exists
Date: Mon, 31 Jul 2023 05:08:15 -0400	[thread overview]
Message-ID: <20230731050200-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <b97484f15824c86f5cee4fe673794f17419bcb1b.camel@nvidia.com>

On Mon, Jul 31, 2023 at 07:15:31AM +0000, Dragos Tatulea wrote:
> On Thu, 2023-07-27 at 12:28 -0400, Michael S. Tsirkin wrote:
> > On Thu, Jul 27, 2023 at 04:02:16PM +0000, Dragos Tatulea wrote:
> > > On Wed, 2023-07-26 at 15:26 -0400, Michael S. Tsirkin wrote:
> > > > On Wed, Jul 26, 2023 at 10:07:38PM +0300, Dragos Tatulea wrote:
> > > > > The ndev was accessed on shutdown without a check if it actually exists.
> > > > > This triggered the crash pasted below. This patch simply adds a check
> > > > > before using ndev.
> > > > > 
> > > > >  BUG: kernel NULL pointer dereference, address: 0000000000000300
> > > > >  #PF: supervisor read access in kernel mode
> > > > >  #PF: error_code(0x0000) - not-present page
> > > > >  PGD 0 P4D 0
> > > > >  Oops: 0000 [#1] SMP
> > > > >  CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 6.5.0-
> > > > > rc2_for_upstream_min_debug_2023_07_17_15_05 #1
> > > > >  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-
> > > > > gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> > > > >  RIP: 0010:mlx5v_shutdown+0xe/0x50 [mlx5_vdpa]
> > > > >  RSP: 0018:ffff8881003bfdc0 EFLAGS: 00010286
> > > > >  RAX: ffff888103befba0 RBX: ffff888109d28008 RCX: 0000000000000017
> > > > >  RDX: 0000000000000001 RSI: 0000000000000212 RDI: ffff888109d28000
> > > > >  RBP: 0000000000000000 R08: 0000000d3a3a3882 R09: 0000000000000001
> > > > >  R10: 0000000000000000 R11: 0000000000000000 R12: ffff888109d28000
> > > > >  R13: ffff888109d28080 R14: 00000000fee1dead R15: 0000000000000000
> > > > >  FS:  00007f4969e0be40(0000) GS:ffff88852c800000(0000)
> > > > > knlGS:0000000000000000
> > > > >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > >  CR2: 0000000000000300 CR3: 00000001051cd006 CR4: 0000000000370eb0
> > > > >  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > >  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > >  Call Trace:
> > > > >   <TASK>
> > > > >   ? __die+0x20/0x60
> > > > >   ? page_fault_oops+0x14c/0x3c0
> > > > >   ? exc_page_fault+0x75/0x140
> > > > >   ? asm_exc_page_fault+0x22/0x30
> > > > >   ? mlx5v_shutdown+0xe/0x50 [mlx5_vdpa]
> > > > >   device_shutdown+0x13e/0x1e0
> > > > >   kernel_restart+0x36/0x90
> > > > >   __do_sys_reboot+0x141/0x210
> > > > >   ? vfs_writev+0xcd/0x140
> > > > >   ? handle_mm_fault+0x161/0x260
> > > > >   ? do_writev+0x6b/0x110
> > > > >   do_syscall_64+0x3d/0x90
> > > > >   entry_SYSCALL_64_after_hwframe+0x46/0xb0
> > > > >  RIP: 0033:0x7f496990fb56
> > > > >  RSP: 002b:00007fffc7bdde88 EFLAGS: 00000206 ORIG_RAX: 00000000000000a9
> > > > >  RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f496990fb56
> > > > >  RDX: 0000000001234567 RSI: 0000000028121969 RDI: fffffffffee1dead
> > > > >  RBP: 00007fffc7bde1d0 R08: 0000000000000000 R09: 0000000000000000
> > > > >  R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
> > > > >  R13: 00007fffc7bddf10 R14: 0000000000000000 R15: 00007fffc7bde2b8
> > > > >   </TASK>
> > > > >  CR2: 0000000000000300
> > > > >  ---[ end trace 0000000000000000 ]---
> > > > > 
> > > > > Fixes: bc9a2b3e686e ("vdpa/mlx5: Support interrupt bypassing")
> > > > > Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
> > > > > ---
> > > > >  drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 ++-
> > > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > > > > b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > > > > index 9138ef2fb2c8..e2e7ebd71798 100644
> > > > > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > > > > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > > > > @@ -3556,7 +3556,8 @@ static void mlx5v_shutdown(struct auxiliary_device
> > > > > *auxdev)
> > > > >         mgtdev = auxiliary_get_drvdata(auxdev);
> > > > >         ndev = mgtdev->ndev;
> > > > >  
> > > > > -       free_irqs(ndev);
> > > > > +       if (ndev)
> > > > > +               free_irqs(ndev);
> > > > >  }
> > > > >  
> > > > 
> > > > something I don't get:
> > > > irqs are allocated in mlx5_vdpa_dev_add
> > > > why are they not freed in mlx5_vdpa_dev_del?
> > > > 
> > > That is a good point. I will try to find out. I also don't get why free_irq
> > > is
> > > called in the vdpa dev .free op instead of mlx5_vdpa_dev_del. Maybe I can
> > > change
> > > that in a different refactoring.
> > 
> > as it is I have no idea whether e.g. ndev can change
> > between these two call sites. that would make the check
> > pointless.
> > 
> > > > this is what's creating all this mess.
> > > > 
> > > > 
> > > Not quite: mlx5_vdpa_dev_del (which is a .dev_del of for struct
> > > vdpa_mgmtdev_ops) doesn't get called on shutdown. At least that's what I
> > > see. Or
> > > am I missing something?
> > 
> > and why do we care whether irqs are freed on shutdown?
> > 
> Had to ask around a bit to find out the answer: there can be issues with kexec
> IRQ allocation on some platforms. It is documented here [0] for mlx5_core.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/mellanox/mlx5/core/main.c#n2129
> 
> Thanks,
> Dragos

It's quite weird. 
	 * Some platforms requiring freeing the IRQ's in the shutdown
	 * flow. If they aren't freed they can't be allocated after
	 * kexec. There is no need to cleanup the mlx5_core software
	 * contexts.

but most drivers don't have a shutdown callback how do they work then?
do you know which platforms these are?

I don't really know much about why shutdown callback is even necessary.
I guess this is to detect shutdown and do a faster cleanup than
the slow, graceful removal, just cleaning hardware resources?

-- 
MST