From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 381E2191
	for <netdev@vger.kernel.org>; Tue, 11 Jul 2023 00:38:59 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 7B1B0C433C8;
	Tue, 11 Jul 2023 00:38:59 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1689035939;
	bh=2Us+2Gjr0gIOj1mGTkYo2JYzunlMRBtsOjC254M84+w=;
	h=Date:From:To:Cc:Subject:In-Reply-To:References:From;
	b=lJwNlETulPGHNBEp/gFTISgJdr1I+W/mVUx63icsBGF9m4lO+kNGo76kCEGxGt+bK
	 VSYa24riJcFXQjuekCsO428wUPbD/1XmLa/x3zzAm+J8fqfzg42wdHwTF2W4f/bRiB
	 i3AthakBeUe4eWyEpTK74lGnG/wVmkcVRlnTRX5oSqx8LqjauKg4I8OBoC+aJvtFNp
	 TcsIgV43bEOfmdoB41V0LQfjuA5q5kvtnnk/uEZdxkeMgsTyoqf7jN0ldKUqJjv+/n
	 kUz2F78bUOI2JCNWeL2iwGOqcxBHXJ4/joOfCTtLCYxCVto84xnhmw72IBYTU8Ar29
	 zbvaEVBnKVHGQ==
Date: Mon, 10 Jul 2023 17:38:58 -0700
From: Jakub Kicinski <kuba@kernel.org>
To: Michael Chan <michael.chan@broadcom.com>
Cc: davem@davemloft.net, netdev@vger.kernel.org, edumazet@google.com,
 pabeni@redhat.com
Subject: Re: [PATCH net-next 0/3] eth: bnxt: handle invalid Tx completions
 more gracefully
Message-ID: <20230710173858.75bc590e@kernel.org>
In-Reply-To: <CACKFLikt=1U5fB2Xe=KfsvjfrXmgQuR2PH4iWCESWcpZBf-8Qg@mail.gmail.com>
References: <20230710205611.1198878-1-kuba@kernel.org>
	<CACKFLikt=1U5fB2Xe=KfsvjfrXmgQuR2PH4iWCESWcpZBf-8Qg@mail.gmail.com>
Precedence: bulk
X-Mailing-List: netdev@vger.kernel.org
List-Id: <netdev.vger.kernel.org>
List-Subscribe: <mailto:netdev+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:netdev+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

On Mon, 10 Jul 2023 14:44:31 -0700 Michael Chan wrote:
> > bnxt trusts the events generated by the device which may lead to kernel
> > crashes. These are extremely rare but they do happen. For a while
> > I thought crashing may be intentional, because device reporting invalid
> > completions should never happen, and having a core dump could be useful
> > if it does. But in practice I haven't found any clues in the core dumps,
> > and panic_on_warn exists.  
> 
> Indeed, it was intentional to crash the kernel so that we could
> analyze the rings in the core dump.  Typically, we would find a bad
> completion in one of the rings and we would debug it with the hardware
> team during early chip testing.  Either the bug is fixed or some
> suitable workaround is implemented.  Ideally, this should never happen
> once the chip goes into production.

I was suspecting bad HW, but some new platforms seems to be hitting it,
too. Which now makes me suspect PXE -> Linux hand off problem? 
Or multi-host?  Hard to tell..
Hopefully once it's not crashing it will be easier to do more analysis -
crashes within softirq during boot don't propagate too well into
monitoring systems :(

> I suppose in a large enough deployment, this NULL SKB crash can
> happen.  I will review your patchset later today.  Thanks.

Thanks!