From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 381E2191 for ; Tue, 11 Jul 2023 00:38:59 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 7B1B0C433C8; Tue, 11 Jul 2023 00:38:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1689035939; bh=2Us+2Gjr0gIOj1mGTkYo2JYzunlMRBtsOjC254M84+w=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=lJwNlETulPGHNBEp/gFTISgJdr1I+W/mVUx63icsBGF9m4lO+kNGo76kCEGxGt+bK VSYa24riJcFXQjuekCsO428wUPbD/1XmLa/x3zzAm+J8fqfzg42wdHwTF2W4f/bRiB i3AthakBeUe4eWyEpTK74lGnG/wVmkcVRlnTRX5oSqx8LqjauKg4I8OBoC+aJvtFNp TcsIgV43bEOfmdoB41V0LQfjuA5q5kvtnnk/uEZdxkeMgsTyoqf7jN0ldKUqJjv+/n kUz2F78bUOI2JCNWeL2iwGOqcxBHXJ4/joOfCTtLCYxCVto84xnhmw72IBYTU8Ar29 zbvaEVBnKVHGQ== Date: Mon, 10 Jul 2023 17:38:58 -0700 From: Jakub Kicinski To: Michael Chan Cc: davem@davemloft.net, netdev@vger.kernel.org, edumazet@google.com, pabeni@redhat.com Subject: Re: [PATCH net-next 0/3] eth: bnxt: handle invalid Tx completions more gracefully Message-ID: <20230710173858.75bc590e@kernel.org> In-Reply-To: References: <20230710205611.1198878-1-kuba@kernel.org> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit On Mon, 10 Jul 2023 14:44:31 -0700 Michael Chan wrote: > > bnxt trusts the events generated by the device which may lead to kernel > > crashes. These are extremely rare but they do happen. For a while > > I thought crashing may be intentional, because device reporting invalid > > completions should never happen, and having a core dump could be useful > > if it does. But in practice I haven't found any clues in the core dumps, > > and panic_on_warn exists. > > Indeed, it was intentional to crash the kernel so that we could > analyze the rings in the core dump. Typically, we would find a bad > completion in one of the rings and we would debug it with the hardware > team during early chip testing. Either the bug is fixed or some > suitable workaround is implemented. Ideally, this should never happen > once the chip goes into production. I was suspecting bad HW, but some new platforms seems to be hitting it, too. Which now makes me suspect PXE -> Linux hand off problem? Or multi-host? Hard to tell.. Hopefully once it's not crashing it will be easier to do more analysis - crashes within softirq during boot don't propagate too well into monitoring systems :( > I suppose in a large enough deployment, this NULL SKB crash can > happen. I will review your patchset later today. Thanks. Thanks!