From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3CDF67F for ; Wed, 14 Jun 2023 17:27:46 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id A921EC433C9; Wed, 14 Jun 2023 17:27:45 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1686763666; bh=nlwk6XMb5RnSmIqe4EjUjhd0IbL+TxpJ7NFl5M/ICHs=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=bUlUyRvRm4MxoOmh2UQm/2dkWk+jOunKpVZPKH7ManG84howibVGCCCsUnPh1tesE hqV417Yio2xUA0oTICH/HJi+lMxTsOeDymrSr3eMtJINZ5oRttCl3n6+Bgsi4pnujL 3J5NsuHPxXzp9Nlri+9/rIICYdY/L87jbbEz1urLP08Fn9ujp9NoKu1Aa78s74Z1t1 LChDgrBQhJZ7HzDuqLLvefEFfltB+SxSGgEHpRFr5d8rvgXUcgp0Uz+/kw9J9dYC6T eg1UMZVRojZaVl2uEPX8Lb3zySvvMUut0TYhvw73VvuIA1NmEc+wqB1T5nO2ZVm6VG 9DlWNJjPZ2vsg== Date: Wed, 14 Jun 2023 10:27:44 -0700 From: Jakub Kicinski To: Martin Habets Cc: =?UTF-8?B?w43DsWlnbw==?= Huguet , ecree.xilinx@gmail.com, davem@davemloft.net, edumazet@google.com, pabeni@redhat.com, netdev@vger.kernel.org, linux-net-drivers@amd.com, Fei Liu Subject: Re: [PATCH net] sfc: use budget for TX completions Message-ID: <20230614102744.71c91f20@kernel.org> In-Reply-To: References: <20230612144254.21039-1-ihuguet@redhat.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Wed, 14 Jun 2023 09:03:05 +0100 Martin Habets wrote: > On Mon, Jun 12, 2023 at 04:42:54PM +0200, =C3=8D=C3=B1igo Huguet wrote: > > When running workloads heavy unbalanced towards TX (high TX, low RX > > traffic), sfc driver can retain the CPU during too long times. Although > > in many cases this is not enough to be visible, it can affect > > performance and system responsiveness. > >=20 > > A way to reproduce it is to use a debug kernel and run some parallel > > netperf TX tests. In some systems, this will lead to this message being > > logged: > > kernel:watchdog: BUG: soft lockup - CPU#12 stuck for 22s! > >=20 > > The reason is that sfc driver doesn't account any NAPI budget for the TX > > completion events work. With high-TX/low-RX traffic, this makes that the > > CPU is held for long time for NAPI poll. > >=20 > > Documentations says "drivers can process completions for any number of = Tx > > packets but should only process up to budget number of Rx packets". > > However, many drivers do limit the amount of TX completions that they > > process in a single NAPI poll. =20 >=20 > I think your work and what other drivers do shows that the documentation = is > no longer correct. I haven't checked when that was written, but maybe it > was years ago when link speeds were lower. > Clearly for drivers that support higher link speeds this is an issue, so = we > should update the documentation. Not sure what constitutes a high link sp= eed, > with current CPUs for me it's anything >=3D 50G. The documentation is pretty recent. I haven't seen this lockup once=20 in production or testing. Do multiple queues complete on the same CPU for SFC or something weird like that?