From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Miller Subject: Re: [net-next 03/10] ixgbe: Drop the TX work limit and instead just leave it to budget Date: Mon, 22 Aug 2011 16:40:27 -0700 (PDT) Message-ID: <20110822.164027.1830363266993513959.davem@davemloft.net> References: <4E52920F.7060603@intel.com> <20110822.135644.683110224886588181.davem@davemloft.net> <4E52DEEF.40504@intel.com> Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Cc: bhutchings@solarflare.com, jeffrey.t.kirsher@intel.com, netdev@vger.kernel.org, gospo@redhat.com To: alexander.h.duyck@intel.com Return-path: Received: from shards.monkeyblade.net ([198.137.202.13]:56518 "EHLO shards.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751589Ab1HVXkn (ORCPT ); Mon, 22 Aug 2011 19:40:43 -0400 In-Reply-To: <4E52DEEF.40504@intel.com> Sender: netdev-owner@vger.kernel.org List-ID: From: Alexander Duyck Date: Mon, 22 Aug 2011 15:57:51 -0700 > The problem was occurring even without large rings. > I was seeing issues with rings just 256 descriptors in size. And the default in the ixgbe driver is 512 entries which I think itself is quite excessive. Something like 128 is more in line with what I'd call a sane default. So the only side effect of your change is to decrease the TX quota to 64 (the default NAPI quota) from it's current value of 512 (IXGBE_DEFAULT_TXD). Talking about the existing code, I can't even see how the current driver private TX quota can trigger except in the most extreme cases. This is because the quota is set the same as the size you're setting the TX ring to. > The problem seemed to be that the TX cleanup being a multiple of > budget was allowing one CPU to overwhelm the other and the fact that > the TX was essentially unbounded was just allowing the issue to > feedback on itself. I still don't understand what issue you could even be running into. On each CPU we round robin against all NAPI requestors for that CPU. In your routing test setup, we should have one cpu doing the RX and another different cpu doing TX. Therefore if the TX cpu simply spins in a loop doing nothing but TX reclaim work it should not really matter. And if you hit the TX budget on the TX cpu, it's just going to come right back into the ixgbe NAPI handler and thus the TX reclaim processing not even a dozen cycles later. The only effect is to have us go through the whole function call sequence and data structure setup into local variables more than you would be doing so before. > In addition since the RX and TX workload was balanced it kept both > locked into polling while the CPU was saturated instead of allowing > the TX to become interrupt driven. In addition since the TX was > working on the same budget as the RX the number of SKBs freed up in > the TX path would match the number consumed when being reallocated > on the RX path. So the only conclusion I can come to is that what happens is we're now executing what are essentially wasted cpu cycles and this takes us over the threshold such that we poll more and take interrupts less. And this improves performance. That's pretty unwise if you ask me, we should do something useful with cpu cycles instead of wasting them merely to make us poll more. > The problem seemed to be present as long as I allowed the TX budget to > be a multiple of the RX budget. The easiest way to keep things > balanced and avoid allowing the TX from one CPU to overwhelm the RX on > another was just to keep the budgets equal. You're executing 10 or 20 cpu cycles after every 64 TX reclaims, that's the only effect of these changes. That's not even long enough for a cache line to transfer between two cpus.