From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Dumazet Subject: Re: [RFC] GRO scalability Date: Fri, 05 Oct 2012 22:06:18 +0200 Message-ID: <1349467578.21172.178.camel@edumazet-glaptop> References: <1348750130.5093.1227.camel@edumazet-glaptop> <1348769294.5093.1566.camel@edumazet-glaptop> <1348769990.5093.1584.camel@edumazet-glaptop> <1348841041.5093.2477.camel@edumazet-glaptop> <1349448747.21172.113.camel@edumazet-glaptop> <506F23F6.1060704@hp.com> <1349463634.21172.152.camel@edumazet-glaptop> <506F368F.3070403@hp.com> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit Cc: Herbert Xu , David Miller , netdev , Jesse Gross To: Rick Jones Return-path: Received: from mail-bk0-f46.google.com ([209.85.214.46]:61242 "EHLO mail-bk0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754021Ab2JEUGY (ORCPT ); Fri, 5 Oct 2012 16:06:24 -0400 Received: by mail-bk0-f46.google.com with SMTP id jk13so1133040bkc.19 for ; Fri, 05 Oct 2012 13:06:22 -0700 (PDT) In-Reply-To: <506F368F.3070403@hp.com> Sender: netdev-owner@vger.kernel.org List-ID: On Fri, 2012-10-05 at 12:35 -0700, Rick Jones wrote: > Just how much code path is there between NAPI and the socket?? (And I > guess just how much combining are you hoping for?) > When GRO correctly works, you can save about 30% of cpu cycles, it depends... Doubling MAX_SKB_FRAGS (allowing 32+1 MSS per GRO skb instead of 16+1) gives an improvement as well... > > Lets say we allow no more than 1ms of delay in GRO, > > OK. That means we can ignore HPC and FSI because they wouldn't tolerate > that kind of added delay anyway. I'm not sure if that also then > eliminates the networked storage types. > I took this 1ms delay, but I never said it was a fixed value ;) Also remember one thing, this is the _max_ delay in case your napi handler is flooded. This almost never happen (tm) > > this means we could have about 400 packets in the GRO queue (assuming > > 1500 bytes packets) > > How many flows are you going to have entering via that queue? And just > how well "shuffled" will the segments of those flows be? That is what > it all comes down to right? How many (active) flows and how well > shuffled they are. If the flows aren't well shuffled, you can get away > with a smallish coalescing context. If they are perfectly shuffled and > greater in number than your delay allowance you get right back to square > with all the overhead of GRO attempts with none of the benefit. Not sure what you mean by shuffle. We use a hash table to locate a flow, but we also have a LRU list to get the packets ordered by their entry in the 'GRO unit'. If napi completes, all the LRU list content is flushed to IP stack. ( napi_gro_flush()) If napi doesnt complete, we would only flush 'too old' packets found in the LRU. Note: this selective flush can be called once per napi run from net_rx_action(). Extra cost to get a somewhat precise timestamp would be acceptable (one call to ktime_get() or get_cycles() every 64 packets) This timestamp could be stored in napi->timestamp and done once per n->poll(n, weight) call. > > If the flow count is < 400 to allow a decent shot at a non-zero > combining rate on well shuffled flows with the 400 packet limit, then > that means each flow is >= 12.5 Mbit/s on average at 5 Gbit/s > aggregated. And I think you then get two segments per flow aggregated > at a time. Is that consistent with what you expect to be the > characteristics of the flows entering via that queue? If a packet cant stay more than 1ms, then a flow sending less than 1000 packets per second wont benefit from GRO. So yes, 12.5 Mbit/s would be the threshold. By the way, when TCP timestamps are used, and hosts are linux machines with HZ=1000, current GRO can not coalesce packets anyway because their TCP options are different. (So it would be not useful trying bigger sojourn time than 1ms)