From mboxrd@z Thu Jan 1 00:00:00 1970 From: Bob Gilligan Subject: Re: [PATCH 1/2] ipv4: Improve the scaling of the ARP cache for multicast destinations. Date: Fri, 31 Aug 2012 12:21:28 -0700 Message-ID: <50410EB8.3040603@aristanetworks.com> References: <50400B68.3060302@aristanetworks.com> <20120830.210628.365120808137655227.davem@davemloft.net> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: netdev@vger.kernel.org To: David Miller Return-path: Received: from mail-pb0-f46.google.com ([209.85.160.46]:36811 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754678Ab2HaTVb (ORCPT ); Fri, 31 Aug 2012 15:21:31 -0400 Received: by pbbrr13 with SMTP id rr13so5328714pbb.19 for ; Fri, 31 Aug 2012 12:21:31 -0700 (PDT) In-Reply-To: <20120830.210628.365120808137655227.davem@davemloft.net> Sender: netdev-owner@vger.kernel.org List-ID: On 8/30/12 6:06 PM, David Miller wrote: > From: Bob Gilligan > Date: Thu, 30 Aug 2012 17:55:04 -0700 > >> The mapping from multicast IPv4 address to MAC address can just as >> easily be done at the time a packet is to be sent. With this change, >> we maintain one ARP cache entry for each interface that has at least >> one multicast group member. All routes to IPv4 multicast destinations >> via a particular interface use the same ARP cache entry. This entry >> does not store the MAC address to use. Instead, packets for multicast >> destinations go to a new output function that maps the destination >> IPv4 multicast address into the MAC address and forms the MAC header. > > Doing an ARP MC mapping on every packet is much more expensive than > doing a copy of the hard header cache. > > I do not believe the memory consumption issue you use to justify this > change is a real issue. > > If you are talking to that many multicast groups actively, you do want > that many neighbour cache entries. This is not different from talking > to nearly every IP address on a local /8 subnet. You'll have a huge > number of neighbour table entries in that case as well. > > If your the actual steady state number of active groups being spoken > to is smaller, you can tune the neighbour cache thresholds to collect > old less used entries more quickly. > > And this today is trivial, since routes no longer hold a reference > to neighbour entries. Therefore any neighbour entry whatsoever can > be immediately reclaimed at any moment. The scaling is N-squared: the number of neighbor cache entries required for your multicast traffic is interfaces * groups. 100 interfaces and 100 groups could generate 10,000 entries. 1,000 interfaces and 1,000 groups could generate a million entries. But the number of groups is hard to predict: it depends on the applications in use and the multicast traffic they generate. So, it is hard to come up with a "budget" for multicast entries in the neighbor cache for a multicast router. If you pick a gc_thresh3 that is less than your working set, you'll end up thrashing the neighbor cache. And calls to neigh_forced_gc() are expensive: It performs a linear search of the entire neighbor cache. Also, the calls to neigh_forced_gc() due to a large number of multicast entries will negatively impact the unicast entries sharing the neighbor cache: it will free any unreferenced but resolved unicast entries. Any subsequent packets for those destinations will trigger a re-ARP. Unnecessary re-ARPing is generally undesirable in a router. The user who wants to avoid these problems is left with the alternative of setting gc_thresh3 to a very large number based on a worst case estimate of the number of unicast plus multicast entries required. Seems just simpler and more efficient to keep the multicast entries out of the neighbor cache entirely. Bob. > > I'm not fond of these patches, and adding yet more special cases to > the neighbour layer, and therefore will not apply them. >