From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tom Herbert Subject: Re: [PATCH v4 1/1] rps: core implementation Date: Tue, 5 Jan 2010 23:56:41 -0800 Message-ID: <65634d661001052356m494475ck5f6c47ab860fda35@mail.gmail.com> References: <65634d660911201528k5a07135el471b65fff9dd7c9d@mail.gmail.com> <20091120154046.67252d23@nehalam> <65634d660912171304p751e1698mbc9de50dade4317d@mail.gmail.com> <65634d661001051732qd64e79dt37e6247f8b0dc863@mail.gmail.com> <4B44258C.2050302@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: David Miller , Linux Netdev List To: Eric Dumazet Return-path: Received: from smtp-out.google.com ([216.239.33.17]:56004 "EHLO smtp-out.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751034Ab0AFH4o convert rfc822-to-8bit (ORCPT ); Wed, 6 Jan 2010 02:56:44 -0500 Received: from kpbe11.cbf.corp.google.com (kpbe11.cbf.corp.google.com [172.25.105.75]) by smtp-out.google.com with ESMTP id o067ug1g024577 for ; Wed, 6 Jan 2010 07:56:43 GMT Received: from pwi8 (pwi8.prod.google.com [10.241.219.8]) by kpbe11.cbf.corp.google.com with ESMTP id o067uCL9020560 for ; Wed, 6 Jan 2010 01:56:41 -0600 Received: by pwi8 with SMTP id 8so11943602pwi.11 for ; Tue, 05 Jan 2010 23:56:41 -0800 (PST) In-Reply-To: <4B44258C.2050302@gmail.com> Sender: netdev-owner@vger.kernel.org List-ID: Eric, thanks for the comments. > Hmm, I cant see a race condition here, could you elaborate on this ? > mask is local to this cpu, and we cannot re-enter a function that cou= ld > change some bits under us (we are called from net_rx_action()) > If you believe there is a race condition, I suspect race is still the= re. > We're allowing bits in the map to be set in netif_rx out of interrupt and __smp_call_function_single needs to be called with interrupts disabled. I guess an alternative would be to copy the mask to a local variable, and then clear the mask and scan over the local variable. Would there be complaints with stack space in local cpumask_t variable? > > Did you tested with VLANS too ? (with/without hardware support) > No. Looks like vlan functions eventually call netif_receive_skb though... I suppose I could test that. >> >> Tom > > Excellent, but I suspect big win comes from using few NICS. > (number_of(NICS) < num_online_cpus) > Yes, our primary motivation for developing this was a single NIC with a single queue on a multicore system. > (in the reverse case, possible contention on queue->csd) > Actually there should never be contention on that, only one cpu at at time will access it which is the one that successfully schedules napi on the remote CPU from enqueue_to_backlog. >> >> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h >> index 97873e3..7107b13 100644 >> --- a/include/linux/netdevice.h >> +++ b/include/linux/netdevice.h >> @@ -676,6 +676,29 @@ struct net_device_ops { >> =A0}; >> >> =A0/* >> + * Structure for Receive Packet Steering. =A0Length of map and arra= y of CPU ID's. >> + */ >> +struct rps_map { >> + =A0 =A0 int len; >> + =A0 =A0 u16 map[0]; >> +}; >> + >> +/* >> + * Structure that contains the rps maps for various NAPI instances = of a device. >> + */ >> +struct dev_rps_maps { >> + =A0 =A0 int num_maps; >> + =A0 =A0 struct rcu_head rcu; >> + =A0 =A0 struct rps_map maps[0]; >> +}; > > I feel uneasy about this, because of kmalloc() max size and rounding = to power of two effects. Other than some wasted memory what is bad about that? > It also uses a single node in NUMA setups. I suppose we should allocate from the devices NUMA node... I'd really like to store the maps with the napi instances themselves and this would work well if the napi structs are allocated on the NUMA node where the interrupt is called (I think this in Peter Waskiewicz's irq patch). Unfortunately, the information is lost by the time netif_receive_skb is called anyway. >> + >> +/* Bound number of CPUs that can be in an rps map */ >> +#define MAX_RPS_CPUS (num_possible_cpus() < 256 ? num_possible_cpus= () : 256) >> + >> +/* Maximum size of RPS map (for allocation) */ >> +#define RPS_MAP_SIZE (sizeof(struct rps_map) + (MAX_RPS_CPUS * size= of(u16))) >> + >> +/* >> =A0 * =A0 The DEVICE structure. >> =A0 * =A0 Actually, this whole structure is a big mistake. =A0It mix= es I/O >> =A0 * =A0 data with strictly "high-level" data, and it has to know a= bout >> @@ -861,6 +884,9 @@ struct net_device { >> >> =A0 =A0 =A0 struct netdev_queue =A0 =A0 rx_queue; >> >> + =A0 =A0 struct dev_rps_maps =A0 =A0 *dev_rps_maps; =A0/* Per-NAPI = maps for >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0receive packet steeing */ >> + > > > If you store rps_map pointer into napi itself, you could avoid this M= AX_RPS_CPUS thing > and really dynamically allocate the structure with the number of onli= ne cpus mentioned > in the map. > > But yes, it makes store_rps_cpus() more complex :( > As I pointed out above, I would like to that. Probably would involve adding a pointer to napi_struct in skb... > This probably can be done later, this Version 4 of RPS looks very goo= d, thanks ! > I am going to test it today on my dev machine before giving an Acked-= by :) > Thanks! > Reviewed-by: Eric Dumazet > > >