From mboxrd@z Thu Jan 1 00:00:00 1970 From: Neil Horman Subject: Re: net: Automatic IRQ siloing for network devices Date: Sun, 17 Apr 2011 21:08:44 -0400 Message-ID: <20110418010844.GA4376@neilslaptop.think-freely.org> References: <1302898677-3833-1-git-send-email-nhorman@tuxdriver.com> <1302908069.2845.29.camel@bwh-desktop> <20110416015938.GB2200@neilslaptop.think-freely.org> <20110416091704.4fa62a50@nehalam> <20110417172010.GA3362@neilslaptop.think-freely.org> <1303065539.5282.938.camel@localhost> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Stephen Hemminger , netdev@vger.kernel.org, davem@davemloft.net, Thomas Gleixner To: Ben Hutchings Return-path: Received: from charlotte.tuxdriver.com ([70.61.120.58]:53388 "EHLO smtp.tuxdriver.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751536Ab1DRBJF (ORCPT ); Sun, 17 Apr 2011 21:09:05 -0400 Content-Disposition: inline In-Reply-To: <1303065539.5282.938.camel@localhost> Sender: netdev-owner@vger.kernel.org List-ID: On Sun, Apr 17, 2011 at 07:38:59PM +0100, Ben Hutchings wrote: > On Sun, 2011-04-17 at 13:20 -0400, Neil Horman wrote: > > On Sat, Apr 16, 2011 at 09:17:04AM -0700, Stephen Hemminger wrote: > [...] > > > My gut feeling is that: > > > * kernel should default to a simple static sane irq policy without user > > > space. This is especially true for multi-queue devices where the default > > > puts all IRQ's on one cpu. > > > > > Thats not how it currently works, AFAICS. The default kernel policy is > > currently that cpu affinity for any newly requested irq is all cpus. Any > > restriction beyond that is the purview and doing of userspace (irqbalance or > > manual affinity setting). > > Right. Though it may be reasonable for the kernel to use the hint as > the initial affinity for a newly allocated IRQ (not sure quite how we > determine that). > So I understand what your saying here, but I'm having a hard time reconciling the two notions. Currently as it stands, affinity_hint gets set by a single function call in the kernel (irq_set_affinity_hint), and is called by drivers wishing to guide irqbalances behavior (currently only ixgbe does this). The behavior a driver is capable of guiding however are either overly simple (ixgbe just tells irqbalance to place each irq on a separate cpu, which irqbalance would do anyway) or overly complex (forcing policy into the kernel, which I tried to do with this patch series, but based on the responses I've gotten here, that seems non-desireable). I personally like the idea of using affinity_hint to export various guidelines to drive irqbalances behavior, but I think I'm in the minority on that. Given the responses here, It almost seems to me like we should do away with affinity_hint alltogether (or perhaps just continue to ignore it), and rather export data relevent to balancing in various other locations, and just have irqbalance use that info directly. > [...] > > > * irqbalance should not do the hacks it does to try and guess at network traffic. > > > > > Well, I can certainly agree with that, but I'm not sure what that looks like. > > > > I could envision something like: > > > > 1) Use irqbalance to do a one time placement of interrupts, keeping a simple > > (possibly sub-optimal) policy, perhaps something like new irqs get assigned to > > the least loaded cpu within the numa node of the device the irq is originating > > from. > > > > 2) Add a udev event on the addition of new interrupts, to rerun irqbalance > > Yes, making irqbalance more (or entirely) event-driven seems like a good > thing. > Yeah, I can do that, it shouldn't be hard. Do we need to worry about bursty interfaces (i.e. irqs that have high volume counts for periods of time followed by periods of low activity). A periodic irqbalance can reblance behavior like that wheras a one-shot cannot. Can we just assume that the one-shot rebalance would give a 'good enough' placement in that situation? > > 3) Add some exported information to identify processes that are high users of > > network traffic, and correlate that usage to a rxq/irq that produces that > > information (possibly some per-task proc file) > > > > 4) Create/expand an additional user space daemon to monitor the highest users of > > network traffic on various rxq/irqs (as identified in (3)) and restrict those > > processes execution to those cpus which are on the same L2 cache as the irq > > itself. The cpuset cgroup could be usefull in doing this perhaps. > > I just don't see that you're going to get processes associated with > specific RX queues unless you make use of flow steering. > > The 128-entry flow hash indirection table is part of Microsoft's > requirements for RSS so most multiqueue hardware is going to let you do > limited flow steering that way. > Yes, Agreed. I had presumed that any feature like this would assume RFS was available and enabled. If it wasn't, we'd have to re-implement some metric gathering code along with code to correlate sockets to rx queues and irqs. That would be 90% of the RFS code and this patch series :) > > Actually, as I read back to myself, that acutally sounds kind of good to me. It > > keeps all the policy for this in user space, and minimizes what we have to add > > to the kernel to make it happen (some process information in /proc and another > > udev event). I'd like to get some feedback before I start implementing this, > > but I think this could be done. What do you think? > > I don't think it's a good idea to override the scheduler dynamically > like this. > Why not? Not disagreeing here, but I'm curious as to why you think this is bad. We already have several interfaces for doing this in user space (cgroups and taskset come to mind). Nominally they are used directly by sysadmins, and used sparingly for specific configurations. All I'm suggesting is that we create a daemon to identify processes that would benefit from running closer to the nics they are getting data from, and restricting them to cpus that fit that benefit. If a sysadmin doesn't want that behavior, they can stop the daemon, or change its configuration to avoid including processes they don't want to move/restrict. Thanks & Regards Neil > Ben. > > -- > Ben Hutchings, Senior Software Engineer, Solarflare > Not speaking for my employer; that's the marketing department's job. > They asked us to note that Solarflare product names are trademarked. > >