All of lore.kernel.org
 help / color / mirror / Atom feed
From: Arthur Kepner <akepner@sgi.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org, x86@kernel.org
Subject: Re: [PATCH] x86/irq: assign vectors from numa_node
Date: Thu, 16 Dec 2010 14:34:42 -0800	[thread overview]
Message-ID: <20101216223442.GE20481@sgi.com> (raw)
In-Reply-To: <alpine.LFD.2.00.1012101051290.2653@localhost6.localdomain6>

On Fri, Dec 10, 2010 at 11:55:12AM +0100, Thomas Gleixner wrote:
> On Fri, 3 Dec 2010, Arthur Kepner wrote:
> > 
> > Several drivers (e.g., mlx4_core) do something similar to:
> > 
> > 	err = pci_enable_msix(pdev, entries, num_possible_cpus());
> 
> Which is the main problem and this patch just addresses the
> symptoms. As Eric pointed out earlier, this needs to be solved
> differently.
> 
> There is absolutely no point assigning 4096 interrupts to a single
> node in the first place. 

I'm not arguing that it's a good idea for a driver to request so 
many resources. But some do. And what we do now is even worse than 
assigning all the interrupts to a single node.

> Especially not, if we end up using only a few
> of them in the end. And those you use are not necessarily on that very
> node.

OK. Eric mentioned in a related thread that vector allocation 
might be deferred until request_irq(). That sounds like a good 
idea. But unless we change the way initial vector assignment is 
done, it just defers the problem (assuming that request_irq() 
is done for every irq allocated in create_irq_nr()).

Using the numa_node of the device to do the initial vector 
assignment seems like a reasonable default choice, no? 

> 
> I understand, that you want to work around your problem at hand, but
> I'm pushing back on it, as it's a crappy solution which just ignores
> the underlying problem. You don't even mention it in your changelog
> that this is a work around and not a solution.
> 
> No, you just ignore that fact and the requests to look at the
> underlying problem.
> 
> > which takes us down this code path:
> > 
> > 	pci_enable_msix
> > 	native_setup_msi_irqs
> > 	create_irq_nr
> > 	__assign_irq_vector
> > 
> > __assign_irq_vector() preferentially uses vectors from low-numbered 
> > CPUs. On a system with a large number (>256) CPUs this can result in 
> > a CPU running out of vectors, and subsequent attempts to assign an 
> > interrupt to that CPU will fail.
> 
> I agree, that __assign_irq_vector() should honour the request for a
> node, but I don't agree, that we should magically spread stuff on
> whatever node we find, when the node ran out of vectors.
> 
> There is probably a reason, why you want an interrupt on a specific
> node and just silently pushing it somewhere else feels very wrong.
> 
> This needs to be solved from ground up with proper rules about failure
> modes and fallback decisions.
> 
> > +static int
> > +__assign_irq_vector_node(int irq, struct irq_cfg *cfg,
> > +			 const struct cpumask *mask, int node)
> > +{
> > +	int err = -EAGAIN;
> > +	int cpu, best_cpu = -1, min_vector_count = NR_VECTORS;
> > +
> > +	for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
> > +		/* find the 'best' CPU to take this vector -
> > +		 * the one with the fewest assigned vectors is
> > +		 * considered 'best' */
> > +		int i, vector_count = 0;
> > +
> > +		if (!cpu_online(cpu))
> > +			continue;
> > +
> > +		for (i = FIRST_EXTERNAL_VECTOR + VECTOR_OFFSET_START;
> > +		     i < NR_VECTORS ; i++)
> > +			if (per_cpu(vector_irq, cpu)[i] != -1)
> > +				vector_count++;
> 
> Instead of having proper book keeping of vectors, we loop through nvec
> * ncpus to figure that out ? And of course we run that loop with
> interrupts disabled and vector lock held.
> 
> > +		if (vector_count < min_vector_count) {
> > +			min_vector_count = vector_count;
> > +			best_cpu = cpu;
> > +		}
> > +	}
> > +
> > +	if (best_cpu >= 0) {
> > +		cpumask_var_t tmp_mask;
> > +
> > +		if (!alloc_cpumask_var(&tmp_mask, GFP_ATOMIC))
> 
> Bah. I made create_irq_nr() atomic region small with lots of effort
> and got rid of all GFP_ATOMIC allocations.
> 
> > +			return -ENOMEM;
> > +
> > +		cpumask_clear(tmp_mask);
> > +		cpumask_set_cpu(best_cpu, tmp_mask);
> > +		err = __assign_irq_vector(irq, cfg, tmp_mask);
> > +
> > +		free_cpumask_var(tmp_mask);
> > +	}
> > +
> > +	return err;
> > +}
> > +
> >  int assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask)
> >  {
> >  	int err;
> > @@ -1128,6 +1171,39 @@ int assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask)
> >  	return err;
> >  }
> >  
> > +static int
> > +assign_irq_vector_node(int irq, struct irq_cfg *cfg,
> > +		       const struct cpumask *mask, int node)
> > +{
> > +	int err;
> > +	unsigned long flags;
> > +
> > +	if (node == NUMA_NO_NODE)
> > +		return assign_irq_vector(irq, cfg, mask);
> > +
> > +	raw_spin_lock_irqsave(&vector_lock, flags);
> > +	err = __assign_irq_vector_node(irq, cfg, mask, node);
> > +	raw_spin_unlock_irqrestore(&vector_lock, flags);
> > +
> > +	if (err != 0)
> > +		/* uh oh - try again w/o specifying a node */
> > +		return assign_irq_vector(irq, cfg, mask);
> > +	else {
> > +		/* and set the affinity mask so that only
> > +		 * CPUs on 'node' will be used */
> > +		struct irq_desc *desc = irq_to_desc(irq);
> > +		unsigned long flags;
> > +
> > +		raw_spin_lock_irqsave(&desc->lock, flags);
> > +		cpumask_and(desc->irq_data.affinity, cpu_online_mask,
> > +			    cpumask_of_node(node));
> 
> Which leaves us with an empty affinity mask, when the last cpu of that
> node just went offline before locking desc->lock. Brilliant.
> 
> > +		desc->status |= IRQ_AFFINITY_SET;
> > +		raw_spin_unlock_irqrestore(&desc->lock, flags);
> 
> Aside of that low level arch code is not supposed to fiddle in
> irq_desc at will excatly for such reasons. 
> 
> As you might have noticed, i'm working on removing access to irq_desc
> from random places and I spent quite some effort to clean up the whole
> irq mess. No way to put crap like this in again, so I can twist my
> brain around it next week how to clean it up.
> 
> Thanks,
> 
> 	tglx

-- 
Arthur

  reply	other threads:[~2010-12-16 22:34 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-12-03 20:53 [PATCH] x86/irq: assign vectors from numa_node Arthur Kepner
2010-12-10 10:55 ` Thomas Gleixner
2010-12-16 22:34   ` Arthur Kepner [this message]
2010-12-17  9:04     ` Thomas Gleixner
2010-12-22  0:40       ` Arthur Kepner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20101216223442.GE20481@sgi.com \
    --to=akepner@sgi.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.