public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Yu Chen <yu.c.chen@intel.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: x86@kernel.org, Ingo Molnar <mingo@redhat.com>,
	"H. Peter Anvin" <hpa@zytor.com>, Rui Zhang <rui.zhang@intel.com>,
	LKML <linux-kernel@vger.kernel.org>,
	"Rafael J. Wysocki" <rjw@rjwysocki.net>,
	Len Brown <lenb@kernel.org>,
	Dan Williams <dan.j.williams@intel.com>,
	Christoph Hellwig <hch@lst.de>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH 4/4][RFC v2] x86/apic: Spread the vectors by choosing the idlest CPU
Date: Wed, 6 Sep 2017 12:13:38 +0800	[thread overview]
Message-ID: <20170906041337.GC23250@localhost.localdomain> (raw)
In-Reply-To: <alpine.DEB.2.20.1709031642030.2351@nanos>

Thanks for looking at this,
(please bear my slow response as I need to check related code
before replying.)
On Sun, Sep 03, 2017 at 08:17:04PM +0200, Thomas Gleixner wrote:
> On Fri, 1 Sep 2017, Chen Yu wrote:
> 
> > This is the major logic to spread the vectors on different CPUs.
> > The main idea is to choose the 'idlest' CPU which has assigned
> > the least number of vectors as the candidate/hint for the vector
> > allocation domain, in the hope that the vector allocation domain
> > could leverage this hint to generate corresponding cpumask.
> > 
> > One of the requirements to do this vector spreading work comes from the
> > following hibernation problem found on a 16 cores server:
> > 
> > CPU 31 disable failed: CPU has 62 vectors assigned and there
> > are only 0 available.
> > 
> > 2 issues were found after investigation:
> > 
> > 1. The network driver has declared many vector resources via
> >    pci_enable_msix_range(), say, this driver might likely want
> >    to reserve 6 per logical CPU, then there would be 192 of them.
> > 2. Besides, most of the vectors reserved by this driver are assigned
> >    on CPU0 due to the current code strategy, so there would be
> >    insufficient slots left on CPU0 to receive any migrated IRQs
> >    during CPU offine.
> > 
> > In theory after the commit c5cb83bb337c ("genirq/cpuhotplug: Handle
> > managed IRQs on CPU hotplug") the issue 1 has been solved with the
> > managed interrupt mechanism for multi queue devices, which no longer
> > migrate these interrupts. It simply shuts them down and restarts
> > them when the CPU comes back online.
> > 
> > However, according to the implementation of this network driver,
> > the commit mentioned above does not fully fix the issue 1.
> > Here's the framework of the network driver:
> > 
> > step 1. Reserved enough irq vectors and corresponding IRQs.
> > step 2. If the network is activated, invoke request_irq() to
> >         register the handler.
> > step 3. Invoke set_affinity() to spread the IRQs onto different
> >         CPUs, thus to spread the vectors too.
> >
> > The problem is, if the network cable is not connected, step 2
> > and 3 will not get invoked, thus the IRQ vectors will not spread
> > on different CPUs and will still be allocated on CPU0. As a result
> > the CPUs will still get the offline failure.
> 
> So the proper solution for this network driver is to switch to managed
> interrupts instead of trying to work around it in some other place. It's
> using the wrong mechanism - end of story.
> 
> Why are you insisting on implementing a band aid for this particular driver
> instead of fixing the underlying problem of that driver which requires to
> have 32 queues and interrupts open even if only a single CPU is online?
> 
I agree, the driver could be rewritten, but it might take some time, so
meanwhile I'm looking at also other possible optimization.
> > Previously there were some discussion in the thread [1] about the
> > vector spread, and here are the suggestion from Thomas:
> > 
> > Q1:
> >     Rewrite that allocation code from scratch, use per cpu bitmaps,
> >     so we can do fast search over masks and keep track of
> >     the vectors which are used/free per cpu.
> > A1:
> >     per cpu bitmap was onced considered but this might not be
> >     as fast as the solution proposed here. That is, if we introduce the
> >     per cpu bitmap, we need to count the number of vectors assigned on
> >     each CPUs by cpumask_weight() and sort them in order to get the
> >     CPUs who have the least number of vectors assigned. By contrast,
> >     if we introduce the bitmaps for each vector number, we can get the
> >     'idle' CPU at the cost of nearly O(1).
> 
> The weight accounting with the cpumasks is an orthogonal issue to the per
> cpu vector bitmaps.
> 
> Once you selected a CPU the current code still loops in circles instead of
> just searching a bitmap.
Yes, I agree.
> 
> And if the required affinity mask needs more than one CPU then this is
> still not giving you the right answer. You assign perhaps a vector to the
> least busy CPU, but the other CPUs which happen to be in the mask are going
> to be randomly selected.
Yes, for multi cpumask, it might depends on how vector domain behaves.
> 
> I'm not against making the vector allocation better, but certainly not by
> adding yet more duct tape to something which is well known as one of the
> dumbest search algorithms on the planet with a worst case of
> 
>        nvectors * nr_online_cpus * nr_online_cpus_in_affinity_mask
> 
> and your mechanism nests another loop of potentially NR_VECTORS into
> that. Which is pointless as the actual assignable vector space is
> smaller. While x86 has 256 vectors the assignable vector space is way
> smaller and non continous:
> 
> Vectors   0- 31 are reserved for CPU traps and exceptions
> Vectors 239-255 are reserved by the kernel for IPIs, local timer etc.
> Vector	     32 can be reserved by the kernel for the reboot interrupt
> Vector	     96 can be reserved by the kernel for INT80
> 
> So the actually usable array size is between 205 and 207.
> 
> Aside of that the code is incorrect vs. the percpu accounting. Remove the
> limit in the update function and see what happens. While the limit is
> necessary in general, it should at least warn and yell whenever the
> accounting goes out of bounds. And if that happens, then the whole thing is
> completely useless simply because the numbers are just wrong.
> 
Ok.
> >     One scenario of this patch can be illustrated below:
> > 
> >     a) After initialization, vector_mask[0] is set with {CPU0...CPU31},
> >        other vector_mask[i] remain zero, which means that, CPU0...CPU31
> >        have zero vectors assigned.
> >     b) During driver probe, CPU0 is chosen to be assigned with one vector.
> >        Thus vector_mask[0] becomes {CPU1...CPU31}, and vector_mask[1] becomes
> >        {CPU0}.
> >     c) The next time the system tries to assign another vector, it searches from
> >        vector[0], because it is non-empty, CPU1 is chosen to place
> >        the vector. Thus vector_mask[0] becomes {CPU2...CPU31}, and vector_mask[1]
> >        becomes {CPU0, CPU1}.
> >     d) and so on, the vectors assignment will spread on different CPUs.
> 
> So this works for your particular network driver scenario, which is the
> wrong example as this driver just needs to be reworked to not expose that
> issue.
> 
> The reality of affinity requirements is not that simple. It's very much
> device dependent and making it mandatory can have negative side effects.
> 
Yes, this patch just 'cure' a special case.
> > Q2:
> >     "respect the allocation domains"
> > A2:
> >     This solution provides a CPU as the param for the vector allocation domains,
> 
> Solution? This is a crude hack which "solves" one particular problem in the
> wrong way.
> 
> >     in the hope that the domains can leverage this hint to generate the cpumask
> 
> Hope is never a good enginering principle.
> 
> >     with less vectors assigned. But yes, the CPU we select upfront gets fed into
> >     apic->vector_allocation_domain() might just say no because the CPU
> >     does not belong to it, but anyway this should somehow spread the vectors,
> >     although I can not find out a graceful solution to this.
> >     BTW, the vector allocation domain is default_vector_allocation_domain()
> >     in our case, thus it works.
> 
> Now try that with the logical delivery modes which allow multi CPU
> assignements and the cluster mode thereof.
Yes, I feel a little hard to consider the vector domain as it looks
indepent of the current vector allocation process.
> 
> > Q3:
> >     "set a preference for the node on which the interrupt was allocated"
> > A3:
> >     This can be achieved by further optimization in the following way:
> >     a) Also record the number of vectors used per node.
> >     b) Each time invoking __assign_irq_vector, adjust the input mask
> >        by cpumask_and(mask, mask, node_mask_with_least_vectors())
> 
> I'm not looking forward to that heuristics and the next loop inside the
> loops of loops.
> 
> So no, this is not going to happen. The current implementation sucks and
> needs a rewrite.
> 
> I already started to look into that, but started at the low level end of
> it. See the IDT cleanup series in tip:x86/apic. There is much more vooddo
> in the vector management code to clean up before we can actually talk about
> making the assignment algorithm smarter.
Ok, I'm pulling from tip:x86/apic.
> 
> I'll send out a separate mail raising a few points to discuss before we are
> actually starting to look into magic spreading functionality.
> 
For the vector calculation question you asked I'll reply to that thread.
Thanks,
	Yu
> Thanks,
> 
> 	tglx

  parent reply	other threads:[~2017-09-06  4:11 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-01  5:03 [PATCH 0/4][RFC v2] x86/irq: Spread vectors on different CPUs Chen Yu
2017-09-01  5:03 ` [PATCH 1/4][RFC v2] x86/apic: Extend the defination for vector_irq Chen Yu
2017-09-01  5:04 ` [PATCH 2/4][RFC v2] x86/apic: Record the number of vectors assigned on a CPU Chen Yu
2017-09-01  5:04 ` [PATCH 3/4] x86/apic: Introduce the per vector cpumask array Chen Yu
2017-09-01  5:04 ` [PATCH 4/4][RFC v2] x86/apic: Spread the vectors by choosing the idlest CPU Chen Yu
2017-09-03 18:17   ` Thomas Gleixner
2017-09-03 19:18     ` RFD: x86: Sanitize the vector allocator Thomas Gleixner
2017-09-05 22:57     ` [PATCH 4/4][RFC v2] x86/apic: Spread the vectors by choosing the idlest CPU Thomas Gleixner
2017-09-06  4:34       ` Yu Chen
2017-09-06  8:03         ` Thomas Gleixner
2017-09-07  2:52           ` Yu Chen
2017-09-07  5:54             ` Thomas Gleixner
2017-09-07  8:34               ` Yu Chen
2017-09-07  9:45                 ` Thomas Gleixner
2017-09-06  4:13     ` Yu Chen [this message]
2017-09-06  6:15       ` Christoph Hellwig
2017-09-06 17:46         ` Dan Williams
2017-09-07  2:57           ` Yu Chen
2017-09-07  5:59           ` Thomas Gleixner
2017-09-07  6:23             ` Dan Williams
2017-09-07  6:59               ` Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170906041337.GC23250@localhost.localdomain \
    --to=yu.c.chen@intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=hch@lst.de \
    --cc=hpa@zytor.com \
    --cc=lenb@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rjw@rjwysocki.net \
    --cc=rui.zhang@intel.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox