public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: userspace irq   balancer 
@ 2003-05-21 18:28 Keith Mannthey
  2003-05-21 19:19 ` userspace irq =?unknown-8bit?Q?balance?= =?unknown-8bit?B?csKg?= William Lee Irwin III
  2003-05-21 23:39 ` userspace irq balancer Keith Mannthey
  0 siblings, 2 replies; 6+ messages in thread
From: Keith Mannthey @ 2003-05-21 18:28 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: davem, habanero, haveblue, wli, arjanv, pbadari,
	linux-kernel@vger.kernel.org, gh, johnstul, jamesclv,
	Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 291 bytes --]

  Here is the patch to turn kirqd into a config option if it is really
needed.  I don't see why the noirqbalance functionality isn't enough for
now.  Is there anything currently keeping a userspace irq balancer from
working as 2.5 stands today?  It dosen't look like it to me.

Keith      


[-- Attachment #2: config-irq-2.5.68 --]
[-- Type: text/x-patch, Size: 1618 bytes --]

diff -urN linux-2.5.68/arch/i386/Kconfig linux-2.5.68-config-irq/arch/i386/Kconfig
--- linux-2.5.68/arch/i386/Kconfig	Sat Apr 19 19:48:52 2003
+++ linux-2.5.68-config-irq/arch/i386/Kconfig	Thu Apr 24 17:04:47 2003
@@ -758,6 +758,14 @@
 
 	  See <file:Documentation/mtrr.txt> for more information.
 
+config IRQBALANCE
+ 	bool "Enable kernel irq balancing"
+	depends on SMP
+	default y
+	help
+ 	  The defalut yes will allow the kernel to do irq load balancing.  
+	  Saying no will keep the kernel from doing irq load balancing. 	
+
 config HAVE_DEC_LOCK
 	bool
 	depends on (SMP || PREEMPT) && X86_CMPXCHG
diff -urN linux-2.5.68/arch/i386/kernel/io_apic.c linux-2.5.68-config-irq/arch/i386/kernel/io_apic.c
--- linux-2.5.68/arch/i386/kernel/io_apic.c	Sat Apr 19 19:49:09 2003
+++ linux-2.5.68-config-irq/arch/i386/kernel/io_apic.c	Thu Apr 24 17:05:30 2003
@@ -263,7 +263,7 @@
 	spin_unlock_irqrestore(&ioapic_lock, flags);
 }
 
-#if defined(CONFIG_SMP)
+#if defined(CONFIG_IRQBALANCE) 
 # include <asm/processor.h>	/* kernel_thread() */
 # include <linux/kernel_stat.h>	/* kstat */
 # include <linux/slab.h>		/* kmalloc() */
@@ -654,8 +654,6 @@
 
 __setup("noirqbalance", irqbalance_disable);
 
-static void set_ioapic_affinity (unsigned int irq, unsigned long mask);
-
 static inline void move_irq(int irq)
 {
 	/* note - we hold the desc->lock */
@@ -667,9 +665,9 @@
 
 __initcall(balanced_irq_init);
 
-#else /* !SMP */
+#else /* !IRQBALANCE */
 static inline void move_irq(int irq) { }
-#endif /* defined(CONFIG_SMP) */
+#endif /* defined(IRQBALANCE) */
 
 
 /*

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: userspace irq =?unknown-8bit?Q?balance?= =?unknown-8bit?B?csKg?=
  2003-05-21 18:28 userspace irq balancer Keith Mannthey
@ 2003-05-21 19:19 ` William Lee Irwin III
  2003-05-21 23:39 ` userspace irq balancer Keith Mannthey
  1 sibling, 0 replies; 6+ messages in thread
From: William Lee Irwin III @ 2003-05-21 19:19 UTC (permalink / raw)
  To: Keith Mannthey
  Cc: Martin J. Bligh, davem, habanero, haveblue, arjanv, pbadari,
	linux-kernel@vger.kernel.org, gh, johnstul, jamesclv,
	Andrew Morton

On Wed, May 21, 2003 at 11:28:41AM -0700, Keith Mannthey wrote:
>   Here is the patch to turn kirqd into a config option if it is really
> needed.  I don't see why the noirqbalance functionality isn't enough for
> now.  Is there anything currently keeping a userspace irq balancer from
> working as 2.5 stands today?  It dosen't look like it to me.
> Keith      

This will do, though my preference is to make the code actually
understand what DESTMOD means in IO-APIC RTE's and what DFR means
for local APIC's instead of the rather ridiculous workarounds for
not doing so currently present.

There are a couple of obstacles to doing this:

(1) There is no true mechanism for correlating IO-APIC's with the
	APIC buses corresponding to a given cluster for APIC. The
	assumption is largely global addressibility a la xAPIC.
(2) DESTMOD is not a static property. Dynamically switching between
	logical and physical DESTMOD is fully possible and allows a
	somewhat greater variety of cpu sets to be handled on APIC.

I'd also like for there to be validity checking and explicit error
returns from the affinity setting API.

I'm not entirely happy with the genapic bits. Basically the APIC
is relatively well-standardized, and I'd rather the point-by-point
"this codepath must differ" abstraction be built atop such an APIC
manipulation "library" as it were. For instance:

(1) cpu wakeup via NMI is possible on ordinary machines; INIT merely
	cannot address cpus above the limit of APIC's physical
	addressing scheme (which is 4 bits) and so is required for
	pre-xAPIC machines with > 16 cpus or machines where only
	logical interrupts are routed across bus boundaries (be they
	APIC buses or memory buses).
(2) clustered hierarchical DFR is usable on single APIC bus boxen and
	xAPIC boxen with provisos for cluster ID's being misrouted.
(3) IO-APIC RTE formats are not magical properties of the machine;
	there is just logical and physical DESTMOD and representability
	of target cpu sets in the logical format and physical format
	and the dependence of the logical format on the cpus' DFR's.

These somewhat obvious observations imply to me that common code should
be used to manipulate the local APIC and IO-APIC and the machine-
specific code should choose its preferred modes when calling it, not
provide a private implementation or magic values to stuff into various
registers that specialize the APIC handling to a particular mode.

OTOH I don't see much (if any) chance of any of this happening since
"just barely works" suffices for most people's purposes and the
moderately large amount of work required to do all this ends up with
approximately zero functional difference in the end.

Thanks.


-- wli

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: userspace irq   balancer 
  2003-05-21 18:28 userspace irq balancer Keith Mannthey
  2003-05-21 19:19 ` userspace irq =?unknown-8bit?Q?balance?= =?unknown-8bit?B?csKg?= William Lee Irwin III
@ 2003-05-21 23:39 ` Keith Mannthey
  2003-05-21 23:42   ` userspace irq balancerB David S. Miller
                     ` (2 more replies)
  1 sibling, 3 replies; 6+ messages in thread
From: Keith Mannthey @ 2003-05-21 23:39 UTC (permalink / raw)
  To: Zwane Mwaikambo
  Cc: Martin J. Bligh, davem, habanero, haveblue, wli, arjanv, pbadari,
	linux-kernel@vger.kernel.org, gh, johnstul, jamesclv,
	Andrew Morton

> You can build masks of capable clusters easily, even for NUMAQ

  Only kinda.  Boxes with Hyperthreaded cpus have an odd ordering
scheme.  The BIOS is free to assign apicids at will to any cpu.  It is
not forced to any certain scheme.  On some hyperthreaded boxes the 2nd
cpu is on the same apicid cluster even thought the cpu numbers are far
apart. 
  This makes building meaningful apicid masks (more than one cpu) a bit
tricky.  For example a user would have to know that cpus 1,2,9,10 were
on the same cluster not (1,2,3,4) as you would expect. Since the bios
can do what it will it makes it hard to build masks of capable clusters
easily in all situations.

Keith  


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: userspace irq balancerB
  2003-05-21 23:39 ` userspace irq balancer Keith Mannthey
@ 2003-05-21 23:42   ` David S. Miller
  2003-05-22  0:14   ` userspace irq =?unknown-8bit?Q?balance?= =?unknown-8bit?B?csKg?= William Lee Irwin III
  2003-05-22  8:17   ` userspace irq balancer Arjan van de Ven
  2 siblings, 0 replies; 6+ messages in thread
From: David S. Miller @ 2003-05-21 23:42 UTC (permalink / raw)
  To: kmannth
  Cc: zwane, mbligh, habanero, haveblue, wli, arjanv, pbadari,
	linux-kernel, gh, johnstul, jamesclv, akpm

   From: Keith Mannthey <kmannth@us.ibm.com>
   Date: 21 May 2003 16:39:29 -0700

     For example a user would have to know that cpus 1,2,9,10 were
   on the same cluster not (1,2,3,4) as you would expect.
   
Nothing prevents us from exporting these mappings to userspace.
Just like we can export a "possible mask" for each interrupt
source.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: userspace irq =?unknown-8bit?Q?balance?= =?unknown-8bit?B?csKg?=
  2003-05-21 23:39 ` userspace irq balancer Keith Mannthey
  2003-05-21 23:42   ` userspace irq balancerB David S. Miller
@ 2003-05-22  0:14   ` William Lee Irwin III
  2003-05-22  8:17   ` userspace irq balancer Arjan van de Ven
  2 siblings, 0 replies; 6+ messages in thread
From: William Lee Irwin III @ 2003-05-22  0:14 UTC (permalink / raw)
  To: Keith Mannthey
  Cc: Zwane Mwaikambo, Martin J. Bligh, davem, habanero, haveblue,
	arjanv, pbadari, linux-kernel@vger.kernel.org, gh, johnstul,
	jamesclv, Andrew Morton

On Wed, May 21, 2003 at 04:39:29PM -0700, Keith Mannthey wrote:
>   Only kinda.  Boxes with Hyperthreaded cpus have an odd ordering
> scheme.  The BIOS is free to assign apicids at will to any cpu.  It is
> not forced to any certain scheme.  On some hyperthreaded boxes the 2nd
> cpu is on the same apicid cluster even thought the cpu numbers are far
> apart. 
>   This makes building meaningful apicid masks (more than one cpu) a bit
> tricky.  For example a user would have to know that cpus 1,2,9,10 were
> on the same cluster not (1,2,3,4) as you would expect. Since the bios
> can do what it will it makes it hard to build masks of capable clusters
> easily in all situations.

APIC issues can be dealt with very, very simply.
(1) for each cpu, report the physical APIC ID
(2) for each cpu, report the logical APIC ID
	(or if using only physical IPI's whatever the BIOS left in the LDR)
(3) report the DFR setting used globally across the system
(4) for each IO-APIC, report where it's attached (bus and node)
(5) report the contents of each IO-APIC RTE
	(5a) report the destination (interpretation depends on DESTMOD)
	(5b) report DESTMOD as either logical or physical
	(5c) report what it's connected to (irq, possibly driver name)
(6) report the APIC revision(s) (to distinguish APIC from xAPIC)
(7) report the IO-APIC revision(s) (for completeness)

The cpus a given IO-APIC RTE can address with physical DESTMOD can then
be determined from the APIC revision, and the cpus a given IO-APIC RTE
can address with logical DESTMOD can then be determined from the APIC
revision and (global and immutable, though the register is per- local
APIC; there's no good way to switch over, and no reason to) DFR setting.

The logical CPU number used to refer to CPUs by the kernel bears no
relation to APIC ID's apart from arithmetic schemes artificially
imposed by the implementation. Fully tabulating the APIC ID's for all
the CPUs as in (1) and (2) is sufficient information for userspace to
construct and invert the relation as required to determine APIC cluster
membership. It is also possible to directly export APIC clusters as
sysfs objects and enumerate the cpus, though (IMHO) it's best to merely
expose the information the kernel acts on as it stands now and let
userspace infer the rest.

In principle one could also export the ability to set IO-APIC RTE's
DESTMOD bits on the fly, given proper validity checks for
addressibility and the like (I'm assuming one would rather barf than
deadlock the box even if some additional code were required). The one
box where it matters doesn't care to use irqbalance anyway, though.

Basically, spill your guts as to what you've got and let userspace
think about how to do the right thing with it.


-- wli

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: userspace irq balancer 
  2003-05-21 23:39 ` userspace irq balancer Keith Mannthey
  2003-05-21 23:42   ` userspace irq balancerB David S. Miller
  2003-05-22  0:14   ` userspace irq =?unknown-8bit?Q?balance?= =?unknown-8bit?B?csKg?= William Lee Irwin III
@ 2003-05-22  8:17   ` Arjan van de Ven
  2 siblings, 0 replies; 6+ messages in thread
From: Arjan van de Ven @ 2003-05-22  8:17 UTC (permalink / raw)
  To: Keith Mannthey
  Cc: Zwane Mwaikambo, Martin J. Bligh, davem, habanero, haveblue, wli,
	arjanv, pbadari, linux-kernel@vger.kernel.org, gh, johnstul,
	jamesclv, Andrew Morton

On Wed, May 21, 2003 at 04:39:29PM -0700, Keith Mannthey wrote:
>   This makes building meaningful apicid masks (more than one cpu) a bit
> tricky.  For example a user would have to know that cpus 1,2,9,10 were
> on the same cluster not (1,2,3,4) as you would expect. Since the bios
> can do what it will it makes it hard to build masks of capable clusters
> easily in all situations.

with sysfs the kernel can export some topology info; iirc that was desired
anyway for other HPC applications anyway ?


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2003-05-22  8:04 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-05-21 18:28 userspace irq balancer Keith Mannthey
2003-05-21 19:19 ` userspace irq =?unknown-8bit?Q?balance?= =?unknown-8bit?B?csKg?= William Lee Irwin III
2003-05-21 23:39 ` userspace irq balancer Keith Mannthey
2003-05-21 23:42   ` userspace irq balancerB David S. Miller
2003-05-22  0:14   ` userspace irq =?unknown-8bit?Q?balance?= =?unknown-8bit?B?csKg?= William Lee Irwin III
2003-05-22  8:17   ` userspace irq balancer Arjan van de Ven

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox