public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* RE: [PATCH][2.4]  generic cluster APIC support for systems with m ore than 8 CPUs
@ 2003-01-06 18:58 Protasevich, Natalie
  2003-01-08 14:53 ` Alan Cox
  0 siblings, 1 reply; 24+ messages in thread
From: Protasevich, Natalie @ 2003-01-06 18:58 UTC (permalink / raw)
  To: 'Alan Cox', Protasevich, Natalie
  Cc: 'William Lee Irwin III', 'Christoph Hellwig',
	'James Cleverdon', 'Pallipadi, Venkatesh',
	'Linux Kernel', 'Martin Bligh',
	'John Stultz', 'Nakajima, Jun',
	'Mallick, Asit K', 'Saxena, Sunil',
	Van Maren, Kevin, 'Andi Kleen', 'Hubert Mantel'

>One thing I will say. Your code would be a hell of a lot saner for
>merging if you mapped the ISA/Legacy IRQ's as 0-15 (to software) and the
>PCI ones to 16+ like everyone else does. That would kill a _lot_ of
>ifdefs and the IRQ0 corner case

Alan,

You were right: my new IRQ overwrite code (done the way you suggested) is
getting much smaller now.
I got it down to ... one line :-)! 

I have to say, that either the Linux code got greatly perfected or our
numerous BIOS changes helped (one or the other, maybe  both) but in earlier
days I couldn't boot the system with generic SMP kernel past the first delay
calibration (off of the PIC). That's why I had to tinker with the IRQ0 and
do the rest of ugly IRQ transformations you noticed earlier. APIC and XTPR
issues  are still there (I will wait for Venkatesh's patch), but I am only
concentrating on interrupts this time. Now, it only stumbles on the IO-APIC
setup, which I can  fix with one line of code... Unfortunately, this line
cannot be justified without bringing up "knowledge of the platform". 

I am working with the MP table for now; the ACPI case gives me same results
but I haven't looked at it yet.

The problem is that current IRQ overwrite code handles everything perfectly
except it cannot handle PCI IRQ range being placed  over the ISA range:

static int pin_2_irq(int idx, int apic, int pin)
{
	.....
        switch (mp_bus_id_to_type[bus])
        {
                case MP_BUS_ISA: /* ISA pin */
                case MP_BUS_EISA:
                case MP_BUS_MCA:
                {
                        irq = mp_irqs[idx].mpc_srcbusirq;
                        break;
                }
                case MP_BUS_PCI: /* PCI pin */
                {
                        /*
                         * PCI IRQs are mapped in order
                         */
                        i = irq = 0;
                        while (i < apic)
                                irq += nr_ioapic_registers[i++];
//Here, it just takes the pin (0-16 in our case) and returns it as IRQ:
                        irq += pin;
//Knowing the above and the fact that our first IO-APIC has the ISA range, I
just shift it off the ISA range:
         ===>>          if (!apic) irq += 16; <<==== NBP - my line. Could be
"if (irq < 16)" instead
                        break;
                }
                default:
                {
	....


The original code is assumtious itself... but it is a question of how
generic I want to be to handle our case.
I guess I could:

1) place pin_2_irq and the one that fixes the ACPI case (and which I haven't
found yet) in our sub-arch making those routines platform defined
2) try to fit in the generic case which would take something like changing
mp_irqs on the platform basis or finding something that fixes every possible
case of this kind. For example: in the IA64 case, irq code was arranged
pretty smart: they made one to one correspondence between vectors and IRQs.
Then they set up ISA range within 0x20-0x2f, and all others go from 0x30 on,
this way they never mix up.(BTW, you mentioned x86 case once, but to me
their IRQ code looked identical to i386 case unless I missed something.)
3) ??? - what would you recommend? - ??? (Everyone's comments are VERY
welcome!)

This is a crucial issue for ES7000, since everything else seems to fit in
sub-arch. 
Another one that I am worried about is XTPR, hopefully someone is looking at
its implementation... 

Thanks,

--Natalie

-----Original Message-----
From: Alan Cox [mailto:alan@lxorguk.ukuu.org.uk]
Sent: Wednesday, December 25, 2002 2:42 PM
To: Protasevich, Natalie
Cc: 'William Lee Irwin III'; 'Christoph Hellwig'; 'James Cleverdon';
'Pallipadi, Venkatesh'; 'Linux Kernel'; 'Martin Bligh'; 'John Stultz';
'Nakajima, Jun'; 'Mallick, Asit K'; 'Saxena, Sunil'; Van Maren, Kevin;
'Andi Kleen'; 'Hubert Mantel'
Subject: RE: [PATCH][2.4] generic cluster APIC support for systems with
m ore than 8 CPUs


One thing I will say. Your code would be a hell of a lot saner for
merging if you mapped the ISA/Legacy IRQ's as 0-15 (to software) and the
PCI ones to 16+ like everyone else does. That would kill a _lot_ of
ifdefs and the IRQ0 corner case


^ permalink raw reply	[flat|nested] 24+ messages in thread
* RE: [PATCH][2.4]  generic cluster APIC support for systems with m ore than 8 CPUs
@ 2002-12-26  2:18 Van Maren, Kevin
  2002-12-27 23:38 ` Alan Cox
  0 siblings, 1 reply; 24+ messages in thread
From: Van Maren, Kevin @ 2002-12-26  2:18 UTC (permalink / raw)
  To: 'Alan Cox ', Protasevich, Natalie
  Cc: ''William Lee Irwin III' ',
	''Christoph Hellwig' ',
	''James Cleverdon' ',
	''Pallipadi, Venkatesh' ',
	''Linux Kernel' ',
	''Martin Bligh' ',
	''John Stultz' ',
	''Nakajima, Jun' ',
	''Mallick, Asit K' ',
	''Saxena, Sunil' ', Van Maren, Kevin,
	''Andi Kleen' ',
	''Hubert Mantel' '

> One thing I will say. Your code would be a hell of a lot saner for
> merging if you mapped the ISA/Legacy IRQ's as 0-15 (to software) and the
> PCI ones to 16+ like everyone else does. That would kill a _lot_ of
> ifdefs and the IRQ0 corner case

If you have a suggestion on how to do that, I am sure we would
all be grateful to hear it.

Note that the reason the code _exists_ is because the interrupt
lines are physically connected to different pins on the APIC
than they are in "normal" systems.  The legitimacy of that
decision is not up for debate at this point -- that is the way
the system was built, and Linux needs to deal with it in
order to run on it.

So the PCI interrupts are in the table at IRQs < 16 (because
it tells which pin is being used), which makes it difficult
to tell whether a PCI or an ISA interrupt is being requested
if you tell the code "irq 3": if ISA, you need to use pin f(X),
while if PCI, you use pin X.

ACPI should have the ISA redirection information, but as
Natalie was saying, drivers hard-code the ISA vectors without
checking the ACPI info.

I suppose it would be possible to detect the ES7000 and have
the kernel re-write the PCI vectors (say, add 16 to them all)
and then re-mangle them based on a "< 16" criteria.
But I don't believe that is a "clean" solution either
(and would break when the ACPI isa redirection table is
properly used).

Anyway, this was the reason for the "severe irq override"
comment by Natalie.

Kevin

^ permalink raw reply	[flat|nested] 24+ messages in thread
* RE: [PATCH][2.4]  generic cluster APIC support for systems with m ore than 8 CPUs
@ 2002-12-26  1:14 Protasevich, Natalie
  2002-12-27 23:39 ` Alan Cox
  0 siblings, 1 reply; 24+ messages in thread
From: Protasevich, Natalie @ 2002-12-26  1:14 UTC (permalink / raw)
  To: 'Alan Cox', Protasevich, Natalie
  Cc: 'William Lee Irwin III', 'Christoph Hellwig',
	'James Cleverdon', 'Pallipadi, Venkatesh',
	'Linux Kernel', 'Martin Bligh',
	'John Stultz', 'Nakajima, Jun',
	'Mallick, Asit K', 'Saxena, Sunil',
	Van Maren, Kevin, 'Andi Kleen', 'Hubert Mantel'

>One thing I will say. Your code would be a hell of a lot saner for
>merging if you mapped the ISA/Legacy IRQ's as 0-15 (to software) and the
>PCI ones to 16+ like everyone else does. That would kill a _lot_ of
>ifdefs and the IRQ0 corner case

Alan, do you mean the case implemented in the IA64 tree? I was terribly out
of time so I had to do something quick and dirty. The IRQ0 was not nearly as
bad as the rest of the legacy drivers asking for the "IRQ3" and "4" etc. I
haven't looked into other arch's implementations - who else has done it? Was
it ever case similar to ours in others?

Thanks,

--Natalie

^ permalink raw reply	[flat|nested] 24+ messages in thread
* RE: [PATCH][2.4]  generic cluster APIC support for systems with m ore than 8 CPUs
@ 2002-12-23  7:29 Kamble, Nitin A
  2002-12-23  7:52 ` Martin J. Bligh
  0 siblings, 1 reply; 24+ messages in thread
From: Kamble, Nitin A @ 2002-12-23  7:29 UTC (permalink / raw)
  To: Martin J. Bligh, William Lee Irwin III
  Cc: Protasevich, Natalie, Pallipadi, Venkatesh, Christoph Hellwig,
	James Cleverdon, Linux Kernel, John Stultz, Nakajima, Jun,
	Mallick, Asit K, Saxena, Sunil, Van Maren, Kevin, Andi Kleen,
	Hubert Mantel

	Martin, Couple of days back I have posted a kernel IRQ distribution patch with some discussion. There we tried doing same things as you have interests here. We have made the interval flexible and longer. Also the randomness of the algorithm is removed.
	  Also about the fairness. Scheduler will not be able to solve the fairness issues coming because of the interrupts at all the times. For example, at very interrupts load, some of the CPUs may get 100% busy just servicing the interrupts. Here the scheduler cannot do anything. To get the fairness, we need the interrupts distribution mechanism to move interrupts as required.
	  May be we can add some generic NUMA awareness in it. But I am not fully aware of the way interrupt routing happens in various NUMA systems. If I can get this information I can look into, how can we have the generic NUMA support in the new IRQ distribution code.

Thanks,
Nitin

-----Original Message-----
From: Martin J. Bligh [mailto:mbligh@aracnet.com]
Sent: Sunday, December 22, 2002 9:21 AM
To: Pallipadi, Venkatesh; William Lee Irwin III; Protasevich, Natalie
Cc: Christoph Hellwig; James Cleverdon; Linux Kernel; John Stultz;
Nakajima, Jun; Mallick, Asit K; Saxena, Sunil; Van Maren, Kevin; Andi
Kleen; Hubert Mantel; Kamble, Nitin A
Subject: RE: [PATCH][2.4] generic cluster APIC support for systems with
m ore than 8 CPUs


> I actually meant interrupt distribution (rather than interrupt routing).
> AFAIK, interrupt distribution right now assumes flat logical setup and
> tries to distribute the interrupt. And is disabled in case of clustered
> APIC mode.  I was just thinking loud, about the changes interrupt
> distribution code should have for systems using clustered APIC/physical
> mode (NUMAQ and non-NUMAQ).

Oh, you mean irq_balance? I'm happy to leave that turned off on NUMA-Q
until it does something less random than it does now. Getting some sort
of affinity for interrupts over a longer period is much more interesting
than providing pretty numbers under /proc/interrupts. Giving each of
the frequently used interrupts their own local CPU to process it would
be cool, but I see no purpose in continually moving them around. If you're
concerned about fairness, that's a scheduler problem to account for and
deal with, IMHO.

The provided topology functions should be able to do node_to_cpumask
and cpu_to_node mappings once that's sorted out. Treat each node as a
seperate "system" and balance within that.

M.

^ permalink raw reply	[flat|nested] 24+ messages in thread
* RE: [PATCH][2.4]  generic cluster APIC support for systems with m ore than 8 CPUs
@ 2002-12-22  6:19 Pallipadi, Venkatesh
  2002-12-22  6:39 ` William Lee Irwin III
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Pallipadi, Venkatesh @ 2002-12-22  6:19 UTC (permalink / raw)
  To: Martin J. Bligh, William Lee Irwin III, Protasevich, Natalie
  Cc: Christoph Hellwig, James Cleverdon, Linux Kernel, John Stultz,
	Nakajima, Jun, Mallick, Asit K, Saxena, Sunil, Van Maren, Kevin,
	Andi Kleen, Hubert Mantel, Kamble, Nitin A



> -----Original Message-----
> From: Martin J. Bligh [mailto:mbligh@aracnet.com]
> > Yes, our feeling it is possible to handle all non-NUMAQ 
> systems pretty
> > generically in terms of APIC setup and interrupt routing. We can use
> > either logical clustered or physical destination modes. But 
> for NUMAQ
> > systems, interrupt routing has to know about the local 
> nodes and have
> > necessary logic to do the routing withing local node.
> 
> NUMA-Q doesn't have to know about the local nodes. I set it up to use
> physical delivery broadcast, which is a node-local broadcast ... gave
> me NUMA affinity for free. I could also use logical clustered 
> (p3 style)
> addressing, and work out all the node locality, but I don't 
> see the point.
> 

I actually meant interrupt distribution (rather than interrupt routing). AFAIK, interrupt distribution right now assumes flat logical setup and tries to distribute the interrupt. And is disabled in case of clustered APIC mode. 
I was just thinking loud, about the changes interrupt distribution code should have for systems using clustered APIC/physical mode (NUMAQ and non-NUMAQ).

Thanks,
-Venkatesh

^ permalink raw reply	[flat|nested] 24+ messages in thread
* RE: [PATCH][2.4]  generic cluster APIC support for systems with m ore than 8 CPUs
@ 2002-12-22  4:00 Pallipadi, Venkatesh
  2002-12-22  4:05 ` Martin J. Bligh
  0 siblings, 1 reply; 24+ messages in thread
From: Pallipadi, Venkatesh @ 2002-12-22  4:00 UTC (permalink / raw)
  To: William Lee Irwin III, Protasevich, Natalie
  Cc: Christoph Hellwig, James Cleverdon, Linux Kernel, Martin Bligh,
	John Stultz, Nakajima, Jun, Mallick, Asit K, Saxena, Sunil,
	Van Maren, Kevin, Andi Kleen, Hubert Mantel, Kamble, Nitin A



> -----Original Message-----
> From: William Lee Irwin III [mailto:wli@holomorphy.com]
> On Fri, Dec 20, 2002 at 04:57:28PM -0600, Protasevich, Natalie wrote:
> > There are only a few problems with porting the Linux kernel 
> to the ES7000:
> > 	we use 8-bit APIC IDs - this makes us use APIC_LDR instead of
> > APIC_ID throughout the code;
> > 	we have special RTE destination values on IO-APIC - the 
> "if" in the
> > programming IO-APIC line code;
> > 	we introduce severe IRQ override case - we remap ISA 
> interrupts to a
> > different interrupt range (all the "i < 16" clauses).
> > Also, I usually have to add things like XTPR mechanism for 
> Fosters/Gallatins
> > and disable conventional IRQ balancing, since our IO-APIC 
> doesn't work this
> > way... (All of the above is in the SuSE code base).
> 
> Venkatesh, do you think you can handle these generically? Aside from
> machine-specific configurations this all looks like perfectly generic.
> 
> If it's publicly discussable, what's the difference wrt. the IO-APIC?
> IIRC NUMA-Q had a similar issue, where flat logical destinations were
> being programmed into the IO-APIC by the IRQ balancing code, but the
> NUMA-Q IO-APIC was programmed to accept physical destinations in the
> RTE's via the DESTMOD bit, using physical broadcast by default, and
> achieving node-locality as physical destinations may not refer to
> off-node cpus. There probably isn't an issue of node 
> locality, but even
> if the IO-APIC's are programmed for logical DESTMOD it won't work with
> the flat logical gunk the original IRQ balance patch programmed up.
> 
> From 2.5.52 include/asm-i386/smp.h:
> 
> #ifdef CONFIG_CLUSTERED_APIC
>  #define INT_DELIVERY_MODE 0     /* physical delivery on LOCAL quad */
> #else
>  #define INT_DELIVERY_MODE 1     /* logical delivery 
> broadcast to all procs */
> #endif
> 
> 
> From 2.5.52 arch/i386/mach-generic/mach_apic.h:
> 
> #ifdef CONFIG_SMP
>  #define TARGET_CPUS (clustered_apic_mode ? 0xf : cpu_online_map)
> #else
>  #define TARGET_CPUS 0x01
> #endif
> 
> And while setting up the RTE's in io_apic.c:
> 
>                 entry.delivery_mode = dest_LowestPrio;
>                 entry.dest_mode = INT_DELIVERY_MODE;
>                 entry.mask = 0;                         /* 
> enable IRQ */
>                 entry.dest.logical.logical_dest = TARGET_CPUS;
> 
> ... which is rather blatant abuse of entry.dest.logical.logical_dest
> for the NUMA-Q case, but never mind that.
> 
> 
> On Fri, Dec 20, 2002 at 04:57:28PM -0600, Protasevich, Natalie wrote:
> > I worked with the SuSE tree which has clustered code (at 
> the first glance)
> > close to the patch being discussed here.
> > The 2.5 tree gives us a benefit of the subarch that will accomodate
> > (hopefully) our special cases. 
> > But I may need to add more hooks.
> 
> It'd be great to have the APIC interface general enough to handle all
> these machines.

Yes, our feeling it is possible to handle all non-NUMAQ systems pretty generically in terms of APIC setup and interrupt routing. We can use either logical clustered or physical destination modes.
But for NUMAQ systems, interrupt routing has to know about the local nodes and have necessary logic to do the routing withing local node.

Thanks,
-Venkatesh 

 

^ permalink raw reply	[flat|nested] 24+ messages in thread
* RE: [PATCH][2.4]  generic cluster APIC support for systems with m ore than 8 CPUs
@ 2002-12-20 22:57 Protasevich, Natalie
  2002-12-20 23:33 ` William Lee Irwin III
  2002-12-25 21:41 ` Alan Cox
  0 siblings, 2 replies; 24+ messages in thread
From: Protasevich, Natalie @ 2002-12-20 22:57 UTC (permalink / raw)
  To: 'William Lee Irwin III', 'Christoph Hellwig',
	'James Cleverdon', 'Pallipadi, Venkatesh',
	'Linux Kernel', 'Martin Bligh',
	'John Stultz', 'Nakajima, Jun',
	'Mallick, Asit K', 'Saxena, Sunil',
	Van Maren, Kevin
  Cc: 'Andi Kleen', 'Hubert Mantel'

> On Thu, Dec 19, 2002 at 06:04:55PM -0800, James Cleverdon wrote:
> >>> A generic patch should also support Unisys' new box, the ES7000 or
> >>> some such.
> 
> On Fri, Dec 20, 2002 at 08:00:50AM +0000, Christoph Hellwig wrote:
> >> That box needs more changes than just the apic setup.  Unfortunately
> >> unisys thinks they shouldn't send their patches to lkml, but when you
see
> >> them e.g. in the suse tree it's a bit understandable that they don't
want
> >> anyone to really see their mess :)

Briefly, our ES7000 boxes are non-NUMA, but use clustered APICs (logical
with Cascades, and physical with Gallatins/Fosters). Our code is pretty much
within the clustered APIC code (when both physical and logical are
implemented). Even with NUMA that is forced in clustered APIC case, we are
usually OK as a single-node case.
There are only a few problems with porting the Linux kernel to the ES7000:
	we use 8-bit APIC IDs - this makes us use APIC_LDR instead of
APIC_ID throughout the code;
	we have special RTE destination values on IO-APIC - the "if" in the
programming IO-APIC line code;
	we introduce severe IRQ override case - we remap ISA interrupts to a
different interrupt range (all the "i < 16" clauses).

Also, I usually have to add things like XTPR mechanism for Fosters/Gallatins
and disable conventional IRQ balancing, since our IO-APIC doesn't work this
way... (All of the above is in the SuSE code base).

I worked with the SuSE tree which has clustered code (at the first glance)
close to the patch being discussed here.
The 2.5 tree gives us a benefit of the subarch that will accomodate
(hopefully) our special cases. 
But I may need to add more hooks.

>No need to sugar-coat anything :-)

>Natalie is the engineer who added support for the ES7000 to Linux.
>Fortunately she is in the cube next to me.

>She has sent the patches to SuSE/United Linux, and is in the final process
>of testing them on 2.5.5x before submitting them to LKML for comment.

> >> And btw, the box isn't that new, but three years ago or so when they
first
> >> showed it on cebit they even refused to talk about linux due to their
> >> restrictive agreements with Microsoft..
>
> On Fri, Dec 20, 2002 at 03:24:01AM -0800, William Lee Irwin III wrote:
> > Kevin, you're the only lkml-posting contact point I know of within
Unisys.
> > Is there any chance you could flag down some of the ia32 crew there for
> > some commentary on this stuff? (or do so yourself if you're in it)

I will be looking at the Intel patch submited against 2.4 with support for
the ES7000 in mind. I am trying to get the ES7000 patch for 2.5.x out
sometime next week (my boss won't let me have a life until I get ES7000
support in 2.5 (:-<)). At the same time, we are very interested in any
clustered APIC patch that goes in the 2.5 tree (sooner the better).  Having
physical cluster support in 2.5 would greatly reduce the size of diffs for
the ES7000.

>I mostly work on our 16-32p IA64 machines.  Natalie or someone else will
>have to comment on the clustered-apic code.

>I do know that a lot of the code for the ES7000 is optional, and only
>required to support value-added management functionality, which is
>especially useful if you are running more than one OS instance on the
>machine (it supports 8 fully-independent partitions).

>Also, as a clarification, our 32-processor systems are NOT NUMA: there
>is a full non-blocking crossbar to memory.  So clustered APIC support
>should not be dependant on NUMA.

>Kevin

^ permalink raw reply	[flat|nested] 24+ messages in thread
[parent not found: <3FAD1088D4556046AEC48D80B47B478C0101F55D@usslc-exch-4.slc.unisy s.com>]

end of thread, other threads:[~2003-01-08 13:59 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <3FAD1088D4556046AEC48D80B47B478C1AEC75@usslc-exch-4.slc.unisys. com>
2002-12-22 20:41 ` [PATCH][2.4] generic cluster APIC support for systems with m ore than 8 CPUs Protasevich, Natalie
2002-12-22 20:52   ` Martin J. Bligh
2003-01-06 18:58 Protasevich, Natalie
2003-01-08 14:53 ` Alan Cox
  -- strict thread matches above, loose matches on Subject: below --
2002-12-26  2:18 Van Maren, Kevin
2002-12-27 23:38 ` Alan Cox
2002-12-26  1:14 Protasevich, Natalie
2002-12-27 23:39 ` Alan Cox
2002-12-23  7:29 Kamble, Nitin A
2002-12-23  7:52 ` Martin J. Bligh
2002-12-23  9:46   ` Zwane Mwaikambo
2002-12-23 15:30     ` Martin J. Bligh
2002-12-22  6:19 Pallipadi, Venkatesh
2002-12-22  6:39 ` William Lee Irwin III
2002-12-22 17:21 ` Martin J. Bligh
2002-12-22 17:23 ` Martin J. Bligh
2002-12-22  4:00 Pallipadi, Venkatesh
2002-12-22  4:05 ` Martin J. Bligh
2002-12-20 22:57 Protasevich, Natalie
2002-12-20 23:33 ` William Lee Irwin III
2002-12-25 21:41 ` Alan Cox
     [not found] <3FAD1088D4556046AEC48D80B47B478C0101F55D@usslc-exch-4.slc.unisy s.com>
2002-12-20 15:46 ` Van Maren, Kevin
2002-12-20 16:30   ` Martin J. Bligh
2002-12-20 17:16   ` William Lee Irwin III

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox