public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [BUG] safe_smp_process_id() uses apicid which exceeds NR_CPUs in array
@ 2006-06-12 22:38 Doug Thompson
  2006-06-13  4:03 ` Andi Kleen
  0 siblings, 1 reply; 2+ messages in thread
From: Doug Thompson @ 2006-06-12 22:38 UTC (permalink / raw)
  To: Andi Kleen, linux-kernel


with 2.6.15 kernel running on a Tyan S4881 quad processor board (with factory BIOS)
using Opterons 254s, I received the following MCEs:

CPU 18: Machine Check Exception:                4 Bank 0: b601a00000000833
TSC 9c3799943459 ADDR 4eee07800

CPU 18: Machine Check Exception:                4 Bank 2: d000400000000863
TSC 9c3799943d01

CPU 18: Machine Check Exception:                4 Bank 4: d42dc00100000813
TSC 9c379994422d ADDR 4eee05708


It was later determined to be a bad memory stick, but the problem was 
'CPU 18'.  Running the same hardware with 2.6.17-rc6 produced MCEs with:

'CPU 2' messages instead

as the output. Thought problem fixed, BUT.....

looking at 2.6.17-rc6 safe_smp_processor_id()  in arch/x86_64/kernel/smp.c (This
function is called by the MCE handler code):

int safe_smp_processor_id(void)
{
        int apicid, i;

        if (disable_apic)
                return 0;

        apicid = hard_smp_processor_id();

----->  if (x86_cpu_to_apicid[apicid] == apicid)
                return apicid;

        for (i = 0; i < NR_CPUS; ++i) {
                if (x86_cpu_to_apicid[i] == apicid)
                        return i;
        }

        /* No entries in x86_cpu_to_apicid?  Either no MPS|ACPI,
         * or called too early.  Either way, we must be CPU 0. */
        if (x86_cpu_to_apicid[0] == BAD_APICID)
                return 0;

        return 0; /* Should not happen */
}

I noticed the:   if (x86_cpu_to_apicid[apicid] == apicid)
above.

NR_CPUS was 4 and apicid could be:  16, 17 18, or 19

definitely an out-of-bounds reference.

doug thompson

portion of boot.mesg follows:

SRAT: PXM 0 -> APIC 16 -> Node 0
SRAT: PXM 1 -> APIC 17 -> Node 1
SRAT: PXM 2 -> APIC 18 -> Node 2
SRAT: PXM 3 -> APIC 19 -> Node 3
SRAT: Node 0 PXM 0 0-a0000
SRAT: Node 0 PXM 0 0-d0000000
SRAT: Node 0 PXM 0 0-230000000
SRAT: Node 1 PXM 1 230000000-430000000
SRAT: Node 2 PXM 2 430000000-630000000
SRAT: Node 3 PXM 3 630000000-830000000
NUMA: Using 28 for the hash shift.
Bootmem setup node 0 0000000000000000-0000000230000000
Bootmem setup node 1 0000000230000000-0000000430000000
Bootmem setup node 2 0000000430000000-0000000630000000
Bootmem setup node 3 0000000630000000-0000000830000000
On node 0 totalpages: 2063996
  DMA zone: 2596 pages, LIFO batch:0
  DMA32 zone: 833240 pages, LIFO batch:31
  Normal zone: 1228160 pages, LIFO batch:31
On node 1 totalpages: 2068480
  Normal zone: 2068480 pages, LIFO batch:31
On node 2 totalpages: 2068480
  Normal zone: 2068480 pages, LIFO batch:31
On node 3 totalpages: 2068480
  Normal zone: 2068480 pages, LIFO batch:31
Nvidia board detected. Ignoring ACPI timer override.
ACPI: PM-Timer IO Port: 0x8008
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x10] enabled)
Processor #16 15:5 APIC version 16
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x11] enabled)
Processor #17 15:5 APIC version 16
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x12] enabled)
Processor #18 15:5 APIC version 16
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x13] enabled)
Processor #19 15:5 APIC version 16


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [BUG] safe_smp_process_id() uses apicid which exceeds NR_CPUs in array
  2006-06-12 22:38 [BUG] safe_smp_process_id() uses apicid which exceeds NR_CPUs in array Doug Thompson
@ 2006-06-13  4:03 ` Andi Kleen
  0 siblings, 0 replies; 2+ messages in thread
From: Andi Kleen @ 2006-06-13  4:03 UTC (permalink / raw)
  To: Doug Thompson; +Cc: linux-kernel


> 
> I noticed the:   if (x86_cpu_to_apicid[apicid] == apicid)
> above.

You're right - the fast check should either check for >= NR_CPUS 
or just be removed and let it be done by the loop. I came up
with this patch.

Thanks.

-Andi

Fix fast check in safe_smp_processor_id

The APIC ID returned by hard_smp_processor_id can be beyond
NR_CPUS and then overflow the x86_cpu_to_apic[] array.

Add a check for overflow. If it happens then the slow loop below
will catch.

Bug pointed out by Doug Thompson
Signed-off-by: Andi Kleen <ak@suse.de>

Index: linux/arch/x86_64/kernel/smp.c
===================================================================
--- linux.orig/arch/x86_64/kernel/smp.c
+++ linux/arch/x86_64/kernel/smp.c
@@ -520,13 +520,13 @@ asmlinkage void smp_call_function_interr
 
 int safe_smp_processor_id(void)
 {
-	int apicid, i;
+	unsigned apicid, i;
 
 	if (disable_apic)
 		return 0;
 
 	apicid = hard_smp_processor_id();
-	if (x86_cpu_to_apicid[apicid] == apicid)
+	if (apicid < NR_CPUS && x86_cpu_to_apicid[apicid] == apicid)
 		return apicid;
 
 	for (i = 0; i < NR_CPUS; ++i) {

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2006-06-13  4:04 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-12 22:38 [BUG] safe_smp_process_id() uses apicid which exceeds NR_CPUs in array Doug Thompson
2006-06-13  4:03 ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox