NUMA page allocation from next Node

linux-numa.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* NUMA page allocation from next Node
@ 2010-10-26 16:27 Tharindu Rukshan Bamunuarachchi
       [not found] ` <20101027213652.GA12345@sgi.com>
  0 siblings, 1 reply; 14+ messages in thread
From: Tharindu Rukshan Bamunuarachchi @ 2010-10-26 16:27 UTC (permalink / raw)
  To: linux-numa

Dear All,

Today, we experienced abnormal memory allocation behavior.
I do not know whether this is the expected behavior or due to misconfiguration.

I have two node NUMA system and 100G TMPFS mount.

1. When "dd" running freely (without CPU affinity) all memory pages
were allocated from NODE 0 and then from NODE 1.

2. When "dd" running bound (using taskset) to CPU core in NODE 1 ....
    All memory pages were started to be allocated from NODE 1.
    BUT machine stopped responding after exhausting NODE 1.
    No memory pages were started to be allocated from NODE 0.

Why "dd" cannot allocate memory from NODE 0 when it is running bound
to NODE 1 CPU core ?

Please help.
I am using SLES 11 with 2.6.27 kernel.
__
Tharindu R Bamunuarachchi.

^ permalink raw reply	[flat|nested] 14+ messages in thread

[parent not found: <20101027213652.GA12345@sgi.com>]

* Re: NUMA page allocation from next Node
       [not found] ` <20101027213652.GA12345@sgi.com>
@ 2010-10-29  2:05   ` Tharindu Rukshan Bamunuarachchi
       [not found]     ` <20101029033058.GB555@www.lurndal.org>
                       ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Tharindu Rukshan Bamunuarachchi @ 2010-10-29  2:05 UTC (permalink / raw)
  To: Cliff Wickman

Finally I could isolate the issue further.
I tried following kernels and hardware.
Issue is visible only with IBM + SLES 11.

1. SLES 11 + IBM HW --> Issue is Visible
2. SLES 11 + HP, Sun HW --> Issue is not Visible
2. 2.6.32 Vanilla + Any HW --> Issue is not Visible
3. 2.6.36 Vanilla + Any HW --> Issue is not Visible

HP has same hardware as IBM. Both Nehalem. Sun is bit old Opteron.

Any thoughts ?
__
Tharindu R Bamunuarachchi.




On Thu, Oct 28, 2010 at 3:06 AM, Cliff Wickman <cpw@sgi.com> wrote:
> Hi Tharindu,
>
> On Tue, Oct 26, 2010 at 09:57:53PM +0530, Tharindu Rukshan Bamunuarachchi wrote:
>> Dear All,
>>
>> Today, we experienced abnormal memory allocation behavior.
>> I do not know whether this is the expected behavior or due to misconfiguration.
>>
>> I have two node NUMA system and 100G TMPFS mount.
>>
>> 1. When "dd" running freely (without CPU affinity) all memory pages
>> were allocated from NODE 0 and then from NODE 1.
>>
>> 2. When "dd" running bound (using taskset) to CPU core in NODE 1 ....
>>     All memory pages were started to be allocated from NODE 1.
>>     BUT machine stopped responding after exhausting NODE 1.
>>     No memory pages were started to be allocated from NODE 0.
>>
>> Why "dd" cannot allocate memory from NODE 0 when it is running bound
>> to NODE 1 CPU core ?
>>
>> Please help.
>> I am using SLES 11 with 2.6.27 kernel.
>
> I'm no expert on the taskset command, but from what I can see, it
> just uses sched_setaffinity() to set cpu affinity.  I don't see any
> set_mempolicy calls to affect memory affinity.  So I see no reason
> for restricting memory allocation.
> You're not using some other placement mechanism in conjunction with
> taskset, are you?  A cpuset for example?
>
> -Cliff
> --
> Cliff Wickman
> SGI
> cpw@sgi.com
> (651) 683-3824
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

[parent not found: <20101029033058.GB555@www.lurndal.org>]

* Re: NUMA page allocation from next Node
       [not found]     ` <20101029033058.GB555@www.lurndal.org>
@ 2010-10-29  6:58       ` Tharindu Rukshan Bamunuarachchi
  0 siblings, 0 replies; 14+ messages in thread
From: Tharindu Rukshan Bamunuarachchi @ 2010-10-29  6:58 UTC (permalink / raw)
  To: Scott Lurndal; +Cc: Cliff Wickman, linux-numa

I have gone through BIOS settings and only applicable setting was
memory type : NUMA or Non-NUMA. (current value is NUMA)
I have attached part of dmesg output.

Is there any other tool or way to gather info ?

DMESG
======

Initializing cgroup subsys cpuset
Initializing cgroup subsys cpu
Linux version 2.6.27.45-0.1-default (geeko@buildhost) (gcc version
4.3.2 [gcc-4_3-branch revision 141291] (SUSE Linux) ) #1 SMP
2010-02-22 16:49:47 +0100
Command line: root=/dev/disk/by-id/scsi-3600605b0023f45a01449ea30199cc9ae-part1
resume=/dev/disk/by-id/scsi-3600605b0023f45a01449ea30199cc9ae-part3
splash=silent crashkernel=256M-:128M@16M vga=0x314
KERNEL supported cpus:
  Intel GenuineIntel
  AMD AuthenticAMD
  Centaur CentaurHauls
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009c400 (usable)
 BIOS-e820: 000000000009c400 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000007d151000 (usable)
 BIOS-e820: 000000007d151000 - 000000007d215000 (reserved)
 BIOS-e820: 000000007d215000 - 000000007d854000 (usable)
 BIOS-e820: 000000007d854000 - 000000007d904000 (reserved)
 BIOS-e820: 000000007d904000 - 000000007f68f000 (usable)
 BIOS-e820: 000000007f68f000 - 000000007f6df000 (reserved)
 BIOS-e820: 000000007f6df000 - 000000007f7df000 (ACPI NVS)
 BIOS-e820: 000000007f7df000 - 000000007f7ff000 (ACPI data)
 BIOS-e820: 000000007f7ff000 - 000000007f800000 (usable)
 BIOS-e820: 000000007f800000 - 0000000090000000 (reserved)
 BIOS-e820: 00000000fc000000 - 00000000fd000000 (reserved)
 BIOS-e820: 00000000fed1c000 - 00000000fed20000 (reserved)
 BIOS-e820: 00000000ff800000 - 0000000100000000 (reserved)
 BIOS-e820: 0000000100000000 - 0000000c80000000 (usable)
DMI 2.5 present.
last_pfn = 0xc80000 max_arch_pfn = 0x100000000
x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
last_pfn = 0x7f800 max_arch_pfn = 0x100000000
init_memory_mapping
Using GB pages for direct mapping
 0000000000 - 0040000000 page 1G
 0040000000 - 007f800000 page 2M
kernel direct mapping tables up to 7f800000 @ 8000-a000
last_map_addr: 7f800000 end: 7f800000
init_memory_mapping
Using GB pages for direct mapping
 0100000000 - 0c80000000 page 1G
kernel direct mapping tables up to c80000000 @ 9000-a000
last_map_addr: c80000000 end: c80000000
RAMDISK: 37a03000 - 37fef962
ACPI: RSDP 000FDFD0, 0024 (r2 IBM   )
ACPI: XSDT 7F7FE120, 0084 (r1 IBM    THURLEY         0       1000013)
ACPI: FACP 7F7FB000, 00F4 (r4 IBM    THURLEY         0 IBM   1000013)
ACPI: DSDT 7F7F8000, 2BF3 (r1 IBM    THURLEY         3 IBM   1000013)
ACPI: FACS 7F6EC000, 0040
ACPI: TCPA 7F7FD000, 0064 (r0                        0             0)
ACPI: APIC 7F7F7000, 011E (r2 IBM    THURLEY         0 IBM   1000013)
ACPI: MCFG 7F7F6000, 003C (r1 IBM    THURLEY         1 IBM   1000013)
ACPI: SLIC 7F7F5000, 0176 (r1 IBM    THURLEY         0 IBM   1000013)
ACPI: HPET 7F7F4000, 0038 (r1 IBM    THURLEY         1 IBM   1000013)
ACPI: SRAT 7F7F3000, 0168 (r2 IBM    THURLEY         1 IBM   1000013)
ACPI: SLIT 7F7F2000, 0030 (r1 IBM    THURLEY         0 IBM   1000013)
ACPI: SSDT 7F7F1000, 0183 (r2 IBM    CPUSCOPE     4000 IBM   1000013)
ACPI: SSDT 7F7F0000, 0699 (r2 IBM    CPUWYVRN     4000 IBM   1000013)
ACPI: ERST 7F7EF000, 0230 (r1 IBM    THURLEY         1 IBM   1000013)
ACPI: DMAR 7F7EE000, 00D8 (r1 IBM    THURLEY         1 IBM   1000013)
ACPI: Local APIC address 0xfee00000
SRAT: PXM 0 -> APIC 0 -> Node 0
SRAT: PXM 0 -> APIC 2 -> Node 0
SRAT: PXM 0 -> APIC 4 -> Node 0
SRAT: PXM 0 -> APIC 16 -> Node 0
SRAT: PXM 0 -> APIC 18 -> Node 0
SRAT: PXM 0 -> APIC 20 -> Node 0
SRAT: PXM 1 -> APIC 32 -> Node 1
SRAT: PXM 1 -> APIC 34 -> Node 1
SRAT: PXM 1 -> APIC 36 -> Node 1
SRAT: PXM 1 -> APIC 48 -> Node 1
SRAT: PXM 1 -> APIC 50 -> Node 1
SRAT: PXM 1 -> APIC 52 -> Node 1
SRAT: Node 0 PXM 0 0-80000000
SRAT: Node 0 PXM 0 100000000-680000000
SRAT: Node 1 PXM 1 680000000-c80000000
NUMA: Using 31 for the hash shift.
Bootmem setup node 0 0000000000000000-0000000680000000
  NODE_DATA [0000000000009000 - 0000000000020fff]
  bootmap [0000000000100000 -  00000000001cffff] pages d0
(7 early reservations) ==> bootmem [0000000000 - 0680000000]
  #0 [0000000000 - 0000001000]   BIOS data page ==> [0000000000 - 0000001000]
  #1 [0000006000 - 0000008000]       TRAMPOLINE ==> [0000006000 - 0000008000]
  #2 [0000200000 - 0000bcc8b8]    TEXT DATA BSS ==> [0000200000 - 0000bcc8b8]
  #3 [0037a03000 - 0037fef962]          RAMDISK ==> [0037a03000 - 0037fef962]
  #4 [000009c400 - 0000100000]    BIOS reserved ==> [000009c400 - 0000100000]
  #5 [0000008000 - 0000009000]          PGTABLE ==> [0000008000 - 0000009000]
  #6 [0000001000 - 0000001030]        ACPI SLIT ==> [0000001000 - 0000001030]
Bootmem setup node 1 0000000680000000-0000000c80000000
  NODE_DATA [0000000680000000 - 0000000680017fff]
  bootmap [0000000680018000 -  00000006800d7fff] pages c0
(7 early reservations) ==> bootmem [0680000000 - 0c80000000]
  #0 [0000000000 - 0000001000]   BIOS data page
  #1 [0000006000 - 0000008000]       TRAMPOLINE
  #2 [0000200000 - 0000bcc8b8]    TEXT DATA BSS
  #3 [0037a03000 - 0037fef962]          RAMDISK
  #4 [000009c400 - 0000100000]    BIOS reserved
  #5 [0000008000 - 0000009000]          PGTABLE
  #6 [0000001000 - 0000001030]        ACPI SLIT
found SMP MP-table at [ffff88000009c540] 0009c540
Reserving 128MB of memory at 16MB for crashkernel (System RAM: 51200MB)
 [ffffe20000000000-ffffe200117fffff] PMD ->
[ffff880028200000-ffff8800379fffff] on node 0
 [ffffe20011800000-ffffe20019ffffff] PMD ->
[ffff880038000000-ffff8800407fffff] on node 0
 [ffffe2001a000000-ffffe20031ffffff] PMD ->
[ffff880680200000-ffff8806981fffff] on node 1
Zone PFN ranges:
  DMA      0x00000000 -> 0x00001000
  DMA32    0x00001000 -> 0x00100000
  Normal   0x00100000 -> 0x00c80000
Movable zone start PFN for each node
early_node_map[7] active PFN ranges
    0: 0x00000000 -> 0x0000009c
    0: 0x00000100 -> 0x0007d151
    0: 0x0007d215 -> 0x0007d854
    0: 0x0007d904 -> 0x0007f68f
    0: 0x0007f7ff -> 0x0007f800
    0: 0x00100000 -> 0x00680000
    1: 0x00680000 -> 0x00c80000
On node 0 totalpages: 6288568
  DMA zone: 1319 pages, LIFO batch:0
  DMA32 zone: 501084 pages, LIFO batch:31
  Normal zone: 5677056 pages, LIFO batch:31
On node 1 totalpages: 6291456
  Normal zone: 6193152 pages, LIFO batch:31
ACPI: PM-Timer IO Port: 0x588
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x02] enabled)
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x04] enabled)
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x10] enabled)
ACPI: LAPIC (acpi_id[0x04] lapic_id[0x12] enabled)
ACPI: LAPIC (acpi_id[0x05] lapic_id[0x14] enabled)
ACPI: LAPIC (acpi_id[0x06] lapic_id[0x20] enabled)
ACPI: LAPIC (acpi_id[0x07] lapic_id[0x22] enabled)
ACPI: LAPIC (acpi_id[0x08] lapic_id[0x24] enabled)
ACPI: LAPIC (acpi_id[0x09] lapic_id[0x30] enabled)
ACPI: LAPIC (acpi_id[0x0a] lapic_id[0x32] enabled)
ACPI: LAPIC (acpi_id[0x0b] lapic_id[0x34] enabled)
ACPI: LAPIC (acpi_id[0x0c] lapic_id[0x01] disabled)
ACPI: LAPIC (acpi_id[0x0d] lapic_id[0x03] disabled)
ACPI: LAPIC (acpi_id[0x0e] lapic_id[0x05] disabled)
ACPI: LAPIC (acpi_id[0x0f] lapic_id[0x11] disabled)
ACPI: LAPIC (acpi_id[0x10] lapic_id[0x13] disabled)
ACPI: LAPIC (acpi_id[0x11] lapic_id[0x15] disabled)
ACPI: LAPIC (acpi_id[0x12] lapic_id[0x21] disabled)
ACPI: LAPIC (acpi_id[0x13] lapic_id[0x23] disabled)
ACPI: LAPIC (acpi_id[0x14] lapic_id[0x25] disabled)
ACPI: LAPIC (acpi_id[0x15] lapic_id[0x31] disabled)
ACPI: LAPIC (acpi_id[0x16] lapic_id[0x33] disabled)
ACPI: LAPIC (acpi_id[0x17] lapic_id[0x35] disabled)
ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1])
ACPI: IOAPIC (id[0x08] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 8, version 0, address 0xfec00000, GSI 0-23
ACPI: IOAPIC (id[0x09] address[0xfec80000] gsi_base[24])
IOAPIC[1]: apic_id 9, version 0, address 0xfec80000, GSI 24-47
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
ACPI: IRQ0 used by override.
ACPI: IRQ2 used by override.
ACPI: IRQ9 used by override.
ACPI: HPET id: 0x8086a301 base: 0xfed00000
Using ACPI (MADT) for SMP configuration information
SMP: Allowing 24 CPUs, 12 hotplug CPUs
PM: Registered nosave memory: 000000000009c000 - 000000000009d000
PM: Registered nosave memory: 000000000009d000 - 00000000000a0000
PM: Registered nosave memory: 00000000000a0000 - 00000000000e0000
PM: Registered nosave memory: 00000000000e0000 - 0000000000100000
PM: Registered nosave memory: 000000007d151000 - 000000007d215000
PM: Registered nosave memory: 000000007d854000 - 000000007d904000
PM: Registered nosave memory: 000000007f68f000 - 000000007f6df000
PM: Registered nosave memory: 000000007f6df000 - 000000007f7df000
PM: Registered nosave memory: 000000007f7df000 - 000000007f7ff000
PM: Registered nosave memory: 000000007f800000 - 0000000090000000
PM: Registered nosave memory: 0000000090000000 - 00000000fc000000
PM: Registered nosave memory: 00000000fc000000 - 00000000fd000000
PM: Registered nosave memory: 00000000fd000000 - 00000000fed1c000
PM: Registered nosave memory: 00000000fed1c000 - 00000000fed20000
PM: Registered nosave memory: 00000000fed20000 - 00000000ff800000
PM: Registered nosave memory: 00000000ff800000 - 0000000100000000
Allocating PCI resources starting at 98000000 (gap: 90000000:6c000000)
PERCPU: Allocating 61472 bytes of per cpu data
NR_CPUS: 512, nr_cpu_ids: 24, nr_node_ids 2
Built 2 zonelists in Zone order, mobility grouping on.  Total pages: 12372611
Policy zone: Normal


__
Tharindu R Bamunuarachchi.




On Fri, Oct 29, 2010 at 9:00 AM, Scott Lurndal <scott@lurndal.org> wrote:
> On Fri, Oct 29, 2010 at 07:35:35AM +0530, Tharindu Rukshan Bamunuarachchi wrote:
>> Finally I could isolate the issue further.
>> I tried following kernels and hardware.
>> Issue is visible only with IBM + SLES 11.
>
> Check your ACPI settings, make sure the SRAT and SLIT tables
> are being provided by the BIOS to the kernel.
>
> scott
>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

[parent not found: <AANLkTikDtKc7RdAWJagqCf7T0JKscfe0Hd0ojc8g7yYo@mail.gmail.com>]

* Re: NUMA page allocation from next Node
       [not found]     ` <AANLkTikDtKc7RdAWJagqCf7T0JKscfe0Hd0ojc8g7yYo@mail.gmail.com>
@ 2010-10-29  7:06       ` Tharindu Rukshan Bamunuarachchi
  2010-10-29  8:49         ` Andi Kleen
  0 siblings, 1 reply; 14+ messages in thread
From: Tharindu Rukshan Bamunuarachchi @ 2010-10-29  7:06 UTC (permalink / raw)
  To: jiahua; +Cc: Cliff Wickman, linux-numa

What kind of NUMA setting i should look for ? I gather following config values ?

Speed : Max Performance
LV-DIMM Power : Low Power
Memory Channel Mode : Independent
Socket Interleave : NUMA
Patrol Scrub : Disabled
Demand Scrub : Enable
Turbo Mode : Enabled
Turbo Boost : Traditional
C1 Enahnced Mode : Disabled
Report C2 OS : Disabled
ACPI C-State : C3
VT : Disabled
Cache Data Prefetch : Enabled
Data Reuse : Enabled
QPI Link Speed : Max
AEM PowerCapping : Disabled
__
Tharindu R Bamunuarachchi.




On Fri, Oct 29, 2010 at 9:06 AM, Jiahua <jiahua@gmail.com> wrote:
> Did you check the BIOS NUMA settings?
>
> Jiahua
>
>
> On Thu, Oct 28, 2010 at 7:05 PM, Tharindu Rukshan Bamunuarachchi
> <btharindu@gmail.com> wrote:
>>
>> Finally I could isolate the issue further.
>> I tried following kernels and hardware.
>> Issue is visible only with IBM + SLES 11.
>>
>> 1. SLES 11 + IBM HW --> Issue is Visible
>> 2. SLES 11 + HP, Sun HW --> Issue is not Visible
>> 2. 2.6.32 Vanilla + Any HW --> Issue is not Visible
>> 3. 2.6.36 Vanilla + Any HW --> Issue is not Visible
>>
>> HP has same hardware as IBM. Both Nehalem. Sun is bit old Opteron.
>>
>> Any thoughts ?
>> __
>> Tharindu R Bamunuarachchi.
>>
>>
>>
>>
>> On Thu, Oct 28, 2010 at 3:06 AM, Cliff Wickman <cpw@sgi.com> wrote:
>> > Hi Tharindu,
>> >
>> > On Tue, Oct 26, 2010 at 09:57:53PM +0530, Tharindu Rukshan
>> > Bamunuarachchi wrote:
>> >> Dear All,
>> >>
>> >> Today, we experienced abnormal memory allocation behavior.
>> >> I do not know whether this is the expected behavior or due to
>> >> misconfiguration.
>> >>
>> >> I have two node NUMA system and 100G TMPFS mount.
>> >>
>> >> 1. When "dd" running freely (without CPU affinity) all memory pages
>> >> were allocated from NODE 0 and then from NODE 1.
>> >>
>> >> 2. When "dd" running bound (using taskset) to CPU core in NODE 1 ....
>> >>     All memory pages were started to be allocated from NODE 1.
>> >>     BUT machine stopped responding after exhausting NODE 1.
>> >>     No memory pages were started to be allocated from NODE 0.
>> >>
>> >> Why "dd" cannot allocate memory from NODE 0 when it is running bound
>> >> to NODE 1 CPU core ?
>> >>
>> >> Please help.
>> >> I am using SLES 11 with 2.6.27 kernel.
>> >
>> > I'm no expert on the taskset command, but from what I can see, it
>> > just uses sched_setaffinity() to set cpu affinity.  I don't see any
>> > set_mempolicy calls to affect memory affinity.  So I see no reason
>> > for restricting memory allocation.
>> > You're not using some other placement mechanism in conjunction with
>> > taskset, are you?  A cpuset for example?
>> >
>> > -Cliff
>> > --
>> > Cliff Wickman
>> > SGI
>> > cpw@sgi.com
>> > (651) 683-3824
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-numa" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NUMA page allocation from next Node
  2010-10-29  7:06       ` Tharindu Rukshan Bamunuarachchi
@ 2010-10-29  8:49         ` Andi Kleen
  2010-10-29  9:16           ` Tharindu Rukshan Bamunuarachchi
  0 siblings, 1 reply; 14+ messages in thread
From: Andi Kleen @ 2010-10-29  8:49 UTC (permalink / raw)
  To: Tharindu Rukshan Bamunuarachchi; +Cc: jiahua, Cliff Wickman, linux-numa

On Fri, Oct 29, 2010 at 12:36:47PM +0530, Tharindu Rukshan Bamunuarachchi wrote:
> What kind of NUMA setting i should look for ? I gather following config values ?

Most likely one of the systems forces zone reclaim by having large
SLIT values and the other doesn't. You can configure zone reclaim
manually through sysctl.

You should not get a lockup though in any case, that's some kind of 
VM bug in 2.6.27. I would recommend reporting that to Novell.

-Andi

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NUMA page allocation from next Node
  2010-10-29  8:49         ` Andi Kleen
@ 2010-10-29  9:16           ` Tharindu Rukshan Bamunuarachchi
  0 siblings, 0 replies; 14+ messages in thread
From: Tharindu Rukshan Bamunuarachchi @ 2010-10-29  9:16 UTC (permalink / raw)
  To: Andi Kleen; +Cc: jiahua, Cliff Wickman, linux-numa

You mean changing node or zone through this ...

/proc/sys/vm/zone_reclaim_mode


On Fri, Oct 29, 2010 at 2:19 PM, Andi Kleen <andi@firstfloor.org> wrote:
> On Fri, Oct 29, 2010 at 12:36:47PM +0530, Tharindu Rukshan Bamunuarachchi wrote:
>> What kind of NUMA setting i should look for ? I gather following config values ?
>
> Most likely one of the systems forces zone reclaim by having large
> SLIT values and the other doesn't. You can configure zone reclaim
> manually through sysctl.
>
> You should not get a lockup though in any case, that's some kind of
> VM bug in 2.6.27. I would recommend reporting that to Novell.
>
> -Andi
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NUMA page allocation from next Node
  2010-10-29  2:05   ` Tharindu Rukshan Bamunuarachchi
       [not found]     ` <20101029033058.GB555@www.lurndal.org>
       [not found]     ` <AANLkTikDtKc7RdAWJagqCf7T0JKscfe0Hd0ojc8g7yYo@mail.gmail.com>
@ 2010-10-29 19:52     ` Tim Pepper
  2010-10-29 20:30       ` Lee Schermerhorn
  2010-11-01 14:18       ` Tharindu Rukshan Bamunuarachchi
  2 siblings, 2 replies; 14+ messages in thread
From: Tim Pepper @ 2010-10-29 19:52 UTC (permalink / raw)
  To: Tharindu Rukshan Bamunuarachchi; +Cc: linux-numa

On Fri 29 Oct at 07:35:35 +0530 btharindu@gmail.com said:
> Finally I could isolate the issue further.
> I tried following kernels and hardware.
> Issue is visible only with IBM + SLES 11.
> 
> 1. SLES 11 + IBM HW --> Issue is Visible
> 2. SLES 11 + HP, Sun HW --> Issue is not Visible
> 2. 2.6.32 Vanilla + Any HW --> Issue is not Visible
> 3. 2.6.36 Vanilla + Any HW --> Issue is not Visible

It would be interesting to see the output of "numactl --hardware" for each
of these scenarios.

-- 
Tim Pepper  <lnxninja@linux.vnet.ibm.com>
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NUMA page allocation from next Node
  2010-10-29 19:52     ` Tim Pepper
@ 2010-10-29 20:30       ` Lee Schermerhorn
  2010-11-01 13:55         ` Tharindu Rukshan Bamunuarachchi
  2010-11-01 14:18       ` Tharindu Rukshan Bamunuarachchi
  1 sibling, 1 reply; 14+ messages in thread
From: Lee Schermerhorn @ 2010-10-29 20:30 UTC (permalink / raw)
  To: Tim Pepper; +Cc: Tharindu Rukshan Bamunuarachchi, linux-numa

On Fri, 2010-10-29 at 12:52 -0700, Tim Pepper wrote:
> On Fri 29 Oct at 07:35:35 +0530 btharindu@gmail.com said:
> > Finally I could isolate the issue further.
> > I tried following kernels and hardware.
> > Issue is visible only with IBM + SLES 11.
> > 
> > 1. SLES 11 + IBM HW --> Issue is Visible
> > 2. SLES 11 + HP, Sun HW --> Issue is not Visible
> > 2. 2.6.32 Vanilla + Any HW --> Issue is not Visible
> > 3. 2.6.36 Vanilla + Any HW --> Issue is not Visible
> 
> It would be interesting to see the output of "numactl --hardware" for each
> of these scenarios.
> 

Also, if you could add "mminit_loglevel=2" to the boot command line, and
grep for 'zonelist general'.  The general zonelists for the Normal zones
will show the order of allocation for the two nodes.  On a 2 node [AMD]
platform, I see:

xxx(lts)dmesg | grep 'zonelist general'
mminit::zonelist general 0:DMA = 0:DMA 
mminit::zonelist general 0:DMA32 = 0:DMA32 0:DMA 
mminit::zonelist general 0:Normal = 0:Normal 0:DMA32 0:DMA 1:Normal 
mminit::zonelist general 1:Normal = 1:Normal 0:Normal 0:DMA32 0:DMA 

so, node 0 Normal zone allocates from 0:Normal first, as expected, and
than falls back via DMA32, DMA [both on node 0] eventually to node 1
Normal.  Node 1 starts locally and falls back to Node 0 Normal and,
finally, the DMA zones.

You can also try:

cat /proc/zoneinfo | egrep '^Node|^  pages|^ +present'

and maybe "watch" that [watch(1)] while you run your tests.

And, just to be sure, you could suspend your dd job [^Z] and take a look
at it's mempolicy and such via /proc/<pid>/status [Mems_allowed*] and
it's /proc/<pid>/numa_maps.   If you haven't changed anything you should
see  both nodes in Mems_allowed[_list] and all of the policies in the
numa_maps should show 'default'.  

Andi already mentioned zone_reclaim_mode.  You'll want that set to '0'
if you want allocations to overflow/fallback to off-node without
attempting direct reclaim first.  E.g., set vm.zone_reclaim_mode = 0 in
your /etc/sysctl.conf and reload via 'sysctl -p' if you want it to
stick.

Regards,
Lee

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NUMA page allocation from next Node
  2010-10-29 20:30       ` Lee Schermerhorn
@ 2010-11-01 13:55         ` Tharindu Rukshan Bamunuarachchi
  0 siblings, 0 replies; 14+ messages in thread
From: Tharindu Rukshan Bamunuarachchi @ 2010-11-01 13:55 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Tim Pepper, linux-numa

Dear All,

On Sat, Oct 30, 2010 at 2:00 AM, Lee Schermerhorn
<Lee.Schermerhorn@hp.com> wrote:
>
> Also, if you could add "mminit_loglevel=2" to the boot command line, and
> grep for 'zonelist general'.  The general zonelists for the Normal zones
> will show the order of allocation for the two nodes.  On a 2 node [AMD]
> platform, I see:
>

Output of "mminit_loglevel=2" ...

mminit::zonelist general 0:DMA = 0:DMA
mminit::zonelist general 0:DMA32 = 0:DMA32 0:DMA
mminit::zonelist general 0:Normal = 0:Normal 1:Normal 0:DMA32 0:DMA
mminit::zonelist general 1:Normal = 1:Normal 0:Normal 0:DMA32 0:DMA

>
> And, just to be sure, you could suspend your dd job [^Z] and take a look
> at it's mempolicy and such via /proc/<pid>/status [Mems_allowed*] and
> it's /proc/<pid>/numa_maps.   If you haven't changed anything you should
> see  both nodes in Mems_allowed[_list] and all of the policies in the
> numa_maps should show 'default'.
>

/proc/<PID>/numa_maps shows "default".

Both nodes are shown in "Mems_allowed*".

Mems_allowed:	00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
Mems_allowed_list:	0-1


> Andi already mentioned zone_reclaim_mode.  You'll want that set to '0'
> if you want allocations to overflow/fallback to off-node without
> attempting direct reclaim first.  E.g., set vm.zone_reclaim_mode = 0 in
> your /etc/sysctl.conf and reload via 'sysctl -p' if you want it to
> stick.
>

I set zone_reclaim_mode to zero.

It is working fine. :-)
DD can allocate remaining memory from other node.



BTW, I have tried several vanilla kernels and Issue is not visible
after 2.6.31.
Is there anyway to identify the patch which should have fix this in
2.6.31* trees?




Thankx a lot for your support.
Tharindu.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NUMA page allocation from next Node
  2010-10-29 19:52     ` Tim Pepper
  2010-10-29 20:30       ` Lee Schermerhorn
@ 2010-11-01 14:18       ` Tharindu Rukshan Bamunuarachchi
  2010-11-01 14:59         ` Lee Schermerhorn
  2010-11-01 17:59         ` Andi Kleen
  1 sibling, 2 replies; 14+ messages in thread
From: Tharindu Rukshan Bamunuarachchi @ 2010-11-01 14:18 UTC (permalink / raw)
  To: Tim Pepper; +Cc: linux-numa

Tim,

I found that default value for "zone_reclaim_mode" is zero in HP
machine. But It is one in IBM.
Why does it set 1 or 0 in different hardware ?

On Sat, Oct 30, 2010 at 1:22 AM, Tim Pepper <lnxninja@linux.vnet.ibm.com> wrote:
>
> It would be interesting to see the output of "numactl --hardware" for each
> of these scenarios.
>

1. SLES11 + IBM HW
After consuming all memory in node1, it shows following ...

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5
node 0 size: 24564 MB
node 0 free: 23025 MB
node 1 cpus: 6 7 8 9 10 11
node 1 size: 24576 MB
node 1 free: 16 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10


2. SLES 11 + HP

available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6
node 0 size: 24565 MB
node 0 free: 19929 MB
node 1 cpus: 1 3 5 7
node 1 size: 24575 MB
node 1 free: 3043 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6
node 0 size: 24565 MB
node 0 free: 19912 MB
node 1 cpus: 1 3 5 7
node 1 size: 24575 MB
node 1 free: 335 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6
node 0 size: 24565 MB
node 0 free: 17066 MB
node 1 cpus: 1 3 5 7
node 1 size: 24575 MB
node 1 free: 16 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6
node 0 size: 24565 MB
node 0 free: 10468 MB
node 1 cpus: 1 3 5 7
node 1 size: 24575 MB
node 1 free: 16 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10



> --
> Tim Pepper Â <lnxninja@linux.vnet.ibm.com>
> IBM Linux Technology Center
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NUMA page allocation from next Node
  2010-11-01 14:18       ` Tharindu Rukshan Bamunuarachchi
@ 2010-11-01 14:59         ` Lee Schermerhorn
  2010-11-01 18:00           ` Andi Kleen
  2010-11-01 17:59         ` Andi Kleen
  1 sibling, 1 reply; 14+ messages in thread
From: Lee Schermerhorn @ 2010-11-01 14:59 UTC (permalink / raw)
  To: Tharindu Rukshan Bamunuarachchi; +Cc: Tim Pepper, linux-numa

On Mon, 2010-11-01 at 19:48 +0530, Tharindu Rukshan Bamunuarachchi
wrote:
> Tim,
> 
> I found that default value for "zone_reclaim_mode" is zero in HP
> machine. But It is one in IBM.
> Why does it set 1 or 0 in different hardware

Because the SLIT on the IBM platform has distances > 20.  Looks like IBM
is populating the SLIT on those platforms with "real" values.  The HP
bios is not supplying a slit, letting the remote distances default to
20.  That is the threshold for setting zone_reclaim_mode.  A patch was
submitted recently to bump the threshold to ~30.  Now that vendors are
starting to populate the SLIT with values > 20, we've begun to see the
behavior that you experienced. 

Regards,
Lee
 
> 
> On Sat, Oct 30, 2010 at 1:22 AM, Tim Pepper <lnxninja@linux.vnet.ibm.com> wrote:
> >
> > It would be interesting to see the output of "numactl --hardware" for each
> > of these scenarios.
> >
> 
> 1. SLES11 + IBM HW
> After consuming all memory in node1, it shows following ...
> 
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5
> node 0 size: 24564 MB
> node 0 free: 23025 MB
> node 1 cpus: 6 7 8 9 10 11
> node 1 size: 24576 MB
> node 1 free: 16 MB
> node distances:
> node   0   1
>   0:  10  21
>   1:  21  10
> 
> 
> 2. SLES 11 + HP
> 
> available: 2 nodes (0-1)
> node 0 cpus: 0 2 4 6
> node 0 size: 24565 MB
> node 0 free: 19929 MB
> node 1 cpus: 1 3 5 7
> node 1 size: 24575 MB
> node 1 free: 3043 MB
> node distances:
> node   0   1
>   0:  10  20
>   1:  20  10
> 
> available: 2 nodes (0-1)
> node 0 cpus: 0 2 4 6
> node 0 size: 24565 MB
> node 0 free: 19912 MB
> node 1 cpus: 1 3 5 7
> node 1 size: 24575 MB
> node 1 free: 335 MB
> node distances:
> node   0   1
>   0:  10  20
>   1:  20  10
> 
> available: 2 nodes (0-1)
> node 0 cpus: 0 2 4 6
> node 0 size: 24565 MB
> node 0 free: 17066 MB
> node 1 cpus: 1 3 5 7
> node 1 size: 24575 MB
> node 1 free: 16 MB
> node distances:
> node   0   1
>   0:  10  20
>   1:  20  10
> 
> available: 2 nodes (0-1)
> node 0 cpus: 0 2 4 6
> node 0 size: 24565 MB
> node 0 free: 10468 MB
> node 1 cpus: 1 3 5 7
> node 1 size: 24575 MB
> node 1 free: 16 MB
> node distances:
> node   0   1
>   0:  10  20
>   1:  20  10
> 
> 
> 
> > --
> > Tim Pepper  <lnxninja@linux.vnet.ibm.com>
> > IBM Linux Technology Center
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-numa" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NUMA page allocation from next Node
  2010-11-01 14:59         ` Lee Schermerhorn
@ 2010-11-01 18:00           ` Andi Kleen
  2010-11-02  0:49             ` Tharindu Rukshan Bamunuarachchi
  0 siblings, 1 reply; 14+ messages in thread
From: Andi Kleen @ 2010-11-01 18:00 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Tharindu Rukshan Bamunuarachchi, Tim Pepper, linux-numa

> Because the SLIT on the IBM platform has distances > 20.  Looks like IBM
> is populating the SLIT on those platforms with "real" values.  The HP
> bios is not supplying a slit, letting the remote distances default to
> 20.  That is the threshold for setting zone_reclaim_mode.  A patch was
> submitted recently to bump the threshold to ~30.  Now that vendors are
> starting to populate the SLIT with values > 20, we've begun to see the
> behavior that you experienced. 

I think it's intentional by the vendors: they use it a way to make
Linux behave like they want.

-Andi

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NUMA page allocation from next Node
  2010-11-01 18:00           ` Andi Kleen
@ 2010-11-02  0:49             ` Tharindu Rukshan Bamunuarachchi
  0 siblings, 0 replies; 14+ messages in thread
From: Tharindu Rukshan Bamunuarachchi @ 2010-11-02  0:49 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Lee Schermerhorn, Tim Pepper, linux-numa

Andi/Lee/Tim/Scott/Cliff/Jiahua,

Thankx a lot for your valuable inputs/advices.

__
Tharindu R Bamunuarachchi.




On Mon, Nov 1, 2010 at 11:30 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> Because the SLIT on the IBM platform has distances > 20.  Looks like IBM
>> is populating the SLIT on those platforms with "real" values.  The HP
>> bios is not supplying a slit, letting the remote distances default to
>> 20.  That is the threshold for setting zone_reclaim_mode.  A patch was
>> submitted recently to bump the threshold to ~30.  Now that vendors are
>> starting to populate the SLIT with values > 20, we've begun to see the
>> behavior that you experienced.
>
> I think it's intentional by the vendors: they use it a way to make
> Linux behave like they want.
>
> -Andi
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NUMA page allocation from next Node
  2010-11-01 14:18       ` Tharindu Rukshan Bamunuarachchi
  2010-11-01 14:59         ` Lee Schermerhorn
@ 2010-11-01 17:59         ` Andi Kleen
  1 sibling, 0 replies; 14+ messages in thread
From: Andi Kleen @ 2010-11-01 17:59 UTC (permalink / raw)
  To: Tharindu Rukshan Bamunuarachchi; +Cc: Tim Pepper, linux-numa

On Mon, Nov 01, 2010 at 07:48:08PM +0530, Tharindu Rukshan Bamunuarachchi wrote:
> Tim,
> 
> I found that default value for "zone_reclaim_mode" is zero in HP
> machine. But It is one in IBM.
> Why does it set 1 or 0 in different hardware ?

See my earlier mail: it depends on the SLIT.

-Andi

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2010-11-02  0:49 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-26 16:27 NUMA page allocation from next Node Tharindu Rukshan Bamunuarachchi
     [not found] ` <20101027213652.GA12345@sgi.com>
2010-10-29  2:05   ` Tharindu Rukshan Bamunuarachchi
     [not found]     ` <20101029033058.GB555@www.lurndal.org>
2010-10-29  6:58       ` Tharindu Rukshan Bamunuarachchi
     [not found]     ` <AANLkTikDtKc7RdAWJagqCf7T0JKscfe0Hd0ojc8g7yYo@mail.gmail.com>
2010-10-29  7:06       ` Tharindu Rukshan Bamunuarachchi
2010-10-29  8:49         ` Andi Kleen
2010-10-29  9:16           ` Tharindu Rukshan Bamunuarachchi
2010-10-29 19:52     ` Tim Pepper
2010-10-29 20:30       ` Lee Schermerhorn
2010-11-01 13:55         ` Tharindu Rukshan Bamunuarachchi
2010-11-01 14:18       ` Tharindu Rukshan Bamunuarachchi
2010-11-01 14:59         ` Lee Schermerhorn
2010-11-01 18:00           ` Andi Kleen
2010-11-02  0:49             ` Tharindu Rukshan Bamunuarachchi
2010-11-01 17:59         ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).