* NUMA page allocation from next Node
@ 2010-10-26 16:27 Tharindu Rukshan Bamunuarachchi
[not found] ` <20101027213652.GA12345@sgi.com>
0 siblings, 1 reply; 14+ messages in thread
From: Tharindu Rukshan Bamunuarachchi @ 2010-10-26 16:27 UTC (permalink / raw)
To: linux-numa
Dear All,
Today, we experienced abnormal memory allocation behavior.
I do not know whether this is the expected behavior or due to misconfiguration.
I have two node NUMA system and 100G TMPFS mount.
1. When "dd" running freely (without CPU affinity) all memory pages
were allocated from NODE 0 and then from NODE 1.
2. When "dd" running bound (using taskset) to CPU core in NODE 1 ....
All memory pages were started to be allocated from NODE 1.
BUT machine stopped responding after exhausting NODE 1.
No memory pages were started to be allocated from NODE 0.
Why "dd" cannot allocate memory from NODE 0 when it is running bound
to NODE 1 CPU core ?
Please help.
I am using SLES 11 with 2.6.27 kernel.
__
Tharindu R Bamunuarachchi.
^ permalink raw reply [flat|nested] 14+ messages in thread[parent not found: <20101027213652.GA12345@sgi.com>]
* Re: NUMA page allocation from next Node [not found] ` <20101027213652.GA12345@sgi.com> @ 2010-10-29 2:05 ` Tharindu Rukshan Bamunuarachchi [not found] ` <20101029033058.GB555@www.lurndal.org> ` (2 more replies) 0 siblings, 3 replies; 14+ messages in thread From: Tharindu Rukshan Bamunuarachchi @ 2010-10-29 2:05 UTC (permalink / raw) To: Cliff Wickman Finally I could isolate the issue further. I tried following kernels and hardware. Issue is visible only with IBM + SLES 11. 1. SLES 11 + IBM HW --> Issue is Visible 2. SLES 11 + HP, Sun HW --> Issue is not Visible 2. 2.6.32 Vanilla + Any HW --> Issue is not Visible 3. 2.6.36 Vanilla + Any HW --> Issue is not Visible HP has same hardware as IBM. Both Nehalem. Sun is bit old Opteron. Any thoughts ? __ Tharindu R Bamunuarachchi. On Thu, Oct 28, 2010 at 3:06 AM, Cliff Wickman <cpw@sgi.com> wrote: > Hi Tharindu, > > On Tue, Oct 26, 2010 at 09:57:53PM +0530, Tharindu Rukshan Bamunuarachchi wrote: >> Dear All, >> >> Today, we experienced abnormal memory allocation behavior. >> I do not know whether this is the expected behavior or due to misconfiguration. >> >> I have two node NUMA system and 100G TMPFS mount. >> >> 1. When "dd" running freely (without CPU affinity) all memory pages >> were allocated from NODE 0 and then from NODE 1. >> >> 2. When "dd" running bound (using taskset) to CPU core in NODE 1 .... >> All memory pages were started to be allocated from NODE 1. >> BUT machine stopped responding after exhausting NODE 1. >> No memory pages were started to be allocated from NODE 0. >> >> Why "dd" cannot allocate memory from NODE 0 when it is running bound >> to NODE 1 CPU core ? >> >> Please help. >> I am using SLES 11 with 2.6.27 kernel. > > I'm no expert on the taskset command, but from what I can see, it > just uses sched_setaffinity() to set cpu affinity. I don't see any > set_mempolicy calls to affect memory affinity. So I see no reason > for restricting memory allocation. > You're not using some other placement mechanism in conjunction with > taskset, are you? A cpuset for example? > > -Cliff > -- > Cliff Wickman > SGI > cpw@sgi.com > (651) 683-3824 > ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <20101029033058.GB555@www.lurndal.org>]
* Re: NUMA page allocation from next Node [not found] ` <20101029033058.GB555@www.lurndal.org> @ 2010-10-29 6:58 ` Tharindu Rukshan Bamunuarachchi 0 siblings, 0 replies; 14+ messages in thread From: Tharindu Rukshan Bamunuarachchi @ 2010-10-29 6:58 UTC (permalink / raw) To: Scott Lurndal; +Cc: Cliff Wickman, linux-numa I have gone through BIOS settings and only applicable setting was memory type : NUMA or Non-NUMA. (current value is NUMA) I have attached part of dmesg output. Is there any other tool or way to gather info ? DMESG ====== Initializing cgroup subsys cpuset Initializing cgroup subsys cpu Linux version 2.6.27.45-0.1-default (geeko@buildhost) (gcc version 4.3.2 [gcc-4_3-branch revision 141291] (SUSE Linux) ) #1 SMP 2010-02-22 16:49:47 +0100 Command line: root=/dev/disk/by-id/scsi-3600605b0023f45a01449ea30199cc9ae-part1 resume=/dev/disk/by-id/scsi-3600605b0023f45a01449ea30199cc9ae-part3 splash=silent crashkernel=256M-:128M@16M vga=0x314 KERNEL supported cpus: Intel GenuineIntel AMD AuthenticAMD Centaur CentaurHauls BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 000000000009c400 (usable) BIOS-e820: 000000000009c400 - 00000000000a0000 (reserved) BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 000000007d151000 (usable) BIOS-e820: 000000007d151000 - 000000007d215000 (reserved) BIOS-e820: 000000007d215000 - 000000007d854000 (usable) BIOS-e820: 000000007d854000 - 000000007d904000 (reserved) BIOS-e820: 000000007d904000 - 000000007f68f000 (usable) BIOS-e820: 000000007f68f000 - 000000007f6df000 (reserved) BIOS-e820: 000000007f6df000 - 000000007f7df000 (ACPI NVS) BIOS-e820: 000000007f7df000 - 000000007f7ff000 (ACPI data) BIOS-e820: 000000007f7ff000 - 000000007f800000 (usable) BIOS-e820: 000000007f800000 - 0000000090000000 (reserved) BIOS-e820: 00000000fc000000 - 00000000fd000000 (reserved) BIOS-e820: 00000000fed1c000 - 00000000fed20000 (reserved) BIOS-e820: 00000000ff800000 - 0000000100000000 (reserved) BIOS-e820: 0000000100000000 - 0000000c80000000 (usable) DMI 2.5 present. last_pfn = 0xc80000 max_arch_pfn = 0x100000000 x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106 last_pfn = 0x7f800 max_arch_pfn = 0x100000000 init_memory_mapping Using GB pages for direct mapping 0000000000 - 0040000000 page 1G 0040000000 - 007f800000 page 2M kernel direct mapping tables up to 7f800000 @ 8000-a000 last_map_addr: 7f800000 end: 7f800000 init_memory_mapping Using GB pages for direct mapping 0100000000 - 0c80000000 page 1G kernel direct mapping tables up to c80000000 @ 9000-a000 last_map_addr: c80000000 end: c80000000 RAMDISK: 37a03000 - 37fef962 ACPI: RSDP 000FDFD0, 0024 (r2 IBM ) ACPI: XSDT 7F7FE120, 0084 (r1 IBM THURLEY 0 1000013) ACPI: FACP 7F7FB000, 00F4 (r4 IBM THURLEY 0 IBM 1000013) ACPI: DSDT 7F7F8000, 2BF3 (r1 IBM THURLEY 3 IBM 1000013) ACPI: FACS 7F6EC000, 0040 ACPI: TCPA 7F7FD000, 0064 (r0 0 0) ACPI: APIC 7F7F7000, 011E (r2 IBM THURLEY 0 IBM 1000013) ACPI: MCFG 7F7F6000, 003C (r1 IBM THURLEY 1 IBM 1000013) ACPI: SLIC 7F7F5000, 0176 (r1 IBM THURLEY 0 IBM 1000013) ACPI: HPET 7F7F4000, 0038 (r1 IBM THURLEY 1 IBM 1000013) ACPI: SRAT 7F7F3000, 0168 (r2 IBM THURLEY 1 IBM 1000013) ACPI: SLIT 7F7F2000, 0030 (r1 IBM THURLEY 0 IBM 1000013) ACPI: SSDT 7F7F1000, 0183 (r2 IBM CPUSCOPE 4000 IBM 1000013) ACPI: SSDT 7F7F0000, 0699 (r2 IBM CPUWYVRN 4000 IBM 1000013) ACPI: ERST 7F7EF000, 0230 (r1 IBM THURLEY 1 IBM 1000013) ACPI: DMAR 7F7EE000, 00D8 (r1 IBM THURLEY 1 IBM 1000013) ACPI: Local APIC address 0xfee00000 SRAT: PXM 0 -> APIC 0 -> Node 0 SRAT: PXM 0 -> APIC 2 -> Node 0 SRAT: PXM 0 -> APIC 4 -> Node 0 SRAT: PXM 0 -> APIC 16 -> Node 0 SRAT: PXM 0 -> APIC 18 -> Node 0 SRAT: PXM 0 -> APIC 20 -> Node 0 SRAT: PXM 1 -> APIC 32 -> Node 1 SRAT: PXM 1 -> APIC 34 -> Node 1 SRAT: PXM 1 -> APIC 36 -> Node 1 SRAT: PXM 1 -> APIC 48 -> Node 1 SRAT: PXM 1 -> APIC 50 -> Node 1 SRAT: PXM 1 -> APIC 52 -> Node 1 SRAT: Node 0 PXM 0 0-80000000 SRAT: Node 0 PXM 0 100000000-680000000 SRAT: Node 1 PXM 1 680000000-c80000000 NUMA: Using 31 for the hash shift. Bootmem setup node 0 0000000000000000-0000000680000000 NODE_DATA [0000000000009000 - 0000000000020fff] bootmap [0000000000100000 - 00000000001cffff] pages d0 (7 early reservations) ==> bootmem [0000000000 - 0680000000] #0 [0000000000 - 0000001000] BIOS data page ==> [0000000000 - 0000001000] #1 [0000006000 - 0000008000] TRAMPOLINE ==> [0000006000 - 0000008000] #2 [0000200000 - 0000bcc8b8] TEXT DATA BSS ==> [0000200000 - 0000bcc8b8] #3 [0037a03000 - 0037fef962] RAMDISK ==> [0037a03000 - 0037fef962] #4 [000009c400 - 0000100000] BIOS reserved ==> [000009c400 - 0000100000] #5 [0000008000 - 0000009000] PGTABLE ==> [0000008000 - 0000009000] #6 [0000001000 - 0000001030] ACPI SLIT ==> [0000001000 - 0000001030] Bootmem setup node 1 0000000680000000-0000000c80000000 NODE_DATA [0000000680000000 - 0000000680017fff] bootmap [0000000680018000 - 00000006800d7fff] pages c0 (7 early reservations) ==> bootmem [0680000000 - 0c80000000] #0 [0000000000 - 0000001000] BIOS data page #1 [0000006000 - 0000008000] TRAMPOLINE #2 [0000200000 - 0000bcc8b8] TEXT DATA BSS #3 [0037a03000 - 0037fef962] RAMDISK #4 [000009c400 - 0000100000] BIOS reserved #5 [0000008000 - 0000009000] PGTABLE #6 [0000001000 - 0000001030] ACPI SLIT found SMP MP-table at [ffff88000009c540] 0009c540 Reserving 128MB of memory at 16MB for crashkernel (System RAM: 51200MB) [ffffe20000000000-ffffe200117fffff] PMD -> [ffff880028200000-ffff8800379fffff] on node 0 [ffffe20011800000-ffffe20019ffffff] PMD -> [ffff880038000000-ffff8800407fffff] on node 0 [ffffe2001a000000-ffffe20031ffffff] PMD -> [ffff880680200000-ffff8806981fffff] on node 1 Zone PFN ranges: DMA 0x00000000 -> 0x00001000 DMA32 0x00001000 -> 0x00100000 Normal 0x00100000 -> 0x00c80000 Movable zone start PFN for each node early_node_map[7] active PFN ranges 0: 0x00000000 -> 0x0000009c 0: 0x00000100 -> 0x0007d151 0: 0x0007d215 -> 0x0007d854 0: 0x0007d904 -> 0x0007f68f 0: 0x0007f7ff -> 0x0007f800 0: 0x00100000 -> 0x00680000 1: 0x00680000 -> 0x00c80000 On node 0 totalpages: 6288568 DMA zone: 1319 pages, LIFO batch:0 DMA32 zone: 501084 pages, LIFO batch:31 Normal zone: 5677056 pages, LIFO batch:31 On node 1 totalpages: 6291456 Normal zone: 6193152 pages, LIFO batch:31 ACPI: PM-Timer IO Port: 0x588 ACPI: Local APIC address 0xfee00000 ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled) ACPI: LAPIC (acpi_id[0x01] lapic_id[0x02] enabled) ACPI: LAPIC (acpi_id[0x02] lapic_id[0x04] enabled) ACPI: LAPIC (acpi_id[0x03] lapic_id[0x10] enabled) ACPI: LAPIC (acpi_id[0x04] lapic_id[0x12] enabled) ACPI: LAPIC (acpi_id[0x05] lapic_id[0x14] enabled) ACPI: LAPIC (acpi_id[0x06] lapic_id[0x20] enabled) ACPI: LAPIC (acpi_id[0x07] lapic_id[0x22] enabled) ACPI: LAPIC (acpi_id[0x08] lapic_id[0x24] enabled) ACPI: LAPIC (acpi_id[0x09] lapic_id[0x30] enabled) ACPI: LAPIC (acpi_id[0x0a] lapic_id[0x32] enabled) ACPI: LAPIC (acpi_id[0x0b] lapic_id[0x34] enabled) ACPI: LAPIC (acpi_id[0x0c] lapic_id[0x01] disabled) ACPI: LAPIC (acpi_id[0x0d] lapic_id[0x03] disabled) ACPI: LAPIC (acpi_id[0x0e] lapic_id[0x05] disabled) ACPI: LAPIC (acpi_id[0x0f] lapic_id[0x11] disabled) ACPI: LAPIC (acpi_id[0x10] lapic_id[0x13] disabled) ACPI: LAPIC (acpi_id[0x11] lapic_id[0x15] disabled) ACPI: LAPIC (acpi_id[0x12] lapic_id[0x21] disabled) ACPI: LAPIC (acpi_id[0x13] lapic_id[0x23] disabled) ACPI: LAPIC (acpi_id[0x14] lapic_id[0x25] disabled) ACPI: LAPIC (acpi_id[0x15] lapic_id[0x31] disabled) ACPI: LAPIC (acpi_id[0x16] lapic_id[0x33] disabled) ACPI: LAPIC (acpi_id[0x17] lapic_id[0x35] disabled) ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1]) ACPI: IOAPIC (id[0x08] address[0xfec00000] gsi_base[0]) IOAPIC[0]: apic_id 8, version 0, address 0xfec00000, GSI 0-23 ACPI: IOAPIC (id[0x09] address[0xfec80000] gsi_base[24]) IOAPIC[1]: apic_id 9, version 0, address 0xfec80000, GSI 24-47 ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level) ACPI: IRQ0 used by override. ACPI: IRQ2 used by override. ACPI: IRQ9 used by override. ACPI: HPET id: 0x8086a301 base: 0xfed00000 Using ACPI (MADT) for SMP configuration information SMP: Allowing 24 CPUs, 12 hotplug CPUs PM: Registered nosave memory: 000000000009c000 - 000000000009d000 PM: Registered nosave memory: 000000000009d000 - 00000000000a0000 PM: Registered nosave memory: 00000000000a0000 - 00000000000e0000 PM: Registered nosave memory: 00000000000e0000 - 0000000000100000 PM: Registered nosave memory: 000000007d151000 - 000000007d215000 PM: Registered nosave memory: 000000007d854000 - 000000007d904000 PM: Registered nosave memory: 000000007f68f000 - 000000007f6df000 PM: Registered nosave memory: 000000007f6df000 - 000000007f7df000 PM: Registered nosave memory: 000000007f7df000 - 000000007f7ff000 PM: Registered nosave memory: 000000007f800000 - 0000000090000000 PM: Registered nosave memory: 0000000090000000 - 00000000fc000000 PM: Registered nosave memory: 00000000fc000000 - 00000000fd000000 PM: Registered nosave memory: 00000000fd000000 - 00000000fed1c000 PM: Registered nosave memory: 00000000fed1c000 - 00000000fed20000 PM: Registered nosave memory: 00000000fed20000 - 00000000ff800000 PM: Registered nosave memory: 00000000ff800000 - 0000000100000000 Allocating PCI resources starting at 98000000 (gap: 90000000:6c000000) PERCPU: Allocating 61472 bytes of per cpu data NR_CPUS: 512, nr_cpu_ids: 24, nr_node_ids 2 Built 2 zonelists in Zone order, mobility grouping on. Total pages: 12372611 Policy zone: Normal __ Tharindu R Bamunuarachchi. On Fri, Oct 29, 2010 at 9:00 AM, Scott Lurndal <scott@lurndal.org> wrote: > On Fri, Oct 29, 2010 at 07:35:35AM +0530, Tharindu Rukshan Bamunuarachchi wrote: >> Finally I could isolate the issue further. >> I tried following kernels and hardware. >> Issue is visible only with IBM + SLES 11. > > Check your ACPI settings, make sure the SRAT and SLIT tables > are being provided by the BIOS to the kernel. > > scott > > ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <AANLkTikDtKc7RdAWJagqCf7T0JKscfe0Hd0ojc8g7yYo@mail.gmail.com>]
* Re: NUMA page allocation from next Node [not found] ` <AANLkTikDtKc7RdAWJagqCf7T0JKscfe0Hd0ojc8g7yYo@mail.gmail.com> @ 2010-10-29 7:06 ` Tharindu Rukshan Bamunuarachchi 2010-10-29 8:49 ` Andi Kleen 0 siblings, 1 reply; 14+ messages in thread From: Tharindu Rukshan Bamunuarachchi @ 2010-10-29 7:06 UTC (permalink / raw) To: jiahua; +Cc: Cliff Wickman, linux-numa What kind of NUMA setting i should look for ? I gather following config values ? Speed : Max Performance LV-DIMM Power : Low Power Memory Channel Mode : Independent Socket Interleave : NUMA Patrol Scrub : Disabled Demand Scrub : Enable Turbo Mode : Enabled Turbo Boost : Traditional C1 Enahnced Mode : Disabled Report C2 OS : Disabled ACPI C-State : C3 VT : Disabled Cache Data Prefetch : Enabled Data Reuse : Enabled QPI Link Speed : Max AEM PowerCapping : Disabled __ Tharindu R Bamunuarachchi. On Fri, Oct 29, 2010 at 9:06 AM, Jiahua <jiahua@gmail.com> wrote: > Did you check the BIOS NUMA settings? > > Jiahua > > > On Thu, Oct 28, 2010 at 7:05 PM, Tharindu Rukshan Bamunuarachchi > <btharindu@gmail.com> wrote: >> >> Finally I could isolate the issue further. >> I tried following kernels and hardware. >> Issue is visible only with IBM + SLES 11. >> >> 1. SLES 11 + IBM HW --> Issue is Visible >> 2. SLES 11 + HP, Sun HW --> Issue is not Visible >> 2. 2.6.32 Vanilla + Any HW --> Issue is not Visible >> 3. 2.6.36 Vanilla + Any HW --> Issue is not Visible >> >> HP has same hardware as IBM. Both Nehalem. Sun is bit old Opteron. >> >> Any thoughts ? >> __ >> Tharindu R Bamunuarachchi. >> >> >> >> >> On Thu, Oct 28, 2010 at 3:06 AM, Cliff Wickman <cpw@sgi.com> wrote: >> > Hi Tharindu, >> > >> > On Tue, Oct 26, 2010 at 09:57:53PM +0530, Tharindu Rukshan >> > Bamunuarachchi wrote: >> >> Dear All, >> >> >> >> Today, we experienced abnormal memory allocation behavior. >> >> I do not know whether this is the expected behavior or due to >> >> misconfiguration. >> >> >> >> I have two node NUMA system and 100G TMPFS mount. >> >> >> >> 1. When "dd" running freely (without CPU affinity) all memory pages >> >> were allocated from NODE 0 and then from NODE 1. >> >> >> >> 2. When "dd" running bound (using taskset) to CPU core in NODE 1 .... >> >> All memory pages were started to be allocated from NODE 1. >> >> BUT machine stopped responding after exhausting NODE 1. >> >> No memory pages were started to be allocated from NODE 0. >> >> >> >> Why "dd" cannot allocate memory from NODE 0 when it is running bound >> >> to NODE 1 CPU core ? >> >> >> >> Please help. >> >> I am using SLES 11 with 2.6.27 kernel. >> > >> > I'm no expert on the taskset command, but from what I can see, it >> > just uses sched_setaffinity() to set cpu affinity. I don't see any >> > set_mempolicy calls to affect memory affinity. So I see no reason >> > for restricting memory allocation. >> > You're not using some other placement mechanism in conjunction with >> > taskset, are you? A cpuset for example? >> > >> > -Cliff >> > -- >> > Cliff Wickman >> > SGI >> > cpw@sgi.com >> > (651) 683-3824 >> > >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-numa" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NUMA page allocation from next Node 2010-10-29 7:06 ` Tharindu Rukshan Bamunuarachchi @ 2010-10-29 8:49 ` Andi Kleen 2010-10-29 9:16 ` Tharindu Rukshan Bamunuarachchi 0 siblings, 1 reply; 14+ messages in thread From: Andi Kleen @ 2010-10-29 8:49 UTC (permalink / raw) To: Tharindu Rukshan Bamunuarachchi; +Cc: jiahua, Cliff Wickman, linux-numa On Fri, Oct 29, 2010 at 12:36:47PM +0530, Tharindu Rukshan Bamunuarachchi wrote: > What kind of NUMA setting i should look for ? I gather following config values ? Most likely one of the systems forces zone reclaim by having large SLIT values and the other doesn't. You can configure zone reclaim manually through sysctl. You should not get a lockup though in any case, that's some kind of VM bug in 2.6.27. I would recommend reporting that to Novell. -Andi ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NUMA page allocation from next Node 2010-10-29 8:49 ` Andi Kleen @ 2010-10-29 9:16 ` Tharindu Rukshan Bamunuarachchi 0 siblings, 0 replies; 14+ messages in thread From: Tharindu Rukshan Bamunuarachchi @ 2010-10-29 9:16 UTC (permalink / raw) To: Andi Kleen; +Cc: jiahua, Cliff Wickman, linux-numa You mean changing node or zone through this ... /proc/sys/vm/zone_reclaim_mode On Fri, Oct 29, 2010 at 2:19 PM, Andi Kleen <andi@firstfloor.org> wrote: > On Fri, Oct 29, 2010 at 12:36:47PM +0530, Tharindu Rukshan Bamunuarachchi wrote: >> What kind of NUMA setting i should look for ? I gather following config values ? > > Most likely one of the systems forces zone reclaim by having large > SLIT values and the other doesn't. You can configure zone reclaim > manually through sysctl. > > You should not get a lockup though in any case, that's some kind of > VM bug in 2.6.27. I would recommend reporting that to Novell. > > -Andi > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NUMA page allocation from next Node 2010-10-29 2:05 ` Tharindu Rukshan Bamunuarachchi [not found] ` <20101029033058.GB555@www.lurndal.org> [not found] ` <AANLkTikDtKc7RdAWJagqCf7T0JKscfe0Hd0ojc8g7yYo@mail.gmail.com> @ 2010-10-29 19:52 ` Tim Pepper 2010-10-29 20:30 ` Lee Schermerhorn 2010-11-01 14:18 ` Tharindu Rukshan Bamunuarachchi 2 siblings, 2 replies; 14+ messages in thread From: Tim Pepper @ 2010-10-29 19:52 UTC (permalink / raw) To: Tharindu Rukshan Bamunuarachchi; +Cc: linux-numa On Fri 29 Oct at 07:35:35 +0530 btharindu@gmail.com said: > Finally I could isolate the issue further. > I tried following kernels and hardware. > Issue is visible only with IBM + SLES 11. > > 1. SLES 11 + IBM HW --> Issue is Visible > 2. SLES 11 + HP, Sun HW --> Issue is not Visible > 2. 2.6.32 Vanilla + Any HW --> Issue is not Visible > 3. 2.6.36 Vanilla + Any HW --> Issue is not Visible It would be interesting to see the output of "numactl --hardware" for each of these scenarios. -- Tim Pepper <lnxninja@linux.vnet.ibm.com> IBM Linux Technology Center ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NUMA page allocation from next Node 2010-10-29 19:52 ` Tim Pepper @ 2010-10-29 20:30 ` Lee Schermerhorn 2010-11-01 13:55 ` Tharindu Rukshan Bamunuarachchi 2010-11-01 14:18 ` Tharindu Rukshan Bamunuarachchi 1 sibling, 1 reply; 14+ messages in thread From: Lee Schermerhorn @ 2010-10-29 20:30 UTC (permalink / raw) To: Tim Pepper; +Cc: Tharindu Rukshan Bamunuarachchi, linux-numa On Fri, 2010-10-29 at 12:52 -0700, Tim Pepper wrote: > On Fri 29 Oct at 07:35:35 +0530 btharindu@gmail.com said: > > Finally I could isolate the issue further. > > I tried following kernels and hardware. > > Issue is visible only with IBM + SLES 11. > > > > 1. SLES 11 + IBM HW --> Issue is Visible > > 2. SLES 11 + HP, Sun HW --> Issue is not Visible > > 2. 2.6.32 Vanilla + Any HW --> Issue is not Visible > > 3. 2.6.36 Vanilla + Any HW --> Issue is not Visible > > It would be interesting to see the output of "numactl --hardware" for each > of these scenarios. > Also, if you could add "mminit_loglevel=2" to the boot command line, and grep for 'zonelist general'. The general zonelists for the Normal zones will show the order of allocation for the two nodes. On a 2 node [AMD] platform, I see: xxx(lts)dmesg | grep 'zonelist general' mminit::zonelist general 0:DMA = 0:DMA mminit::zonelist general 0:DMA32 = 0:DMA32 0:DMA mminit::zonelist general 0:Normal = 0:Normal 0:DMA32 0:DMA 1:Normal mminit::zonelist general 1:Normal = 1:Normal 0:Normal 0:DMA32 0:DMA so, node 0 Normal zone allocates from 0:Normal first, as expected, and than falls back via DMA32, DMA [both on node 0] eventually to node 1 Normal. Node 1 starts locally and falls back to Node 0 Normal and, finally, the DMA zones. You can also try: cat /proc/zoneinfo | egrep '^Node|^ pages|^ +present' and maybe "watch" that [watch(1)] while you run your tests. And, just to be sure, you could suspend your dd job [^Z] and take a look at it's mempolicy and such via /proc/<pid>/status [Mems_allowed*] and it's /proc/<pid>/numa_maps. If you haven't changed anything you should see both nodes in Mems_allowed[_list] and all of the policies in the numa_maps should show 'default'. Andi already mentioned zone_reclaim_mode. You'll want that set to '0' if you want allocations to overflow/fallback to off-node without attempting direct reclaim first. E.g., set vm.zone_reclaim_mode = 0 in your /etc/sysctl.conf and reload via 'sysctl -p' if you want it to stick. Regards, Lee ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NUMA page allocation from next Node 2010-10-29 20:30 ` Lee Schermerhorn @ 2010-11-01 13:55 ` Tharindu Rukshan Bamunuarachchi 0 siblings, 0 replies; 14+ messages in thread From: Tharindu Rukshan Bamunuarachchi @ 2010-11-01 13:55 UTC (permalink / raw) To: Lee Schermerhorn; +Cc: Tim Pepper, linux-numa Dear All, On Sat, Oct 30, 2010 at 2:00 AM, Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote: > > Also, if you could add "mminit_loglevel=2" to the boot command line, and > grep for 'zonelist general'. The general zonelists for the Normal zones > will show the order of allocation for the two nodes. On a 2 node [AMD] > platform, I see: > Output of "mminit_loglevel=2" ... mminit::zonelist general 0:DMA = 0:DMA mminit::zonelist general 0:DMA32 = 0:DMA32 0:DMA mminit::zonelist general 0:Normal = 0:Normal 1:Normal 0:DMA32 0:DMA mminit::zonelist general 1:Normal = 1:Normal 0:Normal 0:DMA32 0:DMA > > And, just to be sure, you could suspend your dd job [^Z] and take a look > at it's mempolicy and such via /proc/<pid>/status [Mems_allowed*] and > it's /proc/<pid>/numa_maps. If you haven't changed anything you should > see both nodes in Mems_allowed[_list] and all of the policies in the > numa_maps should show 'default'. > /proc/<PID>/numa_maps shows "default". Both nodes are shown in "Mems_allowed*". Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003 Mems_allowed_list: 0-1 > Andi already mentioned zone_reclaim_mode. You'll want that set to '0' > if you want allocations to overflow/fallback to off-node without > attempting direct reclaim first. E.g., set vm.zone_reclaim_mode = 0 in > your /etc/sysctl.conf and reload via 'sysctl -p' if you want it to > stick. > I set zone_reclaim_mode to zero. It is working fine. :-) DD can allocate remaining memory from other node. BTW, I have tried several vanilla kernels and Issue is not visible after 2.6.31. Is there anyway to identify the patch which should have fix this in 2.6.31* trees? Thankx a lot for your support. Tharindu. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NUMA page allocation from next Node 2010-10-29 19:52 ` Tim Pepper 2010-10-29 20:30 ` Lee Schermerhorn @ 2010-11-01 14:18 ` Tharindu Rukshan Bamunuarachchi 2010-11-01 14:59 ` Lee Schermerhorn 2010-11-01 17:59 ` Andi Kleen 1 sibling, 2 replies; 14+ messages in thread From: Tharindu Rukshan Bamunuarachchi @ 2010-11-01 14:18 UTC (permalink / raw) To: Tim Pepper; +Cc: linux-numa Tim, I found that default value for "zone_reclaim_mode" is zero in HP machine. But It is one in IBM. Why does it set 1 or 0 in different hardware ? On Sat, Oct 30, 2010 at 1:22 AM, Tim Pepper <lnxninja@linux.vnet.ibm.com> wrote: > > It would be interesting to see the output of "numactl --hardware" for each > of these scenarios. > 1. SLES11 + IBM HW After consuming all memory in node1, it shows following ... available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 node 0 size: 24564 MB node 0 free: 23025 MB node 1 cpus: 6 7 8 9 10 11 node 1 size: 24576 MB node 1 free: 16 MB node distances: node 0 1 0: 10 21 1: 21 10 2. SLES 11 + HP available: 2 nodes (0-1) node 0 cpus: 0 2 4 6 node 0 size: 24565 MB node 0 free: 19929 MB node 1 cpus: 1 3 5 7 node 1 size: 24575 MB node 1 free: 3043 MB node distances: node 0 1 0: 10 20 1: 20 10 available: 2 nodes (0-1) node 0 cpus: 0 2 4 6 node 0 size: 24565 MB node 0 free: 19912 MB node 1 cpus: 1 3 5 7 node 1 size: 24575 MB node 1 free: 335 MB node distances: node 0 1 0: 10 20 1: 20 10 available: 2 nodes (0-1) node 0 cpus: 0 2 4 6 node 0 size: 24565 MB node 0 free: 17066 MB node 1 cpus: 1 3 5 7 node 1 size: 24575 MB node 1 free: 16 MB node distances: node 0 1 0: 10 20 1: 20 10 available: 2 nodes (0-1) node 0 cpus: 0 2 4 6 node 0 size: 24565 MB node 0 free: 10468 MB node 1 cpus: 1 3 5 7 node 1 size: 24575 MB node 1 free: 16 MB node distances: node 0 1 0: 10 20 1: 20 10 > -- > Tim Pepper  <lnxninja@linux.vnet.ibm.com> > IBM Linux Technology Center > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NUMA page allocation from next Node 2010-11-01 14:18 ` Tharindu Rukshan Bamunuarachchi @ 2010-11-01 14:59 ` Lee Schermerhorn 2010-11-01 18:00 ` Andi Kleen 2010-11-01 17:59 ` Andi Kleen 1 sibling, 1 reply; 14+ messages in thread From: Lee Schermerhorn @ 2010-11-01 14:59 UTC (permalink / raw) To: Tharindu Rukshan Bamunuarachchi; +Cc: Tim Pepper, linux-numa On Mon, 2010-11-01 at 19:48 +0530, Tharindu Rukshan Bamunuarachchi wrote: > Tim, > > I found that default value for "zone_reclaim_mode" is zero in HP > machine. But It is one in IBM. > Why does it set 1 or 0 in different hardware Because the SLIT on the IBM platform has distances > 20. Looks like IBM is populating the SLIT on those platforms with "real" values. The HP bios is not supplying a slit, letting the remote distances default to 20. That is the threshold for setting zone_reclaim_mode. A patch was submitted recently to bump the threshold to ~30. Now that vendors are starting to populate the SLIT with values > 20, we've begun to see the behavior that you experienced. Regards, Lee > > On Sat, Oct 30, 2010 at 1:22 AM, Tim Pepper <lnxninja@linux.vnet.ibm.com> wrote: > > > > It would be interesting to see the output of "numactl --hardware" for each > > of these scenarios. > > > > 1. SLES11 + IBM HW > After consuming all memory in node1, it shows following ... > > available: 2 nodes (0-1) > node 0 cpus: 0 1 2 3 4 5 > node 0 size: 24564 MB > node 0 free: 23025 MB > node 1 cpus: 6 7 8 9 10 11 > node 1 size: 24576 MB > node 1 free: 16 MB > node distances: > node 0 1 > 0: 10 21 > 1: 21 10 > > > 2. SLES 11 + HP > > available: 2 nodes (0-1) > node 0 cpus: 0 2 4 6 > node 0 size: 24565 MB > node 0 free: 19929 MB > node 1 cpus: 1 3 5 7 > node 1 size: 24575 MB > node 1 free: 3043 MB > node distances: > node 0 1 > 0: 10 20 > 1: 20 10 > > available: 2 nodes (0-1) > node 0 cpus: 0 2 4 6 > node 0 size: 24565 MB > node 0 free: 19912 MB > node 1 cpus: 1 3 5 7 > node 1 size: 24575 MB > node 1 free: 335 MB > node distances: > node 0 1 > 0: 10 20 > 1: 20 10 > > available: 2 nodes (0-1) > node 0 cpus: 0 2 4 6 > node 0 size: 24565 MB > node 0 free: 17066 MB > node 1 cpus: 1 3 5 7 > node 1 size: 24575 MB > node 1 free: 16 MB > node distances: > node 0 1 > 0: 10 20 > 1: 20 10 > > available: 2 nodes (0-1) > node 0 cpus: 0 2 4 6 > node 0 size: 24565 MB > node 0 free: 10468 MB > node 1 cpus: 1 3 5 7 > node 1 size: 24575 MB > node 1 free: 16 MB > node distances: > node 0 1 > 0: 10 20 > 1: 20 10 > > > > > -- > > Tim Pepper <lnxninja@linux.vnet.ibm.com> > > IBM Linux Technology Center > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-numa" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NUMA page allocation from next Node 2010-11-01 14:59 ` Lee Schermerhorn @ 2010-11-01 18:00 ` Andi Kleen 2010-11-02 0:49 ` Tharindu Rukshan Bamunuarachchi 0 siblings, 1 reply; 14+ messages in thread From: Andi Kleen @ 2010-11-01 18:00 UTC (permalink / raw) To: Lee Schermerhorn; +Cc: Tharindu Rukshan Bamunuarachchi, Tim Pepper, linux-numa > Because the SLIT on the IBM platform has distances > 20. Looks like IBM > is populating the SLIT on those platforms with "real" values. The HP > bios is not supplying a slit, letting the remote distances default to > 20. That is the threshold for setting zone_reclaim_mode. A patch was > submitted recently to bump the threshold to ~30. Now that vendors are > starting to populate the SLIT with values > 20, we've begun to see the > behavior that you experienced. I think it's intentional by the vendors: they use it a way to make Linux behave like they want. -Andi ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NUMA page allocation from next Node 2010-11-01 18:00 ` Andi Kleen @ 2010-11-02 0:49 ` Tharindu Rukshan Bamunuarachchi 0 siblings, 0 replies; 14+ messages in thread From: Tharindu Rukshan Bamunuarachchi @ 2010-11-02 0:49 UTC (permalink / raw) To: Andi Kleen; +Cc: Lee Schermerhorn, Tim Pepper, linux-numa Andi/Lee/Tim/Scott/Cliff/Jiahua, Thankx a lot for your valuable inputs/advices. __ Tharindu R Bamunuarachchi. On Mon, Nov 1, 2010 at 11:30 PM, Andi Kleen <andi@firstfloor.org> wrote: >> Because the SLIT on the IBM platform has distances > 20. Looks like IBM >> is populating the SLIT on those platforms with "real" values. The HP >> bios is not supplying a slit, letting the remote distances default to >> 20. That is the threshold for setting zone_reclaim_mode. A patch was >> submitted recently to bump the threshold to ~30. Now that vendors are >> starting to populate the SLIT with values > 20, we've begun to see the >> behavior that you experienced. > > I think it's intentional by the vendors: they use it a way to make > Linux behave like they want. > > -Andi > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NUMA page allocation from next Node 2010-11-01 14:18 ` Tharindu Rukshan Bamunuarachchi 2010-11-01 14:59 ` Lee Schermerhorn @ 2010-11-01 17:59 ` Andi Kleen 1 sibling, 0 replies; 14+ messages in thread From: Andi Kleen @ 2010-11-01 17:59 UTC (permalink / raw) To: Tharindu Rukshan Bamunuarachchi; +Cc: Tim Pepper, linux-numa On Mon, Nov 01, 2010 at 07:48:08PM +0530, Tharindu Rukshan Bamunuarachchi wrote: > Tim, > > I found that default value for "zone_reclaim_mode" is zero in HP > machine. But It is one in IBM. > Why does it set 1 or 0 in different hardware ? See my earlier mail: it depends on the SLIT. -Andi ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2010-11-02 0:49 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-26 16:27 NUMA page allocation from next Node Tharindu Rukshan Bamunuarachchi
[not found] ` <20101027213652.GA12345@sgi.com>
2010-10-29 2:05 ` Tharindu Rukshan Bamunuarachchi
[not found] ` <20101029033058.GB555@www.lurndal.org>
2010-10-29 6:58 ` Tharindu Rukshan Bamunuarachchi
[not found] ` <AANLkTikDtKc7RdAWJagqCf7T0JKscfe0Hd0ojc8g7yYo@mail.gmail.com>
2010-10-29 7:06 ` Tharindu Rukshan Bamunuarachchi
2010-10-29 8:49 ` Andi Kleen
2010-10-29 9:16 ` Tharindu Rukshan Bamunuarachchi
2010-10-29 19:52 ` Tim Pepper
2010-10-29 20:30 ` Lee Schermerhorn
2010-11-01 13:55 ` Tharindu Rukshan Bamunuarachchi
2010-11-01 14:18 ` Tharindu Rukshan Bamunuarachchi
2010-11-01 14:59 ` Lee Schermerhorn
2010-11-01 18:00 ` Andi Kleen
2010-11-02 0:49 ` Tharindu Rukshan Bamunuarachchi
2010-11-01 17:59 ` Andi Kleen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).