* [Bugfix] sched: fix possible invalid memory access caused by CPU hot-addition
@ 2014-04-22 5:27 Jiang Liu
2014-04-22 8:15 ` Peter Zijlstra
0 siblings, 1 reply; 16+ messages in thread
From: Jiang Liu @ 2014-04-22 5:27 UTC (permalink / raw)
To: Andrew Morton, Ingo Molnar, Peter Zijlstra, Ingo Molnar
Cc: Jiang Liu, Rafael J . Wysocki, Tony Luck, linux-kernel
When calling kzalloc_node(size, flags, node), we should first check
whether node is onlined, otherwise it may cause invalid memory access
as below.
[ 3663.324476] BUG: unable to handle kernel paging request at 0000000000001f08
[ 3663.332348] IP: [<ffffffff81172219>] __alloc_pages_nodemask+0xb9/0x2d0
[ 3663.339719] PGD 82fe10067 PUD 82ebef067 PMD 0
[ 3663.344773] Oops: 0000 [#1] SMP
[ 3663.348455] Modules linked in: shpchp gpio_ich x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd microcode joydev sb_edac edac_core lpc_ich ipmi_si tpm_tis ipmi_msghandler ioatdma wmi acpi_pad mac_hid lp parport ixgbe isci mpt2sas dca ahci ptp libsas libahci raid_class pps_core scsi_transport_sas mdio hid_generic usbhid hid
[ 3663.394393] CPU: 61 PID: 2416 Comm: cron Tainted: G W 3.14.0-rc5+ #21
[ 3663.402643] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRIVTIN1.86B.0047.F03.1403031049 03/03/2014
[ 3663.414299] task: ffff88082fe54b00 ti: ffff880845fba000 task.ti: ffff880845fba000
[ 3663.422741] RIP: 0010:[<ffffffff81172219>] [<ffffffff81172219>] __alloc_pages_nodemask+0xb9/0x2d0
[ 3663.432857] RSP: 0018:ffff880845fbbcd0 EFLAGS: 00010246
[ 3663.439265] RAX: 0000000000001f00 RBX: 0000000000000000 RCX: 0000000000000000
[ 3663.447291] RDX: 0000000000000000 RSI: 0000000000000a8d RDI: ffffffff81a8d950
[ 3663.455318] RBP: ffff880845fbbd58 R08: ffff880823293400 R09: 0000000000000001
[ 3663.463345] R10: 0000000000000001 R11: 0000000000000000 R12: 00000000002052d0
[ 3663.471363] R13: ffff880854c07600 R14: 0000000000000002 R15: 0000000000000000
[ 3663.479389] FS: 00007f2e8b99e800(0000) GS:ffff88105a400000(0000) knlGS:0000000000000000
[ 3663.488514] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3663.495018] CR2: 0000000000001f08 CR3: 00000008237b1000 CR4: 00000000001407e0
[ 3663.503476] Stack:
[ 3663.505757] ffffffff811bd74d ffff880854c01d98 ffff880854c01df0 ffff880854c01dd0
[ 3663.514167] 00000003208ca420 000000075a5d84d0 ffff88082fe54b00 ffffffff811bb35f
[ 3663.522567] ffff880854c07600 0000000000000003 0000000000001f00 ffff880845fbbd48
[ 3663.530976] Call Trace:
[ 3663.533753] [<ffffffff811bd74d>] ? deactivate_slab+0x41d/0x4f0
[ 3663.540421] [<ffffffff811bb35f>] ? new_slab+0x3f/0x2d0
[ 3663.546307] [<ffffffff811bb3c5>] new_slab+0xa5/0x2d0
[ 3663.552001] [<ffffffff81768c97>] __slab_alloc+0x35d/0x54a
[ 3663.558185] [<ffffffff810a4845>] ? local_clock+0x25/0x30
[ 3663.564686] [<ffffffff8177a34c>] ? __do_page_fault+0x4ec/0x5e0
[ 3663.571356] [<ffffffff810b0054>] ? alloc_fair_sched_group+0xc4/0x190
[ 3663.578609] [<ffffffff810c77f1>] ? __raw_spin_lock_init+0x21/0x60
[ 3663.585570] [<ffffffff811be476>] kmem_cache_alloc_node_trace+0xa6/0x1d0
[ 3663.593112] [<ffffffff810b0054>] ? alloc_fair_sched_group+0xc4/0x190
[ 3663.600363] [<ffffffff810b0054>] alloc_fair_sched_group+0xc4/0x190
[ 3663.607423] [<ffffffff810a359f>] sched_create_group+0x3f/0x80
[ 3663.613994] [<ffffffff810b611f>] sched_autogroup_create_attach+0x3f/0x1b0
[ 3663.621732] [<ffffffff8108258a>] sys_setsid+0xea/0x110
[ 3663.628020] [<ffffffff8177f42d>] system_call_fastpath+0x1a/0x1f
[ 3663.634780] Code: 00 44 89 e7 e8 b9 f8 f4 ff 41 f6 c4 10 74 18 31 d2 be 8d 0a 00 00 48 c7 c7 50 d9 a8 81 e8 70 6a f2 ff e8 db dd 5f 00 48 8b 45 c8 <48> 83 78 08 00 0f 84 b5 01 00 00 48 83 c0 08 44 89 75 c0 4d 89
[ 3663.657032] RIP [<ffffffff81172219>] __alloc_pages_nodemask+0xb9/0x2d0
[ 3663.664491] RSP <ffff880845fbbcd0>
[ 3663.668429] CR2: 0000000000001f08
[ 3663.672659] ---[ end trace df13f08ed9de18ad ]---
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
---
kernel/sched/fair.c | 12 +++++++-----
kernel/sched/rt.c | 11 +++++++----
2 files changed, 14 insertions(+), 9 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7e9bd0b1fa9e..1c093d213b0f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7485,7 +7485,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se;
- int i;
+ int i, nid;
tg->cfs_rq = kzalloc(sizeof(cfs_rq) * nr_cpu_ids, GFP_KERNEL);
if (!tg->cfs_rq)
@@ -7499,13 +7499,15 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
init_cfs_bandwidth(tg_cfs_bandwidth(tg));
for_each_possible_cpu(i) {
- cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
- GFP_KERNEL, cpu_to_node(i));
+ nid = cpu_to_node(i);
+ if (nid != NUMA_NO_NODE && !node_online(nid))
+ nid = NUMA_NO_NODE;
+
+ cfs_rq = kzalloc_node(sizeof(struct cfs_rq), GFP_KERNEL, nid);
if (!cfs_rq)
goto err;
- se = kzalloc_node(sizeof(struct sched_entity),
- GFP_KERNEL, cpu_to_node(i));
+ se = kzalloc_node(sizeof(struct sched_entity), GFP_KERNEL, nid);
if (!se)
goto err_free_rq;
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index d8cdf1618551..5a5a32e1115c 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -161,7 +161,7 @@ int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
{
struct rt_rq *rt_rq;
struct sched_rt_entity *rt_se;
- int i;
+ int i, nid;
tg->rt_rq = kzalloc(sizeof(rt_rq) * nr_cpu_ids, GFP_KERNEL);
if (!tg->rt_rq)
@@ -174,13 +174,16 @@ int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
ktime_to_ns(def_rt_bandwidth.rt_period), 0);
for_each_possible_cpu(i) {
- rt_rq = kzalloc_node(sizeof(struct rt_rq),
- GFP_KERNEL, cpu_to_node(i));
+ nid = cpu_to_node(i);
+ if (nid != NUMA_NO_NODE && !node_online(nid))
+ nid = NUMA_NO_NODE;
+
+ rt_rq = kzalloc_node(sizeof(struct rt_rq), GFP_KERNEL, nid);
if (!rt_rq)
goto err;
rt_se = kzalloc_node(sizeof(struct sched_rt_entity),
- GFP_KERNEL, cpu_to_node(i));
+ GFP_KERNEL, nid);
if (!rt_se)
goto err_free_rq;
--
1.7.10.4
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [Bugfix] sched: fix possible invalid memory access caused by CPU hot-addition
2014-04-22 5:27 [Bugfix] sched: fix possible invalid memory access caused by CPU hot-addition Jiang Liu
@ 2014-04-22 8:15 ` Peter Zijlstra
2014-04-22 20:01 ` Andrew Morton
2014-04-23 2:45 ` Jiang Liu
0 siblings, 2 replies; 16+ messages in thread
From: Peter Zijlstra @ 2014-04-22 8:15 UTC (permalink / raw)
To: Jiang Liu
Cc: Andrew Morton, Ingo Molnar, Ingo Molnar, Rafael J . Wysocki,
Tony Luck, linux-kernel
On Tue, Apr 22, 2014 at 01:27:15PM +0800, Jiang Liu wrote:
> When calling kzalloc_node(size, flags, node), we should first check
> whether node is onlined, otherwise it may cause invalid memory access
> as below.
But this is only for memory less node crap, right?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Bugfix] sched: fix possible invalid memory access caused by CPU hot-addition
2014-04-22 8:15 ` Peter Zijlstra
@ 2014-04-22 20:01 ` Andrew Morton
2014-04-22 20:04 ` Peter Zijlstra
2014-04-23 2:45 ` Jiang Liu
1 sibling, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2014-04-22 20:01 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Jiang Liu, Ingo Molnar, Ingo Molnar, Rafael J . Wysocki,
Tony Luck, linux-kernel
On Tue, 22 Apr 2014 10:15:15 +0200 Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Apr 22, 2014 at 01:27:15PM +0800, Jiang Liu wrote:
> > When calling kzalloc_node(size, flags, node), we should first check
> > whether node is onlined, otherwise it may cause invalid memory access
> > as below.
>
> But this is only for memory less node crap, right?
um, why are memoryless nodes crap?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Bugfix] sched: fix possible invalid memory access caused by CPU hot-addition
2014-04-22 20:01 ` Andrew Morton
@ 2014-04-22 20:04 ` Peter Zijlstra
2014-04-23 1:59 ` David Rientjes
2014-04-23 2:05 ` Mike Galbraith
0 siblings, 2 replies; 16+ messages in thread
From: Peter Zijlstra @ 2014-04-22 20:04 UTC (permalink / raw)
To: Andrew Morton
Cc: Jiang Liu, Ingo Molnar, Ingo Molnar, Rafael J . Wysocki,
Tony Luck, linux-kernel
On Tue, Apr 22, 2014 at 01:01:51PM -0700, Andrew Morton wrote:
> On Tue, 22 Apr 2014 10:15:15 +0200 Peter Zijlstra <peterz@infradead.org> wrote:
>
> > On Tue, Apr 22, 2014 at 01:27:15PM +0800, Jiang Liu wrote:
> > > When calling kzalloc_node(size, flags, node), we should first check
> > > whether node is onlined, otherwise it may cause invalid memory access
> > > as below.
> >
> > But this is only for memory less node crap, right?
>
> um, why are memoryless nodes crap?
Why wouldn't they be? Having CPUs with no local memory seems decidedly
suboptimal.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Bugfix] sched: fix possible invalid memory access caused by CPU hot-addition
2014-04-22 20:04 ` Peter Zijlstra
@ 2014-04-23 1:59 ` David Rientjes
2014-04-23 4:53 ` Jiang Liu
2014-04-23 2:05 ` Mike Galbraith
1 sibling, 1 reply; 16+ messages in thread
From: David Rientjes @ 2014-04-23 1:59 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Andrew Morton, Jiang Liu, Ingo Molnar, Ingo Molnar,
Rafael J . Wysocki, Tony Luck, linux-kernel
On Tue, 22 Apr 2014, Peter Zijlstra wrote:
> On Tue, Apr 22, 2014 at 01:01:51PM -0700, Andrew Morton wrote:
> > On Tue, 22 Apr 2014 10:15:15 +0200 Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > > On Tue, Apr 22, 2014 at 01:27:15PM +0800, Jiang Liu wrote:
> > > > When calling kzalloc_node(size, flags, node), we should first check
> > > > whether node is onlined, otherwise it may cause invalid memory access
> > > > as below.
> > >
> > > But this is only for memory less node crap, right?
> >
> > um, why are memoryless nodes crap?
>
> Why wouldn't they be? Having CPUs with no local memory seems decidedly
> suboptimal.
The quick fix for memoryless node issues is usually just do cpu_to_mem()
rather than cpu_to_node() in the caller. This assumes that the arch is
setup correctly to handle memoryless nodes with
CONFIG_HAVE_MEMORYLESS_NODES (and we've had problems recently with
memoryless nodes not being configured correctly on powerpc).
That type of a fix would probably be better handled in the slab allocator,
though, since kmalloc_node(nid) shouldn't crash just because nid is
memoryless, we should be doing local_memory_node(node) when allocating the
slab pages.
However, I don't think memoryless nodes are the problem here since Jiang
is testing for !node_online(nid) in his patch, so it's a problem with
cpu_to_node() pointing to an offline node. It makes sense for the page
allocator to crash in such a case, the node id is erroneous.
So either the cpu-to-node mapping is invalid or alloc_fair_sched_group()
is allocating memory for a cpu on an offline node. The
for_each_possible_cpu() looks suspicious. There's no guarantee that
local_memory_node(node) for an offline node will return anything with
affinity, so falling back to NUMA_NO_NODE looks appropriate in Jiang's
patch.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Bugfix] sched: fix possible invalid memory access caused by CPU hot-addition
2014-04-22 20:04 ` Peter Zijlstra
2014-04-23 1:59 ` David Rientjes
@ 2014-04-23 2:05 ` Mike Galbraith
1 sibling, 0 replies; 16+ messages in thread
From: Mike Galbraith @ 2014-04-23 2:05 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Andrew Morton, Jiang Liu, Ingo Molnar, Ingo Molnar,
Rafael J . Wysocki, Tony Luck, linux-kernel
On Tue, 2014-04-22 at 22:04 +0200, Peter Zijlstra wrote:
> On Tue, Apr 22, 2014 at 01:01:51PM -0700, Andrew Morton wrote:
> > On Tue, 22 Apr 2014 10:15:15 +0200 Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > > On Tue, Apr 22, 2014 at 01:27:15PM +0800, Jiang Liu wrote:
> > > > When calling kzalloc_node(size, flags, node), we should first check
> > > > whether node is onlined, otherwise it may cause invalid memory access
> > > > as below.
> > >
> > > But this is only for memory less node crap, right?
> >
> > um, why are memoryless nodes crap?
>
> Why wouldn't they be? Having CPUs with no local memory seems decidedly
> suboptimal.
This ain't exactly wonderful either, makes CPU domains to crawl over.
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
node 0 size: 31723 MB
node 0 free: 27949 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149
node 1 size: 32308 MB
node 1 free: 31033 MB
node 2 cpus:
node 2 size: 32768 MB
node 2 free: 16631 MB
node 3 cpus:
node 3 size: 32768 MB
node 3 free: 32640 MB
node 4 cpus:
node 4 size: 32768 MB
node 4 free: 32640 MB
node 5 cpus:
node 5 size: 32768 MB
node 5 free: 32638 MB
node distances:
node 0 1 2 3 4 5
0: 10 12 12 15 12 15
1: 12 10 15 12 15 15
2: 12 15 10 12 15 15
3: 15 12 12 10 15 12
4: 12 15 15 15 10 12
5: 15 15 15 12 12 10
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Bugfix] sched: fix possible invalid memory access caused by CPU hot-addition
2014-04-22 8:15 ` Peter Zijlstra
2014-04-22 20:01 ` Andrew Morton
@ 2014-04-23 2:45 ` Jiang Liu
2014-04-23 5:32 ` Peter Zijlstra
1 sibling, 1 reply; 16+ messages in thread
From: Jiang Liu @ 2014-04-23 2:45 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Andrew Morton, Ingo Molnar, Ingo Molnar, Rafael J . Wysocki,
Tony Luck, linux-kernel
Hi Peter,
It's not for memoryless node, but to solve a race window
in CPU hot-addition. The related CPU hot-addition flow is:
1) Handle CPU hot-addition event
1.a) gather platform specific information
1.b) associate hot-added CPU with a node
1.c) create CPU device
2) User online hot-added CPUs through sysfs:
2.a) cpu_up()
2.b) ->try_online_node()
2.c) ->hotadd_new_pgdat()
2.d) ->node_set_online()
So between 1.b and 2.c, kmalloc_node(nid) may cause invalid
memory access without the node_online(nid) check.
Best Regards!
Gerry
On 2014/4/22 16:15, Peter Zijlstra wrote:
> On Tue, Apr 22, 2014 at 01:27:15PM +0800, Jiang Liu wrote:
>> When calling kzalloc_node(size, flags, node), we should first check
>> whether node is onlined, otherwise it may cause invalid memory access
>> as below.
>
> But this is only for memory less node crap, right?
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Bugfix] sched: fix possible invalid memory access caused by CPU hot-addition
2014-04-23 1:59 ` David Rientjes
@ 2014-04-23 4:53 ` Jiang Liu
0 siblings, 0 replies; 16+ messages in thread
From: Jiang Liu @ 2014-04-23 4:53 UTC (permalink / raw)
To: David Rientjes, Peter Zijlstra
Cc: Andrew Morton, Ingo Molnar, Ingo Molnar, Rafael J . Wysocki,
Tony Luck, linux-kernel
On 2014/4/23 9:59, David Rientjes wrote:
> On Tue, 22 Apr 2014, Peter Zijlstra wrote:
>
>> On Tue, Apr 22, 2014 at 01:01:51PM -0700, Andrew Morton wrote:
>>> On Tue, 22 Apr 2014 10:15:15 +0200 Peter Zijlstra <peterz@infradead.org> wrote:
>>>
>>>> On Tue, Apr 22, 2014 at 01:27:15PM +0800, Jiang Liu wrote:
>>>>> When calling kzalloc_node(size, flags, node), we should first check
>>>>> whether node is onlined, otherwise it may cause invalid memory access
>>>>> as below.
>>>>
>>>> But this is only for memory less node crap, right?
>>>
>>> um, why are memoryless nodes crap?
>>
>> Why wouldn't they be? Having CPUs with no local memory seems decidedly
>> suboptimal.
>
> The quick fix for memoryless node issues is usually just do cpu_to_mem()
> rather than cpu_to_node() in the caller. This assumes that the arch is
> setup correctly to handle memoryless nodes with
> CONFIG_HAVE_MEMORYLESS_NODES (and we've had problems recently with
> memoryless nodes not being configured correctly on powerpc).
>
> That type of a fix would probably be better handled in the slab allocator,
> though, since kmalloc_node(nid) shouldn't crash just because nid is
> memoryless, we should be doing local_memory_node(node) when allocating the
> slab pages.
>
> However, I don't think memoryless nodes are the problem here since Jiang
> is testing for !node_online(nid) in his patch, so it's a problem with
> cpu_to_node() pointing to an offline node. It makes sense for the page
> allocator to crash in such a case, the node id is erroneous.
>
> So either the cpu-to-node mapping is invalid or alloc_fair_sched_group()
> is allocating memory for a cpu on an offline node. The
> for_each_possible_cpu() looks suspicious. There's no guarantee that
> local_memory_node(node) for an offline node will return anything with
> affinity, so falling back to NUMA_NO_NODE looks appropriate in Jiang's
> patch.
Hi David,
That's the case, alloc_fair_sched_group() is trying to allocate
memory for CPU in offline node, which then access non-exist NODE_DATA.
Thanks!
Gerry
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Bugfix] sched: fix possible invalid memory access caused by CPU hot-addition
2014-04-23 2:45 ` Jiang Liu
@ 2014-04-23 5:32 ` Peter Zijlstra
2014-04-23 5:45 ` Jiang Liu
2014-04-23 5:51 ` Peter Zijlstra
0 siblings, 2 replies; 16+ messages in thread
From: Peter Zijlstra @ 2014-04-23 5:32 UTC (permalink / raw)
To: Jiang Liu
Cc: Andrew Morton, Ingo Molnar, Ingo Molnar, Rafael J . Wysocki,
Tony Luck, linux-kernel
On Wed, Apr 23, 2014 at 10:45:13AM +0800, Jiang Liu wrote:
> Hi Peter,
> It's not for memoryless node, but to solve a race window
> in CPU hot-addition. The related CPU hot-addition flow is:
> 1) Handle CPU hot-addition event
> 1.a) gather platform specific information
> 1.b) associate hot-added CPU with a node
> 1.c) create CPU device
> 2) User online hot-added CPUs through sysfs:
> 2.a) cpu_up()
> 2.b) ->try_online_node()
> 2.c) ->hotadd_new_pgdat()
> 2.d) ->node_set_online()
>
> So between 1.b and 2.c, kmalloc_node(nid) may cause invalid
> memory access without the node_online(nid) check.
Any why was all this not in the Changelog?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Bugfix] sched: fix possible invalid memory access caused by CPU hot-addition
2014-04-23 5:32 ` Peter Zijlstra
@ 2014-04-23 5:45 ` Jiang Liu
2014-04-23 5:51 ` Peter Zijlstra
1 sibling, 0 replies; 16+ messages in thread
From: Jiang Liu @ 2014-04-23 5:45 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Andrew Morton, Ingo Molnar, Ingo Molnar, Rafael J . Wysocki,
Tony Luck, linux-kernel
On 2014/4/23 13:32, Peter Zijlstra wrote:
> On Wed, Apr 23, 2014 at 10:45:13AM +0800, Jiang Liu wrote:
>> Hi Peter,
>> It's not for memoryless node, but to solve a race window
>> in CPU hot-addition. The related CPU hot-addition flow is:
>> 1) Handle CPU hot-addition event
>> 1.a) gather platform specific information
>> 1.b) associate hot-added CPU with a node
>> 1.c) create CPU device
>> 2) User online hot-added CPUs through sysfs:
>> 2.a) cpu_up()
>> 2.b) ->try_online_node()
>> 2.c) ->hotadd_new_pgdat()
>> 2.d) ->node_set_online()
>>
>> So between 1.b and 2.c, kmalloc_node(nid) may cause invalid
>> memory access without the node_online(nid) check.
>
> Any why was all this not in the Changelog?
Sorry, will add above message into changelog.
Thanks!
Gerry
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Bugfix] sched: fix possible invalid memory access caused by CPU hot-addition
2014-04-23 5:32 ` Peter Zijlstra
2014-04-23 5:45 ` Jiang Liu
@ 2014-04-23 5:51 ` Peter Zijlstra
2014-04-23 17:46 ` Luck, Tony
1 sibling, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2014-04-23 5:51 UTC (permalink / raw)
To: Jiang Liu
Cc: Andrew Morton, Ingo Molnar, Ingo Molnar, Rafael J . Wysocki,
Tony Luck, linux-kernel
On Wed, Apr 23, 2014 at 07:32:13AM +0200, Peter Zijlstra wrote:
> On Wed, Apr 23, 2014 at 10:45:13AM +0800, Jiang Liu wrote:
> > Hi Peter,
> > It's not for memoryless node, but to solve a race window
> > in CPU hot-addition. The related CPU hot-addition flow is:
> > 1) Handle CPU hot-addition event
> > 1.a) gather platform specific information
> > 1.b) associate hot-added CPU with a node
> > 1.c) create CPU device
> > 2) User online hot-added CPUs through sysfs:
> > 2.a) cpu_up()
> > 2.b) ->try_online_node()
> > 2.c) ->hotadd_new_pgdat()
> > 2.d) ->node_set_online()
> >
> > So between 1.b and 2.c, kmalloc_node(nid) may cause invalid
> > memory access without the node_online(nid) check.
>
> Any why was all this not in the Changelog?
Also, do explain what kind of hardware you needed to trigger this. This
code has been like this for a good while.
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: [Bugfix] sched: fix possible invalid memory access caused by CPU hot-addition
2014-04-23 5:51 ` Peter Zijlstra
@ 2014-04-23 17:46 ` Luck, Tony
2014-04-24 2:59 ` Jiang Liu
0 siblings, 1 reply; 16+ messages in thread
From: Luck, Tony @ 2014-04-23 17:46 UTC (permalink / raw)
To: Peter Zijlstra, Jiang Liu
Cc: Andrew Morton, Ingo Molnar, Ingo Molnar, Wysocki, Rafael J,
linux-kernel@vger.kernel.org
> > > 1) Handle CPU hot-addition event
> > > 1.a) gather platform specific information
> > > 1.b) associate hot-added CPU with a node
> > > 1.c) create CPU device
> > > 2) User online hot-added CPUs through sysfs:
> > > 2.a) cpu_up()
> > > 2.b) ->try_online_node()
> > > 2.c) ->hotadd_new_pgdat()
> > > 2.d) ->node_set_online()
> > >
> > > So between 1.b and 2.c, kmalloc_node(nid) may cause invalid
> > > memory access without the node_online(nid) check.
> >
> > Any why was all this not in the Changelog?
>
> Also, do explain what kind of hardware you needed to trigger this. This
> code has been like this for a good while.
With your proposed fix in place the allocations will succeed - but they
will be done from other nodes ... and this cpu will have to do a remote
NUMA access for the rest of time.
It would be better to switch the order above - add the memory first,
then add the cpus. Is that possible?
-Tony
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Bugfix] sched: fix possible invalid memory access caused by CPU hot-addition
2014-04-23 17:46 ` Luck, Tony
@ 2014-04-24 2:59 ` Jiang Liu
2014-04-24 7:47 ` Peter Zijlstra
0 siblings, 1 reply; 16+ messages in thread
From: Jiang Liu @ 2014-04-24 2:59 UTC (permalink / raw)
To: Luck, Tony, Peter Zijlstra
Cc: Andrew Morton, Ingo Molnar, Ingo Molnar, Wysocki, Rafael J,
linux-kernel@vger.kernel.org
On 2014/4/24 1:46, Luck, Tony wrote:
>>>> 1) Handle CPU hot-addition event
>>>> 1.a) gather platform specific information
>>>> 1.b) associate hot-added CPU with a node
>>>> 1.c) create CPU device
>>>> 2) User online hot-added CPUs through sysfs:
>>>> 2.a) cpu_up()
>>>> 2.b) ->try_online_node()
>>>> 2.c) ->hotadd_new_pgdat()
>>>> 2.d) ->node_set_online()
>>>>
>>>> So between 1.b and 2.c, kmalloc_node(nid) may cause invalid
>>>> memory access without the node_online(nid) check.
>>>
>>> Any why was all this not in the Changelog?
>>
>> Also, do explain what kind of hardware you needed to trigger this. This
>> code has been like this for a good while.
>
> With your proposed fix in place the allocations will succeed - but they
> will be done from other nodes ... and this cpu will have to do a remote
> NUMA access for the rest of time.
>
> It would be better to switch the order above - add the memory first,
> then add the cpus. Is that possible?
Hi Tony,
The BIOS always sends CPU hot-addition events before memory
hot-addition events, so it's hard to change the order.
And we couldn't completely solve this performance penalty because the
affected code tries to allocate memory for all possible
CPUs instead of onlined CPUs.
Best Regards!
Gerry
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Bugfix] sched: fix possible invalid memory access caused by CPU hot-addition
2014-04-24 2:59 ` Jiang Liu
@ 2014-04-24 7:47 ` Peter Zijlstra
2014-04-24 17:41 ` Luck, Tony
0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2014-04-24 7:47 UTC (permalink / raw)
To: Jiang Liu
Cc: Luck, Tony, Andrew Morton, Ingo Molnar, Ingo Molnar,
Wysocki, Rafael J, linux-kernel@vger.kernel.org
On Thu, Apr 24, 2014 at 10:59:45AM +0800, Jiang Liu wrote:
> On 2014/4/24 1:46, Luck, Tony wrote:
> >>>> 1) Handle CPU hot-addition event
> >>>> 1.a) gather platform specific information
> >>>> 1.b) associate hot-added CPU with a node
> >>>> 1.c) create CPU device
> >>>> 2) User online hot-added CPUs through sysfs:
> >>>> 2.a) cpu_up()
> >>>> 2.b) ->try_online_node()
> >>>> 2.c) ->hotadd_new_pgdat()
> >>>> 2.d) ->node_set_online()
> >>>>
> >>>> So between 1.b and 2.c, kmalloc_node(nid) may cause invalid
> >>>> memory access without the node_online(nid) check.
> >>>
> >>> Any why was all this not in the Changelog?
> >>
> >> Also, do explain what kind of hardware you needed to trigger this. This
> >> code has been like this for a good while.
> >
> > With your proposed fix in place the allocations will succeed - but they
> > will be done from other nodes ... and this cpu will have to do a remote
> > NUMA access for the rest of time.
> >
> > It would be better to switch the order above - add the memory first,
> > then add the cpus. Is that possible?
> Hi Tony,
> The BIOS always sends CPU hot-addition events before memory
> hot-addition events, so it's hard to change the order.
> And we couldn't completely solve this performance penalty because the
> affected code tries to allocate memory for all possible
> CPUs instead of onlined CPUs.
So the BIOS is fucked, news at 11, one would have hoped Intel would have
_some_ say in it, but alas. So how about instead you force memory online
when you online the first CPU, screw whatever the BIOS does or does not?
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: [Bugfix] sched: fix possible invalid memory access caused by CPU hot-addition
2014-04-24 7:47 ` Peter Zijlstra
@ 2014-04-24 17:41 ` Luck, Tony
2014-04-24 19:09 ` Peter Zijlstra
0 siblings, 1 reply; 16+ messages in thread
From: Luck, Tony @ 2014-04-24 17:41 UTC (permalink / raw)
To: Peter Zijlstra, Jiang Liu
Cc: Andrew Morton, Ingo Molnar, Ingo Molnar, Wysocki, Rafael J,
linux-kernel@vger.kernel.org
>> The BIOS always sends CPU hot-addition events before memory
>> hot-addition events, so it's hard to change the order.
>> And we couldn't completely solve this performance penalty because the
>> affected code tries to allocate memory for all possible
>> CPUs instead of onlined CPUs.
>
> So the BIOS is fucked, news at 11, one would have hoped Intel would have
> _some_ say in it, but alas. So how about instead you force memory online
> when you online the first CPU, screw whatever the BIOS does or does not?
Certainly an interesting implementation choice by the BIOS. The only logical
order to use to bring components of a modern cpu online is:
1) Memory - so we have a place to allocate structure needed for following steps
2) Cores - so we have a place to direct interrupts from next step
3) I/O
We should log a bug against the BIOS ... but systems are already shipping so we will
have to deal with this.
Either we use your existing patch - and systems with silly BIOS will work, but with a
small NUMA penalty for objects allocated remotely
or ... we implement some crazy queuing scheme ... where we delay bringing cores
online for a while to see whether more things like memory and I/O start showing
up too. We can't wait forever - people sometimes do configure systems with
memory-less nodes.
I think your existing solution is the better choice ... the penalties probably aren't
all that big ... so extensive workarounds for BIOS bugs seem like the wrong direction.
Maybe a one-time printk() so the user knows they have a buggy BIOS might help
provide back pressure to BIOS teams to do this right in the future? But it isn't
a bug for the memory-less node case.
-Tony
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Bugfix] sched: fix possible invalid memory access caused by CPU hot-addition
2014-04-24 17:41 ` Luck, Tony
@ 2014-04-24 19:09 ` Peter Zijlstra
0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2014-04-24 19:09 UTC (permalink / raw)
To: Luck, Tony
Cc: Jiang Liu, Andrew Morton, Ingo Molnar, Ingo Molnar,
Wysocki, Rafael J, linux-kernel@vger.kernel.org
On Thu, Apr 24, 2014 at 05:41:20PM +0000, Luck, Tony wrote:
> >> The BIOS always sends CPU hot-addition events before memory
> >> hot-addition events, so it's hard to change the order.
> >> And we couldn't completely solve this performance penalty because the
> >> affected code tries to allocate memory for all possible
> >> CPUs instead of onlined CPUs.
> >
> > So the BIOS is fucked, news at 11, one would have hoped Intel would have
> > _some_ say in it, but alas. So how about instead you force memory online
> > when you online the first CPU, screw whatever the BIOS does or does not?
>
> Certainly an interesting implementation choice by the BIOS. The only logical
> order to use to bring components of a modern cpu online is:
>
> 1) Memory - so we have a place to allocate structure needed for following steps
> 2) Cores - so we have a place to direct interrupts from next step
> 3) I/O
>
> We should log a bug against the BIOS ... but systems are already shipping so we will
> have to deal with this.
Someone want to clue me in what systems these are so I can try and stay
the hell away from them?
> Either we use your existing patch - and systems with silly BIOS will work, but with a
> small NUMA penalty for objects allocated remotely
Depending on how this all is constructed, I can imagine the worst case
where we bring up a medium to large system (8+ nodes, non fully
connected etc) and we only have memory for the first node online from
booting. The cpu bringup could be concurrent/fast-enough to not have any
other memory online.
This would result in all cpus having their memory on the first node
(including per-cpu chunks I would imagine), that's entirely retarded.
We should really refuse to bring up CPUs and boot in reduced capacity
for such demented systems.
> or ... we implement some crazy queuing scheme ... where we delay bringing cores
> online for a while to see whether more things like memory and I/O start showing
> up too. We can't wait forever - people sometimes do configure systems with
> memory-less nodes.
Is there no distinction between the cases? I've really no idea how the
BIOS communicates this (and honestly no real desire to know), but it
would be best if we can kludge around this in the arch code and keep it
out of core code.
Did I already say that memory-less nodes are stupid? ;-)
> I think your existing solution is the better choice ... the penalties probably aren't
> all that big ... so extensive workarounds for BIOS bugs seem like the wrong direction.
Why can't we have the architecture code generate a memory add event on
the first cpu up of which there is no memory yet?
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2014-04-24 19:09 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-22 5:27 [Bugfix] sched: fix possible invalid memory access caused by CPU hot-addition Jiang Liu
2014-04-22 8:15 ` Peter Zijlstra
2014-04-22 20:01 ` Andrew Morton
2014-04-22 20:04 ` Peter Zijlstra
2014-04-23 1:59 ` David Rientjes
2014-04-23 4:53 ` Jiang Liu
2014-04-23 2:05 ` Mike Galbraith
2014-04-23 2:45 ` Jiang Liu
2014-04-23 5:32 ` Peter Zijlstra
2014-04-23 5:45 ` Jiang Liu
2014-04-23 5:51 ` Peter Zijlstra
2014-04-23 17:46 ` Luck, Tony
2014-04-24 2:59 ` Jiang Liu
2014-04-24 7:47 ` Peter Zijlstra
2014-04-24 17:41 ` Luck, Tony
2014-04-24 19:09 ` Peter Zijlstra
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox