All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] sched/fair: Fix dereference NULL sched domain during select_idle_sibling
@ 2016-10-09  0:04 Wanpeng Li
  2016-10-10 10:25 ` Catalin Marinas
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Wanpeng Li @ 2016-10-09  0:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Wanpeng Li, Ingo Molnar, Mike Galbraith, Peter Zijlstra,
	Thomas Gleixner

From: Wanpeng Li <wanpeng.li@hotmail.com>

Commit:

  10e2f1acd01 ("sched/core: Rewrite and improve select_idle_siblings()")

... improved select_idle_sibling() but also triggered a regression:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
IP: [<ffffffffb10cd332>] select_idle_sibling+0x1c2/0x4f0
PGD 0 
Oops: 0000 [#1] SMP
CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.8.0+ #16
RIP: 0010:[<ffffffffb10cd332>]  [<ffffffffb10cd332>] select_idle_sibling+0x1c2/0x4f0
Call Trace:
 <IRQ> 
  select_task_rq_fair+0x749/0x930
  ? select_task_rq_fair+0xb4/0x930
  ? __lock_is_held+0x54/0x70
  try_to_wake_up+0x19a/0x5b0
  default_wake_function+0x12/0x20
  autoremove_wake_function+0x12/0x40
  __wake_up_common+0x55/0x90
  __wake_up+0x39/0x50
  wake_up_klogd_work_func+0x40/0x60
  irq_work_run_list+0x57/0x80
  irq_work_run+0x2c/0x30
  smp_irq_work_interrupt+0x2e/0x40
  irq_work_interrupt+0x96/0xa0
 <EOI> 
  ? _raw_spin_unlock_irqrestore+0x45/0x80
  try_to_wake_up+0x4a/0x5b0
  wake_up_state+0x10/0x20
  __kthread_unpark+0x67/0x70
  kthread_unpark+0x22/0x30
  cpuhp_online_idle+0x3e/0x70
  cpu_startup_entry+0x6a/0x450
  start_secondary+0x154/0x180

This can be reproduced by running the ftrace test case of kselftest, the 
test case will hot-unplug the cpu and the cpu will attach to the NULL 
sched-domain during scheduler teardown. 

The step 2 for the rewrite select_idle_siblings():

| Step 2) tracks the average cost of the scan and compares this to the
| average idle time guestimate for the CPU doing the wakeup.

If the cpu which doing the wakeup is the going hot-unplug cpu, then NULL
sched domain will be dereferenced to acquire the average cost of the scan.

This patch fix it by failing the search of an idle CPU in the LLC process 
if this sched domain is NULL.

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
---
v1 -> v2:
 * remove rcu_read_lock()

 kernel/sched/fair.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 543b2f2..cfcdaf4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5472,13 +5472,18 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
  */
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
 {
-	struct sched_domain *this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
-	u64 avg_idle = this_rq()->avg_idle;
-	u64 avg_cost = this_sd->avg_scan_cost;
+	struct sched_domain *this_sd;
+	u64 avg_cost, avg_idle = this_rq()->avg_idle;
 	u64 time, cost;
 	s64 delta;
 	int cpu, wrap;
 
+	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
+	if (!this_sd)
+		return -1;
+
+	avg_cost = this_sd->avg_scan_cost;
+
 	/*
 	 * Due to large variance we need a large fuzz factor; hackbench in
 	 * particularly is sensitive here.
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] sched/fair: Fix dereference NULL sched domain during select_idle_sibling
  2016-10-09  0:04 [PATCH v2] sched/fair: Fix dereference NULL sched domain during select_idle_sibling Wanpeng Li
@ 2016-10-10 10:25 ` Catalin Marinas
  2016-10-11  9:52 ` [tip:sched/urgent] sched/fair: Fix sched domains NULL dereference in select_idle_sibling() tip-bot for Wanpeng Li
  2016-10-17 11:29 ` [PATCH v2] sched/fair: Fix dereference NULL sched domain during select_idle_sibling Petr Mladek
  2 siblings, 0 replies; 4+ messages in thread
From: Catalin Marinas @ 2016-10-10 10:25 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Linux Kernel Mailing List, Wanpeng Li, Ingo Molnar,
	Mike Galbraith, Peter Zijlstra, Thomas Gleixner

On 9 October 2016 at 01:04, Wanpeng Li <kernellwp@gmail.com> wrote:
> From: Wanpeng Li <wanpeng.li@hotmail.com>
>
> Commit:
>
>   10e2f1acd01 ("sched/core: Rewrite and improve select_idle_siblings()")
>
> ... improved select_idle_sibling() but also triggered a regression:
>
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
> IP: [<ffffffffb10cd332>] select_idle_sibling+0x1c2/0x4f0
> PGD 0
> Oops: 0000 [#1] SMP
> CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.8.0+ #16
> RIP: 0010:[<ffffffffb10cd332>]  [<ffffffffb10cd332>] select_idle_sibling+0x1c2/0x4f0
> Call Trace:
>  <IRQ>
>   select_task_rq_fair+0x749/0x930
>   ? select_task_rq_fair+0xb4/0x930
>   ? __lock_is_held+0x54/0x70
>   try_to_wake_up+0x19a/0x5b0
>   default_wake_function+0x12/0x20
>   autoremove_wake_function+0x12/0x40
>   __wake_up_common+0x55/0x90
>   __wake_up+0x39/0x50
>   wake_up_klogd_work_func+0x40/0x60
>   irq_work_run_list+0x57/0x80
>   irq_work_run+0x2c/0x30
>   smp_irq_work_interrupt+0x2e/0x40
>   irq_work_interrupt+0x96/0xa0
>  <EOI>
>   ? _raw_spin_unlock_irqrestore+0x45/0x80
>   try_to_wake_up+0x4a/0x5b0
>   wake_up_state+0x10/0x20
>   __kthread_unpark+0x67/0x70
>   kthread_unpark+0x22/0x30
>   cpuhp_online_idle+0x3e/0x70
>   cpu_startup_entry+0x6a/0x450
>   start_secondary+0x154/0x180
>
> This can be reproduced by running the ftrace test case of kselftest, the
> test case will hot-unplug the cpu and the cpu will attach to the NULL
> sched-domain during scheduler teardown.
>
> The step 2 for the rewrite select_idle_siblings():
>
> | Step 2) tracks the average cost of the scan and compares this to the
> | average idle time guestimate for the CPU doing the wakeup.
>
> If the cpu which doing the wakeup is the going hot-unplug cpu, then NULL
> sched domain will be dereferenced to acquire the average cost of the scan.
>
> This patch fix it by failing the search of an idle CPU in the LLC process
> if this sched domain is NULL.
>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>

This patch fixes a similar panic on arm64 (triggered by the LTP
cpuhotplug tests). FWIW:

Tested-by: Catalin Marinas <catalin.marinas@arm.com>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [tip:sched/urgent] sched/fair: Fix sched domains NULL dereference in select_idle_sibling()
  2016-10-09  0:04 [PATCH v2] sched/fair: Fix dereference NULL sched domain during select_idle_sibling Wanpeng Li
  2016-10-10 10:25 ` Catalin Marinas
@ 2016-10-11  9:52 ` tip-bot for Wanpeng Li
  2016-10-17 11:29 ` [PATCH v2] sched/fair: Fix dereference NULL sched domain during select_idle_sibling Petr Mladek
  2 siblings, 0 replies; 4+ messages in thread
From: tip-bot for Wanpeng Li @ 2016-10-11  9:52 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: wanpeng.li, linux-kernel, tglx, hpa, catalin.marinas, efault,
	torvalds, peterz, mingo

Commit-ID:  9cfb38a7ba5a9c27c1af8093fb1af4b699c0a441
Gitweb:     http://git.kernel.org/tip/9cfb38a7ba5a9c27c1af8093fb1af4b699c0a441
Author:     Wanpeng Li <wanpeng.li@hotmail.com>
AuthorDate: Sun, 9 Oct 2016 08:04:03 +0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 11 Oct 2016 10:40:06 +0200

sched/fair: Fix sched domains NULL dereference in select_idle_sibling()

Commit:

  10e2f1acd01 ("sched/core: Rewrite and improve select_idle_siblings()")

... improved select_idle_sibling(), but also triggered a regression (crash)
during CPU-hotplug:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
  IP: [<ffffffffb10cd332>] select_idle_sibling+0x1c2/0x4f0
  Call Trace:
   <IRQ>
    select_task_rq_fair+0x749/0x930
    ? select_task_rq_fair+0xb4/0x930
    ? __lock_is_held+0x54/0x70
    try_to_wake_up+0x19a/0x5b0
    default_wake_function+0x12/0x20
    autoremove_wake_function+0x12/0x40
    __wake_up_common+0x55/0x90
    __wake_up+0x39/0x50
    wake_up_klogd_work_func+0x40/0x60
    irq_work_run_list+0x57/0x80
    irq_work_run+0x2c/0x30
    smp_irq_work_interrupt+0x2e/0x40
    irq_work_interrupt+0x96/0xa0
   <EOI>
    ? _raw_spin_unlock_irqrestore+0x45/0x80
    try_to_wake_up+0x4a/0x5b0
    wake_up_state+0x10/0x20
    __kthread_unpark+0x67/0x70
    kthread_unpark+0x22/0x30
    cpuhp_online_idle+0x3e/0x70
    cpu_startup_entry+0x6a/0x450
    start_secondary+0x154/0x180

This can be reproduced by running the ftrace test case of kselftest, the
test case will hot-unplug the CPU and the CPU will attach to the NULL
sched-domain during scheduler teardown.

The step 2 for the rewrite select_idle_siblings():

  | Step 2) tracks the average cost of the scan and compares this to the
  | average idle time guestimate for the CPU doing the wakeup.

If the CPU which doing the wakeup is the going hot-unplug CPU, then NULL
sched domain will be dereferenced to acquire the average cost of the scan.

This patch fix it by failing the search of an idle CPU in the LLC process
if this sched domain is NULL.

Tested-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1475971443-3187-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 502e95a..8b03fb5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5471,13 +5471,18 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
  */
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
 {
-	struct sched_domain *this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
-	u64 avg_idle = this_rq()->avg_idle;
-	u64 avg_cost = this_sd->avg_scan_cost;
+	struct sched_domain *this_sd;
+	u64 avg_cost, avg_idle = this_rq()->avg_idle;
 	u64 time, cost;
 	s64 delta;
 	int cpu, wrap;
 
+	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
+	if (!this_sd)
+		return -1;
+
+	avg_cost = this_sd->avg_scan_cost;
+
 	/*
 	 * Due to large variance we need a large fuzz factor; hackbench in
 	 * particularly is sensitive here.

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] sched/fair: Fix dereference NULL sched domain during select_idle_sibling
  2016-10-09  0:04 [PATCH v2] sched/fair: Fix dereference NULL sched domain during select_idle_sibling Wanpeng Li
  2016-10-10 10:25 ` Catalin Marinas
  2016-10-11  9:52 ` [tip:sched/urgent] sched/fair: Fix sched domains NULL dereference in select_idle_sibling() tip-bot for Wanpeng Li
@ 2016-10-17 11:29 ` Petr Mladek
  2 siblings, 0 replies; 4+ messages in thread
From: Petr Mladek @ 2016-10-17 11:29 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: linux-kernel, Wanpeng Li, Ingo Molnar, Mike Galbraith,
	Peter Zijlstra, Thomas Gleixner

On Sun 2016-10-09 08:04:03, Wanpeng Li wrote:
> From: Wanpeng Li <wanpeng.li@hotmail.com>
> 
> Commit:
> 
>   10e2f1acd01 ("sched/core: Rewrite and improve select_idle_siblings()")
> 
> ... improved select_idle_sibling() but also triggered a regression:
> 
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
> IP: [<ffffffffb10cd332>] select_idle_sibling+0x1c2/0x4f0
> PGD 0 
> Oops: 0000 [#1] SMP
> CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.8.0+ #16
> RIP: 0010:[<ffffffffb10cd332>]  [<ffffffffb10cd332>] select_idle_sibling+0x1c2/0x4f0
> Call Trace:
>  <IRQ> 
>   select_task_rq_fair+0x749/0x930
>   ? select_task_rq_fair+0xb4/0x930
>   ? __lock_is_held+0x54/0x70
>   try_to_wake_up+0x19a/0x5b0
>   default_wake_function+0x12/0x20
>   autoremove_wake_function+0x12/0x40
>   __wake_up_common+0x55/0x90
>   __wake_up+0x39/0x50
>   wake_up_klogd_work_func+0x40/0x60
>   irq_work_run_list+0x57/0x80
>   irq_work_run+0x2c/0x30
>   smp_irq_work_interrupt+0x2e/0x40
>   irq_work_interrupt+0x96/0xa0
>  <EOI> 
>   ? _raw_spin_unlock_irqrestore+0x45/0x80
>   try_to_wake_up+0x4a/0x5b0
>   wake_up_state+0x10/0x20
>   __kthread_unpark+0x67/0x70
>   kthread_unpark+0x22/0x30
>   cpuhp_online_idle+0x3e/0x70
>   cpu_startup_entry+0x6a/0x450
>   start_secondary+0x154/0x180
> 
> This can be reproduced by running the ftrace test case of kselftest, the 
> test case will hot-unplug the cpu and the cpu will attach to the NULL 
> sched-domain during scheduler teardown. 
> 
> The step 2 for the rewrite select_idle_siblings():
> 
> | Step 2) tracks the average cost of the scan and compares this to the
> | average idle time guestimate for the CPU doing the wakeup.
> 
> If the cpu which doing the wakeup is the going hot-unplug cpu, then NULL
> sched domain will be dereferenced to acquire the average cost of the scan.
> 
> This patch fix it by failing the search of an idle CPU in the LLC process 
> if this sched domain is NULL.
> 
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>

This patch helped me as well. I tested intel_powerclamp
conversion to the new CPU hotplug state machine. And this
problem was very easy to trigger.

Feel free to add

Tested-by: Petr Mladek <pmladek@suse.com>

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-10-17 11:29 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-10-09  0:04 [PATCH v2] sched/fair: Fix dereference NULL sched domain during select_idle_sibling Wanpeng Li
2016-10-10 10:25 ` Catalin Marinas
2016-10-11  9:52 ` [tip:sched/urgent] sched/fair: Fix sched domains NULL dereference in select_idle_sibling() tip-bot for Wanpeng Li
2016-10-17 11:29 ` [PATCH v2] sched/fair: Fix dereference NULL sched domain during select_idle_sibling Petr Mladek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.