All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Sachin Sant <sachinp@in.ibm.com>
Cc: Linux/PPC Development <linuxppc-dev@ozlabs.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@elte.hu>,
	linux-next@vger.kernel.org,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>
Subject: Re: [Next] CPU Hotplug test failures on powerpc
Date: Mon, 14 Dec 2009 13:19:42 +0100	[thread overview]
Message-ID: <1260793182.4165.223.camel@twins> (raw)
In-Reply-To: <4B261D7A.9040802@in.ibm.com>

On Mon, 2009-12-14 at 16:41 +0530, Sachin Sant wrote:
> Peter Zijlstra wrote:
> > On Fri, 2009-12-11 at 16:23 +0530, Sachin Sant wrote:
> >   
> >> While executing cpu_hotplug(from autotest) tests against latest
> >> next on a power6 box, the machine locks up. A soft reset shows
> >> the following trace
> >>
> >> cpu 0x0: Vector: 100 (System Reset) at [c00000000c9333d0]
> >>     pc: c0000000003433d8: .find_next_bit+0x54/0xc4
> >>     lr: c000000000342f10: .cpumask_next_and+0x4c/0x94
> >>     sp: c00000000c933650
> >>    msr: 8000000000089032
> >>   current = 0xc00000000c173840
> >>   paca    = 0xc000000000bc2600
> >>     pid   = 2602, comm = hotplug06.top.s
> >> enter ? for help
> >> [link register   ] c000000000342f10 .cpumask_next_and+0x4c/0x94
> >> [c00000000c933650] c0000000000e9f34 .cpuset_cpus_allowed_locked+0x38/0x74 (unreliable)
> >> [c00000000c9336e0] c000000000090074 .move_task_off_dead_cpu+0xc4/0x1ac
> >> [c00000000c9337a0] c0000000005e4e5c .migration_call+0x304/0x830
> >> [c00000000c933880] c0000000005e0880 .notifier_call_chain+0x68/0xe0
> >> [c00000000c933920] c00000000012a92c ._cpu_down+0x210/0x34c
> >> [c00000000c933a90] c00000000012aad8 .cpu_down+0x70/0xa8
> >> [c00000000c933b20] c000000000525940 .store_online+0x54/0x894
> >> [c00000000c933bb0] c000000000463430 .sysdev_store+0x3c/0x50
> >> [c00000000c933c20] c0000000001f8320 .sysfs_write_file+0x124/0x18c
> >> [c00000000c933ce0] c00000000017edac .vfs_write+0xd4/0x1fc
> >> [c00000000c933d80] c00000000017efdc .SyS_write+0x58/0xa0
> >> [c00000000c933e30] c0000000000085b4 syscall_exit+0x0/0x40
> >> --- Exception: c01 (System Call) at 00000fff9fa8a8f8
> >> SP (fffe7aef200) is in userspace
> >> 0:mon> e
> >> cpu 0x0: Vector: 100 (System Reset) at [c00000000c9333d0]
> >>     pc: c0000000003433d8: .find_next_bit+0x54/0xc4
> >>     lr: c000000000342f10: .cpumask_next_and+0x4c/0x94
> >>     sp: c00000000c933650
> >>    msr: 8000000000089032
> >>   current = 0xc00000000c173840
> >>   paca    = 0xc000000000bc2600
> >>     pid   = 2602, comm = hotplug06.top.s
> >>

OK so how do I read that above thing? What's a System Reset? Is that
like the x86 triple fault thing?

>From what I can make of it, its in move_task_off_dead_cpu(), right after
having called cpuset_cpus_allowed_locked(), doing that cpumask_any_and()
call.

static void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p)
{
        int dest_cpu;
        const struct cpumask *nodemask = cpumask_of_node(cpu_to_node(dead_cpu));

again:
        /* Look for allowed, online CPU in same node. */
        for_each_cpu_and(dest_cpu, nodemask, cpu_active_mask)
                if (cpumask_test_cpu(dest_cpu, &p->cpus_allowed))
                        goto move;

        /* Any allowed, online CPU? */
        dest_cpu = cpumask_any_and(&p->cpus_allowed, cpu_active_mask);
        if (dest_cpu < nr_cpu_ids)
                goto move;

        /* No more Mr. Nice Guy. */
        if (dest_cpu >= nr_cpu_ids) {
                cpuset_cpus_allowed_locked(p, &p->cpus_allowed);
====>           dest_cpu = cpumask_any_and(cpu_active_mask, &p->cpus_allowed);

                /*
                 * Don't tell them about moving exiting tasks or
                 * kernel threads (both mm NULL), since they never
                 * leave kernel.
                 */
                if (p->mm && printk_ratelimit()) {
                        pr_info("process %d (%s) no longer affine to cpu%d\n",
                                task_pid_nr(p), p->comm, dead_cpu);
                }
        }

move:
        /* It can have affinity changed while we were choosing. */
        if (unlikely(!__migrate_task_irq(p, dead_cpu, dest_cpu)))
                goto again;
}

Both masks, p->cpus_allowed and cpu_active_mask are stable in that p
won't go away since we hold the tasklist_lock (in migrate_list_tasks),
and cpu_active_mask is static storage, so WTH is it going funny on?

WARNING: multiple messages have this Message-ID (diff)
From: Peter Zijlstra <peterz@infradead.org>
To: Sachin Sant <sachinp@in.ibm.com>
Cc: Linux/PPC Development <linuxppc-dev@ozlabs.org>,
	Ingo Molnar <mingo@elte.hu>,
	linux-next@vger.kernel.org,
	linux-kernel <linux-kernel@vger.kernel.org>
Subject: Re: [Next] CPU Hotplug test failures on powerpc
Date: Mon, 14 Dec 2009 13:19:42 +0100	[thread overview]
Message-ID: <1260793182.4165.223.camel@twins> (raw)
In-Reply-To: <4B261D7A.9040802@in.ibm.com>

On Mon, 2009-12-14 at 16:41 +0530, Sachin Sant wrote:
> Peter Zijlstra wrote:
> > On Fri, 2009-12-11 at 16:23 +0530, Sachin Sant wrote:
> >  =20
> >> While executing cpu_hotplug(from autotest) tests against latest
> >> next on a power6 box, the machine locks up. A soft reset shows
> >> the following trace
> >>
> >> cpu 0x0: Vector: 100 (System Reset) at [c00000000c9333d0]
> >>     pc: c0000000003433d8: .find_next_bit+0x54/0xc4
> >>     lr: c000000000342f10: .cpumask_next_and+0x4c/0x94
> >>     sp: c00000000c933650
> >>    msr: 8000000000089032
> >>   current =3D 0xc00000000c173840
> >>   paca    =3D 0xc000000000bc2600
> >>     pid   =3D 2602, comm =3D hotplug06.top.s
> >> enter ? for help
> >> [link register   ] c000000000342f10 .cpumask_next_and+0x4c/0x94
> >> [c00000000c933650] c0000000000e9f34 .cpuset_cpus_allowed_locked+0x38/0=
x74 (unreliable)
> >> [c00000000c9336e0] c000000000090074 .move_task_off_dead_cpu+0xc4/0x1ac
> >> [c00000000c9337a0] c0000000005e4e5c .migration_call+0x304/0x830
> >> [c00000000c933880] c0000000005e0880 .notifier_call_chain+0x68/0xe0
> >> [c00000000c933920] c00000000012a92c ._cpu_down+0x210/0x34c
> >> [c00000000c933a90] c00000000012aad8 .cpu_down+0x70/0xa8
> >> [c00000000c933b20] c000000000525940 .store_online+0x54/0x894
> >> [c00000000c933bb0] c000000000463430 .sysdev_store+0x3c/0x50
> >> [c00000000c933c20] c0000000001f8320 .sysfs_write_file+0x124/0x18c
> >> [c00000000c933ce0] c00000000017edac .vfs_write+0xd4/0x1fc
> >> [c00000000c933d80] c00000000017efdc .SyS_write+0x58/0xa0
> >> [c00000000c933e30] c0000000000085b4 syscall_exit+0x0/0x40
> >> --- Exception: c01 (System Call) at 00000fff9fa8a8f8
> >> SP (fffe7aef200) is in userspace
> >> 0:mon> e
> >> cpu 0x0: Vector: 100 (System Reset) at [c00000000c9333d0]
> >>     pc: c0000000003433d8: .find_next_bit+0x54/0xc4
> >>     lr: c000000000342f10: .cpumask_next_and+0x4c/0x94
> >>     sp: c00000000c933650
> >>    msr: 8000000000089032
> >>   current =3D 0xc00000000c173840
> >>   paca    =3D 0xc000000000bc2600
> >>     pid   =3D 2602, comm =3D hotplug06.top.s
> >>

OK so how do I read that above thing? What's a System Reset? Is that
like the x86 triple fault thing?

>From what I can make of it, its in move_task_off_dead_cpu(), right after
having called cpuset_cpus_allowed_locked(), doing that cpumask_any_and()
call.

static void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p)
{
        int dest_cpu;
        const struct cpumask *nodemask =3D cpumask_of_node(cpu_to_node(dead=
_cpu));

again:
        /* Look for allowed, online CPU in same node. */
        for_each_cpu_and(dest_cpu, nodemask, cpu_active_mask)
                if (cpumask_test_cpu(dest_cpu, &p->cpus_allowed))
                        goto move;

        /* Any allowed, online CPU? */
        dest_cpu =3D cpumask_any_and(&p->cpus_allowed, cpu_active_mask);
        if (dest_cpu < nr_cpu_ids)
                goto move;

        /* No more Mr. Nice Guy. */
        if (dest_cpu >=3D nr_cpu_ids) {
                cpuset_cpus_allowed_locked(p, &p->cpus_allowed);
=3D=3D=3D=3D>           dest_cpu =3D cpumask_any_and(cpu_active_mask, &p->c=
pus_allowed);

                /*
                 * Don't tell them about moving exiting tasks or
                 * kernel threads (both mm NULL), since they never
                 * leave kernel.
                 */
                if (p->mm && printk_ratelimit()) {
                        pr_info("process %d (%s) no longer affine to cpu%d\=
n",
                                task_pid_nr(p), p->comm, dead_cpu);
                }
        }

move:
        /* It can have affinity changed while we were choosing. */
        if (unlikely(!__migrate_task_irq(p, dead_cpu, dest_cpu)))
                goto again;
}

Both masks, p->cpus_allowed and cpu_active_mask are stable in that p
won't go away since we hold the tasklist_lock (in migrate_list_tasks),
and cpu_active_mask is static storage, so WTH is it going funny on?

  reply	other threads:[~2009-12-14 12:21 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-12-11 10:53 [Next] CPU Hotplug test failures on powerpc Sachin Sant
2009-12-11 10:53 ` Sachin Sant
2009-12-14  2:48 ` Benjamin Herrenschmidt
2009-12-14  2:48   ` Benjamin Herrenschmidt
2009-12-14  4:37   ` Sachin Sant
2009-12-14  4:37     ` Sachin Sant
2009-12-14 10:22 ` Peter Zijlstra
2009-12-14 10:22   ` Peter Zijlstra
2009-12-14 11:11   ` Sachin Sant
2009-12-14 11:11     ` Sachin Sant
2009-12-14 11:11     ` Sachin Sant
2009-12-14 12:19     ` Peter Zijlstra [this message]
2009-12-14 12:19       ` Peter Zijlstra
2009-12-14 21:17       ` Benjamin Herrenschmidt
2009-12-14 21:17         ` Benjamin Herrenschmidt
2009-12-15  9:44         ` Sachin Sant
2009-12-15  9:44           ` Sachin Sant
2009-12-15 10:43           ` Peter Zijlstra
2009-12-15 10:43             ` Peter Zijlstra
2009-12-15 13:47             ` Sachin Sant
2009-12-15 13:47               ` Sachin Sant
2009-12-15 15:03               ` Peter Zijlstra
2009-12-15 15:03                 ` Peter Zijlstra
2009-12-16  5:38                 ` Sachin Sant
2009-12-16  5:38                   ` Sachin Sant
2009-12-16  7:14                   ` Peter Zijlstra
2009-12-16  7:14                     ` Peter Zijlstra
2009-12-16  6:56               ` Xiaotian Feng
2009-12-16  6:56                 ` Xiaotian Feng
2009-12-16  6:25 ` Xiaotian Feng
2009-12-16  6:25   ` Xiaotian Feng
2009-12-16  6:41   ` Sachin Sant
2009-12-16  6:41     ` Sachin Sant
2009-12-16  6:45     ` Xiaotian Feng
2009-12-16  6:45       ` Xiaotian Feng
2009-12-16  6:54       ` Sachin Sant
2009-12-16  6:54         ` Sachin Sant
2009-12-16  7:18         ` Peter Zijlstra
2009-12-16  7:18           ` Peter Zijlstra
2009-12-16  7:57           ` Xiaotian Feng
2009-12-16  7:57             ` Xiaotian Feng
2009-12-16  8:24             ` Sachin Sant
2009-12-16  8:24               ` Sachin Sant
2009-12-16  9:07               ` Xiaotian Feng
2009-12-16  9:07                 ` Xiaotian Feng
2009-12-16  9:07                 ` Xiaotian Feng
2009-12-16  9:15               ` [PATCH] fix cpu hotplug " Xiaotian Feng
2009-12-16 10:16                 ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1260793182.4165.223.camel@twins \
    --to=peterz@infradead.org \
    --cc=benh@kernel.crashing.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-next@vger.kernel.org \
    --cc=linuxppc-dev@ozlabs.org \
    --cc=mingo@elte.hu \
    --cc=sachinp@in.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.