All of lore.kernel.org
 help / color / mirror / Atom feed
From: Darren Hart <dvhltc@us.ibm.com>
To: svaidy@linux.vnet.ibm.com
Cc: Stephen Rothwell <sfr@canb.auug.org.au>,
	Gautham R Shenoy <ego@in.ibm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	linuxppc-dev@ozlabs.org, Will Schmidt <willschm@us.ibm.com>,
	Paul Mackerras <paulus@samba.org>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries
Date: Fri, 06 Aug 2010 00:13:23 -0700	[thread overview]
Message-ID: <4C5BB613.6050002@us.ibm.com> (raw)
In-Reply-To: <20100806050936.GI3282@dirshya.in.ibm.com>

On 08/05/2010 10:09 PM, Vaidyanathan Srinivasan wrote:
> * Darren Hart<dvhltc@us.ibm.com>  [2010-08-05 19:19:00]:
> 
>> On 07/22/2010 10:09 PM, Benjamin Herrenschmidt wrote:
>>> On Thu, 2010-07-22 at 21:44 -0700, Darren Hart wrote:
>>>
>>>>    suggestion I updated the instrumentation to display the
>>>> local_save_flags and irqs_disabled_flags:
>>>
>>>> Jul 22 23:36:58 igoort1 kernel: local flags: 0, irqs disabled: 1
>>>> Jul 22 23:36:58 igoort1 kernel: before H_CEDE current->stack: c00000010e9e3ce0, pcnt: 1
>>>> Jul 22 23:36:58 igoort1 kernel: after H_CEDE current->stack: c00000010e9e3ce0, pcnt: 1
>>>>
>>>> I'm not sure if I'm reading that right, but I believe interrupts are
>>>> intended to be disabled here. If accomplished via the
>>>> spin_lock_irqsave() this would behave differently on RT. However, this
>>>> path disables the interrupts handled by xics, all but the IPIs anyway.
>>>> On RT I disabled the decrementer as well.
>>>>
>>>> Is it possible for RT to be receiving other interrupts here?
>>>
>>> Also you may want to call hard_irq_disable() to -really- disable
>>> interrupts ... since we do lazy-disable on powerpc
>>>
>>
>> A quick update and request for direction wrt the cede hcall, interrupt handling, and the stack pointer.
>>
>> I've added patches one at a time, eliminating bugs with preempt_rt
>> (tip/rt/head). With my current patchset I have no crashes with a single run of
>> random_online.sh (stress testing to hit the hang will sees is my todo for
>> tomorrow).
>>
>> Current patchset includes:
>> patches/0001-wms-fix01.patch # P7 lazy flushing thing
>>
>> # next four are sent to / queued for mainline
>> patches/powerpc-increase-pseries_cpu_die-delay.patch
>> patches/powerpc-enable-preemption-before-cpu_die.patch
>> patches/powerpc-silence-__cpu_up-under-normal-operation.patch
>> patches/powerpc-silence-xics_migrate_irqs_away-during-cpu-offline.patch
>>
>> # this one needs to be cleaned up and sent to mainline
>> patches/powerpc-wait-for-cpu-to-go-inactive.patch
>>
>> patches/powerpc-disable-decrementer-on-offline.patch
>> patches/powerpc-cpu_die-preempt-hack.patch # reset preempt_count to 0 after cede
>> patches/powerpc-cede-processor-inst.patch # instrumentation
>> patches/powerpc-hard_irq_disable.patch # hard_irq_disable before cede
>> patches/powerpc-pad-thread_info.patch
>>
>> I didn't include all the patches as the relevant bits are included below in
>> code form.
>>
>> With the instrumentation, it's clear the change to preempt_count() is targeted
>> (since I added padding before and after preempt_count in the thread_info
>> struct) and preempt_count is still changed. It is also clear that it isn't just
>> a stray decrement since I set preempt_count() to 4 prior to calling cede and it
>> still is 0xffffffff after the hcall. Also note that the stack pointer doesn't
>> change across the cede call and neither does any other part of the thread_info
>> structure.
>>
>> Adding hard_irq_disable() did seem to affect things somewhat. Rather than a
>> serialized list of before/after thread_info prints around cede, I see
>> several befores then several afters. But the preempt_count is still modified.
>>
>> The relevant code looks like this (from pseries_mach_cpu_die())
>>
>>                          hard_irq_disable(); /* this doesn't fix things... but
>> 			                       does change behavior a bit */
>>                          preempt_count() = 0x4;
>>                          asm("mr %0,1" : "=r" (sp));  /* stack pointer is in R1 */
>>                          printk("before cede: sp=%lx pcnt=%x\n", sp, preempt_count());
>>                          print_thread_info();
>>                          extended_cede_processor(cede_latency_hint);
>>                          asm("mr %0,1" : "=r" (sp));
>>                          printk("after cede: sp=%lx pcnt=%x\n", sp, preempt_count());
>>                          print_thread_info();
>>                          preempt_count() = 0;
>>
>>
>> With these patches applied, the output across cede looks like:
>>
>> before cede: sp=c000000000b57150 pcnt=4
>> *** current->thread_info ***
>> ti->task: c000000000aa1410
>> ti->exec_domain: c000000000a59958
>> ti->cpu: 0
>> ti->stomp_on_me: 57005 /* 0xDEAD - forgot to print in hex */
>> ti->preempt_count: 4
>> ti->stomp_on_me_too: 48879 /* 0xBEEF - forgot to print in hex */
>> ti->local_flags: 0
>> *** end thread_info ***
>> after cede: sp=c000000000b57150 pcnt=ffffffff
>> *** current->thread_info ***
>> ti->task: c000000000aa1410
>> ti->exec_domain: c000000000a59958
>> ti->cpu: 0
> 
> Is this print sufficient to prove that we did not jump CPU?

Probably not, but the following should be sufficient. Note that the
dump occurs on "CPU: 6" which matches the CPU noted in the ti->cpu
before and after the cede call.

*** current->thread_info ***
ti->task: c00000008e7b2240
ti->exec_domain: c000000000a59958
ti->cpu: 6
ti->stomp_on_me: 57005
ti->preempt_count: 4
ti->stomp_on_me_too: 48879
ti->local_flags: 0
*** end thread_info ***
------------[ cut here ]------------
Badness at kernel/sched.c:3698
NIP: c0000000006987e4 LR: c0000000006987c8 CTR: c00000000005c668
REGS: c00000010e713800 TRAP: 0700   Not tainted  (2.6.33-rt-dvhrt05-02358-g61223dd-dirty)
MSR: 8000000000021032 <ME,CE,IR,DR>  CR: 24000082  XER: 20000000
TASK = c00000008e7b2240[0] 'swapper' THREAD: c00000010e710000 CPU: 6
GPR00: 0000000000000000 c00000010e713a80 c000000000b54f88 0000000000000001 
GPR04: c00000010e713d08 ffffffffffffffff 00000000000000e0 800000000d049138 
GPR08: 0000000000000000 c000000000ca5420 0000000000000001 c000000000bc22e8 
GPR12: 0000000000000002 c000000000ba5080 0000000000000000 000000000f394b6c 
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
GPR20: c000000000ba50c0 0000000000000004 0000000000000002 0000000000000000 
GPR24: 0000000000000004 c00000010e713cd0 c0000000009f1ecc c00000000088aaff 
GPR28: 0000000000000000 0000000000000001 c000000000ad7610 c00000010e713a80 
NIP [c0000000006987e4] .add_preempt_count+0x6c/0xe0
LR [c0000000006987c8] .add_preempt_count+0x50/0xe0
Call Trace:
[c00000010e713a80] [c00000010e713b30] 0xc00000010e713b30 (unreliable)
[c00000010e713b10] [c0000000000824b4] .vprintk+0xac/0x480
[c00000010e713c40] [c00000000069bcbc] .printk+0x48/0x5c
[c00000010e713cd0] [c00000000005c190] .pseries_mach_cpu_die+0x1c4/0x39c
[c00000010e713db0] [c00000000001e7e0] .cpu_die+0x4c/0x68
[c00000010e713e30] [c000000000017de0] .cpu_idle+0x1f8/0x220
[c00000010e713ed0] [c0000000006a17c4] .start_secondary+0x394/0x3d4
[c00000010e713f90] [c000000000008264] .start_secondary_prolog+0x10/0x14
Instruction dump:
78290464 80090018 2f800000 40fc002c 4bd089c1 60000000 2fa30000 419e006c 
e93e87e0 80090000 2f800000 409e005c <0fe00000> 48000054 e97e8398 78290464 
after cede: sp=c00000010e713cd0 pcnt=ffffffff
*** current->thread_info ***
ti->task: c00000008e7b2240
ti->exec_domain: c000000000a59958
ti->cpu: 6
ti->stomp_on_me: 57005
ti->preempt_count: fffffffe
ti->stomp_on_me_too: 48879
ti->local_flags: 0
*** end thread_info ***


> We agree that decrementer can possibly expire and we could have gone
> to the handler and come back just like we would do in an idle loop.

I disable the decrementer in the stop_cpu path... that may not be
sufficient for the pseries_mach_cpu_die() path, I'll have to experiment.

However, even if it did, it shouldn't be just changing the value,
especially not to something illegal.

> This will not happen always, but I could see a window of time during
> which this may happen.  But that should not change the preempt_count
> unconditionally to -1 with the same stack pointer.
> 
>> ti->stomp_on_me: 57005
>> ti->preempt_count: ffffffff
>> ti->stomp_on_me_too: 48879
>> ti->local_flags: 0
> 
> Is there a method to identify whether we had executed any of the
> interrupt handler while in this code?

I'd like to know as well. If anyone knows, please share. Otherwise
I'll be trying to sort that out tomorrow.

> 
>> *** end thread_info ***
>>
>> Are there any additional thoughts on what might be causing preempt_count to change?
>> I was thinking that cede might somehow force it to 0 (or perhaps one of the
>> preempt_count explicit value setters in irq.c) and then decrement it - but that dec
>> should trigger an error in the dec_preempt_count() as I have CONFIG_DEBUG_PREEMPT=y.
> 
> Decrementer is only the suspect, we have not yet proved that the dec
> handler is being executed.  Can somebody be using our stack?  The
> pointer is same.. but do we still own it?  I cannot think of how
> somebody else could be using this cpu's stack... but just a thought to
> explain this anomaly.
> 
> --Vaidy


Thanks for the thoughts,

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

  reply	other threads:[~2010-08-06  7:13 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-07-22 18:24 [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries Darren Hart
2010-07-22 18:36 ` Darren Hart
2010-07-22 18:38 ` Thomas Gleixner
2010-08-10 22:36   ` Darren Hart
2010-07-22 22:25 ` Benjamin Herrenschmidt
2010-07-22 23:57   ` Darren Hart
2010-07-23  4:44     ` Darren Hart
2010-07-23  5:08       ` Vaidyanathan Srinivasan
2010-07-23  5:11         ` Benjamin Herrenschmidt
2010-07-23  7:07           ` Vaidyanathan Srinivasan
2010-08-05  4:45             ` Darren Hart
2010-08-05 11:06               ` Vaidyanathan Srinivasan
2010-08-05 12:26                 ` Thomas Gleixner
2010-07-23  5:09       ` Benjamin Herrenschmidt
2010-08-06  2:19         ` Darren Hart
2010-08-06  5:09           ` Vaidyanathan Srinivasan
2010-08-06  7:13             ` Darren Hart [this message]
2010-07-23 14:39     ` Will Schmidt
2010-08-04 13:44   ` Darren Hart
2010-08-19 15:58 ` Ankita Garg
2010-08-19 18:58   ` Will Schmidt
2010-08-23 22:20   ` Darren Hart
2010-08-31  7:12   ` Darren Hart
2010-09-01  5:54     ` Michael Ellerman
2010-09-01 15:10       ` Darren Hart
2010-09-01 18:47         ` Darren Hart
2010-09-01 19:59           ` Steven Rostedt
2010-09-01 20:42             ` Darren Hart
2010-09-02  1:02               ` Michael Neuling
2010-09-02  4:06                 ` Steven Rostedt
2010-09-02  6:04                   ` Darren Hart
2010-09-03 20:10                     ` Will Schmidt
2010-09-02 23:04                   ` Michael Neuling
2010-09-03  0:08                     ` Darren Hart
2010-09-02  3:46           ` Michael Neuling

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4C5BB613.6050002@us.ibm.com \
    --to=dvhltc@us.ibm.com \
    --cc=ego@in.ibm.com \
    --cc=linuxppc-dev@ozlabs.org \
    --cc=paulus@samba.org \
    --cc=rostedt@goodmis.org \
    --cc=sfr@canb.auug.org.au \
    --cc=svaidy@linux.vnet.ibm.com \
    --cc=tglx@linutronix.de \
    --cc=willschm@us.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.