linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* PPC upstream kernel ignored DABR bug
@ 2007-11-26 22:02 Jan Kratochvil
  2007-11-27 22:35 ` Arnd Bergmann
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Kratochvil @ 2007-11-26 22:02 UTC (permalink / raw)
  To: Paul Mackerras, linuxppc-dev; +Cc: Roland McGrath

Hi,

this testcase:
	http://people.redhat.com/jkratoch/dabr-lost.c

reproduces a PPC DABR kernel bug.  The variable `variable' should not get
modified as the thread modifying it should be caught by its DABR:

$ ./dabr-lost
TID 30914: DABR 0x10012a77 NIP 0x80f6ebb318
TID 30915: DABR 0x10012a77 NIP 0x80f6ebb318
TID 30916: DABR 0x10012a77 NIP 0x80f6ebb318
TID 30914: hitting the variable
TID 30915: hitting the variable
TID 30916: hitting the variable
variable found = 30916, caught TID = 30914
TID 30916: DABR 0x10012a77
Variable got modified by a thread which has DABR still set!

At the `variable found =' line the parent ptracer found the TID thread 30916
wrote the value into the variable - despite it had DABR alrady set before.

As the behavior is dependent on the current weather I expect the scheduling
matters there.

It is important the target thread is in the `nanosleep' syscall.  If you define
WORKAROUND_SET_DABR_IN_SYSCALL in the testcase it busyloops in the userland and
the bug gets no longer reproduced.

I got it reproduced on a utrace-patched kernel on dual-CPU Power5 and Roland
McGrath reported it reproduced on the vanilla upstream kernel on a Mac G5.



Regards,
Jan Kratochvil

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: PPC upstream kernel ignored DABR bug
  2007-11-26 22:02 PPC upstream kernel ignored DABR bug Jan Kratochvil
@ 2007-11-27 22:35 ` Arnd Bergmann
  2007-11-28  8:59   ` Jan Kratochvil
  2007-11-28 22:59   ` Geoff Levand
  0 siblings, 2 replies; 28+ messages in thread
From: Arnd Bergmann @ 2007-11-27 22:35 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Paul Mackerras, Jan Kratochvil, Roland McGrath

On Monday 26 November 2007, Jan Kratochvil wrote:
> Hi,
>=20
> this testcase:
> =A0=A0=A0=A0=A0=A0=A0=A0http://people.redhat.com/jkratoch/dabr-lost.c
>=20
> reproduces a PPC DABR kernel bug. =A0The variable `variable' should not g=
et
> modified as the thread modifying it should be caught by its DABR:
>=20
> $ ./dabr-lost
> TID 30914: DABR 0x10012a77 NIP 0x80f6ebb318
> TID 30915: DABR 0x10012a77 NIP 0x80f6ebb318
> TID 30916: DABR 0x10012a77 NIP 0x80f6ebb318
> TID 30914: hitting the variable
> TID 30915: hitting the variable
> TID 30916: hitting the variable
> variable found =3D 30916, caught TID =3D 30914
> TID 30916: DABR 0x10012a77
> Variable got modified by a thread which has DABR still set!
>=20

This sounds like a bug recently reported by Uli Weigand. BenH
said he'd take a look, but it probably fell under the table.
The problem found by Uli is that on certain processors (Cell/B.E.
in his case), the DABRX register needs to be set in order for
the DABR to take effect.

	Arnd <><

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: PPC upstream kernel ignored DABR bug
  2007-11-27 22:35 ` Arnd Bergmann
@ 2007-11-28  8:59   ` Jan Kratochvil
  2007-11-28 12:28     ` Arnd Bergmann
  2007-11-28 22:59   ` Geoff Levand
  1 sibling, 1 reply; 28+ messages in thread
From: Jan Kratochvil @ 2007-11-28  8:59 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linuxppc-dev, Paul Mackerras, Roland McGrath

On Tue, 27 Nov 2007 23:35:36 +0100, Arnd Bergmann wrote:
> On Monday 26 November 2007, Jan Kratochvil wrote:
> > Hi,
> > 
> > this testcase:
> >         http://people.redhat.com/jkratoch/dabr-lost.c
> > 
> > reproduces a PPC DABR kernel bug.  The variable `variable' should not get
> > modified as the thread modifying it should be caught by its DABR:
> > 
> > $ ./dabr-lost
> > TID 30914: DABR 0x10012a77 NIP 0x80f6ebb318
> > TID 30915: DABR 0x10012a77 NIP 0x80f6ebb318
> > TID 30916: DABR 0x10012a77 NIP 0x80f6ebb318
> > TID 30914: hitting the variable
> > TID 30915: hitting the variable
> > TID 30916: hitting the variable
> > variable found = 30916, caught TID = 30914
> > TID 30916: DABR 0x10012a77
> > Variable got modified by a thread which has DABR still set!
> > 
> 
> This sounds like a bug recently reported by Uli Weigand. BenH
> said he'd take a look, but it probably fell under the table.
> The problem found by Uli is that on certain processors (Cell/B.E.
> in his case), the DABRX register needs to be set in order for
> the DABR to take effect.

Please be aware DABR works fine if the same code runs just 1 (always) or
2 (sometimes) threads.  It starts failing with too many threads running:

$ ./dabr-lost
TID 32725: DABR 0x1001279f NIP 0xfecf41c
TID 32726: DABR 0x1001279f NIP 0xfecf41c
TID 32725: hitting the variable
variable found = -1, caught TID = 32725
TID 32726: hitting the variable
variable found = -1, caught TID = 32726
The kernel bug did not get reproduced - increase THREADS.

As I did not find any code in that kernel touching DABRX its value should not
be dependent on the number of threads running.


Regards,
Lace

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: PPC upstream kernel ignored DABR bug
  2007-11-28  8:59   ` Jan Kratochvil
@ 2007-11-28 12:28     ` Arnd Bergmann
  2007-11-28 12:45       ` Jan Kratochvil
  0 siblings, 1 reply; 28+ messages in thread
From: Arnd Bergmann @ 2007-11-28 12:28 UTC (permalink / raw)
  To: Jan Kratochvil; +Cc: linuxppc-dev, Paul Mackerras, Roland McGrath

On Wednesday 28 November 2007, Jan Kratochvil wrote:
> Please be aware DABR works fine if the same code runs just 1 (always) or
> 2 (sometimes) threads. =C2=A0It starts failing with too many threads runn=
ing:
>=20
> $ ./dabr-lost
> TID 32725: DABR 0x1001279f NIP 0xfecf41c
> TID 32726: DABR 0x1001279f NIP 0xfecf41c
> TID 32725: hitting the variable
> variable found =3D -1, caught TID =3D 32725
> TID 32726: hitting the variable
> variable found =3D -1, caught TID =3D 32726
> The kernel bug did not get reproduced - increase THREADS.
>=20
> As I did not find any code in that kernel touching DABRX its value should=
 not
> be dependent on the number of threads running.
>=20

Right, this is a different problem from the one reported by Uli.
=46rom what I can tell, your problem is that you set the DABR only
in one thread, so the other threads don't see it. DABR is saved
in the thread_struct, so setting it in one thread doesn't have
an impact on any other thread.

	Arnd <><

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: PPC upstream kernel ignored DABR bug
  2007-11-28 12:28     ` Arnd Bergmann
@ 2007-11-28 12:45       ` Jan Kratochvil
  0 siblings, 0 replies; 28+ messages in thread
From: Jan Kratochvil @ 2007-11-28 12:45 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linuxppc-dev, Paul Mackerras, Roland McGrath

On Wed, 28 Nov 2007 13:28:48 +0100, Arnd Bergmann wrote:
> On Wednesday 28 November 2007, Jan Kratochvil wrote:
> > Please be aware DABR works fine if the same code runs just 1 (always) or
> > 2 (sometimes) threads.  It starts failing with too many threads running:
> > 
> > $ ./dabr-lost
> > TID 32725: DABR 0x1001279f NIP 0xfecf41c
> > TID 32726: DABR 0x1001279f NIP 0xfecf41c
> > TID 32725: hitting the variable
> > variable found = -1, caught TID = 32725
> > TID 32726: hitting the variable
> > variable found = -1, caught TID = 32726
> > The kernel bug did not get reproduced - increase THREADS.
> > 
> > As I did not find any code in that kernel touching DABRX its value should not
> > be dependent on the number of threads running.
> > 
> 
> Right, this is a different problem from the one reported by Uli.
> From what I can tell, your problem is that you set the DABR only
> in one thread, so the other threads don't see it. DABR is saved
> in the thread_struct, so setting it in one thread doesn't have
> an impact on any other thread.

It even prints out above:
	TID 32725: DABR 0x1001279f NIP 0xfecf41c
	TID 32726: DABR 0x1001279f NIP 0xfecf41c

that it wrote DABR in both the threads and it has also successfully read it
back from each thread specifically (according to its thread-specific TID).

for (threadi = 0; threadi < THREADS; threadi++)
    {
      pid_t tid = thread[threadi];

      setup (tid);
...
    }
static void setup (pid_t tid)
{
...
  l = ptrace (PTRACE_SET_DEBUGREG, tid, NULL, (void *) dabr);
...
}

Also if I would not set DABR specifically for each thread it would not work in
90% of cases for `THREADS == 2'.  And it would not work for `THREADS == 4' if
they are busylooping (therefore not in a syscall).
	TID 596: DABR 0x100127a7 NIP 0x10000dbc
	TID 597: DABR 0x100127a7 NIP 0x10000db0
	TID 598: DABR 0x100127a7 NIP 0x10000dac
	TID 599: DABR 0x100127a7 NIP 0x10000dbc
	TID 596: hitting the variable
	variable found = -1, caught TID = 596
	TID 599: hitting the variable
	variable found = -1, caught TID = 599
	TID 597: hitting the variable
	variable found = -1, caught TID = 597
	TID 598: hitting the variable
	variable found = -1, caught TID = 598
	The kernel bug got workarounded by WORKAROUND_SET_DABR_IN_SYSCALL.

(I found out now WORKAROUND_SET_DABR_IN_SYSCALL only reduces the probability of
the failure, it is not a 100% workaround of the problem in the testcase.)


There is some tricky kernel code around it but I did not try to debug it:

struct task_struct *__switch_to(struct task_struct *prev,
	struct task_struct *new)
{
...
	if (unlikely(__get_cpu_var(current_dabr) != new->thread.dabr)) {
		set_dabr(new->thread.dabr);
		__get_cpu_var(current_dabr) = new->thread.dabr;
	}
...
}



Regards,
Jan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: PPC upstream kernel ignored DABR bug
  2007-11-27 22:35 ` Arnd Bergmann
  2007-11-28  8:59   ` Jan Kratochvil
@ 2007-11-28 22:59   ` Geoff Levand
  2007-11-29  0:13     ` Arnd Bergmann
  1 sibling, 1 reply; 28+ messages in thread
From: Geoff Levand @ 2007-11-28 22:59 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linuxppc-dev, Paul Mackerras, Roland McGrath, Jan Kratochvil

Arnd Bergmann wrote:
> On Monday 26 November 2007, Jan Kratochvil wrote:
>> Hi,
>> 
>> this testcase:
>>         http://people.redhat.com/jkratoch/dabr-lost.c
>> 
>> reproduces a PPC DABR kernel bug.  The variable `variable' should not get
>> modified as the thread modifying it should be caught by its DABR:
>> 
>> $ ./dabr-lost
>> TID 30914: DABR 0x10012a77 NIP 0x80f6ebb318
>> TID 30915: DABR 0x10012a77 NIP 0x80f6ebb318
>> TID 30916: DABR 0x10012a77 NIP 0x80f6ebb318
>> TID 30914: hitting the variable
>> TID 30915: hitting the variable
>> TID 30916: hitting the variable
>> variable found = 30916, caught TID = 30914
>> TID 30916: DABR 0x10012a77
>> Variable got modified by a thread which has DABR still set!
>> 
> 
> This sounds like a bug recently reported by Uli Weigand. BenH
> said he'd take a look, but it probably fell under the table.
> The problem found by Uli is that on certain processors (Cell/B.E.
> in his case), the DABRX register needs to be set in order for
> the DABR to take effect.

Just as a note, the PS3's lv1_set_dabr(), which we used for
ppc_md.set_dabr sets up both the DABRX and DABR registers.

-Geoff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: PPC upstream kernel ignored DABR bug
  2007-11-28 22:59   ` Geoff Levand
@ 2007-11-29  0:13     ` Arnd Bergmann
  2008-03-10  0:53       ` Luis Machado
  0 siblings, 1 reply; 28+ messages in thread
From: Arnd Bergmann @ 2007-11-29  0:13 UTC (permalink / raw)
  To: Geoff Levand; +Cc: linuxppc-dev, Paul Mackerras, Roland McGrath, Jan Kratochvil

On Wednesday 28 November 2007 23:59:36 Geoff Levand wrote:
> > This sounds like a bug recently reported by Uli Weigand. BenH
> > said he'd take a look, but it probably fell under the table.
> > The problem found by Uli is that on certain processors (Cell/B.E.
> > in his case), the DABRX register needs to be set in order for
> > the DABR to take effect.
>
> Just as a note, the PS3's lv1_set_dabr(), which we used for
> ppc_md.set_dabr sets up both the DABRX and DABR registers.

Yes, I know. I tried it on the PS3 first and couldn't reproduce
the bug he saw on the blade.

	Arnd <><

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: PPC upstream kernel ignored DABR bug
  2007-11-29  0:13     ` Arnd Bergmann
@ 2008-03-10  0:53       ` Luis Machado
  2008-03-10 14:01         ` Jens Osterkamp
  0 siblings, 1 reply; 28+ messages in thread
From: Luis Machado @ 2008-03-10  0:53 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linuxppc-dev, Paul Mackerras, Roland McGrath, Jan Kratochvil

> Yes, I know. I tried it on the PS3 first and couldn't reproduce
> the bug he saw on the blade.

Arnd,

Do we have any news on this topic? 

I've seen this happening quite often within GDB when using hardware
watchpoints on a shared variable in a threaded (7+ threads) binary.
Sometimes the watchpoint won't trigger, even though the monitored
variable's value was modified.

Appreciate your feedback.

Best regards,

-- 
Luis Machado
LoP Toolchain
Software Engineer 
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: PPC upstream kernel ignored DABR bug
  2008-03-10  0:53       ` Luis Machado
@ 2008-03-10 14:01         ` Jens Osterkamp
  2008-03-10 15:13           ` Luis Machado
                             ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Jens Osterkamp @ 2008-03-10 14:01 UTC (permalink / raw)
  To: linuxppc-dev, luisgpm
  Cc: Jan Kratochvil, Paul Mackerras, Roland McGrath, Arnd Bergmann

On Monday 10 March 2008, Luis Machado wrote:
> > Yes, I know. I tried it on the PS3 first and couldn't reproduce
> > the bug he saw on the blade.
> 
> Arnd,
> 
> Do we have any news on this topic? 
> 
> I've seen this happening quite often within GDB when using hardware
> watchpoints on a shared variable in a threaded (7+ threads) binary.
> Sometimes the watchpoint won't trigger, even though the monitored
> variable's value was modified.

On the Blade DABRX had to be set additional to DABR. PS3 and Celleb
already did this. Uli Weigand found this back in November. I submitted
a patch for this which went into 2.6.25-rc4.
Can you please try again with rc4 ?

Gruß,

Jens

IBM Deutschland Entwicklung GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter
Geschäftsführung: Herbert Kircher 
Sitz der Gesellschaft: Böblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: PPC upstream kernel ignored DABR bug
  2008-03-10 14:01         ` Jens Osterkamp
@ 2008-03-10 15:13           ` Luis Machado
  2008-03-10 19:19           ` Roland McGrath
  2008-03-12 17:51           ` Luis Machado
  2 siblings, 0 replies; 28+ messages in thread
From: Luis Machado @ 2008-03-10 15:13 UTC (permalink / raw)
  To: Jens Osterkamp
  Cc: linuxppc-dev, Paul Mackerras, Roland McGrath, Arnd Bergmann,
	Jan Kratochvil

> On the Blade DABRX had to be set additional to DABR. PS3 and Celleb
> already did this. Uli Weigand found this back in November. I submitted
> a patch for this which went into 2.6.25-rc4.
> Can you please try again with rc4 ?

I will try it and will post the results back.

Thanks Jens.

Regards,
-- 
Luis Machado
Software Engineer 
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: PPC upstream kernel ignored DABR bug
  2008-03-10 14:01         ` Jens Osterkamp
  2008-03-10 15:13           ` Luis Machado
@ 2008-03-10 19:19           ` Roland McGrath
  2008-03-10 19:36             ` Luis Machado
  2008-03-10 22:06             ` Segher Boessenkool
  2008-03-12 17:51           ` Luis Machado
  2 siblings, 2 replies; 28+ messages in thread
From: Roland McGrath @ 2008-03-10 19:19 UTC (permalink / raw)
  To: Jens Osterkamp
  Cc: linuxppc-dev, Jan Kratochvil, Paul Mackerras, Arnd Bergmann

> On the Blade DABRX had to be set additional to DABR. PS3 and Celleb
> already did this. Uli Weigand found this back in November. I submitted
> a patch for this which went into 2.6.25-rc4.
> Can you please try again with rc4 ?

This is not the problem.  This came up before and everyone seems have
forgotten.  This bug has been reproduced on G5's, which do not have DABRX
as I understand it.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: PPC upstream kernel ignored DABR bug
  2008-03-10 19:19           ` Roland McGrath
@ 2008-03-10 19:36             ` Luis Machado
  2008-03-10 19:50               ` Olof Johansson
  2008-03-10 22:06             ` Segher Boessenkool
  1 sibling, 1 reply; 28+ messages in thread
From: Luis Machado @ 2008-03-10 19:36 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Jan Kratochvil, Paul Mackerras, Arnd Bergmann, linuxppc-dev

On Mon, 2008-03-10 at 12:19 -0700, Roland McGrath wrote:
> > On the Blade DABRX had to be set additional to DABR. PS3 and Celleb
> > already did this. Uli Weigand found this back in November. I submitted
> > a patch for this which went into 2.6.25-rc4.
> > Can you please try again with rc4 ?
> 
> This is not the problem.  This came up before and everyone seems have
> forgotten.  This bug has been reproduced on G5's, which do not have DABRX
> as I understand it.

Yes, now that you mentioned, i've been able to reproduce this on 970FX's
blades, which i don't think have DABRX registers. I guess it's the
almost the same CPU as G5's.

Regards,

-- 
Luis Machado
Software Engineer 
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: PPC upstream kernel ignored DABR bug
  2008-03-10 19:36             ` Luis Machado
@ 2008-03-10 19:50               ` Olof Johansson
  2008-03-10 19:54                 ` Roland McGrath
  0 siblings, 1 reply; 28+ messages in thread
From: Olof Johansson @ 2008-03-10 19:50 UTC (permalink / raw)
  To: Luis Machado
  Cc: Paul Mackerras, Jan Kratochvil, Arnd Bergmann, Roland McGrath,
	linuxppc-dev

On Mon, Mar 10, 2008 at 04:36:37PM -0300, Luis Machado wrote:
> On Mon, 2008-03-10 at 12:19 -0700, Roland McGrath wrote:
> > > On the Blade DABRX had to be set additional to DABR. PS3 and Celleb
> > > already did this. Uli Weigand found this back in November. I submitted
> > > a patch for this which went into 2.6.25-rc4.
> > > Can you please try again with rc4 ?
> > 
> > This is not the problem.  This came up before and everyone seems have
> > forgotten.  This bug has been reproduced on G5's, which do not have DABRX
> > as I understand it.
> 
> Yes, now that you mentioned, i've been able to reproduce this on 970FX's
> blades, which i don't think have DABRX registers. I guess it's the
> almost the same CPU as G5's.

What Apple called G5 were during the production runs three different
CPUs:

970
970FX
970MP

970 was only used in the very first models. 970MP was used in the last
(the models with pci-express and up to 4 cpus). 970FX was used on almost
everything else inbetween.


-Olof

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: PPC upstream kernel ignored DABR bug
  2008-03-10 19:50               ` Olof Johansson
@ 2008-03-10 19:54                 ` Roland McGrath
  0 siblings, 0 replies; 28+ messages in thread
From: Roland McGrath @ 2008-03-10 19:54 UTC (permalink / raw)
  To: Olof Johansson
  Cc: linuxppc-dev, Paul Mackerras, Jan Kratochvil, Arnd Bergmann

The G5 that I have says:

	cpu             : PPC970FX, altivec supported
	revision        : 3.0 (pvr 003c 0300)

and it does indeed reproduce this bug.

It also strange for it to be the DABRX issue given the failure mode.
That is, it works sometimes but unreliably (as if the context switch
sometimes fails to install the value).


Thanks,
Roland

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: PPC upstream kernel ignored DABR bug
  2008-03-10 19:19           ` Roland McGrath
  2008-03-10 19:36             ` Luis Machado
@ 2008-03-10 22:06             ` Segher Boessenkool
  1 sibling, 0 replies; 28+ messages in thread
From: Segher Boessenkool @ 2008-03-10 22:06 UTC (permalink / raw)
  To: Roland McGrath
  Cc: linuxppc-dev, Jan Kratochvil, Paul Mackerras, Arnd Bergmann

>> On the Blade DABRX had to be set additional to DABR. PS3 and Celleb
>> already did this. Uli Weigand found this back in November. I submitted
>> a patch for this which went into 2.6.25-rc4.
>> Can you please try again with rc4 ?
>
> This is not the problem.  This came up before and everyone seems have
> forgotten.  This bug has been reproduced on G5's, which do not have 
> DABRX
> as I understand it.

970 (all versions) _does_ have a DABRX register.  Dunno if it has
the same register definition (I cannot find DABRX in the Cell docs).


Segher

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: PPC upstream kernel ignored DABR bug
  2008-03-10 14:01         ` Jens Osterkamp
  2008-03-10 15:13           ` Luis Machado
  2008-03-10 19:19           ` Roland McGrath
@ 2008-03-12 17:51           ` Luis Machado
  2008-03-12 22:30             ` Jens Osterkamp
  2 siblings, 1 reply; 28+ messages in thread
From: Luis Machado @ 2008-03-12 17:51 UTC (permalink / raw)
  To: Jens Osterkamp
  Cc: linuxppc-dev, Paul Mackerras, Roland McGrath, Arnd Bergmann,
	Jan Kratochvil

Hi,

> On the Blade DABRX had to be set additional to DABR. PS3 and Celleb
> already did this. Uli Weigand found this back in November. I submitted
> a patch for this which went into 2.6.25-rc4.
> Can you please try again with rc4 ?

> Gruß,
> 
> Jens

Just to make sure, i tested the binary against the 2.6.25-rc4 kernel. It
still fails. So this is really an open bug for PPC.

-- 
Luis Machado
Software Engineer 
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: PPC upstream kernel ignored DABR bug
  2008-03-12 17:51           ` Luis Machado
@ 2008-03-12 22:30             ` Jens Osterkamp
  2008-03-13  1:47               ` Roland McGrath
  2008-03-13 13:13               ` Luis Machado
  0 siblings, 2 replies; 28+ messages in thread
From: Jens Osterkamp @ 2008-03-12 22:30 UTC (permalink / raw)
  To: luisgpm
  Cc: linuxppc-dev, Paul Mackerras, Roland McGrath, Arnd Bergmann,
	Jan Kratochvil


> Just to make sure, i tested the binary against the 2.6.25-rc4 kernel. It
> still fails. So this is really an open bug for PPC.

On a Cell- or 970-based machine ?

Gruß,
	Jens

IBM Deutschland Entwicklung GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter
Geschäftsführung: Herbert Kircher 
Sitz der Gesellschaft: Böblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: PPC upstream kernel ignored DABR bug
  2008-03-12 22:30             ` Jens Osterkamp
@ 2008-03-13  1:47               ` Roland McGrath
  2008-03-13 22:20                 ` Segher Boessenkool
  2008-03-26 20:57                 ` Josh Boyer
  2008-03-13 13:13               ` Luis Machado
  1 sibling, 2 replies; 28+ messages in thread
From: Roland McGrath @ 2008-03-13  1:47 UTC (permalink / raw)
  To: Jens Osterkamp
  Cc: linuxppc-dev, Jan Kratochvil, Paul Mackerras, Arnd Bergmann

AFAICT the DABRX register just has two global bits that enable paying
attention to the DABR register.  It only needs to be set once at boot time
(as the cell code does).  I don't see how missing that initialization could
ever have explained the behavior we see where DABR matches are intermittent.
If those DABRX bits weren't set then no DABR match would have happened.
(Apparently they are set before boot on an Apple G5.)

What we actually see is that DABR matches seem to be reliable when things
are slow, and get intermittent when there are enough threads with DABR set.

I searched the web trying to figure out what a DABRX register does so I
could just go try it myself rather than waiting another n months for powerpc
folks to forget about it again.  (I did try it, and 
	mtspr(SPRN_DABRX, DABRX_KERNEL | DABRX_USER);
makes no difference to the test on my machine, even done in set_dabr every
time we set SPRN_DABR.)

I happened across:

http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/79B6E24422AA101287256E93006C957E/$file/PowerPC_970FX_errata_DD3.X_V1.7.pdf

which is "IBM PowerPC 970FX RISC Microprocessor Errata List for DD3.X"
and contains "Erratum #8: DABRX register might not always be updated correctly":

	Projected Impact
	      The data address breakpoint function might not always work.
	Workaround
	      None.
	Status
	      A fix is not planned at this time for the PowerPC 970FX.

The only machine I have at home for testing powerpc is an Apple G5,
supplied to me by IBM.  It says:
	cpu             : PPC970FX, altivec supported
	revision        : 3.0 (pvr 003c 0300)
so I am guessing this document applies to the chips I have.  Since I can't
test on other chips myself, it is plausible from what I've seen that there
is no mysterious kernel problem and only this hardware problem.  The
description of the hardware problem would not make me think that it would
behave this way, but it is not very detailed or precise, or at least does
not seem so to a reader not expert on powerpc.

So, uh, go IBM!

I'm in the minority in this conversation as someone not expert on powerpc,
and as someone not employed by IBM.  (I don't really mind finding public IBM
documents about powerpc on the web and telling IBM powerpc folks about them.
But, well.)

I don't know what I can do next to tell whether this processor erratum is in
fact what's happening in the test case.  If it is, I don't know if there
might be some arcane way to work around it despite "None" cited above.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: PPC upstream kernel ignored DABR bug
  2008-03-12 22:30             ` Jens Osterkamp
  2008-03-13  1:47               ` Roland McGrath
@ 2008-03-13 13:13               ` Luis Machado
  1 sibling, 0 replies; 28+ messages in thread
From: Luis Machado @ 2008-03-13 13:13 UTC (permalink / raw)
  To: Jens Osterkamp
  Cc: linuxppc-dev, Paul Mackerras, Roland McGrath, Arnd Bergmann,
	Jan Kratochvil

On Wed, 2008-03-12 at 23:30 +0100, Jens Osterkamp wrote:
> > Just to make sure, i tested the binary against the 2.6.25-rc4 kernel. It
> > still fails. So this is really an open bug for PPC.
> 
> On a Cell- or 970-based machine ?
> 
> Gruß,
> 	Jens

On a 970-based machine.

Regards,

-- 
Luis Machado
Software Engineer 
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: PPC upstream kernel ignored DABR bug
  2008-03-13  1:47               ` Roland McGrath
@ 2008-03-13 22:20                 ` Segher Boessenkool
  2008-03-13 22:42                   ` Roland McGrath
  2008-03-16 20:37                   ` Benjamin Herrenschmidt
  2008-03-26 20:57                 ` Josh Boyer
  1 sibling, 2 replies; 28+ messages in thread
From: Segher Boessenkool @ 2008-03-13 22:20 UTC (permalink / raw)
  To: Roland McGrath
  Cc: linuxppc-dev, Jan Kratochvil, Paul Mackerras, Arnd Bergmann

> AFAICT the DABRX register just has two global bits that enable paying
> attention to the DABR register.

It has four bits:

	01	match in user mode
	02	match in supervisor mode
	04	match in hypervisor mode
	08	ignore translation field in DABR

If the kernel can write to DABRX, it is running in hypervisor mode, so
it should set 07 instead of 03 (as it currently does) if it wants to
match in kernel mode; or 01, if it doesn't.

OTOH, the Apple version of the 970 is special (it has no separate
hypervisor mode); still, 07 should always work.

> It only needs to be set once at boot time
> (as the cell code does).  I don't see how missing that initialization  
> could
> ever have explained the behavior we see where DABR matches are  
> intermittent.
> If those DABRX bits weren't set then no DABR match would have happened.
> (Apparently they are set before boot on an Apple G5.)

I don't see the Apple boot code initialising DABRX; maybe the bootup  
state
for DABRX is 07, dunno.  Either way, it would be good if the kernel set  
it
properly, esp. if it wants to enable or disable matches in the kernel  
itself.

> What we actually see is that DABR matches seem to be reliable when  
> things
> are slow, and get intermittent when there are enough threads with DABR  
> set.

> I happened across:
>
> http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/ 
> 79B6E24422AA101287256E93006C957E/$file/ 
> PowerPC_970FX_errata_DD3.X_V1.7.pdf
>
> which is "IBM PowerPC 970FX RISC Microprocessor Errata List for DD3.X"
> and contains "Erratum #8: DABRX register might not always be updated  
> correctly":

> The only machine I have at home for testing powerpc is an Apple G5,
> supplied to me by IBM.  It says:
> 	cpu             : PPC970FX, altivec supported
> 	revision        : 3.0 (pvr 003c 0300)
> so I am guessing this document applies to the chips I have.

Indeed.

> Since I can't
> test on other chips myself, it is plausible from what I've seen that  
> there
> is no mysterious kernel problem and only this hardware problem.  The
> description of the hardware problem would not make me think that it  
> would
> behave this way, but it is not very detailed or precise, or at least  
> does
> not seem so to a reader not expert on powerpc.

Since the 970 kernel never sets DABRX currently, #8 cannot explain
_intermittent_ problems: either it always works, or never does.

You could be happening upon #5, if the non-triggering data breakpoints
are with vector loads/stores in strange code.

> I don't know what I can do next to tell whether this processor erratum  
> is in
> fact what's happening in the test case.  If it is, I don't know if  
> there
> might be some arcane way to work around it despite "None" cited above.

It would help if you could give us the disassembly of some code where  
the
breakpoint did not trigger; say, that insn and the previous 20 or so  
insns.


Segher

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: PPC upstream kernel ignored DABR bug
  2008-03-13 22:20                 ` Segher Boessenkool
@ 2008-03-13 22:42                   ` Roland McGrath
  2008-03-14  2:11                     ` Segher Boessenkool
  2008-03-16 20:37                   ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 28+ messages in thread
From: Roland McGrath @ 2008-03-13 22:42 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: linuxppc-dev, Jan Kratochvil, Paul Mackerras, Arnd Bergmann

> Since the 970 kernel never sets DABRX currently, #8 cannot explain
> _intermittent_ problems: either it always works, or never does.

That's kind of what I thought, but I couldn't make enough sense of
the #8 text to be very sure.

> You could be happening upon #5, if the non-triggering data breakpoints
> are with vector loads/stores in strange code.

They are not.

> It would help if you could give us the disassembly of some code where the
> breakpoint did not trigger; say, that insn and the previous 20 or so insns.

The pointer to the test case was given here before.

http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/tests/ptrace-tests/tests/ppc-dabr-race.c?cvsroot=systemtap

-m32	Dump of assembler code for function child_thread:
	0x10000950 <child_thread+0>:    stwu    r1,-32(r1)
	0x10000954 <child_thread+4>:    li      r3,207
	0x10000958 <child_thread+8>:    mflr    r0
	0x1000095c <child_thread+12>:   stw     r29,20(r1)
	0x10000960 <child_thread+16>:   stw     r0,36(r1)
	0x10000964 <child_thread+20>:   crclr   4*cr1+eq
	0x10000968 <child_thread+24>:   bl      0x10001680 <syscall>
	0x1000096c <child_thread+28>:   lis     r11,4097
	0x10000970 <child_thread+32>:   mr      r29,r3
	0x10000974 <child_thread+36>:   li      r3,1
	0x10000978 <child_thread+40>:   lwz     r9,7800(r11)
	0x1000097c <child_thread+44>:   addi    r9,r9,1
	0x10000980 <child_thread+48>:   stw     r9,7800(r11)
	0x10000984 <child_thread+52>:   bl      0x10001750 <sleep>
	0x10000988 <child_thread+56>:   lis     r9,4097
--->	0x1000098c <child_thread+60>:   stw     r29,7792(r9)
	0x10000990 <child_thread+64>:   bl      0x10001760 <pause>
	0x10000994 <child_thread+68>:   bl      0x10001760 <pause>
	0x10000998 <child_thread+72>:   b       0x10000990 <child_thread+64>
	End of assembler dump.

-m64	Dump of assembler code for function child_thread:
	0x0000000010000d10 <child_thread+0>:    mflr    r0
	0x0000000010000d14 <child_thread+4>:    std     r29,-24(r1)
	0x0000000010000d18 <child_thread+8>:    li      r3,207
	0x0000000010000d1c <child_thread+12>:   std     r0,16(r1)
	0x0000000010000d20 <child_thread+16>:   stdu    r1,-144(r1)
	0x0000000010000d24 <child_thread+20>:   bl      0x10000b68
	0x0000000010000d28 <child_thread+24>:   ld      r2,40(r1)
	0x0000000010000d2c <child_thread+28>:   ld      r11,-32696(r2)
	0x0000000010000d30 <child_thread+32>:   mr      r29,r3
	0x0000000010000d34 <child_thread+36>:   li      r3,1
	0x0000000010000d38 <child_thread+40>:   extsw   r29,r29
	0x0000000010000d3c <child_thread+44>:   lwz     r9,0(r11)
	0x0000000010000d40 <child_thread+48>:   addi    r9,r9,1
	0x0000000010000d44 <child_thread+52>:   clrldi  r9,r9,32
	0x0000000010000d48 <child_thread+56>:   stw     r9,0(r11)
	0x0000000010000d4c <child_thread+60>:   bl      0x10000a88
	0x0000000010000d50 <child_thread+64>:   ld      r2,40(r1)
	0x0000000010000d54 <child_thread+68>:   ld      r9,-32688(r2)
--->	0x0000000010000d58 <child_thread+72>:   std     r29,0(r9)
	0x0000000010000d5c <child_thread+76>:   nop
	0x0000000010000d60 <child_thread+80>:   bl      0x100009a8
	0x0000000010000d64 <child_thread+84>:   ld      r2,40(r1)
	0x0000000010000d68 <child_thread+88>:   b       0x10000d60 <child_thread+80>
	0x0000000010000d6c <child_thread+92>:   .long 0x0
	0x0000000010000d70 <child_thread+96>:   .long 0x1
	0x0000000010000d74 <child_thread+100>:  lwz     r0,0(r3)
	End of assembler dump.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: PPC upstream kernel ignored DABR bug
  2008-03-13 22:42                   ` Roland McGrath
@ 2008-03-14  2:11                     ` Segher Boessenkool
  2008-03-14  7:45                       ` Roland McGrath
  0 siblings, 1 reply; 28+ messages in thread
From: Segher Boessenkool @ 2008-03-14  2:11 UTC (permalink / raw)
  To: Roland McGrath
  Cc: linuxppc-dev, Jan Kratochvil, Paul Mackerras, Arnd Bergmann

> The pointer to the test case was given here before.

Oh, I missed that.  Anyway, I wanted to see the asm, and who knows,
with different compiler versions and all that.

> 	0x10000984 <child_thread+52>:   bl      0x10001750 <sleep>
> 	0x10000988 <child_thread+56>:   lis     r9,4097
> --->	0x1000098c <child_thread+60>:   stw     r29,7792(r9)

> 	0x0000000010000d4c <child_thread+60>:   bl      0x10000a88
> 	0x0000000010000d50 <child_thread+64>:   ld      r2,40(r1)
> 	0x0000000010000d54 <child_thread+68>:   ld      r9,-32688(r2)
> --->	0x0000000010000d58 <child_thread+72>:   std     r29,0(r9)

In both these cases, the storage access goes to LSU0, so you're
not hitting the errata.

I noticed set_dabr() doesn't do proper synchronisation insns, could
you try this patch?  I doubt it helps, but it changes the code to do
"the right thing".


diff --git a/arch/powerpc/kernel/process.c 
b/arch/powerpc/kernel/process.c
index 4846bf5..ee925f5 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -250,7 +250,9 @@ int set_dabr(unsigned long dabr)

         /* XXX should we have a CPU_FTR_HAS_DABR ? */
  #if defined(CONFIG_PPC64) || defined(CONFIG_6xx)
+       asm("sync");
         mtspr(SPRN_DABR, dabr);
+       asm("isync");
  #endif
         return 0;
  }


(badly copy/pasted, please apply by hand.  Will send a real patch later 
;-) )

If this doesn't help, and the failures stay intermittent, I don't think 
there
is a close-to-the-hardware problem here.


Segher

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: PPC upstream kernel ignored DABR bug
  2008-03-14  2:11                     ` Segher Boessenkool
@ 2008-03-14  7:45                       ` Roland McGrath
  2008-03-14  8:42                         ` Segher Boessenkool
  0 siblings, 1 reply; 28+ messages in thread
From: Roland McGrath @ 2008-03-14  7:45 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: linuxppc-dev, Jan Kratochvil, Paul Mackerras, Arnd Bergmann

> In both these cases, the storage access goes to LSU0, so you're
> not hitting the errata.

I'll take your word for it.

> If this doesn't help, and the failures stay intermittent, I don't think
> there is a close-to-the-hardware problem here.

I saw no effect from that change.  So now we're back to pure mystery, I guess.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: PPC upstream kernel ignored DABR bug
  2008-03-14  7:45                       ` Roland McGrath
@ 2008-03-14  8:42                         ` Segher Boessenkool
  2008-03-16 20:38                           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 28+ messages in thread
From: Segher Boessenkool @ 2008-03-14  8:42 UTC (permalink / raw)
  To: Roland McGrath
  Cc: linuxppc-dev, Jan Kratochvil, Paul Mackerras, Arnd Bergmann

>> If this doesn't help, and the failures stay intermittent, I don't 
>> think
>> there is a close-to-the-hardware problem here.
>
> I saw no effect from that change.  So now we're back to pure mystery, 
> I guess.

Hey, we know something now: it's "just" a problem in the kernel :-)


Segher

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: PPC upstream kernel ignored DABR bug
  2008-03-13 22:20                 ` Segher Boessenkool
  2008-03-13 22:42                   ` Roland McGrath
@ 2008-03-16 20:37                   ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 28+ messages in thread
From: Benjamin Herrenschmidt @ 2008-03-16 20:37 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: linuxppc-dev, Paul Mackerras, Jan Kratochvil, Arnd Bergmann,
	Roland McGrath


> Since the 970 kernel never sets DABRX currently, #8 cannot explain
> _intermittent_ problems: either it always works, or never does.

Uh... could be the boot code setting it, the setting happening on LSU0
but not LSU1. No ?

Ben.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: PPC upstream kernel ignored DABR bug
  2008-03-14  8:42                         ` Segher Boessenkool
@ 2008-03-16 20:38                           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 28+ messages in thread
From: Benjamin Herrenschmidt @ 2008-03-16 20:38 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: linuxppc-dev, Paul Mackerras, Jan Kratochvil, Arnd Bergmann,
	Roland McGrath


On Fri, 2008-03-14 at 09:42 +0100, Segher Boessenkool wrote:
> > I saw no effect from that change.  So now we're back to pure
> mystery, 
> > I guess.
> 
> Hey, we know something now: it's "just" a problem in the kernel :-)

We don't know that for sure. The DABR context switching code is trivial
enough...

Ben.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: PPC upstream kernel ignored DABR bug
  2008-03-13  1:47               ` Roland McGrath
  2008-03-13 22:20                 ` Segher Boessenkool
@ 2008-03-26 20:57                 ` Josh Boyer
  2008-03-27  1:47                   ` Josh Boyer
  1 sibling, 1 reply; 28+ messages in thread
From: Josh Boyer @ 2008-03-26 20:57 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Arnd Bergmann, Jan, linuxppc-dev, Paul Mackerras, Kratochvil

On Wed, 12 Mar 2008 18:47:45 -0700 (PDT)
Roland McGrath <roland@redhat.com> wrote:
 
> The only machine I have at home for testing powerpc is an Apple G5,
> supplied to me by IBM.  It says:
> 	cpu             : PPC970FX, altivec supported
> 	revision        : 3.0 (pvr 003c 0300)
> so I am guessing this document applies to the chips I have.  Since I can't
> test on other chips myself, it is plausible from what I've seen that there
> is no mysterious kernel problem and only this hardware problem.  The
> description of the hardware problem would not make me think that it would
> behave this way, but it is not very detailed or precise, or at least does
> not seem so to a reader not expert on powerpc.

I ran the testcase on my older G5 today with:

cpu             : PPC970, altivec supported
revision        : 2.2 (pvr 0039 0202)

and it also failed after a few iterations.  This was with
2.6.25-0.121.rc5.git4.fc9 as the kernel, which is fairly close to mainline.  At the least, this doesn't seem to be 970FX related.  I'll try building a vanilla 2.6.25-rc7 later this evening to see if that makes a difference.

josh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: PPC upstream kernel ignored DABR bug
  2008-03-26 20:57                 ` Josh Boyer
@ 2008-03-27  1:47                   ` Josh Boyer
  0 siblings, 0 replies; 28+ messages in thread
From: Josh Boyer @ 2008-03-27  1:47 UTC (permalink / raw)
  To: Josh Boyer
  Cc: Arnd Bergmann, linuxppc-dev, Paul Mackerras, Jan, Kratochvil,
	Roland McGrath

On Wed, 26 Mar 2008 15:57:32 -0500
Josh Boyer <jwboyer@linux.vnet.ibm.com> wrote:

> On Wed, 12 Mar 2008 18:47:45 -0700 (PDT)
> Roland McGrath <roland@redhat.com> wrote:
> 
> > The only machine I have at home for testing powerpc is an Apple G5,
> > supplied to me by IBM.  It says:
> > 	cpu             : PPC970FX, altivec supported
> > 	revision        : 3.0 (pvr 003c 0300)
> > so I am guessing this document applies to the chips I have.  Since I can't
> > test on other chips myself, it is plausible from what I've seen that there
> > is no mysterious kernel problem and only this hardware problem.  The
> > description of the hardware problem would not make me think that it would
> > behave this way, but it is not very detailed or precise, or at least does
> > not seem so to a reader not expert on powerpc.
> 
> I ran the testcase on my older G5 today with:
> 
> cpu             : PPC970, altivec supported
> revision        : 2.2 (pvr 0039 0202)
> 
> and it also failed after a few iterations.  This was with
> 2.6.25-0.121.rc5.git4.fc9 as the kernel, which is fairly close to mainline.  At the least, this doesn't seem to be 970FX related.  I'll try building a vanilla 2.6.25-rc7 later this evening to see if that makes a difference.

Still failed with a -vanilla build of 2.6.25-rc7.

josh

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2008-03-27  1:49 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-11-26 22:02 PPC upstream kernel ignored DABR bug Jan Kratochvil
2007-11-27 22:35 ` Arnd Bergmann
2007-11-28  8:59   ` Jan Kratochvil
2007-11-28 12:28     ` Arnd Bergmann
2007-11-28 12:45       ` Jan Kratochvil
2007-11-28 22:59   ` Geoff Levand
2007-11-29  0:13     ` Arnd Bergmann
2008-03-10  0:53       ` Luis Machado
2008-03-10 14:01         ` Jens Osterkamp
2008-03-10 15:13           ` Luis Machado
2008-03-10 19:19           ` Roland McGrath
2008-03-10 19:36             ` Luis Machado
2008-03-10 19:50               ` Olof Johansson
2008-03-10 19:54                 ` Roland McGrath
2008-03-10 22:06             ` Segher Boessenkool
2008-03-12 17:51           ` Luis Machado
2008-03-12 22:30             ` Jens Osterkamp
2008-03-13  1:47               ` Roland McGrath
2008-03-13 22:20                 ` Segher Boessenkool
2008-03-13 22:42                   ` Roland McGrath
2008-03-14  2:11                     ` Segher Boessenkool
2008-03-14  7:45                       ` Roland McGrath
2008-03-14  8:42                         ` Segher Boessenkool
2008-03-16 20:38                           ` Benjamin Herrenschmidt
2008-03-16 20:37                   ` Benjamin Herrenschmidt
2008-03-26 20:57                 ` Josh Boyer
2008-03-27  1:47                   ` Josh Boyer
2008-03-13 13:13               ` Luis Machado

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).