public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed
* [Linux-ia64] SIGILL errors in strncpu (NAT consumption)
@ 2002-04-02  3:11 Jack Steiner
  2002-04-02  3:46 ` Jack Steiner
                   ` (12 more replies)
  0 siblings, 13 replies; 14+ messages in thread
From: Jack Steiner @ 2002-04-02  3:11 UTC (permalink / raw)
  To: linux-ia64

Has anyone seen random SIGILL failures in the strncpy
function in glibc-2.2.4-19.3?

The failure is caused by a NAT consumption fault in the 
code sequence shown below. 

I'm still analyzing the failure, but it _appears_ that the failure 
occurs:
	- if a VHPT fault occurs at <strncpy+450>
	- then a NAT consumption occurs at <strncpy+560>

(preliminary analysis - it may be more complicated than this)

In the failing case, neither source or destination crosses or is
near to a page boundary. Source address is reg 1, dest is the stack in
reg 4. Length 25 bytes.

If no one else has seen this failure, I'll gather more information
about it & try to create a simple failing test case.


We are running 2.4.17 with B0 stepping Itanium.

Note:
	rotating registers/predicates
	speculative loads


	....
	<strncpy+416>:       [MIB] (p16) ld8.s r32=[r20],8
	<strncpy+417>:             (p18) chk.s.i r34,0x20000000001f8c90 <strncpy+944>
	<strncpy+418>:                   nop.b 0x0
	<strncpy+432>:       [MII] (p18) mov r31=r34
	<strncpy+433>:             (p18) czx1.r r24=r34;;
	<strncpy+434>:             (p18) cmp.eq p0,p7=8,r24
	<strncpy+448>:       [MFB] (p18) adds r21=-8,r21
	<strncpy+449>:                   nop.f 0x0
	<strncpy+450>:             (p07) br.cond.dpnt.few 0x20000000001f8b40 <strncpy+608>

	<strncpy+464>:       [MBB] (p18) st8 [r18]=r34,8		<<<--------- if VHPT occurs here

	<strncpy+465>:                   nop.b 0x0
	<strncpy+466>:                   br.ctop.dptk.few 0x20000000001f8a80 <strncpy+416>;;

	<strncpy+480>:       [MFB]       chk.s.m r33,0x20000000001f8cb0 <strncpy+976>
	<strncpy+481>:                   nop.f 0x0
	<strncpy+482>:                   nop.b 0x0
	<strncpy+496>:       [MFB]       mov r31=r33
	<strncpy+497>:                   nop.f 0x0
	<strncpy+498>:                   nop.b 0x0
	`
	<strncpy+512>:       [MIB]       cmp.eq p5,p6=r21,r0
	<strncpy+513>:                   adds r21=-1,r21
	<strncpy+514>:             (p05) br.cond.dptk.few 0x20000000001f8bf0 <strncpy+784>;;

	<strncpy+528>:       [MFI]       nop.m 0x0
	<strncpy+529>:                   nop.f 0x0
	<strncpy+530>:                   mov.i ar.lc=r21

	<strncpy+544>:       [MII]       nop.m 0x0
	<strncpy+545>:             (p06) extr.u r27=r31,0,8
	<strncpy+546>:             (p06) shr.u r31=r31,8;;

	<strncpy+560>:       [MIB]       st1 [r18]=r27,1     <<<<<<<<<<<<<<<<<<<< fails here





-- 
Thanks

Jack Steiner (steiner@sgi.com)          651-683-5302
Principal Engineer                      Core OS/Strategic Software Org
SGI - Silicon Graphics, Inc.            Eagan, MN



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Linux-ia64] SIGILL errors in strncpu (NAT consumption)
  2002-04-02  3:11 [Linux-ia64] SIGILL errors in strncpu (NAT consumption) Jack Steiner
@ 2002-04-02  3:46 ` Jack Steiner
  2002-04-03 21:29 ` Erich Focht
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Jack Steiner @ 2002-04-02  3:46 UTC (permalink / raw)
  To: linux-ia64

> 
> On Mon, 1 Apr 2002 21:11:48 -0600 (CST), 
> Jack Steiner <steiner@sgi.com> wrote:
> >Has anyone seen random SIGILL failures in the strncpy
> >function in glibc-2.2.4-19.3?
> >We are running 2.4.17 with B0 stepping Itanium.
> 
> B0 or C0?
> 

> B0 or C0?
 
Whoops. Wrong version. I dont have the steppings chart at home but:

        family     : Itanium
        model      : 0
        revision   : 6
        cpu MHz    : 799.942992



-- 
Thanks

Jack Steiner    (651-683-5302)   (vnet 233-5302)      steiner@sgi.com



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Linux-ia64] SIGILL errors in strncpu (NAT consumption)
  2002-04-02  3:11 [Linux-ia64] SIGILL errors in strncpu (NAT consumption) Jack Steiner
  2002-04-02  3:46 ` Jack Steiner
@ 2002-04-03 21:29 ` Erich Focht
  2002-04-03 21:43 ` Jack Steiner
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Erich Focht @ 2002-04-03 21:29 UTC (permalink / raw)
  To: linux-ia64

> Has anyone seen random SIGILL failures in the strncpy
> function in glibc-2.2.4-19.3?
>
> The failure is caused by a NAT consumption fault in the 
> code sequence shown below. 

I've seen NaT consumption faults with an ISV application in a loop
involving speculative loads, too. It was very hard to trace back (occured
at different simulation times in a huge case) and disappeared after we
rewrote the loop and used the latest Intel Fortran compiler. It occured
with both B3 and C0 CPUs under 2.4.7 and 2.4.17. I don't have a testcase
for this, sorry.

Regards,
Erich




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Linux-ia64] SIGILL errors in strncpu (NAT consumption)
  2002-04-02  3:11 [Linux-ia64] SIGILL errors in strncpu (NAT consumption) Jack Steiner
  2002-04-02  3:46 ` Jack Steiner
  2002-04-03 21:29 ` Erich Focht
@ 2002-04-03 21:43 ` Jack Steiner
  2002-04-03 22:10 ` David Mosberger
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Jack Steiner @ 2002-04-03 21:43 UTC (permalink / raw)
  To: linux-ia64

I isolated the strncpy problem to a simple test program. It fails
with the new glibc-2.2.4-19.3 within a few seconds.

Works fine with older versions of glibc.




David Mosberger took a look at the strncpy code & spotted
the error:

From David:
>> I took a closer look and there seem to be several bugs in the routine:
>> 
>>  (1) I don't think it's save to do:
>> 
>>                 chk.s r[MEMLAT], .recovery3
>>                 mov value = r[MEMLAT]
>> 
>>      in the same cycle.  In the patch below, I fixed this by adding a
>>      stop bit, but obviously it would be better to avoid that (either
>>      by re-ordering the code or by adding a pipeline stage).
>> 
>>  (2) stop bit was missing after br.cloop.dptk
>> 
>>  (3) off-by-one error in .recovery4 code: the destination should be
>>      r[MEMLAT-1], not r[MEMLAT]
>> 
>>  (4) I believe the address calcuation in .recovery3 and .recovery4 may
>>      also be off by 8; this is just based on eye-balling the code though,
>>      so I may be wrong
>> 
>> Hope this helps,
>> 
>>         --david
>> 


---- 
Test case - run ~12 copies of this in parallel.

#include <stdio.h>
#include <signal.h>
#include <string.h>
#include <time.h>

char *dest, *src;

void
sigill_handler(int sig)
{
        fprintf(stderr,"SIGILL: pid %d, dest 0x%lx, src 0x%lx\n",
                getpid(), (long)dest, (long)src);
        exit(1);
}

int
main() {
  time_t temp1;
  char *p, buffer[1024];

  signal(SIGILL, sigill_handler);
  
  time(&temp1);
  src = ctime(&temp1);

  dest = buffer;

  printf("%d\n", strlen(src));

  while(1)
      strncpy(buffer,src,strlen(src));
}


-- 
Thanks

Jack Steiner    (651-683-5302)   (vnet 233-5302)      steiner@sgi.com



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Linux-ia64] SIGILL errors in strncpu (NAT consumption)
  2002-04-02  3:11 [Linux-ia64] SIGILL errors in strncpu (NAT consumption) Jack Steiner
                   ` (2 preceding siblings ...)
  2002-04-03 21:43 ` Jack Steiner
@ 2002-04-03 22:10 ` David Mosberger
  2002-04-04  8:36 ` Francois-Xavier Kowalski
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: David Mosberger @ 2002-04-03 22:10 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Wed, 3 Apr 2002 23:29:58 +0200 (MEST), Erich Focht <efocht@ess.nec.de> said:

  Erich> I've seen NaT consumption faults with an ISV application in a
  Erich> loop involving speculative loads, too. It was very hard to
  Erich> trace back (occured at different simulation times in a huge
  Erich> case) and disappeared after we rewrote the loop and used the
  Erich> latest Intel Fortran compiler. It occured with both B3 and C0
  Erich> CPUs under 2.4.7 and 2.4.17. I don't have a testcase for
  Erich> this, sorry.

It's due to a glibc bug that was introduced last August when strncpy()
was rewritten.  I sent a bug report (and preliminary patch) to the
author and am waiting to hear back.

	--david


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Linux-ia64] SIGILL errors in strncpu (NAT consumption)
  2002-04-02  3:11 [Linux-ia64] SIGILL errors in strncpu (NAT consumption) Jack Steiner
                   ` (3 preceding siblings ...)
  2002-04-03 22:10 ` David Mosberger
@ 2002-04-04  8:36 ` Francois-Xavier Kowalski
  2002-04-04 10:29 ` Hideki Yamamoto
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Francois-Xavier Kowalski @ 2002-04-04  8:36 UTC (permalink / raw)
  To: linux-ia64

David Mosberger wrote:

>>>>>>On Wed, 3 Apr 2002 23:29:58 +0200 (MEST), Erich Focht <efocht@ess.nec.de> said:
>>>>>>
>
>  Erich> I've seen NaT consumption faults with an ISV application in a
>  Erich> loop involving speculative loads, too. It was very hard to
>  Erich> trace back (occured at different simulation times in a huge
>  Erich> case) and disappeared after we rewrote the loop and used the
>  Erich> latest Intel Fortran compiler. It occured with both B3 and C0
>  Erich> CPUs under 2.4.7 and 2.4.17. I don't have a testcase for
>  Erich> this, sorry.
>
>It's due to a glibc bug that was introduced last August when strncpy()
>was rewritten.  I sent a bug report (and preliminary patch) to the
>author and am waiting to hear back.
>

Do you have the bug-report ID on GNATS? I am not  able to find it in the 
database to known if it is being worked-out by the maintainer.

FiX

-- 
Francois-Xavier "FiX" KOWALSKI





^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Linux-ia64] SIGILL errors in strncpu (NAT consumption)
  2002-04-02  3:11 [Linux-ia64] SIGILL errors in strncpu (NAT consumption) Jack Steiner
                   ` (4 preceding siblings ...)
  2002-04-04  8:36 ` Francois-Xavier Kowalski
@ 2002-04-04 10:29 ` Hideki Yamamoto
  2002-04-04 15:54 ` David Mosberger
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Hideki Yamamoto @ 2002-04-04 10:29 UTC (permalink / raw)
  To: linux-ia64

 Hi there,

 I have a favor to ask David.
 If possible, plase give me the patch you had resolved
 this problem. I have been looking for the patch in all
 email(linux-ia64) but I could not find this patch.

 Thanks.

> David Mosberger took a look at the strncpy code & spotted
> the error:
> 
> From David:
> >> I took a closer look and there seem to be several bugs in the routine:
> >> 
> >>  (1) I don't think it's save to do:
> >> 
> >>                 chk.s r[MEMLAT], .recovery3
> >>                 mov value = r[MEMLAT]
> >> 
> >>      in the same cycle.  In the patch below, I fixed this by adding a
> >>      stop bit, but obviously it would be better to avoid that (either
> >>      by re-ordering the code or by adding a pipeline stage).
> >> 
> >>  (2) stop bit was missing after br.cloop.dptk
> >> 
> >>  (3) off-by-one error in .recovery4 code: the destination should be
> >>      r[MEMLAT-1], not r[MEMLAT]
> >> 
> >>  (4) I believe the address calcuation in .recovery3 and .recovery4 may
> >>      also be off by 8; this is just based on eye-balling the code though,
> >>      so I may be wrong
> >> 
> >> Hope this helps,
> >> 
> >>         --david
> >> 
> 
> 
> ---- 
> Test case - run ~12 copies of this in parallel.
> 
> #include <stdio.h>
> #include <signal.h>
> #include <string.h>
> #include <time.h>
> 
> char *dest, *src;
> 
> void
> sigill_handler(int sig)
> {
>         fprintf(stderr,"SIGILL: pid %d, dest 0x%lx, src 0x%lx\n",
>                 getpid(), (long)dest, (long)src);
>         exit(1);
> }
> 
> int
> main() {
>   time_t temp1;
>   char *p, buffer[1024];
> 
>   signal(SIGILL, sigill_handler);
>   
>   time(&temp1);
>   src = ctime(&temp1);
> 
>   dest = buffer;
> 
>   printf("%d\n", strlen(src));
> 
>   while(1)
>       strncpy(buffer,src,strlen(src));
> }
> 
> 
> -- 
> Thanks
> 
> Jack Steiner    (651-683-5302)   (vnet 233-5302)      steiner@sgi.com
> 
> 
> _______________________________________________
> Linux-IA64 mailing list
> Linux-IA64@linuxia64.org
> http://lists.linuxia64.org/lists/listinfo/linux-ia64
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Linux-ia64] SIGILL errors in strncpu (NAT consumption)
  2002-04-02  3:11 [Linux-ia64] SIGILL errors in strncpu (NAT consumption) Jack Steiner
                   ` (5 preceding siblings ...)
  2002-04-04 10:29 ` Hideki Yamamoto
@ 2002-04-04 15:54 ` David Mosberger
  2002-04-04 18:44 ` David Mosberger
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: David Mosberger @ 2002-04-04 15:54 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Thu, 04 Apr 2002 19:29:46 +0900, "Hideki Yamamoto" <hideki@hpc.bs1.fc.nec.co.jp> said:

  Hideki>  I have a favor to ask David.  If possible, plase give me
  Hideki> the patch you had resolved this problem. I have been looking
  Hideki> for the patch in all email(linux-ia64) but I could not find
  Hideki> this patch.

Jack just sent the patch.  Caveat: it has not been tested much.  You
may be better off reverting to the pre-August 2001 version of
strncpy() if stability is what you want.

	--david


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Linux-ia64] SIGILL errors in strncpu (NAT consumption)
  2002-04-02  3:11 [Linux-ia64] SIGILL errors in strncpu (NAT consumption) Jack Steiner
                   ` (6 preceding siblings ...)
  2002-04-04 15:54 ` David Mosberger
@ 2002-04-04 18:44 ` David Mosberger
  2002-04-04 19:27 ` Erich Focht
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: David Mosberger @ 2002-04-04 18:44 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Thu, 04 Apr 2002 10:36:20 +0200, Francois-Xavier Kowalski <francois-xavier_kowalski@hp.com> said:

  Francois-Xavier> Do you have the bug-report ID on GNATS? I am not
  Francois-Xavier> able to find it in the database to known if it is
  Francois-Xavier> being worked-out by the maintainer.

No, I don't.  I just reported it to Jakub Jelenik.

	--david


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Linux-ia64] SIGILL errors in strncpu (NAT consumption)
  2002-04-02  3:11 [Linux-ia64] SIGILL errors in strncpu (NAT consumption) Jack Steiner
                   ` (7 preceding siblings ...)
  2002-04-04 18:44 ` David Mosberger
@ 2002-04-04 19:27 ` Erich Focht
  2002-04-04 19:31 ` David Mosberger
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Erich Focht @ 2002-04-04 19:27 UTC (permalink / raw)
  To: linux-ia64

On Wed, 3 Apr 2002, David Mosberger wrote:

> It's due to a glibc bug that was introduced last August when strncpy()
> was rewritten.  I sent a bug report (and preliminary patch) to the
> author and am waiting to hear back.

The error I've seen didn't have anything to do with strncpy(), the loop
where the strange SIGILL and "NaT consumption" came from was:

      DO 502 IB=1,NBCUT0
      if(IB.GT.NBNCST.AND.IB.LE.NBNCEN) GO TO 502
      IP1=LCU(1,IB)
      IP2=LCU(2,IB)
      IF(LQ(1,IP1).LT.-NBC.OR.LQ(1,IP2).LT.-NBC) GO TO 502
      IDP1=NDIR(ICU(1,IB))
      LQ(IDP1,IP1)=LCB(1,IB)
      IDP2=NDIR(ICU(2,IB))
      LQ(IDP2,IP2)=LCB(2,IB)
  502 CONTINUE

You shouldn't blame me for the first IF condition, it's a third party
(ISV) code. The assembler code produced by the Fortran compiler looked
correct.

What change did you make for strncpy()? Did it somehow produce a NaT
somewhere where it could influence a Fortran program? I'd like to
understand whether the problem comes from a strange combination of
instructions or somehow propagates from glibc. Splitting the loop and
eliminating the first IF helped in this case, so it's improbable that
strncpy() is related to this.

Thanks,
best regards,
Erich




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Linux-ia64] SIGILL errors in strncpu (NAT consumption)
  2002-04-02  3:11 [Linux-ia64] SIGILL errors in strncpu (NAT consumption) Jack Steiner
                   ` (8 preceding siblings ...)
  2002-04-04 19:27 ` Erich Focht
@ 2002-04-04 19:31 ` David Mosberger
  2002-04-04 21:26 ` Erich Focht
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: David Mosberger @ 2002-04-04 19:31 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Thu, 4 Apr 2002 21:27:58 +0200 (MEST), Erich Focht <focht@ess.nec.de> said:

  Erich> What change did you make for strncpy()? Did it somehow
  Erich> produce a NaT somewhere where it could influence a Fortran
  Erich> program? I'd like to understand whether the problem comes
  Erich> from a strange combination of instructions or somehow
  Erich> propagates from glibc.

The glibc routine simply was buggy.  Garbage in, garbage out, not
surprise here.  I suspect your Fortran problem is something entirely
different.

	--david


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Linux-ia64] SIGILL errors in strncpu (NAT consumption)
  2002-04-02  3:11 [Linux-ia64] SIGILL errors in strncpu (NAT consumption) Jack Steiner
                   ` (9 preceding siblings ...)
  2002-04-04 19:31 ` David Mosberger
@ 2002-04-04 21:26 ` Erich Focht
  2002-04-05  3:44 ` Hideki Yamamoto
  2002-04-05 21:27 ` David Mosberger
  12 siblings, 0 replies; 14+ messages in thread
From: Erich Focht @ 2002-04-04 21:26 UTC (permalink / raw)
  To: linux-ia64

On Thu, 4 Apr 2002, David Mosberger wrote:

> The glibc routine simply was buggy.  Garbage in, garbage out, not
> surprise here.  I suspect your Fortran problem is something entirely
> different.

Thanks, I should have read the previous messages to this thread, the
question was answered before... The disadvantage of reading the digest
form of the mailing list traffic.

Regards,
Erich



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Linux-ia64] SIGILL errors in strncpu (NAT consumption)
  2002-04-02  3:11 [Linux-ia64] SIGILL errors in strncpu (NAT consumption) Jack Steiner
                   ` (10 preceding siblings ...)
  2002-04-04 21:26 ` Erich Focht
@ 2002-04-05  3:44 ` Hideki Yamamoto
  2002-04-05 21:27 ` David Mosberger
  12 siblings, 0 replies; 14+ messages in thread
From: Hideki Yamamoto @ 2002-04-05  3:44 UTC (permalink / raw)
  To: linux-ia64

 Hi Jack-san and David-san.

 Thank you for sending the patch.
 Yes, I will take care to do it.

At Thu, 4 Apr 2002 07:54:57 -0800,
David Mosberger wrote:
> 
> >>>>> On Thu, 04 Apr 2002 19:29:46 +0900, "Hideki Yamamoto" <hideki@hpc.bs1.fc.nec.co.jp> said:
> 
>   Hideki>  I have a favor to ask David.  If possible, plase give me
>   Hideki> the patch you had resolved this problem. I have been looking
>   Hideki> for the patch in all email(linux-ia64) but I could not find
>   Hideki> this patch.
> 
> Jack just sent the patch.  Caveat: it has not been tested much.  You
> may be better off reverting to the pre-August 2001 version of
> strncpy() if stability is what you want.
> 
> 	--david
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Linux-ia64] SIGILL errors in strncpu (NAT consumption)
  2002-04-02  3:11 [Linux-ia64] SIGILL errors in strncpu (NAT consumption) Jack Steiner
                   ` (11 preceding siblings ...)
  2002-04-05  3:44 ` Hideki Yamamoto
@ 2002-04-05 21:27 ` David Mosberger
  12 siblings, 0 replies; 14+ messages in thread
From: David Mosberger @ 2002-04-05 21:27 UTC (permalink / raw)
  To: linux-ia64

Just to avoid confusing others, I'd like to clarify this point:

  >> (1) I don't think it's save to do:
  >>
  >> chk.s r[MEMLAT], .recovery3
  >> mov value = r[MEMLAT]
  >>
  >> in the same cycle.

This was a brain fart.  It is of course legal to do this as there is
no dependency violation (both instructions only read r[MEMLAT]) and
hence the effect of the instructions is as if they had been executed
sequentially.  Thanks to Jim Hull for setting me straight.

The rest of the patch should still apply.

	--david


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2002-04-05 21:27 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-04-02  3:11 [Linux-ia64] SIGILL errors in strncpu (NAT consumption) Jack Steiner
2002-04-02  3:46 ` Jack Steiner
2002-04-03 21:29 ` Erich Focht
2002-04-03 21:43 ` Jack Steiner
2002-04-03 22:10 ` David Mosberger
2002-04-04  8:36 ` Francois-Xavier Kowalski
2002-04-04 10:29 ` Hideki Yamamoto
2002-04-04 15:54 ` David Mosberger
2002-04-04 18:44 ` David Mosberger
2002-04-04 19:27 ` Erich Focht
2002-04-04 19:31 ` David Mosberger
2002-04-04 21:26 ` Erich Focht
2002-04-05  3:44 ` Hideki Yamamoto
2002-04-05 21:27 ` David Mosberger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox