* [Linux-ia64] SIGILL errors in strncpu (NAT consumption)
@ 2002-04-02 3:11 Jack Steiner
2002-04-02 3:46 ` Jack Steiner
` (12 more replies)
0 siblings, 13 replies; 14+ messages in thread
From: Jack Steiner @ 2002-04-02 3:11 UTC (permalink / raw)
To: linux-ia64
Has anyone seen random SIGILL failures in the strncpy
function in glibc-2.2.4-19.3?
The failure is caused by a NAT consumption fault in the
code sequence shown below.
I'm still analyzing the failure, but it _appears_ that the failure
occurs:
- if a VHPT fault occurs at <strncpy+450>
- then a NAT consumption occurs at <strncpy+560>
(preliminary analysis - it may be more complicated than this)
In the failing case, neither source or destination crosses or is
near to a page boundary. Source address is reg 1, dest is the stack in
reg 4. Length 25 bytes.
If no one else has seen this failure, I'll gather more information
about it & try to create a simple failing test case.
We are running 2.4.17 with B0 stepping Itanium.
Note:
rotating registers/predicates
speculative loads
....
<strncpy+416>: [MIB] (p16) ld8.s r32=[r20],8
<strncpy+417>: (p18) chk.s.i r34,0x20000000001f8c90 <strncpy+944>
<strncpy+418>: nop.b 0x0
<strncpy+432>: [MII] (p18) mov r31=r34
<strncpy+433>: (p18) czx1.r r24=r34;;
<strncpy+434>: (p18) cmp.eq p0,p7=8,r24
<strncpy+448>: [MFB] (p18) adds r21=-8,r21
<strncpy+449>: nop.f 0x0
<strncpy+450>: (p07) br.cond.dpnt.few 0x20000000001f8b40 <strncpy+608>
<strncpy+464>: [MBB] (p18) st8 [r18]=r34,8 <<<--------- if VHPT occurs here
<strncpy+465>: nop.b 0x0
<strncpy+466>: br.ctop.dptk.few 0x20000000001f8a80 <strncpy+416>;;
<strncpy+480>: [MFB] chk.s.m r33,0x20000000001f8cb0 <strncpy+976>
<strncpy+481>: nop.f 0x0
<strncpy+482>: nop.b 0x0
<strncpy+496>: [MFB] mov r31=r33
<strncpy+497>: nop.f 0x0
<strncpy+498>: nop.b 0x0
`
<strncpy+512>: [MIB] cmp.eq p5,p6=r21,r0
<strncpy+513>: adds r21=-1,r21
<strncpy+514>: (p05) br.cond.dptk.few 0x20000000001f8bf0 <strncpy+784>;;
<strncpy+528>: [MFI] nop.m 0x0
<strncpy+529>: nop.f 0x0
<strncpy+530>: mov.i ar.lc=r21
<strncpy+544>: [MII] nop.m 0x0
<strncpy+545>: (p06) extr.u r27=r31,0,8
<strncpy+546>: (p06) shr.u r31=r31,8;;
<strncpy+560>: [MIB] st1 [r18]=r27,1 <<<<<<<<<<<<<<<<<<<< fails here
--
Thanks
Jack Steiner (steiner@sgi.com) 651-683-5302
Principal Engineer Core OS/Strategic Software Org
SGI - Silicon Graphics, Inc. Eagan, MN
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Linux-ia64] SIGILL errors in strncpu (NAT consumption)
2002-04-02 3:11 [Linux-ia64] SIGILL errors in strncpu (NAT consumption) Jack Steiner
@ 2002-04-02 3:46 ` Jack Steiner
2002-04-03 21:29 ` Erich Focht
` (11 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Jack Steiner @ 2002-04-02 3:46 UTC (permalink / raw)
To: linux-ia64
>
> On Mon, 1 Apr 2002 21:11:48 -0600 (CST),
> Jack Steiner <steiner@sgi.com> wrote:
> >Has anyone seen random SIGILL failures in the strncpy
> >function in glibc-2.2.4-19.3?
> >We are running 2.4.17 with B0 stepping Itanium.
>
> B0 or C0?
>
> B0 or C0?
Whoops. Wrong version. I dont have the steppings chart at home but:
family : Itanium
model : 0
revision : 6
cpu MHz : 799.942992
--
Thanks
Jack Steiner (651-683-5302) (vnet 233-5302) steiner@sgi.com
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Linux-ia64] SIGILL errors in strncpu (NAT consumption)
2002-04-02 3:11 [Linux-ia64] SIGILL errors in strncpu (NAT consumption) Jack Steiner
2002-04-02 3:46 ` Jack Steiner
@ 2002-04-03 21:29 ` Erich Focht
2002-04-03 21:43 ` Jack Steiner
` (10 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Erich Focht @ 2002-04-03 21:29 UTC (permalink / raw)
To: linux-ia64
> Has anyone seen random SIGILL failures in the strncpy
> function in glibc-2.2.4-19.3?
>
> The failure is caused by a NAT consumption fault in the
> code sequence shown below.
I've seen NaT consumption faults with an ISV application in a loop
involving speculative loads, too. It was very hard to trace back (occured
at different simulation times in a huge case) and disappeared after we
rewrote the loop and used the latest Intel Fortran compiler. It occured
with both B3 and C0 CPUs under 2.4.7 and 2.4.17. I don't have a testcase
for this, sorry.
Regards,
Erich
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Linux-ia64] SIGILL errors in strncpu (NAT consumption)
2002-04-02 3:11 [Linux-ia64] SIGILL errors in strncpu (NAT consumption) Jack Steiner
2002-04-02 3:46 ` Jack Steiner
2002-04-03 21:29 ` Erich Focht
@ 2002-04-03 21:43 ` Jack Steiner
2002-04-03 22:10 ` David Mosberger
` (9 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Jack Steiner @ 2002-04-03 21:43 UTC (permalink / raw)
To: linux-ia64
I isolated the strncpy problem to a simple test program. It fails
with the new glibc-2.2.4-19.3 within a few seconds.
Works fine with older versions of glibc.
David Mosberger took a look at the strncpy code & spotted
the error:
From David:
>> I took a closer look and there seem to be several bugs in the routine:
>>
>> (1) I don't think it's save to do:
>>
>> chk.s r[MEMLAT], .recovery3
>> mov value = r[MEMLAT]
>>
>> in the same cycle. In the patch below, I fixed this by adding a
>> stop bit, but obviously it would be better to avoid that (either
>> by re-ordering the code or by adding a pipeline stage).
>>
>> (2) stop bit was missing after br.cloop.dptk
>>
>> (3) off-by-one error in .recovery4 code: the destination should be
>> r[MEMLAT-1], not r[MEMLAT]
>>
>> (4) I believe the address calcuation in .recovery3 and .recovery4 may
>> also be off by 8; this is just based on eye-balling the code though,
>> so I may be wrong
>>
>> Hope this helps,
>>
>> --david
>>
----
Test case - run ~12 copies of this in parallel.
#include <stdio.h>
#include <signal.h>
#include <string.h>
#include <time.h>
char *dest, *src;
void
sigill_handler(int sig)
{
fprintf(stderr,"SIGILL: pid %d, dest 0x%lx, src 0x%lx\n",
getpid(), (long)dest, (long)src);
exit(1);
}
int
main() {
time_t temp1;
char *p, buffer[1024];
signal(SIGILL, sigill_handler);
time(&temp1);
src = ctime(&temp1);
dest = buffer;
printf("%d\n", strlen(src));
while(1)
strncpy(buffer,src,strlen(src));
}
--
Thanks
Jack Steiner (651-683-5302) (vnet 233-5302) steiner@sgi.com
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Linux-ia64] SIGILL errors in strncpu (NAT consumption)
2002-04-02 3:11 [Linux-ia64] SIGILL errors in strncpu (NAT consumption) Jack Steiner
` (2 preceding siblings ...)
2002-04-03 21:43 ` Jack Steiner
@ 2002-04-03 22:10 ` David Mosberger
2002-04-04 8:36 ` Francois-Xavier Kowalski
` (8 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: David Mosberger @ 2002-04-03 22:10 UTC (permalink / raw)
To: linux-ia64
>>>>> On Wed, 3 Apr 2002 23:29:58 +0200 (MEST), Erich Focht <efocht@ess.nec.de> said:
Erich> I've seen NaT consumption faults with an ISV application in a
Erich> loop involving speculative loads, too. It was very hard to
Erich> trace back (occured at different simulation times in a huge
Erich> case) and disappeared after we rewrote the loop and used the
Erich> latest Intel Fortran compiler. It occured with both B3 and C0
Erich> CPUs under 2.4.7 and 2.4.17. I don't have a testcase for
Erich> this, sorry.
It's due to a glibc bug that was introduced last August when strncpy()
was rewritten. I sent a bug report (and preliminary patch) to the
author and am waiting to hear back.
--david
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Linux-ia64] SIGILL errors in strncpu (NAT consumption)
2002-04-02 3:11 [Linux-ia64] SIGILL errors in strncpu (NAT consumption) Jack Steiner
` (3 preceding siblings ...)
2002-04-03 22:10 ` David Mosberger
@ 2002-04-04 8:36 ` Francois-Xavier Kowalski
2002-04-04 10:29 ` Hideki Yamamoto
` (7 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Francois-Xavier Kowalski @ 2002-04-04 8:36 UTC (permalink / raw)
To: linux-ia64
David Mosberger wrote:
>>>>>>On Wed, 3 Apr 2002 23:29:58 +0200 (MEST), Erich Focht <efocht@ess.nec.de> said:
>>>>>>
>
> Erich> I've seen NaT consumption faults with an ISV application in a
> Erich> loop involving speculative loads, too. It was very hard to
> Erich> trace back (occured at different simulation times in a huge
> Erich> case) and disappeared after we rewrote the loop and used the
> Erich> latest Intel Fortran compiler. It occured with both B3 and C0
> Erich> CPUs under 2.4.7 and 2.4.17. I don't have a testcase for
> Erich> this, sorry.
>
>It's due to a glibc bug that was introduced last August when strncpy()
>was rewritten. I sent a bug report (and preliminary patch) to the
>author and am waiting to hear back.
>
Do you have the bug-report ID on GNATS? I am not able to find it in the
database to known if it is being worked-out by the maintainer.
FiX
--
Francois-Xavier "FiX" KOWALSKI
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Linux-ia64] SIGILL errors in strncpu (NAT consumption)
2002-04-02 3:11 [Linux-ia64] SIGILL errors in strncpu (NAT consumption) Jack Steiner
` (4 preceding siblings ...)
2002-04-04 8:36 ` Francois-Xavier Kowalski
@ 2002-04-04 10:29 ` Hideki Yamamoto
2002-04-04 15:54 ` David Mosberger
` (6 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Hideki Yamamoto @ 2002-04-04 10:29 UTC (permalink / raw)
To: linux-ia64
Hi there,
I have a favor to ask David.
If possible, plase give me the patch you had resolved
this problem. I have been looking for the patch in all
email(linux-ia64) but I could not find this patch.
Thanks.
> David Mosberger took a look at the strncpy code & spotted
> the error:
>
> From David:
> >> I took a closer look and there seem to be several bugs in the routine:
> >>
> >> (1) I don't think it's save to do:
> >>
> >> chk.s r[MEMLAT], .recovery3
> >> mov value = r[MEMLAT]
> >>
> >> in the same cycle. In the patch below, I fixed this by adding a
> >> stop bit, but obviously it would be better to avoid that (either
> >> by re-ordering the code or by adding a pipeline stage).
> >>
> >> (2) stop bit was missing after br.cloop.dptk
> >>
> >> (3) off-by-one error in .recovery4 code: the destination should be
> >> r[MEMLAT-1], not r[MEMLAT]
> >>
> >> (4) I believe the address calcuation in .recovery3 and .recovery4 may
> >> also be off by 8; this is just based on eye-balling the code though,
> >> so I may be wrong
> >>
> >> Hope this helps,
> >>
> >> --david
> >>
>
>
> ----
> Test case - run ~12 copies of this in parallel.
>
> #include <stdio.h>
> #include <signal.h>
> #include <string.h>
> #include <time.h>
>
> char *dest, *src;
>
> void
> sigill_handler(int sig)
> {
> fprintf(stderr,"SIGILL: pid %d, dest 0x%lx, src 0x%lx\n",
> getpid(), (long)dest, (long)src);
> exit(1);
> }
>
> int
> main() {
> time_t temp1;
> char *p, buffer[1024];
>
> signal(SIGILL, sigill_handler);
>
> time(&temp1);
> src = ctime(&temp1);
>
> dest = buffer;
>
> printf("%d\n", strlen(src));
>
> while(1)
> strncpy(buffer,src,strlen(src));
> }
>
>
> --
> Thanks
>
> Jack Steiner (651-683-5302) (vnet 233-5302) steiner@sgi.com
>
>
> _______________________________________________
> Linux-IA64 mailing list
> Linux-IA64@linuxia64.org
> http://lists.linuxia64.org/lists/listinfo/linux-ia64
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Linux-ia64] SIGILL errors in strncpu (NAT consumption)
2002-04-02 3:11 [Linux-ia64] SIGILL errors in strncpu (NAT consumption) Jack Steiner
` (5 preceding siblings ...)
2002-04-04 10:29 ` Hideki Yamamoto
@ 2002-04-04 15:54 ` David Mosberger
2002-04-04 18:44 ` David Mosberger
` (5 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: David Mosberger @ 2002-04-04 15:54 UTC (permalink / raw)
To: linux-ia64
>>>>> On Thu, 04 Apr 2002 19:29:46 +0900, "Hideki Yamamoto" <hideki@hpc.bs1.fc.nec.co.jp> said:
Hideki> I have a favor to ask David. If possible, plase give me
Hideki> the patch you had resolved this problem. I have been looking
Hideki> for the patch in all email(linux-ia64) but I could not find
Hideki> this patch.
Jack just sent the patch. Caveat: it has not been tested much. You
may be better off reverting to the pre-August 2001 version of
strncpy() if stability is what you want.
--david
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Linux-ia64] SIGILL errors in strncpu (NAT consumption)
2002-04-02 3:11 [Linux-ia64] SIGILL errors in strncpu (NAT consumption) Jack Steiner
` (6 preceding siblings ...)
2002-04-04 15:54 ` David Mosberger
@ 2002-04-04 18:44 ` David Mosberger
2002-04-04 19:27 ` Erich Focht
` (4 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: David Mosberger @ 2002-04-04 18:44 UTC (permalink / raw)
To: linux-ia64
>>>>> On Thu, 04 Apr 2002 10:36:20 +0200, Francois-Xavier Kowalski <francois-xavier_kowalski@hp.com> said:
Francois-Xavier> Do you have the bug-report ID on GNATS? I am not
Francois-Xavier> able to find it in the database to known if it is
Francois-Xavier> being worked-out by the maintainer.
No, I don't. I just reported it to Jakub Jelenik.
--david
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Linux-ia64] SIGILL errors in strncpu (NAT consumption)
2002-04-02 3:11 [Linux-ia64] SIGILL errors in strncpu (NAT consumption) Jack Steiner
` (7 preceding siblings ...)
2002-04-04 18:44 ` David Mosberger
@ 2002-04-04 19:27 ` Erich Focht
2002-04-04 19:31 ` David Mosberger
` (3 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Erich Focht @ 2002-04-04 19:27 UTC (permalink / raw)
To: linux-ia64
On Wed, 3 Apr 2002, David Mosberger wrote:
> It's due to a glibc bug that was introduced last August when strncpy()
> was rewritten. I sent a bug report (and preliminary patch) to the
> author and am waiting to hear back.
The error I've seen didn't have anything to do with strncpy(), the loop
where the strange SIGILL and "NaT consumption" came from was:
DO 502 IB=1,NBCUT0
if(IB.GT.NBNCST.AND.IB.LE.NBNCEN) GO TO 502
IP1=LCU(1,IB)
IP2=LCU(2,IB)
IF(LQ(1,IP1).LT.-NBC.OR.LQ(1,IP2).LT.-NBC) GO TO 502
IDP1=NDIR(ICU(1,IB))
LQ(IDP1,IP1)=LCB(1,IB)
IDP2=NDIR(ICU(2,IB))
LQ(IDP2,IP2)=LCB(2,IB)
502 CONTINUE
You shouldn't blame me for the first IF condition, it's a third party
(ISV) code. The assembler code produced by the Fortran compiler looked
correct.
What change did you make for strncpy()? Did it somehow produce a NaT
somewhere where it could influence a Fortran program? I'd like to
understand whether the problem comes from a strange combination of
instructions or somehow propagates from glibc. Splitting the loop and
eliminating the first IF helped in this case, so it's improbable that
strncpy() is related to this.
Thanks,
best regards,
Erich
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Linux-ia64] SIGILL errors in strncpu (NAT consumption)
2002-04-02 3:11 [Linux-ia64] SIGILL errors in strncpu (NAT consumption) Jack Steiner
` (8 preceding siblings ...)
2002-04-04 19:27 ` Erich Focht
@ 2002-04-04 19:31 ` David Mosberger
2002-04-04 21:26 ` Erich Focht
` (2 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: David Mosberger @ 2002-04-04 19:31 UTC (permalink / raw)
To: linux-ia64
>>>>> On Thu, 4 Apr 2002 21:27:58 +0200 (MEST), Erich Focht <focht@ess.nec.de> said:
Erich> What change did you make for strncpy()? Did it somehow
Erich> produce a NaT somewhere where it could influence a Fortran
Erich> program? I'd like to understand whether the problem comes
Erich> from a strange combination of instructions or somehow
Erich> propagates from glibc.
The glibc routine simply was buggy. Garbage in, garbage out, not
surprise here. I suspect your Fortran problem is something entirely
different.
--david
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Linux-ia64] SIGILL errors in strncpu (NAT consumption)
2002-04-02 3:11 [Linux-ia64] SIGILL errors in strncpu (NAT consumption) Jack Steiner
` (9 preceding siblings ...)
2002-04-04 19:31 ` David Mosberger
@ 2002-04-04 21:26 ` Erich Focht
2002-04-05 3:44 ` Hideki Yamamoto
2002-04-05 21:27 ` David Mosberger
12 siblings, 0 replies; 14+ messages in thread
From: Erich Focht @ 2002-04-04 21:26 UTC (permalink / raw)
To: linux-ia64
On Thu, 4 Apr 2002, David Mosberger wrote:
> The glibc routine simply was buggy. Garbage in, garbage out, not
> surprise here. I suspect your Fortran problem is something entirely
> different.
Thanks, I should have read the previous messages to this thread, the
question was answered before... The disadvantage of reading the digest
form of the mailing list traffic.
Regards,
Erich
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Linux-ia64] SIGILL errors in strncpu (NAT consumption)
2002-04-02 3:11 [Linux-ia64] SIGILL errors in strncpu (NAT consumption) Jack Steiner
` (10 preceding siblings ...)
2002-04-04 21:26 ` Erich Focht
@ 2002-04-05 3:44 ` Hideki Yamamoto
2002-04-05 21:27 ` David Mosberger
12 siblings, 0 replies; 14+ messages in thread
From: Hideki Yamamoto @ 2002-04-05 3:44 UTC (permalink / raw)
To: linux-ia64
Hi Jack-san and David-san.
Thank you for sending the patch.
Yes, I will take care to do it.
At Thu, 4 Apr 2002 07:54:57 -0800,
David Mosberger wrote:
>
> >>>>> On Thu, 04 Apr 2002 19:29:46 +0900, "Hideki Yamamoto" <hideki@hpc.bs1.fc.nec.co.jp> said:
>
> Hideki> I have a favor to ask David. If possible, plase give me
> Hideki> the patch you had resolved this problem. I have been looking
> Hideki> for the patch in all email(linux-ia64) but I could not find
> Hideki> this patch.
>
> Jack just sent the patch. Caveat: it has not been tested much. You
> may be better off reverting to the pre-August 2001 version of
> strncpy() if stability is what you want.
>
> --david
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Linux-ia64] SIGILL errors in strncpu (NAT consumption)
2002-04-02 3:11 [Linux-ia64] SIGILL errors in strncpu (NAT consumption) Jack Steiner
` (11 preceding siblings ...)
2002-04-05 3:44 ` Hideki Yamamoto
@ 2002-04-05 21:27 ` David Mosberger
12 siblings, 0 replies; 14+ messages in thread
From: David Mosberger @ 2002-04-05 21:27 UTC (permalink / raw)
To: linux-ia64
Just to avoid confusing others, I'd like to clarify this point:
>> (1) I don't think it's save to do:
>>
>> chk.s r[MEMLAT], .recovery3
>> mov value = r[MEMLAT]
>>
>> in the same cycle.
This was a brain fart. It is of course legal to do this as there is
no dependency violation (both instructions only read r[MEMLAT]) and
hence the effect of the instructions is as if they had been executed
sequentially. Thanks to Jim Hull for setting me straight.
The rest of the patch should still apply.
--david
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2002-04-05 21:27 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-04-02 3:11 [Linux-ia64] SIGILL errors in strncpu (NAT consumption) Jack Steiner
2002-04-02 3:46 ` Jack Steiner
2002-04-03 21:29 ` Erich Focht
2002-04-03 21:43 ` Jack Steiner
2002-04-03 22:10 ` David Mosberger
2002-04-04 8:36 ` Francois-Xavier Kowalski
2002-04-04 10:29 ` Hideki Yamamoto
2002-04-04 15:54 ` David Mosberger
2002-04-04 18:44 ` David Mosberger
2002-04-04 19:27 ` Erich Focht
2002-04-04 19:31 ` David Mosberger
2002-04-04 21:26 ` Erich Focht
2002-04-05 3:44 ` Hideki Yamamoto
2002-04-05 21:27 ` David Mosberger
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox