Linux PARISC architecture development
 help / color / mirror / Atom feed
* [parisc-linux] The problem on the PA8800 is all in the data-cache.
@ 2006-07-22 17:50 Carlos O'Donell
  2006-07-23  1:01 ` Thibaut VARENE
  2006-07-24  2:33 ` James Bottomley
  0 siblings, 2 replies; 23+ messages in thread
From: Carlos O'Donell @ 2006-07-22 17:50 UTC (permalink / raw)
  To: parisc-linux, James Bottomley, Grant Grundler

James,

After spending ~4 hours of debugging yesterday evening, between
Thibaut, Dave, and myself, we firmly believe the PA8800 problems are
data cache issues.

Let me describe the test environment, the test, and the results/conclusions:

1. Thibaut's magnum, PA8800. 1 cpu enabled, 2 cores, L1/L2 enabled. etc.
2. Kernel was 2.6.17, with a revision identifier that is not in my notes.
3. A statically compiled sshd, with *everything* disabled.
    This required LIBS="-ldl" and LD_FLAGS="-static" to achieve.

We copied the statically compiled sshd to the PA8800, loaded sshd via
gdb. Passed the following parameters "-D -p 2222", in gdb we used "set
follow-fork-mode child", and started the process.

>>From an external box we initiate an ssh connection to the remote
PA8800 and waite for gdb to catch the SIGSEGV in the sshd child. We
did this over, and over, and over to look for patterns.

Pattern 1:

Using strace, we looked at the syscalls, and determined that the child
sshd process *always* dies after an fd socket read.

Pattern 2:

The set of registers involved is small, roughly r4, r19, r3, r21, r28.
These registers are primarily used by GCC to reference local data on
the stack. r3 is the frame marker and was frequently involved in the
faults.

Pattern 3:

Called functions that fail deal with allocating and touching new
memory. Deaths are primarily in malloc, xmalloc, memset,
packet_read_seqnr, buffer_put_bignum2_ret. Infact we died more often
than not in malloc.

Results:

Initially we thought it was an icache issue, then we realized that
PLABEL's are just data, and when we removed the PLABEL's from the
equation (complete static compile) we stopped seeing invalid insns. We
believe the truth here is that the PLABEL data is corrupted, and thus
r19 and the ip are bogus, so the failure appears to be icache related.
In thruth it was only corrupted PLABELs.

With a fully static sshd, the PLABEL's are not present, and the faults
are *all* memory loads and stores to the stack.

Conclusions:

a) We think it is not an icache issue, but infact a dcache issue.

Often it appears as if a register was corrupted, but the truth is that
the ldw loaded bogus data into a register.

b) One time, on a later comparison in gdb, the register and data in
memory did not equal. I stress that we only saw this situation once.

c) We have often seen the failure with the frame marker on a cacheline
boundary, for example 0xc0278100 (e.g. 256 bytes).

It is my hope that these patterns will trigger someone to devise a
plan for fixing this. If you have any questions about our methods, or
reproducing this, you can easily talk to Thibaut and we can probably
setup access to the test sshd binary.

Grant expressed worry that "Pattern 1" was indicative of a dma sync
problem with the network socket read.

Cheers,
Carlos.
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2006-07-26 21:54 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-07-22 17:50 [parisc-linux] The problem on the PA8800 is all in the data-cache Carlos O'Donell
2006-07-23  1:01 ` Thibaut VARENE
2006-07-23 16:28   ` Michael S. Zick
2006-07-23 22:03     ` Thibaut VARENE
2006-07-24  1:40       ` Kyle McMartin
2006-07-24  2:39         ` Thibaut VARENE
2006-07-24  2:33 ` James Bottomley
2006-07-24  2:54   ` Thibaut VARENE
2006-07-24  3:32     ` Matthew Wilcox
2006-07-24  4:15       ` Thibaut VARENE
     [not found]         ` <1153750204.1235.18.camel@mulgrave.il.steeleye.com>
2006-07-24 16:32           ` Grant Grundler
2006-07-25 14:51             ` James Bottomley
2006-07-25 16:13               ` John David Anglin
2006-07-25 16:17                 ` James Bottomley
2006-07-25 16:46                   ` Kyle McMartin
2006-07-25 22:02                   ` Grant Grundler
2006-07-26 21:54                     ` James Bottomley
2006-07-25 16:34               ` Thibaut VARENE
2006-07-25 16:37                 ` Thibaut VARENE
2006-07-24 14:58       ` John David Anglin
     [not found]     ` <1153711459.1235.13.camel@mulgrave.il.steeleye.com>
2006-07-24  4:26       ` Thibaut VARENE
2006-07-24  4:31         ` Thibaut VARENE
2006-07-24 14:51   ` James Bottomley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox