* [parisc-linux] The problem on the PA8800 is all in the data-cache.
@ 2006-07-22 17:50 Carlos O'Donell
2006-07-23 1:01 ` Thibaut VARENE
2006-07-24 2:33 ` James Bottomley
0 siblings, 2 replies; 23+ messages in thread
From: Carlos O'Donell @ 2006-07-22 17:50 UTC (permalink / raw)
To: parisc-linux, James Bottomley, Grant Grundler
James,
After spending ~4 hours of debugging yesterday evening, between
Thibaut, Dave, and myself, we firmly believe the PA8800 problems are
data cache issues.
Let me describe the test environment, the test, and the results/conclusions:
1. Thibaut's magnum, PA8800. 1 cpu enabled, 2 cores, L1/L2 enabled. etc.
2. Kernel was 2.6.17, with a revision identifier that is not in my notes.
3. A statically compiled sshd, with *everything* disabled.
This required LIBS="-ldl" and LD_FLAGS="-static" to achieve.
We copied the statically compiled sshd to the PA8800, loaded sshd via
gdb. Passed the following parameters "-D -p 2222", in gdb we used "set
follow-fork-mode child", and started the process.
>>From an external box we initiate an ssh connection to the remote
PA8800 and waite for gdb to catch the SIGSEGV in the sshd child. We
did this over, and over, and over to look for patterns.
Pattern 1:
Using strace, we looked at the syscalls, and determined that the child
sshd process *always* dies after an fd socket read.
Pattern 2:
The set of registers involved is small, roughly r4, r19, r3, r21, r28.
These registers are primarily used by GCC to reference local data on
the stack. r3 is the frame marker and was frequently involved in the
faults.
Pattern 3:
Called functions that fail deal with allocating and touching new
memory. Deaths are primarily in malloc, xmalloc, memset,
packet_read_seqnr, buffer_put_bignum2_ret. Infact we died more often
than not in malloc.
Results:
Initially we thought it was an icache issue, then we realized that
PLABEL's are just data, and when we removed the PLABEL's from the
equation (complete static compile) we stopped seeing invalid insns. We
believe the truth here is that the PLABEL data is corrupted, and thus
r19 and the ip are bogus, so the failure appears to be icache related.
In thruth it was only corrupted PLABELs.
With a fully static sshd, the PLABEL's are not present, and the faults
are *all* memory loads and stores to the stack.
Conclusions:
a) We think it is not an icache issue, but infact a dcache issue.
Often it appears as if a register was corrupted, but the truth is that
the ldw loaded bogus data into a register.
b) One time, on a later comparison in gdb, the register and data in
memory did not equal. I stress that we only saw this situation once.
c) We have often seen the failure with the frame marker on a cacheline
boundary, for example 0xc0278100 (e.g. 256 bytes).
It is my hope that these patterns will trigger someone to devise a
plan for fixing this. If you have any questions about our methods, or
reproducing this, you can easily talk to Thibaut and we can probably
setup access to the test sshd binary.
Grant expressed worry that "Pattern 1" was indicative of a dma sync
problem with the network socket read.
Cheers,
Carlos.
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [parisc-linux] The problem on the PA8800 is all in the data-cache.
2006-07-22 17:50 [parisc-linux] The problem on the PA8800 is all in the data-cache Carlos O'Donell
@ 2006-07-23 1:01 ` Thibaut VARENE
2006-07-23 16:28 ` Michael S. Zick
2006-07-24 2:33 ` James Bottomley
1 sibling, 1 reply; 23+ messages in thread
From: Thibaut VARENE @ 2006-07-23 1:01 UTC (permalink / raw)
To: Carlos O'Donell; +Cc: James Bottomley, parisc-linux
Carlos,
I'm observing something totally crazy right now. On the very same
machine we hacked yesterday, exact same setup (same kernel, same
binary, same everything):
i /can't/ start our static sshd anymore. It dies right after a sysctl (!):
898 mmap(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4000b000
898 getpid() = 898
898 rt_sigaction(SIGRTMIN, {0x40c51eba, [], 0}, NULL, 8) = 0
898 rt_sigaction(SIGRT_1, {0x40c51ec2, [TRAP], 0}, NULL, 8) = 0
898 rt_sigaction(SIGRT_2, {0x40c51ea2, [], 0}, NULL, 8) = 0
898 rt_sigprocmask(SIG_BLOCK, [RTMIN], NULL, 8) = 0
898 rt_sigprocmask(SIG_UNBLOCK, [RT_1], NULL, 8) = 0
898 _sysctl({{CTL_KERN, KERN_VERSION}, 2, 0xc00e67d4, 36, (nil), 0}) = 0
898 --- SIGSEGV (Segmentation fault) @ 0 (0) ---
898 +++ killed by SIGSEGV +++
as you can see at that point it hasn't even yet spawned any child. gdb
isn't of much help, as the backtrace is pretty clueless:
(gdb) symbol-file /home/varenet/openssh-4.3p2/sshd
Reading symbols from /home/varenet/openssh-4.3p2/sshd...done.
(gdb) set follow-fork-mode child
(gdb) set args -D -p 2222
(gdb) run
Starting program: /usr/local/test/sbin/sshd -D -p 2222
Program received signal SIGSEGV, Segmentation fault.
0x00000000 in ?? ()
(gdb) bt
#0 0x00000000 in ?? ()
#1 0x40c3a9a0 in ?? ()
Previous frame identical to this frame (corrupt stack?)
finally dmesg shows:
do_page_fault() pid=898 command='sshd' type=6 address=0x00000003
YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
PSW: 00000000000001101111111100001011 Not tainted
r00-03 0000000000000000 0000000040c5233c 0000000040c3a9a3 00000000c00e67d4
r04-07 0000000040c5233c 0000000040c51a80 000000004069e830 0000000000000200
r08-11 0000000040c513d4 00000000000000e1 0000000080000001 0000000090000000
r12-15 00000000d0000000 000000000021fd70 00000000001a5800 00000000001a5800
r16-19 000000000004c8c0 00000000c00de698 000000000004c8c0 0000000040c5233c
r20-23 0000000000000000 0000000000000053 0000000000000000 00000000c00e66c8
r24-27 00000000c00e67d4 0000000040c401b2 00000000c00e67d7 00000000001db060
r28-31 0000000040c51a80 0000000000000213 00000000c00e6a40 0000000040c3a9a3
sr0-3 00000000001fa800 0000000000000000 0000000000000000 00000000001fa800
sr4-7 00000000001fa800 00000000001fa800 00000000001fa800 00000000001fa800
The cool thing is that i can /consistently/ reproduce this.
I'm leaving the box powered up, not touching anything until we get a
chance to investigate this a bit more.
Aside that, I /really/ believe that the fact that we can trigger the
bug that easily with some network applications isn't a coincidence.
Grant's hint of a dma think problem shouldn't be overlooked. The
"make" failures could also be I/O related...
HTH
T-Bone
PS: i tried "ssh localhost" with the 'normal' sshd (/usr/sbin/sshd) as
I told you earlier today, it dies as expected with pretty much the
same tombstones that those we've seen yesterday.
Haven't investigated that much more at that point.
Note: you and James can access that machine (provided you remember
your password) and that your ssh key on mkhppa02.esiee.fr is valid. If
there's a problem with any of these, it's easy to fix. If jda wants an
account, it's also easy
--
Thibaut VARENE
http://www.parisc-linux.org/~varenet/
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [parisc-linux] The problem on the PA8800 is all in the data-cache.
2006-07-23 1:01 ` Thibaut VARENE
@ 2006-07-23 16:28 ` Michael S. Zick
2006-07-23 22:03 ` Thibaut VARENE
0 siblings, 1 reply; 23+ messages in thread
From: Michael S. Zick @ 2006-07-23 16:28 UTC (permalink / raw)
To: parisc-linux; +Cc: Thibaut VARENE
On Sat July 22 2006 20:01, Thibaut VARENE wrote:
> Carlos,
>
> I'm observing something totally crazy right now. On the very same
> machine we hacked yesterday, exact same setup (same kernel, same
> binary, same everything):
>
> i /can't/ start our static sshd anymore. It dies right after a sysctl (!):
>
> I'm leaving the box powered up, not touching anything until we get a
> chance to investigate this a bit more.
>
Is it still possible to capture the kernel's internal state by
copying /proc/kcore to a file somewhere?
Mike
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [parisc-linux] The problem on the PA8800 is all in the data-cache.
2006-07-23 16:28 ` Michael S. Zick
@ 2006-07-23 22:03 ` Thibaut VARENE
2006-07-24 1:40 ` Kyle McMartin
0 siblings, 1 reply; 23+ messages in thread
From: Thibaut VARENE @ 2006-07-23 22:03 UTC (permalink / raw)
To: Michael S. Zick; +Cc: parisc-linux
On 7/23/06, Michael S. Zick <mszick@morethan.org> wrote:
> On Sat July 22 2006 20:01, Thibaut VARENE wrote:
> > Carlos,
> >
> > I'm observing something totally crazy right now. On the very same
> > machine we hacked yesterday, exact same setup (same kernel, same
> > binary, same everything):
> >
> > i /can't/ start our static sshd anymore. It dies right after a sysctl (!):
> >
> > I'm leaving the box powered up, not touching anything until we get a
> > chance to investigate this a bit more.
> >
>
> Is it still possible to capture the kernel's internal state by
> copying /proc/kcore to a file somewhere?
That box hit the "kill your fs" bug I've seen on ppc and ia64
(upcoming report mail to be posted soon) so it's basically dead at
that point and I'm having hard time reinstalling it (thanks to pa8800
being /such/ a hassle to install remotely).
T-Bone
--
Thibaut VARENE
http://www.parisc-linux.org/~varenet/
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [parisc-linux] The problem on the PA8800 is all in the data-cache.
2006-07-23 22:03 ` Thibaut VARENE
@ 2006-07-24 1:40 ` Kyle McMartin
2006-07-24 2:39 ` Thibaut VARENE
0 siblings, 1 reply; 23+ messages in thread
From: Kyle McMartin @ 2006-07-24 1:40 UTC (permalink / raw)
To: Thibaut VARENE; +Cc: parisc-linux
On Sun, Jul 23, 2006 at 06:03:47PM -0400, Thibaut VARENE wrote:
> That box hit the "kill your fs" bug I've seen on ppc and ia64
> (upcoming report mail to be posted soon) so it's basically dead at
> that point and I'm having hard time reinstalling it (thanks to pa8800
> being /such/ a hassle to install remotely).
>
Does it have two disks? Maybe we should keep one disk as 'known good'
backup on ioz and magnum, and if they crap out, BO ALT and dd it over
the scrogged disk? Just a thought which might save some pain in the
future considering they are crash n' bash machines.
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [parisc-linux] The problem on the PA8800 is all in the data-cache.
2006-07-24 1:40 ` Kyle McMartin
@ 2006-07-24 2:39 ` Thibaut VARENE
0 siblings, 0 replies; 23+ messages in thread
From: Thibaut VARENE @ 2006-07-24 2:39 UTC (permalink / raw)
To: Kyle McMartin; +Cc: parisc-linux
On 7/23/06, Kyle McMartin <kyle@mcmartin.ca> wrote:
> On Sun, Jul 23, 2006 at 06:03:47PM -0400, Thibaut VARENE wrote:
> > That box hit the "kill your fs" bug I've seen on ppc and ia64
> > (upcoming report mail to be posted soon) so it's basically dead at
> > that point and I'm having hard time reinstalling it (thanks to pa8800
> > being /such/ a hassle to install remotely).
> >
>
> Does it have two disks? Maybe we should keep one disk as 'known good'
> backup on ioz and magnum, and if they crap out, BO ALT and dd it over
> the scrogged disk? Just a thought which might save some pain in the
> future considering they are crash n' bash machines.
It /had/ two disks and one good backup. Until I ran (for my own shame
and that of my offsprings) '# rm -rf /mnt *' on the backup one...
T-Bone
--
Thibaut VARENE
http://www.parisc-linux.org/~varenet/
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [parisc-linux] The problem on the PA8800 is all in the data-cache.
2006-07-22 17:50 [parisc-linux] The problem on the PA8800 is all in the data-cache Carlos O'Donell
2006-07-23 1:01 ` Thibaut VARENE
@ 2006-07-24 2:33 ` James Bottomley
2006-07-24 2:54 ` Thibaut VARENE
2006-07-24 14:51 ` James Bottomley
1 sibling, 2 replies; 23+ messages in thread
From: James Bottomley @ 2006-07-24 2:33 UTC (permalink / raw)
To: Carlos O'Donell; +Cc: parisc-linux
On Sat, 2006-07-22 at 13:50 -0400, Carlos O'Donell wrote:
> After spending ~4 hours of debugging yesterday evening, between
> Thibaut, Dave, and myself, we firmly believe the PA8800 problems are
> data cache issues.
Thanks very much for spending the time to do this!
> a) We think it is not an icache issue, but infact a dcache issue.
>
> Often it appears as if a register was corrupted, but the truth is that
> the ldw loaded bogus data into a register.
OK, I'll buy this provisionally ... D cache incoherence is far more
difficult to explain than D/I incoherence, but I'll try.
> b) One time, on a later comparison in gdb, the register and data in
> memory did not equal. I stress that we only saw this situation once.
OK, as long as it was a register read, this must have been D
incoherence.
> c) We have often seen the failure with the frame marker on a cacheline
> boundary, for example 0xc0278100 (e.g. 256 bytes).
>
> It is my hope that these patterns will trigger someone to devise a
> plan for fixing this. If you have any questions about our methods, or
> reproducing this, you can easily talk to Thibaut and we can probably
> setup access to the test sshd binary.
OK .. let me try to think of this one. To me, the pattern indicates
errors in newly faulted memory (either from stack growth or touching
malloced areas which are mmapped in glibc). So, here's the only current
theory I can come up with: Aggressive prefetching is causing us
problems in faulting. Theoretically, it looks like the culprit should
be anonymous pages (because that's what stack and malloc areas
are---they're not file backed). However, I tend to discount this
because the only way data gets into anonymous memory is when the user
(or the linker running on behalf of the user) puts it there. Thus,
there should be no user coherence issues with data in anonymous memory.
> Grant expressed worry that "Pattern 1" was indicative of a dma sync
> problem with the network socket read.
I'm still dubious about this one ... even if we agree it's a D cache
issue, it's definitely a D cache issue affecting program execution (i.e.
function pointers or call indirection). The data coming out of the
network pipe for ssh never finds its way into the execution stream,
which means it's unlikely to affect these areas. Additionally, ssh has
message integrity checks which fail noisily (i.e. the network data is
verified against a secure hash before it's used). So, if we had
incoherent data from the pipe, I would exect to see periodic MIC
failures, which we don't see.
I'm also coming to the conclusion that the aggressive prefetch theory
isn't entirely accurate. It fits the I/D incoherency theory because we
get I cache prefetches on D cache TLB entries (because of the combined
I/D tlb) then we only flush the D cache because we don't expect I data
there (all our data regions currently seem to be executable as well).
However, for D cache incoherence alone, it seems implausible because we
have to have a tlb entry to move in across a page boundary, and every
tlb entry can only be inserted from a pte (and for every pte we will
flush the cache on object destruction).
One of the suspicious monsters in our code is the PAGE_FLUSH setting,
which allows tlb re-insertion after linux thinks it has been cleared in
violation of the linux tlb philosophy. However, those mappings are
supposed to be "flush only" and, since the algorithm had a hole in it, I
thought I fixed it not to need PAGE_FLUSH entries (even though we keep
them around). Regardless, tmpalias flushing completely eliminates any
window we have in this regard, and, as pa8800 still doesn't work, I
think I have to conclude it's not even this.
So, the final thing we're left with is a missed or elided flush
somewhere in the linux code, which is going to be extremely hard to
find.
James
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [parisc-linux] The problem on the PA8800 is all in the data-cache.
2006-07-24 2:33 ` James Bottomley
@ 2006-07-24 2:54 ` Thibaut VARENE
2006-07-24 3:32 ` Matthew Wilcox
[not found] ` <1153711459.1235.13.camel@mulgrave.il.steeleye.com>
2006-07-24 14:51 ` James Bottomley
1 sibling, 2 replies; 23+ messages in thread
From: Thibaut VARENE @ 2006-07-24 2:54 UTC (permalink / raw)
To: James Bottomley; +Cc: parisc-linux
On 7/23/06, James Bottomley <James.Bottomley@steeleye.com> wrote:
> Carlos wrote:
> > Grant expressed worry that "Pattern 1" was indicative of a dma sync
> > problem with the network socket read.
>
> I'm still dubious about this one ... even if we agree it's a D cache
> issue, it's definitely a D cache issue affecting program execution (i.e.
> function pointers or call indirection). The data coming out of the
> network pipe for ssh never finds its way into the execution stream,
> which means it's unlikely to affect these areas. Additionally, ssh has
> message integrity checks which fail noisily (i.e. the network data is
> verified against a secure hash before it's used). So, if we had
> incoherent data from the pipe, I would exect to see periodic MIC
> failures, which we don't see.
Actually on some occasion, the sshd would kill the incoming connection
with "bad packet length" and "invalid hash packet" and all sorts of
various nasty error messages. And we made sure that these messages
were sent by the _server_, not the _client_...
My take is that we see that bug so much more on pa8800 because of its
huge cache and thus because we hit cache much more often than on all
other machines...
Still investigating this, i'm about to bring back online my rp3440 ;)
HTH
T-Bone
--
Thibaut VARENE
http://www.parisc-linux.org/~varenet/
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [parisc-linux] The problem on the PA8800 is all in the data-cache.
2006-07-24 2:54 ` Thibaut VARENE
@ 2006-07-24 3:32 ` Matthew Wilcox
2006-07-24 4:15 ` Thibaut VARENE
2006-07-24 14:58 ` John David Anglin
[not found] ` <1153711459.1235.13.camel@mulgrave.il.steeleye.com>
1 sibling, 2 replies; 23+ messages in thread
From: Matthew Wilcox @ 2006-07-24 3:32 UTC (permalink / raw)
To: Thibaut VARENE; +Cc: James Bottomley, parisc-linux
On Sun, Jul 23, 2006 at 10:54:38PM -0400, Thibaut VARENE wrote:
> My take is that we see that bug so much more on pa8800 because of its
> huge cache and thus because we hit cache much more often than on all
> other machines...
I don't think so. pa8800 has less cache per core than pa8700. The L2
cache is ignorable for the purposes of this scenario, since it's
transparent to software.
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [parisc-linux] The problem on the PA8800 is all in the data-cache.
2006-07-24 3:32 ` Matthew Wilcox
@ 2006-07-24 4:15 ` Thibaut VARENE
[not found] ` <1153750204.1235.18.camel@mulgrave.il.steeleye.com>
2006-07-24 14:58 ` John David Anglin
1 sibling, 1 reply; 23+ messages in thread
From: Thibaut VARENE @ 2006-07-24 4:15 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: James Bottomley, parisc-linux
On 7/23/06, Matthew Wilcox <matthew@wil.cx> wrote:
> On Sun, Jul 23, 2006 at 10:54:38PM -0400, Thibaut VARENE wrote:
> > My take is that we see that bug so much more on pa8800 because of its
> > huge cache and thus because we hit cache much more often than on all
> > other machines...
>
> I don't think so. pa8800 has less cache per core than pa8700. The L2
> cache is ignorable for the purposes of this scenario, since it's
> transparent to software.
I don't think so :)
On B180, 1MB cache addon is "transparent" and the kernel doesn't see
it at all (/proc/cpuinfo doesn't even show it).
On pa8800, the kernel sees L2 (/proc/cpuinfo shows it) and computes
the flush routines based on L2 size, so I really don't understand how
it is transparent...
Which is why I'm making the point that either our cache flush
computations are wrong on pa8800 (and ggg says they aren't), or the L2
is /not/ transparent and what I said previously wrt huge cache size
should make some sense, shouldn't it?
HTH
T-Bone
--
Thibaut VARENE
http://www.parisc-linux.org/~varenet/
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [parisc-linux] The problem on the PA8800 is all in the data-cache.
2006-07-24 3:32 ` Matthew Wilcox
2006-07-24 4:15 ` Thibaut VARENE
@ 2006-07-24 14:58 ` John David Anglin
1 sibling, 0 replies; 23+ messages in thread
From: John David Anglin @ 2006-07-24 14:58 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: James.Bottomley, parisc-linux, T-Bone
> On Sun, Jul 23, 2006 at 10:54:38PM -0400, Thibaut VARENE wrote:
> > My take is that we see that bug so much more on pa8800 because of its
> > huge cache and thus because we hit cache much more often than on all
> > other machines...
>
> I don't think so. pa8800 has less cache per core than pa8700. The L2
> cache is ignorable for the purposes of this scenario, since it's
> transparent to software.
It is visible. You see it in the size returned for the D cache by
PDC_CACHE.
Dave
--
J. David Anglin dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada (613) 990-0752 (FAX: 952-6602)
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <1153711459.1235.13.camel@mulgrave.il.steeleye.com>]
* Re: [parisc-linux] The problem on the PA8800 is all in the data-cache.
[not found] ` <1153711459.1235.13.camel@mulgrave.il.steeleye.com>
@ 2006-07-24 4:26 ` Thibaut VARENE
2006-07-24 4:31 ` Thibaut VARENE
0 siblings, 1 reply; 23+ messages in thread
From: Thibaut VARENE @ 2006-07-24 4:26 UTC (permalink / raw)
To: James Bottomley; +Cc: parisc-linux
On 7/23/06, James Bottomley <James.Bottomley@steeleye.com> wrote:
> On Sun, 2006-07-23 at 22:54 -0400, Thibaut VARENE wrote:
> > Actually on some occasion, the sshd would kill the incoming connection
> > with "bad packet length" and "invalid hash packet" and all sorts of
> > various nasty error messages. And we made sure that these messages
> > were sent by the _server_, not the _client_...
> >
> > My take is that we see that bug so much more on pa8800 because of its
> > huge cache and thus because we hit cache much more often than on all
> > other machines...
> >
> > Still investigating this, i'm about to bring back online my rp3440 ;)
>
> I can get a stable ssh connection to a pa8800 ... once established, I
> don't ever see them close for MIC problems.
I'm totally amazed. Carlos and I could never get a single established
connection...
> To get one going, you just start several remote ssh's at once, so this
> would tend to indicate that it's some type of timing issue connected to
> the fork.
Well, I can consistently crash a remote telnet session if i output too
much data in the terminal (eg running dmesg or ls -lR). I doubt this
is fork related...
Another point worth mentioning, is that sshd is very likely mlocking
stuff so that it doesn't get swapped. Actually, it's been a while
since i last gave a look at sshd source, but I'd bet it keeps mostly
everything in RAM, and given the size of the cache on pa8800, blah
blah see my other mail i'm exhausted ;)
HTH
T-Bone
--
Thibaut VARENE
http://www.parisc-linux.org/~varenet/
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [parisc-linux] The problem on the PA8800 is all in the data-cache.
2006-07-24 4:26 ` Thibaut VARENE
@ 2006-07-24 4:31 ` Thibaut VARENE
0 siblings, 0 replies; 23+ messages in thread
From: Thibaut VARENE @ 2006-07-24 4:31 UTC (permalink / raw)
To: James Bottomley; +Cc: parisc-linux
On 7/24/06, Thibaut VARENE <T-Bone@parisc-linux.org> wrote:
> On 7/23/06, James Bottomley <James.Bottomley@steeleye.com> wrote:
> > On Sun, 2006-07-23 at 22:54 -0400, Thibaut VARENE wrote:
> > > Actually on some occasion, the sshd would kill the incoming connection
> > > with "bad packet length" and "invalid hash packet" and all sorts of
> > > various nasty error messages. And we made sure that these messages
> > > were sent by the _server_, not the _client_...
> > >
> > > My take is that we see that bug so much more on pa8800 because of its
> > > huge cache and thus because we hit cache much more often than on all
> > > other machines...
> > >
> > > Still investigating this, i'm about to bring back online my rp3440 ;)
> >
> > I can get a stable ssh connection to a pa8800 ... once established, I
> > don't ever see them close for MIC problems.
>
> I'm totally amazed. Carlos and I could never get a single established
> connection...
Another quick dump, yesterday, Carlos and I experienced a totally
different behaviour from our statically linked sshd. It wouldn't even
start... Grant on the other hand, first kept being bounced off ioz and
suddenly started to consistently get passwd prompts, but couldn't
login... Dunno if that rings any bells.
Carlos will probably have more to say about what we've been through
yesterday, he took notes ;)
HTH
T-Bone
--
Thibaut VARENE
http://www.parisc-linux.org/~varenet/
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [parisc-linux] The problem on the PA8800 is all in the data-cache.
2006-07-24 2:33 ` James Bottomley
2006-07-24 2:54 ` Thibaut VARENE
@ 2006-07-24 14:51 ` James Bottomley
1 sibling, 0 replies; 23+ messages in thread
From: James Bottomley @ 2006-07-24 14:51 UTC (permalink / raw)
To: Carlos O'Donell; +Cc: parisc-linux
On Sun, 2006-07-23 at 22:33 -0400, James Bottomley wrote:
> > Grant expressed worry that "Pattern 1" was indicative of a dma sync
> > problem with the network socket read.
>
> I'm still dubious about this one ... even if we agree it's a D cache
> issue, it's definitely a D cache issue affecting program execution (i.e.
> function pointers or call indirection). The data coming out of the
> network pipe for ssh never finds its way into the execution stream,
> which means it's unlikely to affect these areas. Additionally, ssh has
> message integrity checks which fail noisily (i.e. the network data is
> verified against a secure hash before it's used). So, if we had
> incoherent data from the pipe, I would exect to see periodic MIC
> failures, which we don't see.
Let me back up on this one. I still don't think it's a DMA sync issue.
However, it could be a different D incoherency issue. Because the linux
kernel operates with kernel to user aliases (i.e. the user address of a
page is rarely congruent to the kernel address of a page) it is possible
to generate D incoherency by missing a flush when a kernel page is
reclaimed (i.e. freed).
The scenario that resonates nicely with all this has to do with the
skbuff allocation and copying. Because the network read path isn't zero
copy, we do intermediate copies into skbuff areas before eventually
sending the data to the user socket. the idea is that the skbuff is
freed and then reallocated to the user process in the fault (this gives
us the necessary same physical index). If the kernel address of the
skbuff were accidentally congruent to the fault address, we'd actually
see the skbuff data instead of the underlying page data if it weren't
flushed. The problem, as usual, is that this isn't pa8800 specific ...
James
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2006-07-26 21:54 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-07-22 17:50 [parisc-linux] The problem on the PA8800 is all in the data-cache Carlos O'Donell
2006-07-23 1:01 ` Thibaut VARENE
2006-07-23 16:28 ` Michael S. Zick
2006-07-23 22:03 ` Thibaut VARENE
2006-07-24 1:40 ` Kyle McMartin
2006-07-24 2:39 ` Thibaut VARENE
2006-07-24 2:33 ` James Bottomley
2006-07-24 2:54 ` Thibaut VARENE
2006-07-24 3:32 ` Matthew Wilcox
2006-07-24 4:15 ` Thibaut VARENE
[not found] ` <1153750204.1235.18.camel@mulgrave.il.steeleye.com>
2006-07-24 16:32 ` Grant Grundler
2006-07-25 14:51 ` James Bottomley
2006-07-25 16:13 ` John David Anglin
2006-07-25 16:17 ` James Bottomley
2006-07-25 16:46 ` Kyle McMartin
2006-07-25 22:02 ` Grant Grundler
2006-07-26 21:54 ` James Bottomley
2006-07-25 16:34 ` Thibaut VARENE
2006-07-25 16:37 ` Thibaut VARENE
2006-07-24 14:58 ` John David Anglin
[not found] ` <1153711459.1235.13.camel@mulgrave.il.steeleye.com>
2006-07-24 4:26 ` Thibaut VARENE
2006-07-24 4:31 ` Thibaut VARENE
2006-07-24 14:51 ` James Bottomley
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox