From: Petr Tesarik <ptesarik@suse.cz>
To: Tony Luck <tony.luck@gmail.com>
Cc: "linux-ia64@vger.kernel.org" <linux-ia64@vger.kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: Serious problem with ticket spinlocks on ia64
Date: Mon, 6 Sep 2010 16:47:01 +0200 [thread overview]
Message-ID: <201009061647.02345.ptesarik@suse.cz> (raw)
In-Reply-To: <AANLkTikZRiQ5jCpgp42LxtXb_S52sSUy7KP_=vZ3z4Tf@mail.gmail.com>
On Friday 03 of September 2010 17:50:53 Tony Luck wrote:
> On Fri, Sep 3, 2010 at 7:52 AM, Petr Tesarik <ptesarik@suse.cz> wrote:
> > Anyway, if a global TLB flush is necessary to trigger the bug, it would
> > also explain why we couldn't reproduce it in user-space.
>
> Perhaps ... I did explore the TLB in one variant of my user mode test I
> added a pointer-chasing routine that looked at enough pages to clear out
> the TLB. Not quite the same as a flush - but close. It didn't help at all.
Hi Tony,
I experimented a lot with the code, trying to find a solution, but all in
vain. I also tried to add a "dep %0=0,%0,15,2" instruction in the cmpxchg4
loop in __ticket_spin_lock but it still failed when the wrapped around to
zero (but now the high word was not even touched).
Replacing the "st2.rel" instruction with a similar cmpxchg4 loop in
__ticket_spin_unlock did not help either (so we no longer have two accesses
with different sizes).
What I've seen quite often lately is that the spinlock value is read as "0" by
the ld4.acq in __ticket_spin_lock(), then as "1" by ld4.acq inside the debug
fault handler, and then as "0" again by the "cmpxchg4" instruction, i.e. the
spin lock was actually acquired correctly, but the debug code triggered a
panic. This made me think that I had an error in my debug code, so I tried
running that test kernel without the probe, just waiting whether the kernel
hangs. It did hang within 10 minutes (with 6 parallel test case loops and a
module load/unload loop on another terminal) and produced a crash dump that
was very similar all the others.
To sum it up:
1. The ld4.acq and fetchadd.acq instructions fail to give us a coherent view
of the spinlock memory location.
2. So far, the problem has been observed only after the spinlock value changes
to zero.
3. It cannot be a random memory scribble, because I employed the DBR registers
to catch all writes to that memory location.
4. We haven't been able to reproduce the problem in user-space.
Frankly, I think that the processor does not follow the IPF specification,
hence it is a CPU bug.
But let's be extremely cautious here and re-read the specification once more,
very carefully. We can still miss some writes to the siglock memory location:
1. if the same physical address is accessible with another virtual address
2. if the siglock location is written by a non-mandatory RSE-spill
Option 2 seems extremely unlikely to me. Option 1 is more plausible, but
given that I never saw any siglock corruption with a value other than zero,
it still sounds less likely than a pure CPU bug.
Tony, could you please ask around in Intel if there is any way to debug the
CPU that would help us spot the real cause?
Petr Tesarik
next prev parent reply other threads:[~2010-09-06 14:46 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-08-27 13:37 Serious problem with ticket spinlocks on ia64 Petr Tesarik
2010-08-27 13:48 ` Hedi Berriche
2010-08-27 14:09 ` Petr Tesarik
2010-08-27 14:31 ` Hedi Berriche
2010-08-27 14:40 ` Petr Tesarik
2010-08-27 14:52 ` Hedi Berriche
2010-08-27 16:37 ` Petr Tesarik
2010-08-27 16:08 ` Luck, Tony
2010-08-27 17:16 ` Petr Tesarik
2010-08-27 18:20 ` Hedi Berriche
2010-08-27 19:40 ` Petr Tesarik
2010-08-27 20:29 ` Luck, Tony
2010-08-27 20:41 ` Petr Tesarik
2010-08-27 21:03 ` Petr Tesarik
2010-08-27 21:11 ` Luck, Tony
2010-08-27 22:13 ` Petr Tesarik
2010-08-27 23:26 ` Luck, Tony
2010-08-27 23:55 ` Luck, Tony
2010-08-28 0:28 ` Hedi Berriche
2010-08-28 5:01 ` Luck, Tony
2010-08-30 18:17 ` Luck, Tony
2010-08-30 21:41 ` Petr Tesarik
2010-08-30 22:43 ` Tony Luck
2010-08-31 22:17 ` Tony Luck
2010-09-01 23:09 ` Tony Luck
2010-09-02 0:26 ` Hedi Berriche
2010-09-03 0:06 ` Tony Luck
2010-09-03 9:04 ` Petr Tesarik
2010-09-03 14:35 ` Petr Tesarik
2010-09-03 14:52 ` Petr Tesarik
2010-09-03 15:50 ` Tony Luck
2010-09-06 14:47 ` Petr Tesarik [this message]
2010-09-07 13:17 ` Petr Tesarik
2010-09-07 17:35 ` Tony Luck
2010-09-08 15:55 ` Tony Luck
2010-09-10 2:55 ` Dave Jones
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=201009061647.02345.ptesarik@suse.cz \
--to=ptesarik@suse.cz \
--cc=linux-ia64@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=tony.luck@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox