From: Don Zickus <dzickus@redhat.com>
To: ZAK Magnus <zakmagnus@google.com>
Cc: linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2] Track hard and soft "short lockups" or "stalls."
Date: Thu, 21 Jul 2011 10:51:12 -0400 [thread overview]
Message-ID: <20110721145112.GA24016@redhat.com> (raw)
In-Reply-To: <CAAuSN90cv8iz+EMaVP5QQREs8sQC+m4u55nMLOFXKuFXxpX8Cw@mail.gmail.com>
On Wed, Jul 20, 2011 at 02:15:02PM -0700, ZAK Magnus wrote:
> On Wed, Jul 20, 2011 at 2:07 PM, Don Zickus <dzickus@redhat.com> wrote:
> > On Wed, Jul 20, 2011 at 12:41:39PM -0700, ZAK Magnus wrote:
> >> Are the stack traces very different? I don't understand in what sense
> >> it's confusing.
> >
> > The fact that there are 3 of them telling me the samething. Most people
> > look at the first stack trace to figure out what is going on and will just
> > notice the warning. They might completely miss the HARDLOCKUP message on
> > the third stack trace down.
> >
> > It just looked odd when I ran it the first time. I feel like I would
> > constantly be trying to educate people on why we do it like that.
> Oh, okay. So, maybe the stall warnings should say something like, "a
> lockup might happen soon?" Would that help? I don't know.
Perhaps just changing the wording might help with the confusion. After
reading below it might be hard to restrict the dump_stack to only the
highest stall (as opposed to printing each new worst hardstall as it
increments to a hardlockup).
>
> >> I don't think that exact patch would work (wouldn't it cause
> >> update_hardstall to only ever be called with 0 as its first argument?)
> >> but I hope I still understand what you're saying. You're saying stalls
> >> should only be recorded once they're finished, right? I don't know if
> >> this is the best approach. If we wait until interrupts stop being
> >> missed, it means the code could have exited whatever section caused
> >> the stall to begin with. Maybe your data indicates otherwise, but I
> >> would think this means the stack trace would not really be
> >
> > Crap. good point.
> Which part, exactly?
The part about losing the stalled section when the code resets the missed
interrupts back to 0.
>
> >> informative. It's one thing to know a stall occurs, but its occurrence
> >> is generally reflective of a bug or a suboptimal section, so it would
> >> be good to know where that is in order to try and fix it.
> >>
> >> For soft stalls, I think the same is true. Also, since the soft lockup
> >> system just relies on checking a timestamp compared to now, it can't
> >> know how long a stall was after it has already finished. The hard
> >> system only knows because it keeps a running count of the number of
> >> failed checks. An additional timestamp could be introduced and the
> >> difference between the two retroactively checked in order to reproduce
> >> this, but the stack trace issue would still apply. Also, while not
> >> hugely complex, the change would be more significant than the sort
> >> your patch presents.
> >>
> >> The bottom line is that I think catching a stall in progress is the
> >> most informative thing to do, and I don't understand the downsides of
> >> doing so. Could you please explain them?
> >>
> >> On another note, I'm working on a patch on top of this one which would
> >> change the hard lockup system to be more like the soft lockup system.
> >> It would use a timestamp as well, so it can have a more exact read on
> >> how long the timer has been delayed. This adds resolution and gets rid
> >> of that problem where it can only report missed = 3 or 4. Any
> >> preliminary comments? Or should I just put the patch up before
> >> discussing it?
> >
> > That might work. I would have to see the patch. What clock would you use
> > to read the time? I don't think you can use 'now' if interrupts are
> > disabled.
> Okay, I will send it when it seems ready. For the timestamp, I was
> just using the get_timestamp function that's defined in the file,
> which calls cpu_clock(). Is there a better way?
Oh right. I forgot Peter created some abstraction called cpu_clock().
That should work I suppose, it's NMI safe according to various code
comments. I just wonder if the clock gets updated if interrupts are
accidentally disabled for a long time. Though TSC doesn't care on x86
about interrupts.
Cheers,
Don
prev parent reply other threads:[~2011-07-21 14:51 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-07-15 20:11 [PATCH v2] Track hard and soft "short lockups" or "stalls." Alex Neronskiy
2011-07-18 12:28 ` Don Zickus
[not found] ` <CAAuSN9106qmYF27oRrfUBtqwOmSQgDJWwv3iz_NmTTuYNEymHA@mail.gmail.com>
2011-07-20 15:41 ` Don Zickus
2011-07-20 19:41 ` ZAK Magnus
2011-07-20 21:07 ` Don Zickus
2011-07-20 21:15 ` ZAK Magnus
2011-07-21 14:51 ` Don Zickus [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110721145112.GA24016@redhat.com \
--to=dzickus@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=zakmagnus@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox