From: Don Zickus <dzickus@redhat.com>
To: ZAK Magnus <zakmagnus@google.com>
Cc: linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2] Track hard and soft "short lockups" or "stalls."
Date: Thu, 21 Jul 2011 10:51:12 -0400 [thread overview]
Message-ID: <20110721145112.GA24016@redhat.com> (raw)
In-Reply-To: <CAAuSN90cv8iz+EMaVP5QQREs8sQC+m4u55nMLOFXKuFXxpX8Cw@mail.gmail.com>
On Wed, Jul 20, 2011 at 02:15:02PM -0700, ZAK Magnus wrote:
> On Wed, Jul 20, 2011 at 2:07 PM, Don Zickus <dzickus@redhat.com> wrote:
> > On Wed, Jul 20, 2011 at 12:41:39PM -0700, ZAK Magnus wrote:
> >> Are the stack traces very different? I don't understand in what sense
> >> it's confusing.
> >
> > The fact that there are 3 of them telling me the samething. Most people
> > look at the first stack trace to figure out what is going on and will just
> > notice the warning. They might completely miss the HARDLOCKUP message on
> > the third stack trace down.
> >
> > It just looked odd when I ran it the first time. I feel like I would
> > constantly be trying to educate people on why we do it like that.
> Oh, okay. So, maybe the stall warnings should say something like, "a
> lockup might happen soon?" Would that help? I don't know.
Perhaps just changing the wording might help with the confusion. After
reading below it might be hard to restrict the dump_stack to only the
highest stall (as opposed to printing each new worst hardstall as it
increments to a hardlockup).
>
> >> I don't think that exact patch would work (wouldn't it cause
> >> update_hardstall to only ever be called with 0 as its first argument?)
> >> but I hope I still understand what you're saying. You're saying stalls
> >> should only be recorded once they're finished, right? I don't know if
> >> this is the best approach. If we wait until interrupts stop being
> >> missed, it means the code could have exited whatever section caused
> >> the stall to begin with. Maybe your data indicates otherwise, but I
> >> would think this means the stack trace would not really be
> >
> > Crap. good point.
> Which part, exactly?
The part about losing the stalled section when the code resets the missed
interrupts back to 0.
>
> >> informative. It's one thing to know a stall occurs, but its occurrence
> >> is generally reflective of a bug or a suboptimal section, so it would
> >> be good to know where that is in order to try and fix it.
> >>
> >> For soft stalls, I think the same is true. Also, since the soft lockup
> >> system just relies on checking a timestamp compared to now, it can't
> >> know how long a stall was after it has already finished. The hard
> >> system only knows because it keeps a running count of the number of
> >> failed checks. An additional timestamp could be introduced and the
> >> difference between the two retroactively checked in order to reproduce
> >> this, but the stack trace issue would still apply. Also, while not
> >> hugely complex, the change would be more significant than the sort
> >> your patch presents.
> >>
> >> The bottom line is that I think catching a stall in progress is the
> >> most informative thing to do, and I don't understand the downsides of
> >> doing so. Could you please explain them?
> >>
> >> On another note, I'm working on a patch on top of this one which would
> >> change the hard lockup system to be more like the soft lockup system.
> >> It would use a timestamp as well, so it can have a more exact read on
> >> how long the timer has been delayed. This adds resolution and gets rid
> >> of that problem where it can only report missed = 3 or 4. Any
> >> preliminary comments? Or should I just put the patch up before
> >> discussing it?
> >
> > That might work. I would have to see the patch. What clock would you use
> > to read the time? I don't think you can use 'now' if interrupts are
> > disabled.
> Okay, I will send it when it seems ready. For the timestamp, I was
> just using the get_timestamp function that's defined in the file,
> which calls cpu_clock(). Is there a better way?
Oh right. I forgot Peter created some abstraction called cpu_clock().
That should work I suppose, it's NMI safe according to various code
comments. I just wonder if the clock gets updated if interrupts are
accidentally disabled for a long time. Though TSC doesn't care on x86
about interrupts.
Cheers,
Don
prev parent reply other threads:[~2011-07-21 14:51 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-07-15 20:11 [PATCH v2] Track hard and soft "short lockups" or "stalls." Alex Neronskiy
2011-07-18 12:28 ` Don Zickus
[not found] ` <CAAuSN9106qmYF27oRrfUBtqwOmSQgDJWwv3iz_NmTTuYNEymHA@mail.gmail.com>
2011-07-20 15:41 ` Don Zickus
2011-07-20 19:41 ` ZAK Magnus
2011-07-20 21:07 ` Don Zickus
2011-07-20 21:15 ` ZAK Magnus
2011-07-21 14:51 ` Don Zickus [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110721145112.GA24016@redhat.com \
--to=dzickus@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=zakmagnus@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.