Re: [PATCH v2] Track hard and soft "short lockups" or "stalls."

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Don Zickus <dzickus@redhat.com>
To: ZAK Magnus <zakmagnus@google.com>
Cc: linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2] Track hard and soft "short lockups" or "stalls."
Date: Thu, 21 Jul 2011 10:51:12 -0400	[thread overview]
Message-ID: <20110721145112.GA24016@redhat.com> (raw)
In-Reply-To: <CAAuSN90cv8iz+EMaVP5QQREs8sQC+m4u55nMLOFXKuFXxpX8Cw@mail.gmail.com>

On Wed, Jul 20, 2011 at 02:15:02PM -0700, ZAK Magnus wrote:
> On Wed, Jul 20, 2011 at 2:07 PM, Don Zickus <dzickus@redhat.com> wrote:
> > On Wed, Jul 20, 2011 at 12:41:39PM -0700, ZAK Magnus wrote:
> >> Are the stack traces very different? I don't understand in what sense
> >> it's confusing.
> >
> > The fact that there are 3 of them telling me the samething.  Most people
> > look at the first stack trace to figure out what is going on and will just
> > notice the warning.  They might completely miss the HARDLOCKUP message on
> > the third stack trace down.
> >
> > It just looked odd when I ran it the first time.  I feel like I would
> > constantly be trying to educate people on why we do it like that.
> Oh, okay. So, maybe the stall warnings should say something like, "a
> lockup might happen soon?" Would that help? I don't know.

Perhaps just changing the wording might help with the confusion.  After
reading below it might be hard to restrict the dump_stack to only the
highest stall (as opposed to printing each new worst hardstall as it
increments to a hardlockup).

> 
> >> I don't think that exact patch would work (wouldn't it cause
> >> update_hardstall to only ever be called with 0 as its first argument?)
> >> but I hope I still understand what you're saying. You're saying stalls
> >> should only be recorded once they're finished, right? I don't know if
> >> this is the best approach. If we wait until interrupts stop being
> >> missed, it means the code could have exited whatever section caused
> >> the stall to begin with. Maybe your data indicates otherwise, but I
> >> would think this means the stack trace would not really be
> >
> > Crap.  good point.
> Which part, exactly?

The part about losing the stalled section when the code resets the missed
interrupts back to 0.

> 
> >> informative. It's one thing to know a stall occurs, but its occurrence
> >> is generally reflective of a bug or a suboptimal section, so it would
> >> be good to know where that is in order to try and fix it.
> >>
> >> For soft stalls, I think the same is true. Also, since the soft lockup
> >> system just relies on checking a timestamp compared to now, it can't
> >> know how long a stall was after it has already finished. The hard
> >> system only knows because it keeps a running count of the number of
> >> failed checks. An additional timestamp could be introduced and the
> >> difference between the two retroactively checked in order to reproduce
> >> this, but the stack trace issue would still apply. Also, while not
> >> hugely complex, the change would be more significant than the sort
> >> your patch presents.
> >>
> >> The bottom line is that I think catching a stall in progress is the
> >> most informative thing to do, and I don't understand the downsides of
> >> doing so. Could you please explain them?
> >>
> >> On another note, I'm working on a patch on top of this one which would
> >> change the hard lockup system to be more like the soft lockup system.
> >> It would use a timestamp as well, so it can have a more exact read on
> >> how long the timer has been delayed. This adds resolution and gets rid
> >> of that problem where it can only report missed = 3 or 4. Any
> >> preliminary comments? Or should I just put the patch up before
> >> discussing it?
> >
> > That might work.  I would have to see the patch.  What clock would you use
> > to read the time?  I don't think you can use 'now' if interrupts are
> > disabled.
> Okay, I will send it when it seems ready. For the timestamp, I was
> just using the get_timestamp function that's defined in the file,
> which calls cpu_clock(). Is there a better way?

Oh right.  I forgot Peter created some abstraction called cpu_clock().
That should work I suppose, it's NMI safe according to various code
comments.  I just wonder if the clock gets updated if interrupts are
accidentally disabled for a long time.  Though TSC doesn't care on x86
about interrupts.

Cheers,
Don

     prev parent reply	other threads:[~2011-07-21 14:51 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-07-15 20:11 [PATCH v2] Track hard and soft "short lockups" or "stalls." Alex Neronskiy
2011-07-18 12:28 ` Don Zickus
     [not found]   ` <CAAuSN9106qmYF27oRrfUBtqwOmSQgDJWwv3iz_NmTTuYNEymHA@mail.gmail.com>
2011-07-20 15:41     ` Don Zickus
2011-07-20 19:41       ` ZAK Magnus
2011-07-20 21:07         ` Don Zickus
2011-07-20 21:15           ` ZAK Magnus
2011-07-21 14:51             ` Don Zickus [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110721145112.GA24016@redhat.com \
    --to=dzickus@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=zakmagnus@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox