Re: Possible sandybridge livelock issue

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Ingo Molnar <mingo@elte.hu>
To: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Andi Kleen <andi@firstfloor.org>,
	Christoph Lameter <cl@linux.com>,
	x86@kernel.org, linux-mm <linux-mm@kvack.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Mel Gorman <mgorman@suse.de>
Subject: Re: Possible sandybridge livelock issue
Date: Mon, 16 May 2011 08:52:20 +0200	[thread overview]
Message-ID: <20110516065220.GB24836@elte.hu> (raw)
In-Reply-To: <1305312552.2611.66.camel@mulgrave.site>

* James Bottomley <James.Bottomley@HansenPartnership.com> wrote:

> > Can you figure out better what the kswapd is doing?
> 
> We have ... it was the thread in the first email.  We don't need a fix for 
> the kswapd issue, what we're warning about is a potential sandybridge 
> problem.
> 
> The facts are that only sandybridge systems livelocked in the kswapd problem 
> ... no other systems could reproduce it, although they did see heavy CPU time 
> accumulate to kswapd.  And this is with a gang of mm people trying to 
> reproduce the problem on non-sandybridge systems.
> 
> On the sandybridge systems that livelocked, it was sometimes possible to 
> release the lock by pushing kswapd off the cpu it was hogging.

It's not uncommon at all to see certain races (or even livelocks) only with the 
latest and greatest CPUs.

I have a first-gen CPU system that when i got it a couple of years ago 
triggered like a dozen Linux kernel races and bugs possible theoretically on 
all other CPUs but not reported on any other Linux system up to that point, 
*ever* - and some of those bugs were many years old.

> If you think the theory about why this happend to be wrong, fine ... come up 
> with another one.  The facts are as above and only sandybridge systems seem 
> to be affected.

I can see at least four other plausible hypotheses, all matching the facts as 
you laid them out:

 - i could be a bug/race in the kswapd code.

 - it could be that the race window needs a certain level of instruction 
   parallelism - which occurs with a higher likelyhood on Sandybridge.

 - it could be that Sandybridge CPUs keep dirty cachelines owned a bit longer 
   than other CPUs, making an existing livelock bug in the kernel code easier 
   to trigger.

 - a hardware bug: if cacheline ownership is not arbitrated between 
   nodes/cpus/cores fairly (enough) and a specific CPU can monopolize a 
   cacheline for a very long time if only it keeps modifying it in an 
   aggressive enough kswapd loop.

Note, since each of these hypotheses has a specific non-zero chance of being 
the objective truth, your hypothesis might in the end turn out to be the right 
one and might turn into a proven scientific theory: CPU and scheduler bugs do 
happen after all.

The other hypotheses i outlined have non-zero chances as well: kswapd bugs do 
happen as well and various CPU timing differences do tend to occur as well.

But above you seem to be confused about how supporting facts and hypotheses 
relate to each other: you seemed to imply that because your facts support your 
hypothesis the ball is somehow on the other side. As things stand now we 
clearly need more facts, to exclude more of the many possibilities.

So i wanted to clear up these basics of science first, before any of us wastes 
too much time on writing mails and such. Oh ... never mind ;-)

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2011-05-16  6:52 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-05-13 16:12 Possible sandybridge livelock issue James Bottomley
2011-05-13 16:36 ` Andi Kleen
2011-05-13 17:08   ` Christoph Lameter
2011-05-13 18:23     ` Andi Kleen
2011-05-13 18:49       ` James Bottomley
2011-05-16  6:52         ` Ingo Molnar [this message]
2011-05-16  6:29 ` Ingo Molnar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110516065220.GB24836@elte.hu \
    --to=mingo@elte.hu \
    --cc=James.Bottomley@HansenPartnership.com \
    --cc=andi@firstfloor.org \
    --cc=cl@linux.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).