linux-acpi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andi Kleen <andi@firstfloor.org>
To: Ingo Molnar <mingo@elte.hu>
Cc: Andi Kleen <andi@firstfloor.org>,
	Huang Ying <ying.huang@intel.com>, Len Brown <lenb@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
	Borislav Petkov <petkovbb@googlemail.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	"H. Peter Anvin" <hpa@zytor.com>, Don Zickus <dzickus@redhat.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Mauro Carvalho Chehab <mchehab@redhat.com>,
	Arjan van de Ven <arjan@infradead.org>
Subject: Re: [NAK] Re: [PATCH -v2 9/9] ACPI, APEI, Generic Hardware Error Source POLL/IRQ/NMI notification type support
Date: Mon, 25 Oct 2010 14:37:53 +0200	[thread overview]
Message-ID: <20101025123753.GB17622@basil.fritz.box> (raw)
In-Reply-To: <20101025111530.GA27659@elte.hu>

On Mon, Oct 25, 2010 at 01:15:30PM +0200, Ingo Molnar wrote:

> > > > einj.c: it's about the 3rd separate 'error injection' concept that got 
> > > > introduced ...
> > > 
> > > EINJ is a true platform feature, not just software feature. We need to support 
> > > it to debug various hardware error features.
> > 
> > Also having multiple error injecting interfaces is a good thing.
> 
> It's never a good thing to have separate, vendor dependent interfaces for what to 
> the user is basically the same conceptual thing!

Perhaps a simple example (simplified, in practice there are more
complications) makes it more clear:

The memory error handler does different actions depending on what the 
state the page the error is happening on is in.

To get reasonable coverage of the recovery code you need to present it with 
pages in different states (like locked, clean, dirty, IO etc. etc. )

Now it turns out this is very hard to do if you just inject the error
at the hardware level, because there are lots of races and problems
ensuring the page is still in the expected state etc.etc.

So one of the solution hwpoison did for this was to have another injector
that works on the process level. At the process level you can get
pages into different stages and reasonably cleanly inject the right 
error with the right context. This is essentially error
injection at the VM level.

Again this is simplified, for coverage we actually needed multiple
injectors that work at different entry points, e.g. for example
to make sure buffered file system pages are correctly handled too.

Now that's great, but we still need other injectors that work 
on other level, otherwise the part that talks to the hardware
are not covered at all

But you cannot test all the code paths of that code either using
a single hardware injector.  So there's another one that can fake different
contexts at the software level and provides reasonable
coverage of this code.

But then you still didn't test the whole hardware to software
error path. Now yes you could use a EDAC like ECC bits injector
(which BTW doesn't really need a kernel driver, we did it just
using shell scripts fine before). But that also only tests
one path and not the others possibilities, and also only
works on specific hardware in specific modes with very careful
setup.

But that's just one type of error for one system.  So you need other 
interfaces for other hardware and for other errors etc.etc. 

In some cases you also need to talk to the BIOS to do this injection
for various reasons, that is where ACPI comes in (and all these 
acronyms you seem to object to)

Also it's not enough to do single error injection once
on some system. You need a repeatable regression test that
ensure the error handling stays operable for kernels as
they evolve. This requires that the error injection is reasonably 
portable. 

For this I tried to have a "software only" injector for
nearly everything just to make sure the code can be always
tested. Unfortunately the software injectors, especially
the ones aiming at larger coverage, also have quite different
interface requirements than hardware interfaces.

But again you still need to test the full hardware too,
otherwise we don't know if the error handling is really 
working on a real system in practice.

Error injection is just messy. There is no single general
solution that works for everything and solves all problems, but you
really need a pragmatic approach for every subsystem. 

In the end it means you end up with lots of different injectors,
all tied to some specific problem.

Would it be nice if there was a single great injector that covers
everything? Yes
Is it realistic? No.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

  parent reply	other threads:[~2010-10-25 12:37 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-10-25  7:43 [PATCH -v2 0/9] ACPI, APEI patches for 2.6.37 Huang Ying
2010-10-25  7:43 ` [PATCH -v2 1/9] ACPI, APEI, Add ERST record ID cache Huang Ying
2010-10-25  7:43 ` [PATCH -v2 2/9] Add lock-less version of bitmap_set/clear Huang Ying
2010-10-25  7:43 ` [PATCH -v2 3/9] lock-less NULL terminated single list implementation Huang Ying
2010-10-25  7:43 ` [PATCH -v2 4/9] lock-less general memory allocator Huang Ying
2010-10-25  7:43 ` [PATCH -v2 5/9] Hardware error device core Huang Ying
2010-10-25  7:43 ` [PATCH -v2 6/9] Hardware error record persistent support Huang Ying
2010-10-25  7:43 ` [PATCH -v2 7/9] ACPI, APEI, Use ERST for hardware error persisting before panic Huang Ying
2010-10-25  7:43 ` [PATCH -v2 8/9] ACPI, APEI, Report GHES error record with hardware error device core Huang Ying
2010-10-25  7:43 ` [PATCH -v2 9/9] ACPI, APEI, Generic Hardware Error Source POLL/IRQ/NMI notification type support Huang Ying
2010-10-25  8:45   ` [NAK] " Ingo Molnar
2010-10-25  8:58     ` Huang Ying
2010-10-25  9:19       ` Andi Kleen
2010-10-25 11:15         ` Ingo Molnar
2010-10-25 12:04           ` Mauro Carvalho Chehab
2010-10-25 17:07             ` Tony Luck
2010-10-25 17:19               ` Mauro Carvalho Chehab
2010-10-25 12:37           ` Andi Kleen [this message]
2010-10-25 12:55             ` Ingo Molnar
2010-10-25 13:02               ` Ingo Molnar
2010-10-25 13:11               ` Andi Kleen
2010-10-25 13:47                 ` Ingo Molnar
2010-10-25 15:14                   ` Andi Kleen
2010-10-25 17:10                     ` Ingo Molnar
2010-10-27  8:25                       ` Ingo Molnar
2010-10-25 16:38         ` Thomas Gleixner
2010-10-25  9:25       ` Ingo Molnar
2010-10-25 17:14         ` Tony Luck
2010-10-25 20:23           ` Borislav Petkov
2010-10-25 21:23             ` Tony Luck
2010-10-25 21:51               ` Borislav Petkov
2010-10-25 23:35                 ` Tony Luck
     [not found]                 ` <AANLkTi=pJFUWusDNrwQA8bWYy4q5QZBHxkbikZGKvHLY@mail.gmail.com>
2010-10-26  6:26                   ` Borislav Petkov
2010-10-26  1:06     ` Len Brown
2010-10-26  4:53       ` Thomas Gleixner
2010-10-26  7:22         ` Ingo Molnar
2010-10-26  7:30           ` Huang Ying
2010-10-26  7:55             ` Ingo Molnar
2010-10-26  8:32               ` Huang Ying
2010-10-26 10:03                 ` Ingo Molnar
2010-10-26  8:38         ` Andi Kleen
2010-10-26 10:00           ` Thomas Gleixner
2010-10-26  8:52         ` Huang Ying
2010-10-26 10:15           ` Ingo Molnar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20101025123753.GB17622@basil.fritz.box \
    --to=andi@firstfloor.org \
    --cc=akpm@linux-foundation.org \
    --cc=arjan@infradead.org \
    --cc=dzickus@redhat.com \
    --cc=hpa@zytor.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@redhat.com \
    --cc=mingo@elte.hu \
    --cc=petkovbb@googlemail.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).