All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ben Widawsky <ben@bwidawsk.net>
To: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: intel-gfx@lists.freedesktop.org
Subject: Re: [RFC] algorithm for handling bad cachelines
Date: Wed, 28 Mar 2012 11:04:21 -0700	[thread overview]
Message-ID: <20120328110421.41b9f38a@bwidawsk.net> (raw)
In-Reply-To: <20120328102652.2a930985@jbarnes-desktop>

On Wed, 28 Mar 2012 10:26:52 -0700
Jesse Barnes <jbarnes@virtuousgeek.org> wrote:

> On Tue, 27 Mar 2012 07:19:43 -0700
> Ben Widawsky <ben@bwidawsk.net> wrote:
> 
> > I wanted to run this by folks before I start doing any actual work.
> > 
> > This is primarily for GPGPU, or perhaps *really* accurate rendering
> > requirements.
> > 
> > IVB+ has an interrupt to tell us when a cacheline seems to be going
> > bad. There is also a mechanism to remap the bad cachelines. The
> > implementation details aren't quite clear to me yet, but I'd like to
> > enable this feature for userspace.
> > 
> > Here is my current plan, but it involves filesystem access, so it's
> > probably going to get a lot of flames.
> > 
> > 1. Handle cache line going bad interrupt.
> > <After n number of these interrupts to the same line,>
> > 2. send a uevent
> > 2.5 reset the GPU (docs tell us to)
> > <On module load>
> > 3. Read  a module parameter with a path in the filesystem
> > of the list of bad lines. It's not clear to me yet exactly what I
> > need to store, but it should be a relatively simple list.
> > 4. Parse list on driver load, and handle as necessary.
> > 5. goto 1.
> > 
> > Probably the biggest unanswered question is exactly when in the HW
> > loading do we have to finish remapping. If it can happen at any time
> > while the card is running, I don't need the filesystem stuff, but I
> > believe I need to remap the lines quite early in the device
> > bootstrap.
> > 
> > The only alternative I have is a huge comma separated string for a
> > module parameter, but I kind of like reading the file better.
> > 
> > Any feedback is highly appreciated. I couldn't really find much
> > precedent for doing this in other drivers, so pointers to similar
> > things would also be highly welcome.
> 
> I think the main thing here is to make sure we handle the L3 parity
> check interrupts.  I don't think "lines going bad" will be very common
> in practice (maybe if you really abuse your CPU by putting it in the
> freezer and then into an oven or something), so having a fancy
> interface for it probably isn't too important.

I'd like to clarify my statement a bit. Thanks for pointing this out.
I've yet to enable this interrupt and see what happens, so this is all
speculative based on that document:

It seems like more lines may be susepitble to parity errors via cosmic
rays. Since upon learning this information requires a GPU reset (to
Andi's point which I'll address seperately), I think it's desirable for
users to be able to choose early on whether or not the lines ever get
enabled. For the graphics case however, learning the information every
time is totally acceptible, and as you say below is probably a nop.

> 
> Also, the behavior should be configurable.  For the vast majority of
> users, an L3 parity interrupt is no big deal, and resetting the GPU
> when we see one is more than we want.  So it should probably be off by
> default and be controlled with a module parameter and maybe a config
> option.
> 
> As for the interface for feeding in bad lines (useful for testing if
> nothing else), I'd prefer an ioctl over a module parameter.  Some as
> yet uncoded userspace service can load a set based on previously
> collected uevents at boot time before any real GPU stuff runs...

It's still not clear to me if we can remap after the GPU is fully up
and running and handling the IOCTLs. If so, I believe that is
definitely the consesus.

> 
> Reading a file at load time is definitely a non-starter; we have no
> idea whether the filesystem will be available to the module based on
> mount points, hiding, chroots, etc.
> 

  reply	other threads:[~2012-03-28 18:04 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-27 14:19 [RFC] algorithm for handling bad cachelines Ben Widawsky
2012-03-27 14:34 ` Chris Wilson
2012-03-27 14:50 ` Daniel Vetter
2012-03-27 15:09   ` Ben Widawsky
2012-03-27 15:33     ` Daniel Vetter
2012-03-28 17:26 ` Jesse Barnes
2012-03-28 18:04   ` Ben Widawsky [this message]
     [not found] ` <m2k4253v29.fsf@firstfloor.org>
2012-03-28 21:15   ` Ben Widawsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120328110421.41b9f38a@bwidawsk.net \
    --to=ben@bwidawsk.net \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=jbarnes@virtuousgeek.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.