From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ben Widawsky Subject: Re: [RFC] algorithm for handling bad cachelines Date: Wed, 28 Mar 2012 14:15:27 -0700 Message-ID: <20120328141527.55deb6aa@bwidawsk.net> References: <20120327071943.061bba40@bwidawsk.net> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Received: from cloud01.chad-versace.us (184-106-247-128.static.cloud-ips.com [184.106.247.128]) by gabe.freedesktop.org (Postfix) with ESMTP id 1B9EF9E76E for ; Wed, 28 Mar 2012 14:15:37 -0700 (PDT) In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: intel-gfx-bounces+gcfxdi-intel-gfx=m.gmane.org@lists.freedesktop.org Errors-To: intel-gfx-bounces+gcfxdi-intel-gfx=m.gmane.org@lists.freedesktop.org To: Andi Kleen Cc: intel-gfx@lists.freedesktop.org List-Id: intel-gfx@lists.freedesktop.org On Wed, 28 Mar 2012 02:59:26 -0700 Andi Kleen wrote: > Ben Widawsky writes: > > > > 1. Handle cache line going bad interrupt. > > > > Never use global n without timeout for corrected errors, you would > need a leaky bucket with a suitable timeout. As I understand electrons (which is not very well) parity errors happen all the time and are transparently corrected by our HW. So I suppose 'n' is still interesting information, but your point is noted. It is probably better to let userspace decide that n value. Take this with a grain of salt because the number of interrupts we get is speculative as I haven't actually tried to enable this. > > > 2. send a uevent > > 2.5 reset the GPU (docs tell us to) > > > > Persistent lists on disk usually suffer from all kinds of problems, > e.g. you need to detect when the board or CPU has changed. > Also when the problem is temporary you do not really want > to save such information permanent. > > Usually it's better to rediscover such state each time and handle > it again. Then you also don't need the uevent or complicated > user interfaces. It seems nice to have information stored non-volatility. It doesn't have to be used by the user, but assuming they want to load the option to actually detect these events, it's probably also beneficial to give the known bad cachelines since this requires a GPU reset once detected. The reset both takes time, and may do more damage (that is based on past experience/products only and I hope IVB can magnificently recover from our bad GPU programming). > > > Any feedback is highly appreciated. I couldn't really find much > > precedent for doing this in other drivers, so pointers to similar > > things would also be highly welcome. > > http://mcelog.org > > -Andi Thanks.