* Software based ECC ? @ 2007-08-10 21:16 roland 2007-08-10 22:21 ` Alan Cox 2007-08-11 6:11 ` Valdis.Kletnieks 0 siblings, 2 replies; 9+ messages in thread From: roland @ 2007-08-10 21:16 UTC (permalink / raw) To: linux-kernel Hello ! since ECC (speaking in terms of ram/memory) is some widespread hardware technology within server/enterprise computing for protection of memory failure, i wonder: Can`t this be done in software, too ? I didn`t find a referenc on this list, but i found an interesting paper i'd like to share at: http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf "SoftECC : A System for Software Memory Integrity Checking" Is it possible to implement something like this within the Linux virtual memory subsystem ? If it can be done, wouldn`t this be a great feature ? regards Roland K. system engineer ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Software based ECC ? 2007-08-10 21:16 Software based ECC ? roland @ 2007-08-10 22:21 ` Alan Cox 2007-08-11 6:11 ` Valdis.Kletnieks 1 sibling, 0 replies; 9+ messages in thread From: Alan Cox @ 2007-08-10 22:21 UTC (permalink / raw) To: roland; +Cc: linux-kernel On Fri, 10 Aug 2007 23:16:45 +0200 "roland" <devzero@web.de> wrote: > Hello ! > > since ECC (speaking in terms of ram/memory) is some widespread hardware > technology > within server/enterprise computing for protection of memory failure, i > wonder: > > Can`t this be done in software, too ? Only one way to find out. If it interest you - have a go at it ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Software based ECC ? 2007-08-10 21:16 Software based ECC ? roland 2007-08-10 22:21 ` Alan Cox @ 2007-08-11 6:11 ` Valdis.Kletnieks 2007-08-12 16:51 ` Folkert van Heusden 1 sibling, 1 reply; 9+ messages in thread From: Valdis.Kletnieks @ 2007-08-11 6:11 UTC (permalink / raw) To: roland; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 1252 bytes --] On Fri, 10 Aug 2007 23:16:45 +0200, roland said: > http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf > > "SoftECC : A System for Software Memory Integrity Checking" > > Is it possible to implement something like this within the Linux virtual > memory subsystem ? Anything that can be simulated with a Turing machine is *possible*. The question is how many rocket boosters the pig needs for takeoff. Hint: The thesis talks about why he didn't implement it for Linux. > If it can be done, wouldn`t this be a great feature ? Read section 5.2 of that thesis, particularly this quote from 5.2.2: "For random word writes, this implies that SoftECC will need an order of magnitude more compute time than the user-mode code" Basically, on every single memory page that gets dirtied, we have to then re-checksum the page (blowing away cache lines in the process). If you want to get a feel for it, find the kernel code that recognizes that a page is dirtied, and just add a few lines there: int foo = 0, i; for (i=0;i++;<1024) { // adjust for non-4K pages foo ^= *(page+i); } and see how much your system crawls. Personally, I'd recommend just shelling out the bucks for hardware ECC if the reliability matters. [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Software based ECC ? 2007-08-11 6:11 ` Valdis.Kletnieks @ 2007-08-12 16:51 ` Folkert van Heusden 2007-08-12 17:07 ` Jan Engelhardt ` (2 more replies) 0 siblings, 3 replies; 9+ messages in thread From: Folkert van Heusden @ 2007-08-12 16:51 UTC (permalink / raw) To: Valdis.Kletnieks; +Cc: roland, linux-kernel > > http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf > > "SoftECC : A System for Software Memory Integrity Checking" > > Personally, I'd recommend just shelling out the bucks for hardware ECC if > the reliability matters. a question and an idea: Q: is ecc guaranteed to detect all bitflips? Idea: what about a multicore system (3 or more) that runs the same processes on 2 cores and a third core verifying that they both do the same? As I think it is not only ram that can become faulty. Folkert van Heusden -- MultiTail er et flexible tool for å kontrolere Logfiles og commandoer. Med filtrer, farger, sammenføringer, forskeliger ansikter etc. http://www.vanheusden.com/multitail/ ---------------------------------------------------------------------- Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Software based ECC ? 2007-08-12 16:51 ` Folkert van Heusden @ 2007-08-12 17:07 ` Jan Engelhardt 2007-08-12 19:05 ` chibiryuu 2007-08-13 3:09 ` Valdis.Kletnieks 2 siblings, 0 replies; 9+ messages in thread From: Jan Engelhardt @ 2007-08-12 17:07 UTC (permalink / raw) To: Folkert van Heusden; +Cc: Valdis.Kletnieks, roland, linux-kernel On Aug 12 2007 18:51, Folkert van Heusden wrote: > >> > http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf >> > "SoftECC : A System for Software Memory Integrity Checking" >> >> Personally, I'd recommend just shelling out the bucks for hardware ECC if >> the reliability matters. > >a question and an idea: Q: is ecc guaranteed to detect all bitflips? > >Idea: what about a multicore system (3 or more) that runs the same >processes on 2 cores and a third core verifying that they both do the >same? As I think it is not only ram that can become faulty. Indeed. And for example BOINC (Seti@home) have to consider this. Hence they recalculate each work unit at least three times and then compare between each. What makes this different from ECC is that the checksum is not calculated on every memory operations, but at the end of a larger block of operations. Of course this may mean that an error can propagate for a while, but the total walltime (including recomputation) is lower. :) Jan -- ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Software based ECC ? 2007-08-12 16:51 ` Folkert van Heusden 2007-08-12 17:07 ` Jan Engelhardt @ 2007-08-12 19:05 ` chibiryuu 2007-08-13 3:09 ` Valdis.Kletnieks 2 siblings, 0 replies; 9+ messages in thread From: chibiryuu @ 2007-08-12 19:05 UTC (permalink / raw) To: Folkert van Heusden; +Cc: Valdis.Kletnieks, roland, linux-kernel On 8/12/07, Folkert van Heusden <folkert@vanheusden.com> wrote: > > > http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf > > > "SoftECC : A System for Software Memory Integrity Checking" > > > > Personally, I'd recommend just shelling out the bucks for hardware ECC if > > the reliability matters. > > a question and an idea: Q: is ecc guaranteed to detect all bitflips? > > Idea: what about a multicore system (3 or more) that runs the same > processes on 2 cores and a third core verifying that they both do the > same? As I think it is not only ram that can become faulty. Such hardware does exist -- for example, Stratus sells systems that run the same OS on two separate boards in lockstep, with a voter to determine what action to take if they ever diverge. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Software based ECC ? 2007-08-12 16:51 ` Folkert van Heusden 2007-08-12 17:07 ` Jan Engelhardt 2007-08-12 19:05 ` chibiryuu @ 2007-08-13 3:09 ` Valdis.Kletnieks 2 siblings, 0 replies; 9+ messages in thread From: Valdis.Kletnieks @ 2007-08-13 3:09 UTC (permalink / raw) To: Folkert van Heusden; +Cc: roland, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1818 bytes --] On Sun, 12 Aug 2007 18:51:31 +0200, Folkert van Heusden said: > a question and an idea: Q: is ecc guaranteed to detect all bitflips? It depends on the exact ECC function the hardware implements. Usually it provides performance such as: "Correct all 1-bit errors. Detect all 2-bit errors, and most 3 and higher, but not correct". (Of course, "correct all 1 or 2 bit and detect all 3 bit" can be done, it just takes more bits of ECC.) > Idea: what about a multicore system (3 or more) that runs the same > processes on 2 cores and a third core verifying that they both do the > same? As I think it is not only ram that can become faulty. This is actually done for high-reliability systems (Google for "tell me twice" and "tell me three times"). The problem is that it takes a lot of extra hardware. The G5 and later IBM Z-series mainframe chipsets (not to be confused with the PowerPC G5) implemented dual computation units and a comparator that signals a 'Machine Check' condition if the two CPUs don't end up in the same exact state (as an added bonus, at the end of each instruction that both *do* compare good, it latches the *entire* state of the CPU out, and then does the following: 1) Retry the instruction on the same CPU - if it compares correctly, keep going and flag a "soft" error. 2) If it still fails, read out the last "known good" status latch, and load it into a spare CPU, and fire it up, and flag the failing one as bad. http://www.research.ibm.com/journal/rd/435/spainhower.pdf http://www.research.ibm.com/journal/rd/435/mueller.pdf These guys have forgotten more about designing highly reliable systems than most of us will ever know. ;) Needless to say, not everybody is willing to pay the costs of the hardware overhead of this approach. [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <8QK3R-kc-9@gated-at.bofh.it>]
[parent not found: <8QSuw-4J2-9@gated-at.bofh.it>]
[parent not found: <8RoXy-3NJ-13@gated-at.bofh.it>]
* Re: Software based ECC ? [not found] ` <8RoXy-3NJ-13@gated-at.bofh.it> @ 2007-08-21 18:44 ` Bodo Eggert 2007-08-21 20:17 ` linux-os (Dick Johnson) 0 siblings, 1 reply; 9+ messages in thread From: Bodo Eggert @ 2007-08-21 18:44 UTC (permalink / raw) To: Folkert van Heusden, Valdis.Kletnieks, roland, linux-kernel Folkert van Heusden <folkert@vanheusden.com> wrote: >> > http://pdos.csail.mit.edu/papers/softecc:ddopson-meng softecc_ddopson-meng.pdf >> > "SoftECC : A System for Software Memory Integrity Checking" >> >> Personally, I'd recommend just shelling out the bucks for hardware ECC if >> the reliability matters. > > a question and an idea: Q: is ecc guaranteed to detect all bitflips? It's guaranteed not to. Having n extra bits, you can detect n-bit-flips and correct n/2-bit-flips (provided you use an optimal code). These extra bits can flip, too, so if you have m >= 1 data bits and any finite number n of extra bits, it's possible to have an undetectable n+1-bit-flip. -- If you can't remember, then the claymore IS pointed at you. Friß, Spammer: W@2fzbe.7eggert.dyndns.org k@-.7eggert.dyndns.org Rtc@Ytzq.7eggert.dyndns.org 9qKSiPo@ZesTis.7eggert.dyndns.org ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Software based ECC ? 2007-08-21 18:44 ` Bodo Eggert @ 2007-08-21 20:17 ` linux-os (Dick Johnson) 0 siblings, 0 replies; 9+ messages in thread From: linux-os (Dick Johnson) @ 2007-08-21 20:17 UTC (permalink / raw) To: Bodo Eggert; +Cc: Folkert van Heusden, Valdis.Kletnieks, roland, linux-kernel On Tue, 21 Aug 2007, Bodo Eggert wrote: > Folkert van Heusden <folkert@vanheusden.com> wrote: > >>>> http://pdos.csail.mit.edu/papers/softecc:ddopson-meng > softecc_ddopson-meng.pdf >>>> "SoftECC : A System for Software Memory Integrity Checking" >>> >>> Personally, I'd recommend just shelling out the bucks for hardware ECC if >>> the reliability matters. >> >> a question and an idea: Q: is ecc guaranteed to detect all bitflips? > > It's guaranteed not to. > > Having n extra bits, you can detect n-bit-flips and correct n/2-bit-flips > (provided you use an optimal code). > > These extra bits can flip, too, so if you have m >= 1 data bits and any > finite number n of extra bits, it's possible to have an undetectable > n+1-bit-flip. > -- > If you can't remember, then the claymore IS pointed at you. > Of course common ECC codes detect and correct single bit errors. When used in memory, bits in a word are never adjacent so a cosmic ray or other stray particle which could upset bits usually result in bits being upset in different words so they remain correctable. The MIT paper is noticeably deficient in its ability to do anything useful. It proposes checking things at 100 Hz intervals and trapping each memory access as though these things happen only once in awhile and, of course, assumes that the code doing the checking will never be corrupted. Further, it ignores the cache(s). Cheers, Dick Johnson Penguin : Linux version 2.6.22.1 on an i686 machine (5588.29 BogoMips). My book : http://www.AbominableFirebug.com/ _ **************************************************************** The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them. Thank you. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2007-08-21 20:18 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-08-10 21:16 Software based ECC ? roland
2007-08-10 22:21 ` Alan Cox
2007-08-11 6:11 ` Valdis.Kletnieks
2007-08-12 16:51 ` Folkert van Heusden
2007-08-12 17:07 ` Jan Engelhardt
2007-08-12 19:05 ` chibiryuu
2007-08-13 3:09 ` Valdis.Kletnieks
[not found] <8QK3R-kc-9@gated-at.bofh.it>
[not found] ` <8QSuw-4J2-9@gated-at.bofh.it>
[not found] ` <8RoXy-3NJ-13@gated-at.bofh.it>
2007-08-21 18:44 ` Bodo Eggert
2007-08-21 20:17 ` linux-os (Dick Johnson)
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox