From mboxrd@z Thu Jan 1 00:00:00 1970 From: Oliver Schinagl Subject: Re: OT: silent data corruption reading from hard drives Date: Thu, 16 Aug 2012 09:30:51 +0200 Message-ID: <502CA1AB.8030706@schinagl.nl> References: <50191ADE.10809@fastmail.fm> <5019D052.4000409@hardwarefreak.com> <20120802091917.32fb0ca5@natsu> <501A316C.3020305@hardwarefreak.com> <20120802140634.6f729d07@natsu> <1343926790.28421.140661109896017.1A222237@webmail.messagingengine.com> <501AB9D8.1030404@turmel.org> <1343932322.14845.140661109934445.709DABBB@webmail.messagingengine.com> <501BD3E6.6050004@turmel.org> <20524.6904.183888.747899@tree.ty.sabi.co.UK> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20524.6904.183888.747899@tree.ty.sabi.co.UK> Sender: linux-raid-owner@vger.kernel.org To: Peter Grandi Cc: Linux RAID List-Id: linux-raid.ids On 15-08-12 23:55, Peter Grandi wrote: > [ ... ] > >> In my opinion, any corruption noticed in a non-ECC system is >> most likely due to the RAM. > That's pretty common, but many disk drive models also have bugs, > and most hw RAID host adapters have many (terrible) bugs. > >> You really need to run memtest86 on your system, preferably >> for 24 hours or more. > Even that is not conclusive. Some "memory" errors are due to > activity/noise spikes on the PCI/PCIe bus due to hw bugs or > poorly electrically designed cards. > >>>> Hard drives write extensive ECC payloads to catch >>>> corruptions there; SATA and SAS protocols have CRC checks on >>>> every frame transferred; > A warning to the masses: USB mass storage is weak as to this and > in particular as to error recovery, and most USB chipsets > (especially USB-drive ones, but also motherboard ones) are > massively buggy. > >>>> the PCIe bus uses CRC checks on each lane, with low-level >>>> encoding very similar to SATA. Even modern processors are >>>> using PCIe-style encoded [ ... ] >> [ ... ] machine handling data you really care about > ... should have end-to-end verification, that is the data itself > should be checksummed at least to detect corruption. For example > by putting it into checksummed containers (even just ZIP without > compression). > >> should have ECC ram. > Oh yes, and any machine should have ECC RAM as the cost is > really modest. Unfortunately the usual evil marketers like to > segment artificially the market into cheap stuff without ECC and > premium stuff with ECC, and will not put ECC into cheap stuff to > avoid tempting business customers to buy it instead of the > premium stuff. While I agree that all machine's should have ECC Ram (there are still some people think its not worth it), last time I checked on newegg, I found ECC prices not that much higher. My servers both run happily with ECC ram. As for data corruption, I've also been there and know it simply just happens. Yes I had shitty IDE drives on a shitty 'rocketraid 404' controller, but that's no excuse to simply assume all data will always be right. Maybe in a few years from now, we'll have some 'open cores' for properly designed almost bug free hardware :) > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html