From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pl0-f54.google.com ([209.85.160.54]:37711 "EHLO mail-pl0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752108AbeA0Rmo (ORCPT ); Sat, 27 Jan 2018 12:42:44 -0500 Received: by mail-pl0-f54.google.com with SMTP id ay8so889170plb.4 for ; Sat, 27 Jan 2018 09:42:44 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: <20180122212250.GY3807@carfax.org.uk> From: Claes Fransson Date: Sat, 27 Jan 2018 18:42:42 +0100 Message-ID: Subject: Re: bad key ordering - repairable? To: Btrfs BTRFS Content-Type: text/plain; charset="UTF-8" Sender: linux-btrfs-owner@vger.kernel.org List-ID: 2018-01-27 18:32 GMT+01:00 Claes Fransson : > > Duncan Wed, 24 Jan 2018 15:18:25 -0800 > > Claes Fransson posted on Wed, 24 Jan 2018 20:44:33 +0100 as excerpted: > > > So, I have now some results from the PassMark Memtest86! I let the > > default automatic tests run for about 19 hours and 16 passes. It > > reported zero "Errors", but 4 lines of "[Note] RAM may be vulnerable to > > high frequency row hammer bit flips". If I understand it correctly, > > it means that some errors were detected when the RAM was tested at > > higher rates than guaranteed accurate by the vendors. > > >From Wikipedia: > >> Row hammer (also written as rowhammer) is an unintended side effect in >> dynamic random-access memory (DRAM) that causes memory cells to leak >> their charges and interact electrically between themselves, possibly >> altering the contents of nearby memory rows that were not addressed in >> the original memory access. This circumvention of the isolation between >> DRAM memory cells results from the high cell density in modern DRAM, and >> can be triggered by specially crafted memory access patterns that rapidly >> activate the same memory rows numerous times.[1][2][3] >> >> The row hammer effect has been used in some privilege escalation computer >> security exploits. >> >> https://en.wikipedia.org/wiki/Row_hammer >> >> So it has nothing to do with (generic) testing the RAM at higher rates >> than guaranteed by the vendors, but rather, with deliberate rapid >> repeated access (at normal clock rates) of the same cell rows in ordered >> to trigger a bitflip in nearby memory cells that could not normally be >> accessed due to process separation and insufficient privileges. > > Well, I was thinking of the specific error message by memtest86. According to the PassMark website, https://www.memtest86.com/troubleshooting.htm, "Why am I only getting errors during Test 13 Hammer Test?", second paragraph. Thanks for the Wikipedia explanation though. > >> IOW, it's unlikely to be accidentally tripped, and thus is exceedingly >> unlikely to be relevant here, unless you're being hacked, of course. > > Okay, thanks for your conclusion. > >> > That said, and entirely unrelated to rowhammer, I know one of the > problems of memory test false-negatives from experience. > > In my case, I was even running ECC RAM. But the memory I had purchased > (back in the day when memory was far more expensive and sub-GB memory was > the norm) was cheap, and as it happened, marked as stable at slightly > higher clock rates than it actually was. But I couldn't afford more (or > I'd have procured less dodgy RAM in the first place) and had little > recourse but to live with it for awhile. A year or so later there was a > BIOS update that added better memory clocking control, and I was able to > declock the RAM slightly from its rating (IIRC to PC-3000 level, it was > PC3200 rated, this was DDR1 era), after which it was /entirely/ stable, > even after reducing some of the wait-state settings somewhat to try to > claw back some of what I lost due to the underclocking. > > I run gentoo, and nearly all of my problems occurred when I was doing > updates, building packages at 100% CPU with multiple cores accessing the > same RAM. FWIW, the most frequent /detected/ problem was bunzip checksum > errors as it decompressed and verified the data in memory (before writing > out)... that would move or go away if I tried again. Occasionally I'd > get machine-check errors (MCEs), but not frequently, and the ECC RAM > subsystem /never/ reported errors. > My filesystem went readonly just after I did some updating of a lot of packages (I think it was thousands of packages :) ), so massive disk-IO for me, but possible also some CPU and RAM usage... > >> But the memory tests gave that memory an all-clear. > > >>> The problem with the memory tests in this case is that they tend to work >>> on an otherwise unloaded system, and test the retention of the memory >>> cells, /not/ so much the speed and reliability at which they are accessed >>> under fully loaded system stress -- and how could they when memory speed >>> is normally set by the BIOS and not something the memory tester has >>> access to? >>> >>> But my memory problems weren't with the memory cells themselves -- they >>> retained their data just fine and indeed it was ECC RAM so would have >>> triggered ECC errors if they didn't -- but with the precision timing of >>> memory IO -- it wasn't quite up to the specs it claimed to support and >>> would occasionally produce in-transit errors (the ECC would have detected >>> and possibly corrected errors in storage), and the memory testers simply >>> didn't test that like a fully loaded system doing unpacks of sources and >>> builds from them did. >>> >>> As mentioned, once I got a BIOS update that let me declock the RAM a bit, >>> everything was fine, and it remained fine when I did upgrade the RAM some >>> years later, after prices had fallen, as well. > > Thanks for telling, but unfortunately I do not have any setting to change the clocking of the RAM on my laptop when booting into the BIOS-settings menus. Claes > >> (The system was first-gen AMD Opteron, on a server-grade Tyan board, that >> I ran from purchase in late 2003 for over eight years, maxing out the >> >> pair of CPUs to dual-core Opteron 290s and the RAM to 8 gigs, over time, >> until the board finally died in 2012 due to burst capacitors. Which >> reminds me, I'm still running the replacement, a Gigabyte with an fx6100 >> overclocked a bit to 3.9 GHz and 16 gig RAM, and it's now nearing six >> years old, so I suppose I better start planning for the next upgrade... >> I've spent that six years upgrading to big-screen TVs as monitors, with a >> 65inch/165cm 4K as my primary now and a 48inch/122cm as a secondary to >> put youtube or whatever on fullscreen, and to now my second generation of >> ssds, a pair of 1 TB samsung evos, but this reminds me that at nearing >> six years old the main system's aging too, so I better start thinking of >> replacing it again...) >> >> -- >> Duncan - List replies preferred. No HTML msgs. >> "Every nonfree program has a lord, a master -- >> and if you use the program, he is your master." Richard Stallman >> >> --