From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay3.corp.sgi.com [198.149.34.15]) by oss.sgi.com (Postfix) with ESMTP id 2DC177F55 for ; Mon, 20 Jul 2015 08:08:24 -0500 (CDT) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay3.corp.sgi.com (Postfix) with ESMTP id C9DBCAC006 for ; Mon, 20 Jul 2015 06:08:20 -0700 (PDT) Received: from nm19-vm2.bullet.mail.sg3.yahoo.com (nm19-vm2.bullet.mail.sg3.yahoo.com [106.10.149.113]) by cuda.sgi.com with ESMTP id SKdNtMCRbvcN9GIa (version=TLSv1 cipher=RC4-SHA bits=128 verify=NO) for ; Mon, 20 Jul 2015 06:08:18 -0700 (PDT) Date: Mon, 20 Jul 2015 13:08:15 +0000 (UTC) From: Gim Leong Chin Message-ID: <1469853784.545263.1437397695535.JavaMail.yahoo@mail.yahoo.com> In-Reply-To: <55ACB2BD.6050601@mygrande.net> References: <55ACB2BD.6050601@mygrande.net> Subject: Re: XFS File system in trouble MIME-Version: 1.0 Reply-To: Gim Leong Chin List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============4433075778157635864==" Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Leslie Rhorer Cc: "xfs@oss.sgi.com" --===============4433075778157635864== Content-Type: multipart/alternative; boundary="----=_Part_545262_260624072.1437397695529" ------=_Part_545262_260624072.1437397695529 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Leslie, My two cents here, it appears you are using AMD FX CPU on ASUS Sabertooth m= otherboard? I would strongly suggest you use unbuffered ECC DIMMs in your system.=C2=A0= Mcelog will warn of ECC errors in your DIMMs.=C2=A0 ECC will correct singl= e bit errors and at least detect multi bit errors. I had AMD Opteron servers with registered ECC DIMMs with continuous correct= able ECC errors running HPC jobs for up to one month without any crashes un= til I could schedule down time for DIMM replacement.=C2=A0 The errors will = be flagged either in BMC (service processor) or mcelog. All my PC / workstations at work place and at home with consumer AMD Althon= 64 and AMD Phenom II had unbuffered ECC DIMMs on ASUS motherboards.=C2=A0 = I never had any memory errors; I know that if there are memory errors I wil= l get notified. Chin Gim Leong From: Leslie Rhorer To: Martin Papik =20 Cc: xfs@oss.sgi.com=20 Sent: Monday, 20 July 2015, 16:35 Subject: Re: XFS File system in trouble =20 On 7/20/2015 3:05 AM, Martin Papik wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA512 > > > Since you've already found one HW related fault, would you consider > booting into memtest for a couple of passes just to be on the safe > side. =C2=A0=C2=A0=C2=A0 I did that after confirming the one stick of memory was = bad.=C2=A0 Twice.=C2=A0 I=20 got over 20,000 errors on the bad stick, and 0 on the good one.=C2=A0 I als= o=20 swapped the locations on the motherboard, and the bad stick still failed=20 while the good one passed 100%. > And did you by any chance look at SMART if applicable and > possibly running a test on the drives. =C2=A0=C2=A0=C2=A0 Yes. SMART found no errors, but think about it.=C2=A0 Ev= ery time tar tries=20 to create a directory when untarring that file in that location, the=20 file system croaks when it tries to create a directory. Not when reading=20 and not when writing other than when it creates a directory.=C2=A0 When I= =20 create the directory manualy, the process quits failing at that point=20 and fails later on during a different directory create.=C2=A0 The array=20 remains intact when reading, and dmesg shows no drive errors.=C2=A0 I've=20 re-synced the array, which reads every byte on all 8 drives without a=20 single mismatch - several times.=C2=A0 To my knowledge, no read has ever=20 failed except after the filesystem goes offline.=C2=A0 I thought reads were= =20 failing during the CRC checks, but that was a red herring. > Another test I sometimes do > when I'm unsure about disks is "cat /dev/sda > /dev/null" (i.e. a > whole disk read test) echo repair > /sys/block/md0/md/sync_action reads not one drive, but=20 every byte on all 8 drives. > and see (dmesg) if any errors show up, unless =C2=A0=C2=A0=C2=A0 'Nary one, and no mismatches. > you're willing to run badblocks in a read-write nondestructive mode. > In my experience the read test or badblocks can be run simultaneously > with smartctl -t long. But as a start I'd look at smartctl --all > /dev/sd? and see if there are any bad signs. I hope this helps. Good luck > > > On 07/20/2015 10:41 AM, Leslie Rhorer wrote: >> On 7/19/2015 6:27 PM, Dave Chinner wrote: >>> On Sat, Jul 18, 2015 at 08:02:50PM -0500, Leslie Rhorer wrote: >>>> >>>> I found the problem with md5sum (and probably nfs, as well). >>>> One of the memory modules in the server was bad.=C2=A0 The problem >>>> with XFS persists.=C2=A0 Every time tar tried to create the >>>> directory: >>> >>> Now you need to run xfs_repair. >> >> I do that every time the array implodes.=C2=A0 It makes no difference. >> It never mentions cleaning the structure tar says needs cleaning, >> and the next time I run tar on that file, the filesystem craters. >> >> _______________________________________________ xfs mailing list >> xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1 > > iQIcBAEBCgAGBQJVrKuzAAoJELsEaSRwbVYrdjoP/3n1W9YtcpdiDoylp6tDYcjF > vEVz7IWLv2cOky8Lp+0WAZ4Z0WMhcutFzT571H1Vc+jT/UgO25pQHa3yLYTboPuZ > +tBidVUycs7ZIr9QCZFs2uPQ/7YstamB+F7paCTMKtOJJr5CZLiYX4iyJ9sFmWVY > UFPAIhyoqD5CFgoaAkwCmk50kNiT0aPM7egizIUVEt14cWuxZxMN0NIJ5b0WJfAk > qtNQjstVI/xYDgsImm2ZAm19SfOG9ltm2G9zafRr6lR6rRtXjtZX8zEg0l/o9XUw > OifghjoSup8OCzvX6+4+Soj/3mCKZv4rkBm3exf4YzfQ9eVG6Ktele2rLIs1sl3O > hUrZUNEl8hYGJeb5gBHFV/TLWDMMwNde/6JiBVy0V8EbDF1lvR4jYpUwThOE0jyL > ZbzZe4N/B0qvB1OpLDkHrMVm9NPtDkfXdTtM2kRmo5955xtkK09yHF/v64kz7IKc > 2rM5pOwTR6HWE8RF2j9UujgPjw6nEUuY01TvIMGYzMfkJTI+sVjeDQfwnPG8tzIa > x4uLa4vTrBD5IaICjAmQiY69qqmt5Vg42G4latZVTYQLelvWQ774mXZfgfT/GtbT > RKzVwvYowWr/EBhtp7ix/1rWANTFiX0lxOPnRmUFvu8UJnyZhR0/EYbJYy1+jTt7 > O7hZMfAayQBsnVcSK1JC > =3D3Ubd > -----END PGP SIGNATURE----- > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ------=_Part_545262_260624072.1437397695529 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Leslie,
<= br>
My two ce= nts here, it appears you are using AMD FX CPU on ASUS Sabertooth motherboar= d?

=
I would strongly s= uggest you use unbuffered ECC DIMMs in your system.  Mcelog will warn = of ECC errors in your DIMMs.  ECC will correct single bit errors and a= t least detect multi bit errors.

I had AMD Opteron servers with registered ECC DIMMs with contin= uous correctable ECC errors running HPC jobs for up to one month without an= y crashes until I could schedule down time for DIMM replacement.  The = errors will be flagged either in BMC (service processor) or mcelog.

All my PC / workstations at = work place and at home with consumer AMD Althon 64 and AMD Phenom II had un= buffered ECC DIMMs on ASUS motherboards.  I never had any memory error= s; I know that if there are memory errors I will get notified.


Chin Gim Leong


From: Leslie Rhorer <lrhorer@m= ygrande.net>
To: Ma= rtin Papik <mp6058@gmail.com>
Cc: xfs@oss.sgi.com
Subject: Re: XFS File system in trouble

On 7/20/2015 3:05 AM, Martin Papik wrote:
> -= ----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA512
>
>
> Since you'v= e already found one HW related fault, would you consider
= > booting into memtest for a couple of passes just to be on the safe
> side.

  =   I did that after confirming the one stick of memory was bad.  T= wice.  I
got over 20,000 errors on the bad stick, a= nd 0 on the good one.  I also
swapped the locations= on the motherboard, and the bad stick still failed
whil= e the good one passed 100%.

> And d= id you by any chance look at SMART if applicable and
>= possibly running a test on the drives.

    Yes. SMART found no errors, but think about it.  = Every time tar tries
to create a directory when untarrin= g that file in that location, the
file system croaks whe= n it tries to create a directory. Not when reading
and n= ot when writing other than when it creates a directory.  When I
create the directory manualy, the process quits failing at tha= t point
and fails later on during a different directory = create.  The array
remains intact when reading, and= dmesg shows no drive errors.  I've
re-synced the a= rray, which reads every byte on all 8 drives without a
s= ingle mismatch - several times.  To my knowledge, no read has ever failed except after the filesystem goes offline.  I th= ought reads were
failing during the CRC checks, but that= was a red herring.

> Another test = I sometimes do
> when I'm unsure about disks is "cat /= dev/sda > /dev/null" (i.e. a
> whole disk read test= )

echo repair > /sys/block/md0/md/s= ync_action reads not one drive, but
every byte on all 8 = drives.

> and see (dmesg) if any er= rors show up, unless

   = ; 'Nary one, and no mismatches.


<= div class=3D"yqt5741585799" id=3D"yqtfd69759">

> you're willing to run badblocks in a read-write nondestructi= ve mode.
> In my experience the read test or badblocks= can be run simultaneously
> with smartctl -t long. Bu= t as a start I'd look at smartctl --all
> /dev/sd? and= see if there are any bad signs. I hope this helps. Good luck
>
>
> On 07/20/2015 10:41= AM, Leslie Rhorer wrote:
>> On 7/19/2015 6:27 PM, = Dave Chinner wrote:
>>> On Sat, Jul 18, 2015 at = 08:02:50PM -0500, Leslie Rhorer wrote:
>>>>>>>> I found the problem with md5sum (and prob= ably nfs, as well).
>>>> One of the memory mo= dules in the server was bad.  The problem
>>&g= t;> with XFS persists.  Every time tar tried to create the
>>>> directory:
>>>
>>> Now you need to run xfs_repair.
= >>
>> I do that every time the array implodes= .  It makes no difference.
>> It never mention= s cleaning the structure tar says needs cleaning,
>>= ; and the next time I run tar on that file, the filesystem craters.
>>
>> ____________________________= ___________________ xfs mailing list
>> x= fs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs>
> -----BEGIN PGP SIGNATURE-----> Version: GnuPG v1
>
> iQIcBAEBCgAGBQJVrKuzAAoJELsEaSRwbVYrdjoP/3n1W9YtcpdiDoylp6tDYcjF=
> vEVz7IWLv2cOky8Lp+0WAZ4Z0WMhcutFzT571H1Vc+jT/UgO25p= QHa3yLYTboPuZ
> +tBidVUycs7ZIr9QCZFs2uPQ/7YstamB+F7paC= TMKtOJJr5CZLiYX4iyJ9sFmWVY
> UFPAIhyoqD5CFgoaAkwCmk50k= NiT0aPM7egizIUVEt14cWuxZxMN0NIJ5b0WJfAk
> qtNQjstVI/xY= DgsImm2ZAm19SfOG9ltm2G9zafRr6lR6rRtXjtZX8zEg0l/o9XUw
>= OifghjoSup8OCzvX6+4+Soj/3mCKZv4rkBm3exf4YzfQ9eVG6Ktele2rLIs1sl3O
> hUrZUNEl8hYGJeb5gBHFV/TLWDMMwNde/6JiBVy0V8EbDF1lvR4jYpUwThOE= 0jyL
> ZbzZe4N/B0qvB1OpLDkHrMVm9NPtDkfXdTtM2kRmo5955xt= kK09yHF/v64kz7IKc
> 2rM5pOwTR6HWE8RF2j9UujgPjw6nEUuY01= TvIMGYzMfkJTI+sVjeDQfwnPG8tzIa
> x4uLa4vTrBD5IaICjAmQi= Y69qqmt5Vg42G4latZVTYQLelvWQ774mXZfgfT/GtbT
> RKzVwvYo= wWr/EBhtp7ix/1rWANTFiX0lxOPnRmUFvu8UJnyZhR0/EYbJYy1+jTt7
= > O7hZMfAayQBsnVcSK1JC
> =3D3Ubd
= > -----END PGP SIGNATURE-----
>
<= br clear=3D"none">_______________________________________________
xfs mailing list
xfs@oss.sgi.com<= br clear=3D"none">http://oss.sgi.com/mailman/listinfo/xfs


------=_Part_545262_260624072.1437397695529-- --===============4433075778157635864== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs --===============4433075778157635864==--