public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Gim Leong Chin <chingimleong@yahoo.com.sg>
To: Leslie Rhorer <lrhorer@mygrande.net>
Cc: "xfs@oss.sgi.com" <xfs@oss.sgi.com>
Subject: Re: XFS File system in trouble
Date: Mon, 20 Jul 2015 13:08:15 +0000 (UTC)	[thread overview]
Message-ID: <1469853784.545263.1437397695535.JavaMail.yahoo@mail.yahoo.com> (raw)
In-Reply-To: <55ACB2BD.6050601@mygrande.net>


[-- Attachment #1.1: Type: text/plain, Size: 4843 bytes --]

Hi Leslie,
My two cents here, it appears you are using AMD FX CPU on ASUS Sabertooth motherboard?
I would strongly suggest you use unbuffered ECC DIMMs in your system.  Mcelog will warn of ECC errors in your DIMMs.  ECC will correct single bit errors and at least detect multi bit errors.
I had AMD Opteron servers with registered ECC DIMMs with continuous correctable ECC errors running HPC jobs for up to one month without any crashes until I could schedule down time for DIMM replacement.  The errors will be flagged either in BMC (service processor) or mcelog.
All my PC / workstations at work place and at home with consumer AMD Althon 64 and AMD Phenom II had unbuffered ECC DIMMs on ASUS motherboards.  I never had any memory errors; I know that if there are memory errors I will get notified.

Chin Gim Leong

      From: Leslie Rhorer <lrhorer@mygrande.net>
 To: Martin Papik <mp6058@gmail.com> 
Cc: xfs@oss.sgi.com 
 Sent: Monday, 20 July 2015, 16:35
 Subject: Re: XFS File system in trouble
   
On 7/20/2015 3:05 AM, Martin Papik wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA512
>
>
> Since you've already found one HW related fault, would you consider
> booting into memtest for a couple of passes just to be on the safe
> side.

    I did that after confirming the one stick of memory was bad.  Twice.  I 
got over 20,000 errors on the bad stick, and 0 on the good one.  I also 
swapped the locations on the motherboard, and the bad stick still failed 
while the good one passed 100%.

> And did you by any chance look at SMART if applicable and
> possibly running a test on the drives.

    Yes. SMART found no errors, but think about it.  Every time tar tries 
to create a directory when untarring that file in that location, the 
file system croaks when it tries to create a directory. Not when reading 
and not when writing other than when it creates a directory.  When I 
create the directory manualy, the process quits failing at that point 
and fails later on during a different directory create.  The array 
remains intact when reading, and dmesg shows no drive errors.  I've 
re-synced the array, which reads every byte on all 8 drives without a 
single mismatch - several times.  To my knowledge, no read has ever 
failed except after the filesystem goes offline.  I thought reads were 
failing during the CRC checks, but that was a red herring.

> Another test I sometimes do
> when I'm unsure about disks is "cat /dev/sda > /dev/null" (i.e. a
> whole disk read test)

echo repair > /sys/block/md0/md/sync_action reads not one drive, but 
every byte on all 8 drives.

> and see (dmesg) if any errors show up, unless

    'Nary one, and no mismatches.



> you're willing to run badblocks in a read-write nondestructive mode.
> In my experience the read test or badblocks can be run simultaneously
> with smartctl -t long. But as a start I'd look at smartctl --all
> /dev/sd? and see if there are any bad signs. I hope this helps. Good luck
>
>
> On 07/20/2015 10:41 AM, Leslie Rhorer wrote:
>> On 7/19/2015 6:27 PM, Dave Chinner wrote:
>>> On Sat, Jul 18, 2015 at 08:02:50PM -0500, Leslie Rhorer wrote:
>>>>
>>>> I found the problem with md5sum (and probably nfs, as well).
>>>> One of the memory modules in the server was bad.  The problem
>>>> with XFS persists.  Every time tar tried to create the
>>>> directory:
>>>
>>> Now you need to run xfs_repair.
>>
>> I do that every time the array implodes.  It makes no difference.
>> It never mentions cleaning the structure tar says needs cleaning,
>> and the next time I run tar on that file, the filesystem craters.
>>
>> _______________________________________________ xfs mailing list
>> xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
>
> iQIcBAEBCgAGBQJVrKuzAAoJELsEaSRwbVYrdjoP/3n1W9YtcpdiDoylp6tDYcjF
> vEVz7IWLv2cOky8Lp+0WAZ4Z0WMhcutFzT571H1Vc+jT/UgO25pQHa3yLYTboPuZ
> +tBidVUycs7ZIr9QCZFs2uPQ/7YstamB+F7paCTMKtOJJr5CZLiYX4iyJ9sFmWVY
> UFPAIhyoqD5CFgoaAkwCmk50kNiT0aPM7egizIUVEt14cWuxZxMN0NIJ5b0WJfAk
> qtNQjstVI/xYDgsImm2ZAm19SfOG9ltm2G9zafRr6lR6rRtXjtZX8zEg0l/o9XUw
> OifghjoSup8OCzvX6+4+Soj/3mCKZv4rkBm3exf4YzfQ9eVG6Ktele2rLIs1sl3O
> hUrZUNEl8hYGJeb5gBHFV/TLWDMMwNde/6JiBVy0V8EbDF1lvR4jYpUwThOE0jyL
> ZbzZe4N/B0qvB1OpLDkHrMVm9NPtDkfXdTtM2kRmo5955xtkK09yHF/v64kz7IKc
> 2rM5pOwTR6HWE8RF2j9UujgPjw6nEUuY01TvIMGYzMfkJTI+sVjeDQfwnPG8tzIa
> x4uLa4vTrBD5IaICjAmQiY69qqmt5Vg42G4latZVTYQLelvWQ774mXZfgfT/GtbT
> RKzVwvYowWr/EBhtp7ix/1rWANTFiX0lxOPnRmUFvu8UJnyZhR0/EYbJYy1+jTt7
> O7hZMfAayQBsnVcSK1JC
> =3Ubd
> -----END PGP SIGNATURE-----
>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs


  

[-- Attachment #1.2: Type: text/html, Size: 8778 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  parent reply	other threads:[~2015-07-20 13:08 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-07-18  1:46 XFS File system in trouble Rhorer, Leslie
2015-07-18 14:16 ` Eric Sandeen
2015-07-18 17:23   ` Rhorer, Leslie
2015-07-18 17:47     ` Kris Rusocki
2015-07-18 18:12       ` Leslie Rhorer
2015-07-19  1:02       ` Leslie Rhorer
2015-07-19 23:27         ` Dave Chinner
2015-07-20  7:41           ` Leslie Rhorer
2015-07-20  8:05             ` Martin Papik
2015-07-20  8:35               ` Leslie Rhorer
2015-07-20  8:52                 ` Martin Papik
2015-07-20 13:08                 ` Gim Leong Chin [this message]
2015-07-20 13:34             ` Eric Sandeen
2015-07-23  3:18             ` Eric Sandeen
2015-07-24 13:47               ` Leslie Rhorer
2015-07-24 14:44                 ` Eric Sandeen
2015-07-24 15:29                   ` Rhorer, Leslie
2015-07-20 11:17         ` Brian Foster
2015-07-23  1:45           ` Leslie Rhorer
2015-07-23 11:36             ` Brian Foster
2015-07-28  7:46           ` Leslie Rhorer
2015-07-28  8:35             ` Stefan Ring
2015-07-28 10:48             ` Roger Willcocks
2015-07-28 12:33             ` Brian Foster
2015-07-28 15:13               ` Leslie Rhorer
2015-07-28 16:53                 ` Eric Sandeen
2015-07-28 19:12                   ` Martin Papik
2015-07-28 19:52                     ` Martin Steigerwald
2015-07-28 22:11                 ` Brian Foster
2015-08-02 20:24                   ` Leslie Rhorer
2015-08-04  7:52                     ` Leslie Rhorer
2015-08-04 12:19                       ` Brian Foster
2015-08-04 22:42                       ` Dave Chinner
2015-08-10  1:37                         ` Leslie Rhorer
2015-08-13  6:21                           ` Leslie Rhorer
2015-08-14  1:26                             ` Dave Chinner
2015-08-14 23:12                               ` Leslie Rhorer
2015-08-15 12:28                                 ` Roger Willcocks
2015-08-15 18:48                                   ` Eric Sandeen
2015-08-15 18:57                                     ` Roger Willcocks
2015-08-15 22:48                                       ` Dave Chinner
2015-08-15 19:00                                     ` Eric Sandeen
2015-08-15 19:13                                       ` Roger Willcocks
2015-08-16  0:32                                       ` Eric Sandeen
2015-08-18  2:14                                 ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1469853784.545263.1437397695535.JavaMail.yahoo@mail.yahoo.com \
    --to=chingimleong@yahoo.com.sg \
    --cc=lrhorer@mygrande.net \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox