raid1 has failing disks, but smart is clear

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* raid1 has failing disks, but smart is clear
@ 2016-07-06 22:14 Corey Coughlin
  2016-07-06 22:59 ` Tomasz Kusmierz
  0 siblings, 1 reply; 13+ messages in thread
From: Corey Coughlin @ 2016-07-06 22:14 UTC (permalink / raw)
  To: Btrfs BTRFS

Hi all,
     Hoping you all can help, have a strange problem, think I know 
what's going on, but could use some verification.  I set up a raid1 type 
btrfs filesystem on an Ubuntu 16.04 system, here's what it looks like:

btrfs fi show
Label: none  uuid: 597ee185-36ac-4b68-8961-d4adc13f95d4
     Total devices 10 FS bytes used 3.42TiB
     devid    1 size 1.82TiB used 1.18TiB path /dev/sdd
     devid    2 size 698.64GiB used 47.00GiB path /dev/sdk
     devid    3 size 931.51GiB used 280.03GiB path /dev/sdm
     devid    4 size 931.51GiB used 280.00GiB path /dev/sdl
     devid    5 size 1.82TiB used 1.17TiB path /dev/sdi
     devid    6 size 1.82TiB used 823.03GiB path /dev/sdj
     devid    7 size 698.64GiB used 47.00GiB path /dev/sdg
     devid    8 size 1.82TiB used 1.18TiB path /dev/sda
     devid    9 size 1.82TiB used 1.18TiB path /dev/sdb
     devid   10 size 1.36TiB used 745.03GiB path /dev/sdh

I added a couple disks, and then ran a balance operation, and that took 
about 3 days to finish.  When it did finish, tried a scrub and got this 
message:

scrub status for 597ee185-36ac-4b68-8961-d4adc13f95d4
     scrub started at Sun Jun 26 18:19:28 2016 and was aborted after 
01:16:35
     total bytes scrubbed: 926.45GiB with 18849935 errors
     error details: read=18849935
     corrected errors: 5860, uncorrectable errors: 18844075, unverified 
errors: 0

So that seems bad.  Took a look at the devices and a few of them have 
errors:
...
[/dev/sdi].generation_errs 0
[/dev/sdj].write_io_errs   289436740
[/dev/sdj].read_io_errs    289492820
[/dev/sdj].flush_io_errs   12411
[/dev/sdj].corruption_errs 0
[/dev/sdj].generation_errs 0
[/dev/sdg].write_io_errs   0
...
[/dev/sda].generation_errs 0
[/dev/sdb].write_io_errs   3490143
[/dev/sdb].read_io_errs    111
[/dev/sdb].flush_io_errs   268
[/dev/sdb].corruption_errs 0
[/dev/sdb].generation_errs 0
[/dev/sdh].write_io_errs   5839
[/dev/sdh].read_io_errs    2188
[/dev/sdh].flush_io_errs   11
[/dev/sdh].corruption_errs 1
[/dev/sdh].generation_errs 16373

So I checked the smart data for those disks, they seem perfect, no 
reallocated sectors, no problems.  But one thing I did notice is that 
they are all WD Green drives.  So I'm guessing that if they power down 
and get reassigned to a new /dev/sd* letter, that could lead to data 
corruption.  I used idle3ctl to turn off the shut down mode on all the 
green drives in the system, but I'm having trouble getting the 
filesystem working without the errors.  I tried a 'check --repair' 
command on it, and it seems to find a lot of verification errors, but it 
doesn't look like things are getting fixed.  But I have all the data on 
it backed up on another system, so I can recreate this if I need to.  
But here's what I want to know:

1.  Am I correct about the issues with the WD Green drives, if they 
change mounts during disk operations, will that corrupt data?
2.  If that is the case:
     a.) Is there any way I can stop the /dev/sd* mount points from 
changing?  Or can I set up the filesystem using UUIDs or something more 
solid?  I googled about it, but found conflicting info
     b.) Or, is there something else changing my drive devices?  I have 
most of drives on an LSI SAS 9201-16i card, is there something I need to 
do to make them fixed?
     c.) Or, is there a script or something I can use to figure out if 
the disks will change mounts?
     d.) Or, if I wipe everything and rebuild, will the disks with the 
idle3ctl fix work now?

Regardless of whether or not it's a WD Green drive issue, should I just 
wipefs all the disks and rebuild it?  Is there any way to recover this?  
Thanks for any help!


     ------- Corey

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: raid1 has failing disks, but smart is clear
  2016-07-06 22:14 raid1 has failing disks, but smart is clear Corey Coughlin
@ 2016-07-06 22:59 ` Tomasz Kusmierz
  2016-07-07  6:40   ` Corey Coughlin
  2016-07-07 11:58   ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 13+ messages in thread
From: Tomasz Kusmierz @ 2016-07-06 22:59 UTC (permalink / raw)
  To: Corey Coughlin; +Cc: Btrfs BTRFS


> On 6 Jul 2016, at 23:14, Corey Coughlin <corey.coughlin.cc3@gmail.com> wrote:
> 
> Hi all,
>    Hoping you all can help, have a strange problem, think I know what's going on, but could use some verification.  I set up a raid1 type btrfs filesystem on an Ubuntu 16.04 system, here's what it looks like:
> 
> btrfs fi show
> Label: none  uuid: 597ee185-36ac-4b68-8961-d4adc13f95d4
>    Total devices 10 FS bytes used 3.42TiB
>    devid    1 size 1.82TiB used 1.18TiB path /dev/sdd
>    devid    2 size 698.64GiB used 47.00GiB path /dev/sdk
>    devid    3 size 931.51GiB used 280.03GiB path /dev/sdm
>    devid    4 size 931.51GiB used 280.00GiB path /dev/sdl
>    devid    5 size 1.82TiB used 1.17TiB path /dev/sdi
>    devid    6 size 1.82TiB used 823.03GiB path /dev/sdj
>    devid    7 size 698.64GiB used 47.00GiB path /dev/sdg
>    devid    8 size 1.82TiB used 1.18TiB path /dev/sda
>    devid    9 size 1.82TiB used 1.18TiB path /dev/sdb
>    devid   10 size 1.36TiB used 745.03GiB path /dev/sdh
> 
> I added a couple disks, and then ran a balance operation, and that took about 3 days to finish.  When it did finish, tried a scrub and got this message:
> 
> scrub status for 597ee185-36ac-4b68-8961-d4adc13f95d4
>    scrub started at Sun Jun 26 18:19:28 2016 and was aborted after 01:16:35
>    total bytes scrubbed: 926.45GiB with 18849935 errors
>    error details: read=18849935
>    corrected errors: 5860, uncorrectable errors: 18844075, unverified errors: 0
> 
> So that seems bad.  Took a look at the devices and a few of them have errors:
> ...
> [/dev/sdi].generation_errs 0
> [/dev/sdj].write_io_errs   289436740
> [/dev/sdj].read_io_errs    289492820
> [/dev/sdj].flush_io_errs   12411
> [/dev/sdj].corruption_errs 0
> [/dev/sdj].generation_errs 0
> [/dev/sdg].write_io_errs   0
> ...
> [/dev/sda].generation_errs 0
> [/dev/sdb].write_io_errs   3490143
> [/dev/sdb].read_io_errs    111
> [/dev/sdb].flush_io_errs   268
> [/dev/sdb].corruption_errs 0
> [/dev/sdb].generation_errs 0
> [/dev/sdh].write_io_errs   5839
> [/dev/sdh].read_io_errs    2188
> [/dev/sdh].flush_io_errs   11
> [/dev/sdh].corruption_errs 1
> [/dev/sdh].generation_errs 16373
> 
> So I checked the smart data for those disks, they seem perfect, no reallocated sectors, no problems.  But one thing I did notice is that they are all WD Green drives.  So I'm guessing that if they power down and get reassigned to a new /dev/sd* letter, that could lead to data corruption.  I used idle3ctl to turn off the shut down mode on all the green drives in the system, but I'm having trouble getting the filesystem working without the errors.  I tried a 'check --repair' command on it, and it seems to find a lot of verification errors, but it doesn't look like things are getting fixed.
>  But I have all the data on it backed up on another system, so I can recreate this if I need to.  But here's what I want to know:
> 
> 1.  Am I correct about the issues with the WD Green drives, if they change mounts during disk operations, will that corrupt data?
I just wanted to chip in with WD Green drives. I have a RAID10 running on 6x2TB of those, actually had for ~3 years. If disk goes down for spin down, and you try to access something - kernel & FS & whole system will wait for drive to re-spin and everything works OK. I’ve never had a drive being reassigned to different /dev/sdX due to spin down / up. 
2 years ago I was having a corruption due to not using ECC ram on my system and one of RAM modules started producing errors that were never caught up by CPU / MoBo. Long story short, guy here managed to point me to the right direction and I started shifting my data to hopefully new and not corrupted FS … but I was sceptical of similar issue that you have described AND I did raid1 and while mounted I did shift disk from one SATA port to another and FS managed to pick up the disk in new location and did not even blinked (as far as I remember there was syslog entry to say that disk vanished and then that disk was added)

Last word, you got plenty of errors in your SMART for transfer related stuff, please be advised that this may mean:
- faulty cable
- faulty mono controller
- faulty drive controller
- bad RAM - yes, mother board CAN use your ram for storing data and transfer related stuff … specially chapter ones. 

> 2.  If that is the case:
>    a.) Is there any way I can stop the /dev/sd* mount points from changing?  Or can I set up the filesystem using UUIDs or something more solid?  I googled about it, but found conflicting info
Don’t get it the wrong way but I’m personally surprised that anybody still uses mount points rather than UUID. Devices change from boot to boot for a lot of people and most of distros moved to uuid (2 years ago ? even the swap is mounted via UUID now)

>    b.) Or, is there something else changing my drive devices?  I have most of drives on an LSI SAS 9201-16i card, is there something I need to do to make them fixed?
I’ll let more senior data storage experts to speak up but most of the time people frowned on me for mentioning anything different than north bridge / Intel raid card / super micro / 3 ware . 

(And yes I did found the hard way they were right:
- marvel controller on my mobo randomly writes garbage to your drives 
- adapted PCI express card was switching of all the drives mid flight while pretending “it’s OK” resulting in very peculiar data losses in the middle of big files)

>    c.) Or, is there a script or something I can use to figure out if the disks will change mounts?
>    d.) Or, if I wipe everything and rebuild, will the disks with the idle3ctl fix work now?
> 
> Regardless of whether or not it's a WD Green drive issue, should I just wipefs all the disks and rebuild it?  Is there any way to recover this?  Thanks for any help!

IF you remotelly care about the data that you have (I think you should if you came here), I would suggest a good exercise: 
- unplug all the drives you use for this file system and stop toying with it because you may loose more data (I did because I thought I knew better)
- get your self 2 new drives
- find my thread from ~2 years ago on this mailing list (might be different email address)
- try to locate Chris Mason reply with a script “my old friend”
- run this script on you system for couple of DAYS and you will see whenever you have any corruption creeping in
- if corruptions are creeping in, change a component in your system (controller / RAM / mobo / CPU / PSU) and repeat exercise (best to organise your self access to some spare parts / extra machine.
- when all is good, make and FS out of those 2 new drives, and try to rescue data from OLD FS !
- unplug new FS and put it one the shelf
- try to fix old FS … this will be a FUN and very educating exercise … 

> 
>    ------- Corey
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: raid1 has failing disks, but smart is clear
  2016-07-06 22:59 ` Tomasz Kusmierz
@ 2016-07-07  6:40   ` Corey Coughlin
  2016-07-08  1:24     ` Duncan
  2016-07-09  5:40     ` Andrei Borzenkov
  2016-07-07 11:58   ` Austin S. Hemmelgarn
  1 sibling, 2 replies; 13+ messages in thread
From: Corey Coughlin @ 2016-07-07  6:40 UTC (permalink / raw)
  To: Tomasz Kusmierz; +Cc: Btrfs BTRFS

Hi Tomasz,
     Thanks for the response!  I should clear some things up, though.

On 07/06/2016 03:59 PM, Tomasz Kusmierz wrote:
>> On 6 Jul 2016, at 23:14, Corey Coughlin <corey.coughlin.cc3@gmail.com> wrote:
>>
>> Hi all,
>>     Hoping you all can help, have a strange problem, think I know what's going on, but could use some verification.  I set up a raid1 type btrfs filesystem on an Ubuntu 16.04 system, here's what it looks like:
>>
>> btrfs fi show
>> Label: none  uuid: 597ee185-36ac-4b68-8961-d4adc13f95d4
>>     Total devices 10 FS bytes used 3.42TiB
>>     devid    1 size 1.82TiB used 1.18TiB path /dev/sdd
>>     devid    2 size 698.64GiB used 47.00GiB path /dev/sdk
>>     devid    3 size 931.51GiB used 280.03GiB path /dev/sdm
>>     devid    4 size 931.51GiB used 280.00GiB path /dev/sdl
>>     devid    5 size 1.82TiB used 1.17TiB path /dev/sdi
>>     devid    6 size 1.82TiB used 823.03GiB path /dev/sdj
>>     devid    7 size 698.64GiB used 47.00GiB path /dev/sdg
>>     devid    8 size 1.82TiB used 1.18TiB path /dev/sda
>>     devid    9 size 1.82TiB used 1.18TiB path /dev/sdb
>>     devid   10 size 1.36TiB used 745.03GiB path /dev/sdh
Now when I say that the drives mount points change, I'm not saying they 
change when I reboot.  They change while the system is running.  For 
instance, here's the fi show after I ran a "check --repair" run this 
afternoon:

btrfs fi show
Label: none  uuid: 597ee185-36ac-4b68-8961-d4adc13f95d4
     Total devices 10 FS bytes used 3.42TiB
     devid    1 size 1.82TiB used 1.18TiB path /dev/sdd
     devid    2 size 698.64GiB used 47.00GiB path /dev/sdk
     devid    3 size 931.51GiB used 280.03GiB path /dev/sdm
     devid    4 size 931.51GiB used 280.00GiB path /dev/sdl
     devid    5 size 1.82TiB used 1.17TiB path /dev/sdi
     devid    6 size 1.82TiB used 823.03GiB path /dev/sds
     devid    7 size 698.64GiB used 47.00GiB path /dev/sdg
     devid    8 size 1.82TiB used 1.18TiB path /dev/sda
     devid    9 size 1.82TiB used 1.18TiB path /dev/sdb
     devid   10 size 1.36TiB used 745.03GiB path /dev/sdh

Notice that /dev/sdj in the previous run changed to /dev/sds.  There was 
no reboot, the mount just changed.  I don't know why that is happening, 
but it seems like the majority of the errors are on that drive.  But 
given that I've fixed the start/stop issue on that disk, it probably 
isn't a WD Green issue.

>>
>> I added a couple disks, and then ran a balance operation, and that took about 3 days to finish.  When it did finish, tried a scrub and got this message:
>>
>> scrub status for 597ee185-36ac-4b68-8961-d4adc13f95d4
>>     scrub started at Sun Jun 26 18:19:28 2016 and was aborted after 01:16:35
>>     total bytes scrubbed: 926.45GiB with 18849935 errors
>>     error details: read=18849935
>>     corrected errors: 5860, uncorrectable errors: 18844075, unverified errors: 0
>>
>> So that seems bad.  Took a look at the devices and a few of them have errors:
>> ...
>> [/dev/sdi].generation_errs 0
>> [/dev/sdj].write_io_errs   289436740
>> [/dev/sdj].read_io_errs    289492820
>> [/dev/sdj].flush_io_errs   12411
>> [/dev/sdj].corruption_errs 0
>> [/dev/sdj].generation_errs 0
>> [/dev/sdg].write_io_errs   0
>> ...
>> [/dev/sda].generation_errs 0
>> [/dev/sdb].write_io_errs   3490143
>> [/dev/sdb].read_io_errs    111
>> [/dev/sdb].flush_io_errs   268
>> [/dev/sdb].corruption_errs 0
>> [/dev/sdb].generation_errs 0
>> [/dev/sdh].write_io_errs   5839
>> [/dev/sdh].read_io_errs    2188
>> [/dev/sdh].flush_io_errs   11
>> [/dev/sdh].corruption_errs 1
>> [/dev/sdh].generation_errs 16373
>>
>> So I checked the smart data for those disks, they seem perfect, no reallocated sectors, no problems.  But one thing I did notice is that they are all WD Green drives.  So I'm guessing that if they power down and get reassigned to a new /dev/sd* letter, that could lead to data corruption.  I used idle3ctl to turn off the shut down mode on all the green drives in the system, but I'm having trouble getting the filesystem working without the errors.  I tried a 'check --repair' command on it, and it seems to find a lot of verification errors, but it doesn't look like things are getting fixed.
>>   But I have all the data on it backed up on another system, so I can recreate this if I need to.  But here's what I want to know:
>>
>> 1.  Am I correct about the issues with the WD Green drives, if they change mounts during disk operations, will that corrupt data?
> I just wanted to chip in with WD Green drives. I have a RAID10 running on 6x2TB of those, actually had for ~3 years. If disk goes down for spin down, and you try to access something - kernel & FS & whole system will wait for drive to re-spin and everything works OK. I’ve never had a drive being reassigned to different /dev/sdX due to spin down / up.
> 2 years ago I was having a corruption due to not using ECC ram on my system and one of RAM modules started producing errors that were never caught up by CPU / MoBo. Long story short, guy here managed to point me to the right direction and I started shifting my data to hopefully new and not corrupted FS … but I was sceptical of similar issue that you have described AND I did raid1 and while mounted I did shift disk from one SATA port to another and FS managed to pick up the disk in new location and did not even blinked (as far as I remember there was syslog entry to say that disk vanished and then that disk was added)
>
> Last word, you got plenty of errors in your SMART for transfer related stuff, please be advised that this may mean:
> - faulty cable
> - faulty mono controller
> - faulty drive controller
> - bad RAM - yes, mother board CAN use your ram for storing data and transfer related stuff … specially chapter ones.
OK, I'll see if I can narrow things down to a faulty component or 
memory.  It could definitely be the drive controller or sas/sata cable.
>> 2.  If that is the case:
>>     a.) Is there any way I can stop the /dev/sd* mount points from changing?  Or can I set up the filesystem using UUIDs or something more solid?  I googled about it, but found conflicting info
> Don’t get it the wrong way but I’m personally surprised that anybody still uses mount points rather than UUID. Devices change from boot to boot for a lot of people and most of distros moved to uuid (2 years ago ? even the swap is mounted via UUID now)
Well yeah, if I was mounting all the disks to different mount points, I 
would definitely use UUIDs to get them mounted.  But I haven't seen any 
way to set up a "mkfs.btrfs" command to use UUID or anything else for 
individual drives.  Am I missing something?  I've been doing a lot of 
googling.

>
>>     b.) Or, is there something else changing my drive devices?  I have most of drives on an LSI SAS 9201-16i card, is there something I need to do to make them fixed?
> I’ll let more senior data storage experts to speak up but most of the time people frowned on me for mentioning anything different than north bridge / Intel raid card / super micro / 3 ware .
>
> (And yes I did found the hard way they were right:
> - marvel controller on my mobo randomly writes garbage to your drives
> - adapted PCI express card was switching of all the drives mid flight while pretending “it’s OK” resulting in very peculiar data losses in the middle of big files)
Hmm.... good to know.  Might see if I can find a more reliable sata 
controller card.
>
>>     c.) Or, is there a script or something I can use to figure out if the disks will change mounts?
>>     d.) Or, if I wipe everything and rebuild, will the disks with the idle3ctl fix work now?
>>
>> Regardless of whether or not it's a WD Green drive issue, should I just wipefs all the disks and rebuild it?  Is there any way to recover this?  Thanks for any help!
> IF you remotelly care about the data that you have (I think you should if you came here), I would suggest a good exercise:
> - unplug all the drives you use for this file system and stop toying with it because you may loose more data (I did because I thought I knew better)
> - get your self 2 new drives
> - find my thread from ~2 years ago on this mailing list (might be different email address)
> - try to locate Chris Mason reply with a script “my old friend”
> - run this script on you system for couple of DAYS and you will see whenever you have any corruption creeping in
> - if corruptions are creeping in, change a component in your system (controller / RAM / mobo / CPU / PSU) and repeat exercise (best to organise your self access to some spare parts / extra machine.
> - when all is good, make and FS out of those 2 new drives, and try to rescue data from OLD FS !
> - unplug new FS and put it one the shelf
> - try to fix old FS … this will be a FUN and very educating exercise …
I'm not worried about the data, it's all backed up.  But ok, I can get a 
couple of drives and start testing stuff.  I think I found the script 
you meant (stress.sh I'm hoping) so I can get that running. But as far 
as the old fs goes, I can just wipe that and start fresh.  Thanks for 
all the tips!

     ------- Corey

>
>>     ------- Corey
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: raid1 has failing disks, but smart is clear
  2016-07-07  6:40   ` Corey Coughlin
@ 2016-07-08  1:24     ` Duncan
  2016-07-08  4:51       ` Corey Coughlin
  2016-07-09  5:51       ` Andrei Borzenkov
  2016-07-09  5:40     ` Andrei Borzenkov
  1 sibling, 2 replies; 13+ messages in thread
From: Duncan @ 2016-07-08  1:24 UTC (permalink / raw)
  To: linux-btrfs

Corey Coughlin posted on Wed, 06 Jul 2016 23:40:30 -0700 as excerpted:

> Well yeah, if I was mounting all the disks to different mount points, I
> would definitely use UUIDs to get them mounted.  But I haven't seen any
> way to set up a "mkfs.btrfs" command to use UUID or anything else for
> individual drives.  Am I missing something?  I've been doing a lot of
> googling.

FWIW, you can use the /dev/disk/by-*/* symlinks (as normally setup by 
udev) to reference various devices.

Of course because the identifiers behind by-uuid and by-label are per-
filesystem, those will normally only identify the one device of a multi-
device filesystem, but the by-id links ID on device serials and partition 
number, and if you are using GPT partitioning, you have by-partuuid and 
(if you set them when setting up the partitions) by-partlabel as well.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: raid1 has failing disks, but smart is clear
  2016-07-08  1:24     ` Duncan
@ 2016-07-08  4:51       ` Corey Coughlin
  2016-07-09  5:51       ` Andrei Borzenkov
  1 sibling, 0 replies; 13+ messages in thread
From: Corey Coughlin @ 2016-07-08  4:51 UTC (permalink / raw)
  To: Duncan, linux-btrfs

Hi Duncan,
     Thanks for the info!  I've seen that done in the fstab, but it 
didn't work for me the last time I tried it on the command line. Worth a 
shot!

     ------ Corey

On 07/07/2016 06:24 PM, Duncan wrote:
> Corey Coughlin posted on Wed, 06 Jul 2016 23:40:30 -0700 as excerpted:
>
>> Well yeah, if I was mounting all the disks to different mount points, I
>> would definitely use UUIDs to get them mounted.  But I haven't seen any
>> way to set up a "mkfs.btrfs" command to use UUID or anything else for
>> individual drives.  Am I missing something?  I've been doing a lot of
>> googling.
> FWIW, you can use the /dev/disk/by-*/* symlinks (as normally setup by
> udev) to reference various devices.
>
> Of course because the identifiers behind by-uuid and by-label are per-
> filesystem, those will normally only identify the one device of a multi-
> device filesystem, but the by-id links ID on device serials and partition
> number, and if you are using GPT partitioning, you have by-partuuid and
> (if you set them when setting up the partitions) by-partlabel as well.
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: raid1 has failing disks, but smart is clear
  2016-07-08  1:24     ` Duncan
  2016-07-08  4:51       ` Corey Coughlin
@ 2016-07-09  5:51       ` Andrei Borzenkov
  1 sibling, 0 replies; 13+ messages in thread
From: Andrei Borzenkov @ 2016-07-09  5:51 UTC (permalink / raw)
  To: Duncan, linux-btrfs

08.07.2016 04:24, Duncan пишет:
> Corey Coughlin posted on Wed, 06 Jul 2016 23:40:30 -0700 as excerpted:
> 
>> Well yeah, if I was mounting all the disks to different mount points, I
>> would definitely use UUIDs to get them mounted.  But I haven't seen any
>> way to set up a "mkfs.btrfs" command to use UUID or anything else for
>> individual drives.  Am I missing something?  I've been doing a lot of
>> googling.
> 
> FWIW, you can use the /dev/disk/by-*/* symlinks (as normally setup by 
> udev) to reference various devices.
> 

Current udev ships rule that calls equivalent of "btrfs device ready
$dev", where $dev is the canonical kernel device name. btrfs kernel
driver will update internal list of device names when it gets this
ioctl, which means that unless you explicitly pass full list of
/dev/disk/by-*/* during mount you will see those kernel names.

And even then as soon as device for some reason dis- and re-appears (as
is apparently the case here) it will be renamed back by udev when "add"
event is seen.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: raid1 has failing disks, but smart is clear
  2016-07-07  6:40   ` Corey Coughlin
  2016-07-08  1:24     ` Duncan
@ 2016-07-09  5:40     ` Andrei Borzenkov
  2016-07-12  4:50       ` Corey Coughlin
  1 sibling, 1 reply; 13+ messages in thread
From: Andrei Borzenkov @ 2016-07-09  5:40 UTC (permalink / raw)
  To: Corey Coughlin, Tomasz Kusmierz; +Cc: Btrfs BTRFS

07.07.2016 09:40, Corey Coughlin пишет:
> Hi Tomasz,
>     Thanks for the response!  I should clear some things up, though.
> 
> On 07/06/2016 03:59 PM, Tomasz Kusmierz wrote:
>>> On 6 Jul 2016, at 23:14, Corey Coughlin
>>> <corey.coughlin.cc3@gmail.com> wrote:
>>>
>>> Hi all,
>>>     Hoping you all can help, have a strange problem, think I know
>>> what's going on, but could use some verification.  I set up a raid1
>>> type btrfs filesystem on an Ubuntu 16.04 system, here's what it looks
>>> like:
>>>
>>> btrfs fi show
>>> Label: none  uuid: 597ee185-36ac-4b68-8961-d4adc13f95d4
>>>     Total devices 10 FS bytes used 3.42TiB
>>>     devid    1 size 1.82TiB used 1.18TiB path /dev/sdd
>>>     devid    2 size 698.64GiB used 47.00GiB path /dev/sdk
>>>     devid    3 size 931.51GiB used 280.03GiB path /dev/sdm
>>>     devid    4 size 931.51GiB used 280.00GiB path /dev/sdl
>>>     devid    5 size 1.82TiB used 1.17TiB path /dev/sdi
>>>     devid    6 size 1.82TiB used 823.03GiB path /dev/sdj
>>>     devid    7 size 698.64GiB used 47.00GiB path /dev/sdg
>>>     devid    8 size 1.82TiB used 1.18TiB path /dev/sda
>>>     devid    9 size 1.82TiB used 1.18TiB path /dev/sdb
>>>     devid   10 size 1.36TiB used 745.03GiB path /dev/sdh
> Now when I say that the drives mount points change, I'm not saying they
> change when I reboot.  They change while the system is running.  For
> instance, here's the fi show after I ran a "check --repair" run this
> afternoon:
> 
> btrfs fi show
> Label: none  uuid: 597ee185-36ac-4b68-8961-d4adc13f95d4
>     Total devices 10 FS bytes used 3.42TiB
>     devid    1 size 1.82TiB used 1.18TiB path /dev/sdd
>     devid    2 size 698.64GiB used 47.00GiB path /dev/sdk
>     devid    3 size 931.51GiB used 280.03GiB path /dev/sdm
>     devid    4 size 931.51GiB used 280.00GiB path /dev/sdl
>     devid    5 size 1.82TiB used 1.17TiB path /dev/sdi
>     devid    6 size 1.82TiB used 823.03GiB path /dev/sds
>     devid    7 size 698.64GiB used 47.00GiB path /dev/sdg
>     devid    8 size 1.82TiB used 1.18TiB path /dev/sda
>     devid    9 size 1.82TiB used 1.18TiB path /dev/sdb
>     devid   10 size 1.36TiB used 745.03GiB path /dev/sdh
> 
> Notice that /dev/sdj in the previous run changed to /dev/sds.  There was
> no reboot, the mount just changed.  I don't know why that is happening,
> but it seems like the majority of the errors are on that drive.  But
> given that I've fixed the start/stop issue on that disk, it probably
> isn't a WD Green issue.

It's not "mount point", it is just device names. Do not make it sound
more confusing than it already is :)

This implies that disks drop off and reappear. Do you have "dmesg" or
log (/var/log/syslog or /var/log/messages or journalctl) for the same
period of time?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: raid1 has failing disks, but smart is clear
  2016-07-09  5:40     ` Andrei Borzenkov
@ 2016-07-12  4:50       ` Corey Coughlin
  0 siblings, 0 replies; 13+ messages in thread
From: Corey Coughlin @ 2016-07-12  4:50 UTC (permalink / raw)
  To: Andrei Borzenkov, Tomasz Kusmierz; +Cc: Btrfs BTRFS

Hi Andrei,
     Thanks for the info, sorry about the improper terminology.  In 
better news, I discovered that one disk wasn't getting recognized by the 
OS at certain outputs of one of my mini-sas to sata cables, so I got a 
new one, and the disk works fine on that.  So I'm that's at least one 
bad cable, I still have to check the other 3, though.

     ------ Corey

On 07/08/2016 10:40 PM, Andrei Borzenkov wrote:
> 07.07.2016 09:40, Corey Coughlin пишет:
>> Hi Tomasz,
>>      Thanks for the response!  I should clear some things up, though.
>>
>> On 07/06/2016 03:59 PM, Tomasz Kusmierz wrote:
>>>> On 6 Jul 2016, at 23:14, Corey Coughlin
>>>> <corey.coughlin.cc3@gmail.com> wrote:
>>>>
>>>> Hi all,
>>>>      Hoping you all can help, have a strange problem, think I know
>>>> what's going on, but could use some verification.  I set up a raid1
>>>> type btrfs filesystem on an Ubuntu 16.04 system, here's what it looks
>>>> like:
>>>>
>>>> btrfs fi show
>>>> Label: none  uuid: 597ee185-36ac-4b68-8961-d4adc13f95d4
>>>>      Total devices 10 FS bytes used 3.42TiB
>>>>      devid    1 size 1.82TiB used 1.18TiB path /dev/sdd
>>>>      devid    2 size 698.64GiB used 47.00GiB path /dev/sdk
>>>>      devid    3 size 931.51GiB used 280.03GiB path /dev/sdm
>>>>      devid    4 size 931.51GiB used 280.00GiB path /dev/sdl
>>>>      devid    5 size 1.82TiB used 1.17TiB path /dev/sdi
>>>>      devid    6 size 1.82TiB used 823.03GiB path /dev/sdj
>>>>      devid    7 size 698.64GiB used 47.00GiB path /dev/sdg
>>>>      devid    8 size 1.82TiB used 1.18TiB path /dev/sda
>>>>      devid    9 size 1.82TiB used 1.18TiB path /dev/sdb
>>>>      devid   10 size 1.36TiB used 745.03GiB path /dev/sdh
>> Now when I say that the drives mount points change, I'm not saying they
>> change when I reboot.  They change while the system is running.  For
>> instance, here's the fi show after I ran a "check --repair" run this
>> afternoon:
>>
>> btrfs fi show
>> Label: none  uuid: 597ee185-36ac-4b68-8961-d4adc13f95d4
>>      Total devices 10 FS bytes used 3.42TiB
>>      devid    1 size 1.82TiB used 1.18TiB path /dev/sdd
>>      devid    2 size 698.64GiB used 47.00GiB path /dev/sdk
>>      devid    3 size 931.51GiB used 280.03GiB path /dev/sdm
>>      devid    4 size 931.51GiB used 280.00GiB path /dev/sdl
>>      devid    5 size 1.82TiB used 1.17TiB path /dev/sdi
>>      devid    6 size 1.82TiB used 823.03GiB path /dev/sds
>>      devid    7 size 698.64GiB used 47.00GiB path /dev/sdg
>>      devid    8 size 1.82TiB used 1.18TiB path /dev/sda
>>      devid    9 size 1.82TiB used 1.18TiB path /dev/sdb
>>      devid   10 size 1.36TiB used 745.03GiB path /dev/sdh
>>
>> Notice that /dev/sdj in the previous run changed to /dev/sds.  There was
>> no reboot, the mount just changed.  I don't know why that is happening,
>> but it seems like the majority of the errors are on that drive.  But
>> given that I've fixed the start/stop issue on that disk, it probably
>> isn't a WD Green issue.
> It's not "mount point", it is just device names. Do not make it sound
> more confusing than it already is :)
>
> This implies that disks drop off and reappear. Do you have "dmesg" or
> log (/var/log/syslog or /var/log/messages or journalctl) for the same
> period of time?
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: raid1 has failing disks, but smart is clear
  2016-07-06 22:59 ` Tomasz Kusmierz
  2016-07-07  6:40   ` Corey Coughlin
@ 2016-07-07 11:58   ` Austin S. Hemmelgarn
  2016-07-08  4:50     ` Corey Coughlin
  1 sibling, 1 reply; 13+ messages in thread
From: Austin S. Hemmelgarn @ 2016-07-07 11:58 UTC (permalink / raw)
  To: Tomasz Kusmierz, Corey Coughlin; +Cc: Btrfs BTRFS

On 2016-07-06 18:59, Tomasz Kusmierz wrote:
>
>> On 6 Jul 2016, at 23:14, Corey Coughlin <corey.coughlin.cc3@gmail.com> wrote:
>>
>> Hi all,
>>    Hoping you all can help, have a strange problem, think I know what's going on, but could use some verification.  I set up a raid1 type btrfs filesystem on an Ubuntu 16.04 system, here's what it looks like:
>>
>> btrfs fi show
>> Label: none  uuid: 597ee185-36ac-4b68-8961-d4adc13f95d4
>>    Total devices 10 FS bytes used 3.42TiB
>>    devid    1 size 1.82TiB used 1.18TiB path /dev/sdd
>>    devid    2 size 698.64GiB used 47.00GiB path /dev/sdk
>>    devid    3 size 931.51GiB used 280.03GiB path /dev/sdm
>>    devid    4 size 931.51GiB used 280.00GiB path /dev/sdl
>>    devid    5 size 1.82TiB used 1.17TiB path /dev/sdi
>>    devid    6 size 1.82TiB used 823.03GiB path /dev/sdj
>>    devid    7 size 698.64GiB used 47.00GiB path /dev/sdg
>>    devid    8 size 1.82TiB used 1.18TiB path /dev/sda
>>    devid    9 size 1.82TiB used 1.18TiB path /dev/sdb
>>    devid   10 size 1.36TiB used 745.03GiB path /dev/sdh
>>
>> I added a couple disks, and then ran a balance operation, and that took about 3 days to finish.  When it did finish, tried a scrub and got this message:
>>
>> scrub status for 597ee185-36ac-4b68-8961-d4adc13f95d4
>>    scrub started at Sun Jun 26 18:19:28 2016 and was aborted after 01:16:35
>>    total bytes scrubbed: 926.45GiB with 18849935 errors
>>    error details: read=18849935
>>    corrected errors: 5860, uncorrectable errors: 18844075, unverified errors: 0
>>
>> So that seems bad.  Took a look at the devices and a few of them have errors:
>> ...
>> [/dev/sdi].generation_errs 0
>> [/dev/sdj].write_io_errs   289436740
>> [/dev/sdj].read_io_errs    289492820
>> [/dev/sdj].flush_io_errs   12411
>> [/dev/sdj].corruption_errs 0
>> [/dev/sdj].generation_errs 0
>> [/dev/sdg].write_io_errs   0
>> ...
>> [/dev/sda].generation_errs 0
>> [/dev/sdb].write_io_errs   3490143
>> [/dev/sdb].read_io_errs    111
>> [/dev/sdb].flush_io_errs   268
>> [/dev/sdb].corruption_errs 0
>> [/dev/sdb].generation_errs 0
>> [/dev/sdh].write_io_errs   5839
>> [/dev/sdh].read_io_errs    2188
>> [/dev/sdh].flush_io_errs   11
>> [/dev/sdh].corruption_errs 1
>> [/dev/sdh].generation_errs 16373
>>
>> So I checked the smart data for those disks, they seem perfect, no reallocated sectors, no problems.  But one thing I did notice is that they are all WD Green drives.  So I'm guessing that if they power down and get reassigned to a new /dev/sd* letter, that could lead to data corruption.  I used idle3ctl to turn off the shut down mode on all the green drives in the system, but I'm having trouble getting the filesystem working without the errors.  I tried a 'check --repair' command on it, and it seems to find a lot of verification errors, but it doesn't look like things are getting fixed.
>>  But I have all the data on it backed up on another system, so I can recreate this if I need to.  But here's what I want to know:
>>
>> 1.  Am I correct about the issues with the WD Green drives, if they change mounts during disk operations, will that corrupt data?
> I just wanted to chip in with WD Green drives. I have a RAID10 running on 6x2TB of those, actually had for ~3 years. If disk goes down for spin down, and you try to access something - kernel & FS & whole system will wait for drive to re-spin and everything works OK. I’ve never had a drive being reassigned to different /dev/sdX due to spin down / up.
> 2 years ago I was having a corruption due to not using ECC ram on my system and one of RAM modules started producing errors that were never caught up by CPU / MoBo. Long story short, guy here managed to point me to the right direction and I started shifting my data to hopefully new and not corrupted FS … but I was sceptical of similar issue that you have described AND I did raid1 and while mounted I did shift disk from one SATA port to another and FS managed to pick up the disk in new location and did not even blinked (as far as I remember there was syslog entry to say that disk vanished and then that disk was added)
>
> Last word, you got plenty of errors in your SMART for transfer related stuff, please be advised that this may mean:
> - faulty cable
> - faulty mono controller
> - faulty drive controller
> - bad RAM - yes, mother board CAN use your ram for storing data and transfer related stuff … specially chapter ones.
It's worth pointing out that the most likely point here for data 
corruption assuming the cable and controllers are OK is during the DMA 
transfer from system RAM to the drive controller.  Even when dealing 
with really good HBA's that have an on-board NVRAM cache, you still have 
to copy the data out of system RAM at some point, and that's usually 
when the corruption occurs if the problem is with the RAM, CPU or MB.
>
>> 2.  If that is the case:
>>    a.) Is there any way I can stop the /dev/sd* mount points from changing?  Or can I set up the filesystem using UUIDs or something more solid?  I googled about it, but found conflicting info
> Don’t get it the wrong way but I’m personally surprised that anybody still uses mount points rather than UUID. Devices change from boot to boot for a lot of people and most of distros moved to uuid (2 years ago ? even the swap is mounted via UUID now)
>
Providing there are no changes in hardware configuration, device nodes 
(which is what /dev/sd* are, not mount points) remain relatively stable 
across boot.  My personal recommendation would be to mount by label, not 
UUID, and give your filesystems relatively unique names (in my case, I 
have a short identifier to ID the system the FS is for, followed by a 
tag identifying where in the hierarchy it gets mounted).  Mounting by 
UUID works, until you have to recreate the FS from scratch and restore a 
backup, because then you have a different UUID.
>>    b.) Or, is there something else changing my drive devices?  I have most of drives on an LSI SAS 9201-16i card, is there something I need to do to make them fixed?
> I’ll let more senior data storage experts to speak up but most of the time people frowned on me for mentioning anything different than north bridge / Intel raid card / super micro / 3 ware .
>
> (And yes I did found the hard way they were right:
> - marvel controller on my mobo randomly writes garbage to your drives
> - adapted PCI express card was switching of all the drives mid flight while pretending “it’s OK” resulting in very peculiar data losses in the middle of big file.
Based on personal experience:
1. LSI Logic, Super Micro, Intel, and 3 Ware all generally have very 
high quality HBA's (both RAID and non-RAID)
2. HighPoint and ASMedia (ASUS's internal semi-conductor branch) are 
decent if you get recent cards.
3. Marvell, Adaptec, and Areca are hit or miss, some work great, some 
are horrible.
4. I have never had a JMicron based card that worked reliably.
5. If it's in a server and branded by the manufacturer of that server, 
it's probably a good adapter.
>
>>    c.) Or, is there a script or something I can use to figure out if the disks will change mounts?
>>    d.) Or, if I wipe everything and rebuild, will the disks with the idle3ctl fix work now?
>>
>> Regardless of whether or not it's a WD Green drive issue, should I just wipefs all the disks and rebuild it?  Is there any way to recover this?  Thanks for any help!
>
> IF you remotelly care about the data that you have (I think you should if you came here), I would suggest a good exercise:
> - unplug all the drives you use for this file system and stop toying with it because you may loose more data (I did because I thought I knew better)
> - get your self 2 new drives
> - find my thread from ~2 years ago on this mailing list (might be different email address)
> - try to locate Chris Mason reply with a script “my old friend”
> - run this script on you system for couple of DAYS and you will see whenever you have any corruption creeping in
> - if corruptions are creeping in, change a component in your system (controller / RAM / mobo / CPU / PSU) and repeat exercise (best to organise your self access to some spare parts / extra machine.
> - when all is good, make and FS out of those 2 new drives, and try to rescue data from OLD FS !
> - unplug new FS and put it one the shelf
> - try to fix old FS … this will be a FUN and very educating exercise …
FWIW, based on what's been said, I'm almost certain it's a hardware 
issue, and would give better than 50/50 odds that it's bad RAM or a bad 
storage controller.  In general, here's the order I'd swap things in if 
you have spares:
1. RAM (If you find that RAM is the issue, try each individual module 
separately to figure out which one is bad, if they all appear bad, it's 
either a power issue or something is wrong with your MB).
2. Storage Controller
3. PSU (if you have a multi-meter and a bit of wire, or a PSU tester, 
you can check this without needing a spare).
4. MB
5. CPU

Other things to consider regarding power:
1. If you're using a UPS, make sure it lists either 'True sine wave 
output' or 'APFC' support, many cheap ones produce a quantized sine wave 
output which causes issues with many modern computer PSU's.
2. If you're not using a UPS, you may have issues with what's 
colloquially known as 'dirty' power.  In signal theory terms, this means 
that you've got a noise source mixed in with the 50/60Hz sine wave 
that's supposed to be present on line power.  This is actually a 
surprisingly common issue in many parts of the world because of the 
conditions of the transmission lines and the power plant itself. 
Somewhat ironically, one of the most reliable ways to deal with this is 
to get a UPS (and if you do, make sure and look for one that meets what 
I said above)
Both issues can manifest almost identically to having bad RAM or a bad 
PSU, but they're often expensive to fix, and in the second case, not 
easy to test for.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: raid1 has failing disks, but smart is clear
  2016-07-07 11:58   ` Austin S. Hemmelgarn
@ 2016-07-08  4:50     ` Corey Coughlin
  2016-07-08 11:14       ` Tomasz Kusmierz
  0 siblings, 1 reply; 13+ messages in thread
From: Corey Coughlin @ 2016-07-08  4:50 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Tomasz Kusmierz; +Cc: Btrfs BTRFS

Hi Austin,
     Thanks for the reply!  I'll go inline for more:

On 07/07/2016 04:58 AM, Austin S. Hemmelgarn wrote:
> On 2016-07-06 18:59, Tomasz Kusmierz wrote:
>>
>>> On 6 Jul 2016, at 23:14, Corey Coughlin 
>>> <corey.coughlin.cc3@gmail.com> wrote:
>>>
>>> Hi all,
>>>    Hoping you all can help, have a strange problem, think I know 
>>> what's going on, but could use some verification.  I set up a raid1 
>>> type btrfs filesystem on an Ubuntu 16.04 system, here's what it 
>>> looks like:
>>>
>>> btrfs fi show
>>> Label: none  uuid: 597ee185-36ac-4b68-8961-d4adc13f95d4
>>>    Total devices 10 FS bytes used 3.42TiB
>>>    devid    1 size 1.82TiB used 1.18TiB path /dev/sdd
>>>    devid    2 size 698.64GiB used 47.00GiB path /dev/sdk
>>>    devid    3 size 931.51GiB used 280.03GiB path /dev/sdm
>>>    devid    4 size 931.51GiB used 280.00GiB path /dev/sdl
>>>    devid    5 size 1.82TiB used 1.17TiB path /dev/sdi
>>>    devid    6 size 1.82TiB used 823.03GiB path /dev/sdj
>>>    devid    7 size 698.64GiB used 47.00GiB path /dev/sdg
>>>    devid    8 size 1.82TiB used 1.18TiB path /dev/sda
>>>    devid    9 size 1.82TiB used 1.18TiB path /dev/sdb
>>>    devid   10 size 1.36TiB used 745.03GiB path /dev/sdh
>>>
>>> I added a couple disks, and then ran a balance operation, and that 
>>> took about 3 days to finish.  When it did finish, tried a scrub and 
>>> got this message:
>>>
>>> scrub status for 597ee185-36ac-4b68-8961-d4adc13f95d4
>>>    scrub started at Sun Jun 26 18:19:28 2016 and was aborted after 
>>> 01:16:35
>>>    total bytes scrubbed: 926.45GiB with 18849935 errors
>>>    error details: read=18849935
>>>    corrected errors: 5860, uncorrectable errors: 18844075, 
>>> unverified errors: 0
>>>
>>> So that seems bad.  Took a look at the devices and a few of them 
>>> have errors:
>>> ...
>>> [/dev/sdi].generation_errs 0
>>> [/dev/sdj].write_io_errs   289436740
>>> [/dev/sdj].read_io_errs    289492820
>>> [/dev/sdj].flush_io_errs   12411
>>> [/dev/sdj].corruption_errs 0
>>> [/dev/sdj].generation_errs 0
>>> [/dev/sdg].write_io_errs   0
>>> ...
>>> [/dev/sda].generation_errs 0
>>> [/dev/sdb].write_io_errs   3490143
>>> [/dev/sdb].read_io_errs    111
>>> [/dev/sdb].flush_io_errs   268
>>> [/dev/sdb].corruption_errs 0
>>> [/dev/sdb].generation_errs 0
>>> [/dev/sdh].write_io_errs   5839
>>> [/dev/sdh].read_io_errs    2188
>>> [/dev/sdh].flush_io_errs   11
>>> [/dev/sdh].corruption_errs 1
>>> [/dev/sdh].generation_errs 16373
>>>
>>> So I checked the smart data for those disks, they seem perfect, no 
>>> reallocated sectors, no problems.  But one thing I did notice is 
>>> that they are all WD Green drives.  So I'm guessing that if they 
>>> power down and get reassigned to a new /dev/sd* letter, that could 
>>> lead to data corruption.  I used idle3ctl to turn off the shut down 
>>> mode on all the green drives in the system, but I'm having trouble 
>>> getting the filesystem working without the errors.  I tried a 'check 
>>> --repair' command on it, and it seems to find a lot of verification 
>>> errors, but it doesn't look like things are getting fixed.
>>>  But I have all the data on it backed up on another system, so I can 
>>> recreate this if I need to.  But here's what I want to know:
>>>
>>> 1.  Am I correct about the issues with the WD Green drives, if they 
>>> change mounts during disk operations, will that corrupt data?
>> I just wanted to chip in with WD Green drives. I have a RAID10 
>> running on 6x2TB of those, actually had for ~3 years. If disk goes 
>> down for spin down, and you try to access something - kernel & FS & 
>> whole system will wait for drive to re-spin and everything works OK. 
>> I’ve never had a drive being reassigned to different /dev/sdX due to 
>> spin down / up.
>> 2 years ago I was having a corruption due to not using ECC ram on my 
>> system and one of RAM modules started producing errors that were 
>> never caught up by CPU / MoBo. Long story short, guy here managed to 
>> point me to the right direction and I started shifting my data to 
>> hopefully new and not corrupted FS … but I was sceptical of similar 
>> issue that you have described AND I did raid1 and while mounted I did 
>> shift disk from one SATA port to another and FS managed to pick up 
>> the disk in new location and did not even blinked (as far as I 
>> remember there was syslog entry to say that disk vanished and then 
>> that disk was added)
>>
>> Last word, you got plenty of errors in your SMART for transfer 
>> related stuff, please be advised that this may mean:
>> - faulty cable
>> - faulty mono controller
>> - faulty drive controller
>> - bad RAM - yes, mother board CAN use your ram for storing data and 
>> transfer related stuff … specially chapter ones.
> It's worth pointing out that the most likely point here for data 
> corruption assuming the cable and controllers are OK is during the DMA 
> transfer from system RAM to the drive controller.  Even when dealing 
> with really good HBA's that have an on-board NVRAM cache, you still 
> have to copy the data out of system RAM at some point, and that's 
> usually when the corruption occurs if the problem is with the RAM, CPU 
> or MB.

Well, I was able to run memtest on the system last night, that passed 
with flying colors, so I'm now leaning toward the problem being in the 
sas card.  But I'll have to run some more tests.

>>
>>> 2.  If that is the case:
>>>    a.) Is there any way I can stop the /dev/sd* mount points from 
>>> changing?  Or can I set up the filesystem using UUIDs or something 
>>> more solid?  I googled about it, but found conflicting info
>> Don’t get it the wrong way but I’m personally surprised that anybody 
>> still uses mount points rather than UUID. Devices change from boot to 
>> boot for a lot of people and most of distros moved to uuid (2 years 
>> ago ? even the swap is mounted via UUID now)
>>
> Providing there are no changes in hardware configuration, device nodes 
> (which is what /dev/sd* are, not mount points) remain relatively 
> stable across boot.  My personal recommendation would be to mount by 
> label, not UUID, and give your filesystems relatively unique names (in 
> my case, I have a short identifier to ID the system the FS is for, 
> followed by a tag identifying where in the hierarchy it gets 
> mounted).  Mounting by UUID works, until you have to recreate the FS 
> from scratch and restore a backup, because then you have a different 
> UUID.

I got another email describing that in more detail, using 
/dev/disk/by-*, so I'll give that a try later.  But my problem does seem 
to be that the mounts are changing while the system is up.  For 
instance, I ran a "check --repair" on the filesystem shown above, and 
/dev/sdj changed to /dev/sds afterwards.  Which leads me to think it's a 
controller card problem, but I'll have to check it to be sure.


>>>    b.) Or, is there something else changing my drive devices?  I 
>>> have most of drives on an LSI SAS 9201-16i card, is there something 
>>> I need to do to make them fixed?
>> I’ll let more senior data storage experts to speak up but most of the 
>> time people frowned on me for mentioning anything different than 
>> north bridge / Intel raid card / super micro / 3 ware .
>>
>> (And yes I did found the hard way they were right:
>> - marvel controller on my mobo randomly writes garbage to your drives
>> - adapted PCI express card was switching of all the drives mid flight 
>> while pretending “it’s OK” resulting in very peculiar data losses in 
>> the middle of big file.
> Based on personal experience:
> 1. LSI Logic, Super Micro, Intel, and 3 Ware all generally have very 
> high quality HBA's (both RAID and non-RAID)
> 2. HighPoint and ASMedia (ASUS's internal semi-conductor branch) are 
> decent if you get recent cards.
> 3. Marvell, Adaptec, and Areca are hit or miss, some work great, some 
> are horrible.
> 4. I have never had a JMicron based card that worked reliably.
> 5. If it's in a server and branded by the manufacturer of that server, 
> it's probably a good adapter.

Thanks for the list, definitely worth looking into if my current LSI 
card is the problem.


>>
>>>    c.) Or, is there a script or something I can use to figure out if 
>>> the disks will change mounts?
>>>    d.) Or, if I wipe everything and rebuild, will the disks with the 
>>> idle3ctl fix work now?
>>>
>>> Regardless of whether or not it's a WD Green drive issue, should I 
>>> just wipefs all the disks and rebuild it?  Is there any way to 
>>> recover this?  Thanks for any help!
>>
>> IF you remotelly care about the data that you have (I think you 
>> should if you came here), I would suggest a good exercise:
>> - unplug all the drives you use for this file system and stop toying 
>> with it because you may loose more data (I did because I thought I 
>> knew better)
>> - get your self 2 new drives
>> - find my thread from ~2 years ago on this mailing list (might be 
>> different email address)
>> - try to locate Chris Mason reply with a script “my old friend”
>> - run this script on you system for couple of DAYS and you will see 
>> whenever you have any corruption creeping in
>> - if corruptions are creeping in, change a component in your system 
>> (controller / RAM / mobo / CPU / PSU) and repeat exercise (best to 
>> organise your self access to some spare parts / extra machine.
>> - when all is good, make and FS out of those 2 new drives, and try to 
>> rescue data from OLD FS !
>> - unplug new FS and put it one the shelf
>> - try to fix old FS … this will be a FUN and very educating exercise …
> FWIW, based on what's been said, I'm almost certain it's a hardware 
> issue, and would give better than 50/50 odds that it's bad RAM or a 
> bad storage controller.  In general, here's the order I'd swap things 
> in if you have spares:
> 1. RAM (If you find that RAM is the issue, try each individual module 
> separately to figure out which one is bad, if they all appear bad, 
> it's either a power issue or something is wrong with your MB).
> 2. Storage Controller
> 3. PSU (if you have a multi-meter and a bit of wire, or a PSU tester, 
> you can check this without needing a spare).
> 4. MB
> 5. CPU

As I said, the ram looks reasonably good at this point.  I'm guessing 
it's a storage controller issue, I'll set up some test arrays on the 
board controller vs the added controller and see how that goes.  The psu 
is a good idea to check, but it's a fairly high power system so I keep 
an eye on that pretty closely.  If it's the MB, that's probably the best 
case scenario, I have a few old boards sitting around so finding a 
replacement will be cheap.  Might be harder to find a cpu problem, it's 
a dual xeon board, so if one is bad and the other is good, could be 
tricky to figure out.


>
> Other things to consider regarding power:
> 1. If you're using a UPS, make sure it lists either 'True sine wave 
> output' or 'APFC' support, many cheap ones produce a quantized sine 
> wave output which causes issues with many modern computer PSU's.
> 2. If you're not using a UPS, you may have issues with what's 
> colloquially known as 'dirty' power.  In signal theory terms, this 
> means that you've got a noise source mixed in with the 50/60Hz sine 
> wave that's supposed to be present on line power.  This is actually a 
> surprisingly common issue in many parts of the world because of the 
> conditions of the transmission lines and the power plant itself. 
> Somewhat ironically, one of the most reliable ways to deal with this 
> is to get a UPS (and if you do, make sure and look for one that meets 
> what I said above)
> Both issues can manifest almost identically to having bad RAM or a bad 
> PSU, but they're often expensive to fix, and in the second case, not 
> easy to test for.

Interesting.  I have 4 machines that I use for folding or DVR duty, 
they're on 24-7 and pretty stable.  I suppose btrfs might be super 
sensitive, but that seems unlikely.  Something to think about, though.  
And thanks again for all the help!

     ------ Corey

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: raid1 has failing disks, but smart is clear
  2016-07-08  4:50     ` Corey Coughlin
@ 2016-07-08 11:14       ` Tomasz Kusmierz
  2016-07-08 12:14         ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 13+ messages in thread
From: Tomasz Kusmierz @ 2016-07-08 11:14 UTC (permalink / raw)
  To: Corey Coughlin; +Cc: Austin S. Hemmelgarn, Btrfs BTRFS

>
> Well, I was able to run memtest on the system last night, that passed with
> flying colors, so I'm now leaning toward the problem being in the sas card.
> But I'll have to run some more tests.
>

Seriously use the "stres.sh" for couple of days, When I was running
memtest it was running continuously for 3 days without the error, day
of stres.sh and errors started showing up.
Be VERY careful with trusting any sort of that tool, modern CPU's lye
to you continuously !!!
1. You may think that you've wrote best on the planet code that
bypasses a CPU cache, but in reality since CPU's are multicore you can
end up with overzealous MPMD traping you inside of you cache memory
and all you resting will do is write a page (trapped in cache) read it
from cache (coherency mechanism, not the mis/hit one) will trap you
inside of L3 so you have no clue you don't touch the ram, then CPU
will just dump your page to RAM and "job done"
2. Since coherency problems and real problems with non blocking on
mpmd you can have a DMA controller sucking pages out your own cache,
due to ram being marked as dirty and CPU will try to spare the time
and accelerate the operation to push DMA straigh out of L3 to
somewhere else (mentioning that sine some testers use crazy way of
forcing your ram access via DMA to somewhere and back to force droping
out of L3)
3. This one is actually funny: some testers didn't claim the pages to
the process so for some reason pages that the were using were not
showing up as used / dirty etc so all the testing was done 32kB of L1
... tests were fast thou :)

srters.sh will test operation of the whole system !!! it shifts a lot
of data so disks are engaged, CPU keeps pumping out CRC32 all the time
so it's busy, RAM gets hit nicely as well due to high DMA.

When come to think about it, if your device points change during
operation of the system it might be an LSI card dying -> reinitialize
-> rediscovering drives -> drives show up in different point. On my
system I can hot swap sata and it will come up with different dev even
thou it was connected to same place on the controller.

I think, most important - I presume you run nonECC ?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: raid1 has failing disks, but smart is clear
  2016-07-08 11:14       ` Tomasz Kusmierz
@ 2016-07-08 12:14         ` Austin S. Hemmelgarn
  2016-07-09  5:13           ` Corey Coughlin
  0 siblings, 1 reply; 13+ messages in thread
From: Austin S. Hemmelgarn @ 2016-07-08 12:14 UTC (permalink / raw)
  To: Tomasz Kusmierz, Corey Coughlin; +Cc: Btrfs BTRFS

On 2016-07-08 07:14, Tomasz Kusmierz wrote:
>>
>> Well, I was able to run memtest on the system last night, that passed with
>> flying colors, so I'm now leaning toward the problem being in the sas card.
>> But I'll have to run some more tests.
>>
>
> Seriously use the "stres.sh" for couple of days, When I was running
> memtest it was running continuously for 3 days without the error, day
> of stres.sh and errors started showing up.
> Be VERY careful with trusting any sort of that tool, modern CPU's lye
> to you continuously !!!
> 1. You may think that you've wrote best on the planet code that
> bypasses a CPU cache, but in reality since CPU's are multicore you can
> end up with overzealous MPMD traping you inside of you cache memory
> and all you resting will do is write a page (trapped in cache) read it
> from cache (coherency mechanism, not the mis/hit one) will trap you
> inside of L3 so you have no clue you don't touch the ram, then CPU
> will just dump your page to RAM and "job done"
> 2. Since coherency problems and real problems with non blocking on
> mpmd you can have a DMA controller sucking pages out your own cache,
> due to ram being marked as dirty and CPU will try to spare the time
> and accelerate the operation to push DMA straigh out of L3 to
> somewhere else (mentioning that sine some testers use crazy way of
> forcing your ram access via DMA to somewhere and back to force droping
> out of L3)
> 3. This one is actually funny: some testers didn't claim the pages to
> the process so for some reason pages that the were using were not
> showing up as used / dirty etc so all the testing was done 32kB of L1
> ... tests were fast thou :)
>
> srters.sh will test operation of the whole system !!! it shifts a lot
> of data so disks are engaged, CPU keeps pumping out CRC32 all the time
> so it's busy, RAM gets hit nicely as well due to high DMA.
Agreed, never just trust memtest86 or memtest86+.

FWIW< here's the routine I go through to test new RAM:
1. Run regular memtest86 for at least 3 full cycles in full SMP mode (F2 
while starting up to force SMP).  On some systems this may hang, but 
that's an issue in the BIOS's setup of the CPU and MC, not the RAM, and 
is generally not indicative of a system which will have issues.
2. Run regular memtest86 for at least 3 full cycles in regular UP mode 
(the default on most non-NUMA hardware).
3. Repeat 1 and 2 with memtest86+.  It's diverged enough from regular 
memtest86 that it's functionally a separate tool, and I've seen RAM that 
passes one but not the other on multiple occasions before.
4. Boot SystemRescueCD, download a copy of the Linux sources, and run as 
many allmodconfig builds in parallel as I have CPU's, each with a number 
of make jobs equal to the twice number of CPU's (so each CPU ends up 
running at least two threads).  This forces enough context switching to 
completely trash even the L3 cache on almost any modern processor, which 
means it forces things out to RAM.  It won't hit all your RAM, but I've 
found it to be a relatively reliable way to verify the memory bus and 
the memory controller work properly.
5. Still from SystemRescueCD, use a tool called memtester (essentially 
memtest86, but run from userspace) to check the RAM.
6. Still from SystemRescueCD, use sha1sum to compute SHA-1 hashes of all 
the disks in the system, using at least 8 instances of sha1sum per CPU 
core, and make sure that all the sums for a disk match.
7. Do 6 again, but using cat to compute the sum of a concatenation of 
all the disks in the system (so the individual commands end up being 
`cat /dev/sd? | sha1sum`).  This will rapidly use all available memory 
on the system and keep it in use for quite a while.
8. If I'm using my home server system, I also have a special virtual 
runlevel set up where I spin up 4 times as many VM's as I have CPU cores 
(so on my current 8 core system, I spin up 32), all assigned a part of 
the RAM not used by the host (which I shrink to the minimum useable size 
of about 500MB), all running steps 1-3 in parallel.

It may also be worth mentioning that I've seen very poorly behaved HBA's 
that produce symptoms that look like bad RAM, including issues not 
related to the disks themselves, yet show no issues when regular memory 
testing is run.
>
> When come to think about it, if your device points change during
> operation of the system it might be an LSI card dying -> reinitialize
> -> rediscovering drives -> drives show up in different point. On my
> system I can hot swap sata and it will come up with different dev even
> thou it was connected to same place on the controller.
Barring a few odd controllers I've seen which support hot-plug but not 
hot-remove, that shouldn't happen unless the device is in use, and in 
that case it only happens because of the existing open references to the 
device being held by whatever is using it.
>
> I think, most important - I presume you run nonECC ?
And if not, how well shielded is your system?  You can often get by with 
non-ECC RAM if you have good EMI shielding and reboot regularly.  Most 
servers actually do have good EMI shielding, and many pre-built desktops 
do, but a lot of DIY systems don't (especially if ti's a gaming case, 
the poly-carbonate windows many of them have in the side panel are a 
_huge_ hole in the EMI shielding).

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: raid1 has failing disks, but smart is clear
  2016-07-08 12:14         ` Austin S. Hemmelgarn
@ 2016-07-09  5:13           ` Corey Coughlin
  0 siblings, 0 replies; 13+ messages in thread
From: Corey Coughlin @ 2016-07-09  5:13 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Tomasz Kusmierz; +Cc: Btrfs BTRFS

Hi all,
     One thing I may not have made clear, this wasn't a system that has 
been running for a month and just became up corrupt out of nowhere, the 
corruption showed up the first time I tried to run a filesystem balance, 
basically the day after I set up the filesystem and copied files over.  
I was hoping to get it stable and then add some more disks, but since it 
wasn't stable right off the top, I'm assuming the problem is bigger than 
some bad memory. I ran the stress.sh on two disks connected to ports on 
the motherboard, that seemed to work fine.  And I'm using a pair of WD 
green drives, in case there's an issue with those.  I did order some WD 
red NAS drives, hoping they arrive soon.  I'm running the stress now 
with them connected to the SAS card, trying to only let them run for a 
day to see if something bad happens, it's a 4 port card so if there's a 
problem with a specific port or cable, it could take me a while to find 
it.  I'm hoping it shows up in a somewhat obvious way.  But thanks for 
all the help, the stress.sh runs give me a clear way to try to debug 
this, thanks again for that tip.

     ------- Corey

On 07/08/2016 05:14 AM, Austin S. Hemmelgarn wrote:
> On 2016-07-08 07:14, Tomasz Kusmierz wrote:
>>>
>>> Well, I was able to run memtest on the system last night, that 
>>> passed with
>>> flying colors, so I'm now leaning toward the problem being in the 
>>> sas card.
>>> But I'll have to run some more tests.
>>>
>>
>> Seriously use the "stres.sh" for couple of days, When I was running
>> memtest it was running continuously for 3 days without the error, day
>> of stres.sh and errors started showing up.
>> Be VERY careful with trusting any sort of that tool, modern CPU's lye
>> to you continuously !!!
>> 1. You may think that you've wrote best on the planet code that
>> bypasses a CPU cache, but in reality since CPU's are multicore you can
>> end up with overzealous MPMD traping you inside of you cache memory
>> and all you resting will do is write a page (trapped in cache) read it
>> from cache (coherency mechanism, not the mis/hit one) will trap you
>> inside of L3 so you have no clue you don't touch the ram, then CPU
>> will just dump your page to RAM and "job done"
>> 2. Since coherency problems and real problems with non blocking on
>> mpmd you can have a DMA controller sucking pages out your own cache,
>> due to ram being marked as dirty and CPU will try to spare the time
>> and accelerate the operation to push DMA straigh out of L3 to
>> somewhere else (mentioning that sine some testers use crazy way of
>> forcing your ram access via DMA to somewhere and back to force droping
>> out of L3)
>> 3. This one is actually funny: some testers didn't claim the pages to
>> the process so for some reason pages that the were using were not
>> showing up as used / dirty etc so all the testing was done 32kB of L1
>> ... tests were fast thou :)
>>
>> srters.sh will test operation of the whole system !!! it shifts a lot
>> of data so disks are engaged, CPU keeps pumping out CRC32 all the time
>> so it's busy, RAM gets hit nicely as well due to high DMA.
> Agreed, never just trust memtest86 or memtest86+.
>
> FWIW< here's the routine I go through to test new RAM:
> 1. Run regular memtest86 for at least 3 full cycles in full SMP mode 
> (F2 while starting up to force SMP).  On some systems this may hang, 
> but that's an issue in the BIOS's setup of the CPU and MC, not the 
> RAM, and is generally not indicative of a system which will have issues.
> 2. Run regular memtest86 for at least 3 full cycles in regular UP mode 
> (the default on most non-NUMA hardware).
> 3. Repeat 1 and 2 with memtest86+.  It's diverged enough from regular 
> memtest86 that it's functionally a separate tool, and I've seen RAM 
> that passes one but not the other on multiple occasions before.
> 4. Boot SystemRescueCD, download a copy of the Linux sources, and run 
> as many allmodconfig builds in parallel as I have CPU's, each with a 
> number of make jobs equal to the twice number of CPU's (so each CPU 
> ends up running at least two threads).  This forces enough context 
> switching to completely trash even the L3 cache on almost any modern 
> processor, which means it forces things out to RAM.  It won't hit all 
> your RAM, but I've found it to be a relatively reliable way to verify 
> the memory bus and the memory controller work properly.
> 5. Still from SystemRescueCD, use a tool called memtester (essentially 
> memtest86, but run from userspace) to check the RAM.
> 6. Still from SystemRescueCD, use sha1sum to compute SHA-1 hashes of 
> all the disks in the system, using at least 8 instances of sha1sum per 
> CPU core, and make sure that all the sums for a disk match.
> 7. Do 6 again, but using cat to compute the sum of a concatenation of 
> all the disks in the system (so the individual commands end up being 
> `cat /dev/sd? | sha1sum`).  This will rapidly use all available memory 
> on the system and keep it in use for quite a while.
> 8. If I'm using my home server system, I also have a special virtual 
> runlevel set up where I spin up 4 times as many VM's as I have CPU 
> cores (so on my current 8 core system, I spin up 32), all assigned a 
> part of the RAM not used by the host (which I shrink to the minimum 
> useable size of about 500MB), all running steps 1-3 in parallel.
>
> It may also be worth mentioning that I've seen very poorly behaved 
> HBA's that produce symptoms that look like bad RAM, including issues 
> not related to the disks themselves, yet show no issues when regular 
> memory testing is run.
>>
>> When come to think about it, if your device points change during
>> operation of the system it might be an LSI card dying -> reinitialize
>> -> rediscovering drives -> drives show up in different point. On my
>> system I can hot swap sata and it will come up with different dev even
>> thou it was connected to same place on the controller.
> Barring a few odd controllers I've seen which support hot-plug but not 
> hot-remove, that shouldn't happen unless the device is in use, and in 
> that case it only happens because of the existing open references to 
> the device being held by whatever is using it.
>>
>> I think, most important - I presume you run nonECC ?
> And if not, how well shielded is your system?  You can often get by 
> with non-ECC RAM if you have good EMI shielding and reboot regularly.  
> Most servers actually do have good EMI shielding, and many pre-built 
> desktops do, but a lot of DIY systems don't (especially if ti's a 
> gaming case, the poly-carbonate windows many of them have in the side 
> panel are a _huge_ hole in the EMI shielding).


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-07-12  4:50 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-07-06 22:14 raid1 has failing disks, but smart is clear Corey Coughlin
2016-07-06 22:59 ` Tomasz Kusmierz
2016-07-07  6:40   ` Corey Coughlin
2016-07-08  1:24     ` Duncan
2016-07-08  4:51       ` Corey Coughlin
2016-07-09  5:51       ` Andrei Borzenkov
2016-07-09  5:40     ` Andrei Borzenkov
2016-07-12  4:50       ` Corey Coughlin
2016-07-07 11:58   ` Austin S. Hemmelgarn
2016-07-08  4:50     ` Corey Coughlin
2016-07-08 11:14       ` Tomasz Kusmierz
2016-07-08 12:14         ` Austin S. Hemmelgarn
2016-07-09  5:13           ` Corey Coughlin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).