Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Need Help with crashed RAID5 (that was rebuilding and then had SATA error on another drive)
From: Ben Kamen @ 2016-08-22 21:51 UTC (permalink / raw)
  To: linux-raid

Hey all. I'm looking at the RAID Wiki and need some help.

First Info:

I have a RAID5 with 4 members /dev/sd[cdef]1 where last night, sdc1
reported a smart error recommended drive replacement (after watching
sector errors pile up for about a week.)

no problem. shut down the drive, pulled it, replace it with a cold
spare. Started the rebuild (around midnight CDT).

At 5:43am, I got this message:

This is an automatically generated mail message from mdadm
running on quantum

A Fail event had been detected on md device /dev/md127.

It could be related to component device /dev/sde1.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid1] [raid6] [raid5] [raid4]
md0 : active raid1 sda2[0] sdb2[2]
      511988 blocks super 1.0 [2/2] [UU]

md127 : active raid5 sdc1[4] sdf1[6] sde1[1](F) sdd1[5]
      2930276352 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/2] [U_U_]
      [===========>.........]  recovery = 55.9% (546131076/976758784)
finish=381.6min speed=18805K/sec
      bitmap: 4/8 pages [16KB], 65536KB chunk

md1 : active raid1 sda3[0] sdb3[2]
      239489916 blocks super 1.1 [2/2] [UU]
      bitmap: 2/2 pages [8KB], 65536KB chunk

md10 : active raid1 sda1[0] sdb1[2]
      4193272 blocks super 1.1 [2/2] [UU]

unused devices: <none>

/dev/md127  is the one with issues.

It looks like the SATA controller had issues. I couldn't see sde - so
I rebooted. (scold me later.)

All the drives are available. SMARTCTL tells me /dev/sde is happy as
can be (has a few bad sectors and is slated for replacement next, but
smart says drive is healthy).

I looked at the raid Wiki - and saved the mdadm --examine info. Of the
active members, the event count is off by 25 for happy vs unhappy
members.

But forcing the assembly claims

mdadm --assemble --force /dev/md127 /dev/sd[cdef]1
mdadm: /dev/sdc1 is busy - skipping
mdadm: /dev/sdd1 is busy - skipping
mdadm: /dev/sdf1 is busy - skipping
mdadm: Found some drive for an array that is already active: /dev/md/:BigRAID
mdadm: giving up.

So before I mess up ANYTHING else...

What should I be doing?

(should I be stopping the RAID as right now it's seems like it's running)

Thanks,

   -Ben

^ permalink raw reply

* Re: [PATCH v2 0/6] Add AVX512 optimized gen_syndrome, xor_syndrome and recovery functions
From: Shaohua Li @ 2016-08-22 16:38 UTC (permalink / raw)
  To: Gayatri Kammela; +Cc: linux-raid, linux-kernel, ravi.v.shankar
In-Reply-To: <1471050204-26361-1-git-send-email-gayatri.kammela@intel.com>

On Fri, Aug 12, 2016 at 06:03:18PM -0700, Gayatri Kammela wrote:
> This is the version 2 patch series for adding AVX512 optimized gen_syndrome,
> xor_syndrome and recovery functions.
> 
> Optimization of RAID6 using AVX512 instructions should improve the
> RAID6 performance.These patches are tested and observed the improvement
> in performance.
> 
> Changes since v1:
> 1) Added xor_syndrome functions to avx512 optimized raid6.
> 
> Gayatri Kammela (6):
>   lib/raid6: Add AVX512 optimized gen_syndrome functions
>   lib/raid6: Add AVX512 optimized recovery functions
>   lib/raid6/test/Makefile: Add avx512 gen_syndrome and recovery
>     functions
>   lib/raid6: Add AVX512 optimized xor_syndrome functions
>   (DO NOT APPLY) lib/raid6: Add unroll by 8 to AVX512 optimized
>     gen_syndrome functions
>   (DO NOT APPLY) lib/raid6: Add unroll by 8 to AVX512 optimized    
>     xor_syndrome functions.
> 
>  arch/x86/Makefile        |   5 +-
>  include/linux/raid/pq.h  |   5 +
>  lib/raid6/Makefile       |   2 +-
>  lib/raid6/algos.c        |  13 +
>  lib/raid6/avx512.c       | 972 +++++++++++++++++++++++++++++++++++++++++++++++
>  lib/raid6/recov_avx512.c | 388 +++++++++++++++++++
>  lib/raid6/test/Makefile  |   5 +-
>  lib/raid6/x86.h          |  10 +
>  8 files changed, 1396 insertions(+), 4 deletions(-)
>  create mode 100644 lib/raid6/avx512.c
>  create mode 100644 lib/raid6/recov_avx512.c
Hi,

I applied the first 4 patches, but I'm not familiar with the new instructions
and nor have the hardware to test them, so only have limited check.

Thanks,
Shaohua

^ permalink raw reply

* Re: [RFC] Some fixes to allow for more than 128 md devices.
From: Robert LeBlanc @ 2016-08-22 16:03 UTC (permalink / raw)
  To: linux-raid; +Cc: dm-devel, Robert LeBlanc
In-Reply-To: <1471659031-30688-1-git-send-email-robert@leblancnet.us>

Apparently, the mdadm source on git-kernel.org (commit 13db17bd)
already has the fixes to properly create the device nodes, but I still
have the unexpected failure opening /dev/md1048574.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Fri, Aug 19, 2016 at 8:10 PM, Robert LeBlanc <robert@leblancnet.us> wrote:
> I'm stuck and need some help getting this across the finish line. This
> is in no way complete, but to help show what I'm working on.
>
> When we added more than 128 md devices, we started getting failures.
> Looking through the code it seems that the minor dev number was being
> stored in an int and causing overflow and wrecking havoc on everything.
> I finally got the mknod in mdadm to correctly make the dev node with
> minors up to 1048574 as expected in the mdadm code. However, I can
> only create md devices up to 511. Trying to create an md higher than
> that has an error where the device can't be read/opened strace reports:
> open("/dev/.tmp.md.15341:9:1048574", O_RDWR|O_EXCL|O_DIRECT) = -1 ENXIO
> (No such device or address)
> while Python reports:
> IOError: [Errno 6] No such device or address: '/dev/.tmp.md.3279:9:512'
>
> A corresponding node is not created in /sys/block/md* for mds over 511.
>
> I believe that there may be a bug in the kernel code that is now being
> hit. After looking through the kernel code, I can't seem to find where
> this might be. Please help me by either pointing me to the source
> location that this might be a problem or fixing it based on these
> patches I've worked on so far. I'm using 4.7.0 currently.
>
> I'm using this for testing:
> ./mdadm --create /dev/md1048574 --assume-clean --verbose --level=1 \
> --raid-devices=2 /dev/loop0 missing
>
> Yes, we have a real need for more than 128 and 512 md devices.
>
> Please include me in any replies as I'm not on the ML.
>
> Thank you.
>
> Robert LeBlanc (1):
>   Some fixes to allow for more than 128 md devices.
>
>  Manage.c |  5 +++--
>  lib.c    |  2 +-
>  mdadm.h  |  6 +++---
>  util.c   | 25 +++++++++++++------------
>  4 files changed, 20 insertions(+), 18 deletions(-)
>
> --
> 2.8.1
>

^ permalink raw reply

* Re: Assistance Reviewing Proposed Recovery Measures
From: Phil Turmel @ 2016-08-22 12:09 UTC (permalink / raw)
  To: Chris Maxwell, linux-raid
In-Reply-To: <alpine.LRH.2.11.1608191717050.23388@bluenose.cs.dal.ca>

Good morning Chris,

Very good report, btw.

On 08/19/2016 04:18 PM, Chris Maxwell wrote:

[trim /]

> The machine has a 3ware Hardware RAID controller which is showing sdb
> and sdc as disks.  (Unit 0 and Unit 1).
> Unit 0 (sdb) is made up of
> Phy 0: WD WCAW35791262
> Phy 1:  Seagate 9QJ7N744
> 
> Unit 1(sdc) is made up of
> Phy 2: Seagate 9QJ7F3PJ and
> Phy 3: Seagate 9QJ7R3Y1
> 
> These are then combined into mirror md0 made of sdb1 and sdc1
> This is the physical volume for LVM VG lvm-raid, which then has LV inside:
> lvmdata1 and gokcen

The models of the disks would be useful information, too.  Your dmesg
indicates unit 3 is very slow to report UREs, which means its probably a
desktop drive, not a raid drive.  I don't have much hardware raid
experience, but I do know that smartctl won't report properly on devices
connected to hardware raid without additional options on the command
line.  You need to do this to get a smartctl -x report on all of these
devices.

It is unclear from your description if the Phy0/1 pair are mirrored
themselves or striped.  Same with Phy2/3.  Do you have a net four copies
of your data or a net two copies of your data?

> ==========================================================================
> Figure 2: mdadm —examine of /dev/sdb1 and sdc1:
> 
> # mdadm --examine /dev/sd[bc]1 >> raid.status.latest
> 
> /dev/sdb1:
>           Magic : a92b4efc
>         Version : 0.90.00
>            UUID : fdd98007:78663948:0760cb1c:ce437c35
>   Creation Time : Mon Oct 18 10:54:29 2010
>      Raid Level : raid1
>   Used Dev Size : 976551040 (931.31 GiB 999.99 GB)
>      Array Size : 976551040 (931.31 GiB 999.99 GB)
>    Raid Devices : 2
>   Total Devices : 2
> Preferred Minor : 0
> 
>     Update Time : Thu Aug  4 12:11:38 2016
>           State : clean
>  Active Devices : 1
> Working Devices : 2
>  Failed Devices : 0
>   Spare Devices : 1
>        Checksum : d3a40069 - correct
>          Events : 858
> 
> 
>       Number   Major   Minor   RaidDevice State
> this     2       8       17        2      spare   /dev/sdb1
> 
>    0     0       0        0        0      removed
>    1     1       8       33        1      active sync   /dev/sdc1
>    2     2       8       17        2      spare   /dev/sdb1
> /dev/sdc1:
>           Magic : a92b4efc
>         Version : 0.90.00
>            UUID : fdd98007:78663948:0760cb1c:ce437c35
>   Creation Time : Mon Oct 18 10:54:29 2010
>      Raid Level : raid1
>   Used Dev Size : 976551040 (931.31 GiB 999.99 GB)
>      Array Size : 976551040 (931.31 GiB 999.99 GB)
>    Raid Devices : 2
>   Total Devices : 2
> Preferred Minor : 0
> 
>     Update Time : Thu Aug  4 12:11:38 2016
>           State : clean
>  Active Devices : 1
> Working Devices : 2
>  Failed Devices : 0
>   Spare Devices : 1
>        Checksum : d3a4007d - correct
>          Events : 858
> 
> 
>       Number   Major   Minor   RaidDevice State
> this     1       8       33        1      active sync   /dev/sdc1
> 
>    0     0       0        0        0      removed
>    1     1       8       33        1      active sync   /dev/sdc1
>    2     2       8       17        2      spare   /dev/sdb1

I suspect that your array has scattered UREs on desktop drives.  The
hardware raid isn't kicking the drives out after 30 seconds like
software raid would (see reference threads below), but instead allows
the problem to persist.

If you have a 4-way mirror, plugging these drives directly into a mobo
w/ the driver timeout work-around might be the best way to safely
recover your data.  The 3ware card certainly isn't behaving the way I
would predict, which means my advice isn't valid with it in the mix.  If
a hardware raid expert pipes up with alternatives, that would be helpful.

Meanwhile, please supply the smartctl -x reports.  Just paste them in
your reply w/ line wrap disabled.

Phil

Readings for timeout mismatch issues:  (whole threads if possible)

http://marc.info/?l=linux-raid&m=139050322510249&w=2
http://marc.info/?l=linux-raid&m=135863964624202&w=2
http://marc.info/?l=linux-raid&m=135811522817345&w=1
http://marc.info/?l=linux-raid&m=133761065622164&w=2
http://marc.info/?l=linux-raid&m=132477199207506
http://marc.info/?l=linux-raid&m=133665797115876&w=2
http://marc.info/?l=linux-raid&m=142487508806844&w=3
http://marc.info/?l=linux-raid&m=144535576302583&w=2

^ permalink raw reply

* [PATCH -next] md-cluster: fix error return code in join()
From: Wei Yongjun @ 2016-08-21 14:42 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Wei Yongjun, linux-raid

From: Wei Yongjun <weiyongjun1@huawei.com>

Fix to return error code -ENOMEM from the lockres_init() error
handling case instead of 0, as done elsewhere in this function.

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
---
 drivers/md/md-cluster.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 333f0cf..2b13117 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -874,8 +874,10 @@ static int join(struct mddev *mddev, int nodes)
 		goto err;
 	}
 	cinfo->ack_lockres = lockres_init(mddev, "ack", ack_bast, 0);
-	if (!cinfo->ack_lockres)
+	if (!cinfo->ack_lockres) {
+		ret = -ENOMEM;
 		goto err;
+	}
 	/* get sync CR lock on ACK. */
 	if (dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_CR))
 		pr_err("md-cluster: failed to get a sync CR lock on ACK!(%d)\n",
@@ -889,8 +891,10 @@ static int join(struct mddev *mddev, int nodes)
 	pr_info("md-cluster: Joined cluster %s slot %d\n", str, cinfo->slot_number);
 	snprintf(str, 64, "bitmap%04d", cinfo->slot_number - 1);
 	cinfo->bitmap_lockres = lockres_init(mddev, str, NULL, 1);
-	if (!cinfo->bitmap_lockres)
+	if (!cinfo->bitmap_lockres) {
+		ret = -ENOMEM;
 		goto err;
+	}
 	if (dlm_lock_sync(cinfo->bitmap_lockres, DLM_LOCK_PW)) {
 		pr_err("Failed to get bitmap lock\n");
 		ret = -EINVAL;
@@ -898,8 +902,10 @@ static int join(struct mddev *mddev, int nodes)
 	}
 
 	cinfo->resync_lockres = lockres_init(mddev, "resync", NULL, 0);
-	if (!cinfo->resync_lockres)
+	if (!cinfo->resync_lockres) {
+		ret = -ENOMEM;
 		goto err;
+	}
 
 	return 0;
 err:


^ permalink raw reply related

* Attention Beneficially
From: Robert David @ 2016-08-21  6:50 UTC (permalink / raw)

In-Reply-To: <908285276.426230.1471762247322.ref@mail.yahoo.com>

Attention Beneficially


WESTERN UNION

This is very urgent Our Operation manager has sent your first payment of $5000.00 to you. Here is what we need from you to complete the
transfer, Your Name, Your Address,A Copy Of Your ID and Your Telephone Number: Noted That Only Fee Request from You Is to
Reactivate Transfer Files which is Sum Of $65Only, No More Fee again after that.
Western Union Agent: Mr. Peterson Godwin
Direct phone line is: + 229 98782817
E-mail ::(westernunion8087@gmail.com)

Bellow is your first payment for $5000usd sent today in your name.
Sender Name: Mr. 
MTCN: 
Question
Answer
Amount Sent: $5000.00

You can use this link to confirm the payment,

Best Regards,
Mr.Joseph Ezo

^ permalink raw reply

* Re: Rewrite md raid1 member
From: Wols Lists @ 2016-08-20 10:44 UTC (permalink / raw)
  To: Chris Dunlop, Chris Murphy; +Cc: Brad Campbell, Linux-RAID
In-Reply-To: <20160820014336.GA30642@onthe.net.au>

On 20/08/16 02:43, Chris Dunlop wrote:
> Then again, I guess in the end what I'd really like is to be able to
> flag a particular disk to md for "write repair", and tell md to repair.
> Then md would read data from unflagged disks to write to the flagged
> disk (that could work for parity raids as well as mirrors).

I had that idea. I'm probably better at understanding and documenting
things, hence my interest in the raid wiki, but I'm looking at this
exact thing as a project for my first foray into kernel programming. Is
that wise? :-)

Basically, do a stripe integrity check, and optionally rewrite it? I
don't to what extent linux raid actually implements a lot of interesting
theoretical abilities, and if I can document it, I can then identify
holes and try and fill them. Especially when you're trying to recover a
broken array, the more options you have, the better ...

Unfortunately the raid wiki admin is MIA at the moment, and I really
want to hack that as a learning exercise before I start messing about
with kernel code.

Cheers,
Wol

^ permalink raw reply

* Re: Assistance Reviewing Proposed Recovery Measures
From: Wols Lists @ 2016-08-20 10:16 UTC (permalink / raw)
  To: Chris Maxwell, linux-raid
In-Reply-To: <alpine.LRH.2.11.1608191717050.23388@bluenose.cs.dal.ca>

On 19/08/16 21:18, Chris Maxwell wrote:
> However, the it also says
> 
>> Recreating should be considered a *last* resort, only to be used when
>> everything else fails. People getting this wrong is one of the primary
>> reasons people lose data. It is very commonly used way too early in
>> the fault finding process. You have been warned! It's better to send
>> an email to the linux-raid mailing list with detailed information
>> (mdadm --examine from all component drives plus log entries from when
>> the failure happened, including mdadm and kernel version) and ask for
>> advice than to try to use --create --assume-clean and getting it wrong.
> 
> 
> 
> So before I messed things up even worse, I figured I would consult the
> experts to see what my next step should be.
> 
Download Phil Turmel's lsdrv. It should be easy to find with google. If
your default Python is 3, just edit the shebang line to make it run.

SAVE THE HARD COPY IN A SAFE PLACE!!!

If anything does go wrong (Sarah's comments about wiping superblocks etc
etc unnerve me) this *should* have all the information you need to
recreate the drives and recover what's recoverable.

(Oh - and the linux raid wiki is a bit of a cobweb site at the moment
:-( so take anything on it with a pinch of salt - it could well be out
of date.)

Also download the latest mdadm from Neil Brown's site. Most distros are
a bit out of date, and the bug fixes might matter.

Cheers,
Wol

^ permalink raw reply

* Re: Rewrite md raid1 member
From: NeilBrown @ 2016-08-20  6:52 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: Wols Lists, Brad Campbell, linux-raid
In-Reply-To: <20160820015708.GB30642@onthe.net.au>

[-- Attachment #1: Type: text/plain, Size: 3011 bytes --]

On Sat, Aug 20 2016, Chris Dunlop wrote:

> Hi Neil,
>
> Nice work on the Bus1 article!

Thanks :-)

>
> On Sat, Aug 20, 2016 at 07:26:27AM +1000, NeilBrown wrote:
>> On Fri, Aug 19 2016, Chris Dunlop wrote:
>>> In my case, I want it to write everything.
>>>
>>> If I do my 'dd' to write everything as previously described, with the window
>>> of opportunity for stale data to end up on the written disk, one option
>>> would to run a scrub / repair to check the data is the same - but if I'm
>>> unlucky with my dd and the data isn't the same for some sector[s], I want to
>>> ensure the correct data is copied over the stale data and not the other way
>>> around, e.g. to specify "in the event of a mismatch, use the data from sda
>>> and overwrite the data on sdb".
>>>
>>> Unfortunately I don't know how that can be done.
>>>
>>> Does anyone know?
>> 
>> If it is the second device in the array (as listed by mdadm --detail)
>> then you can stop the array and re-assemble with --update=resync.
>
> That's nearly there - except in this specific case it's my root filesystem
> so I can't stop the array without booting into a recovery disk etc. Of
> course I could do that, but the point of the exercise is to see if it can
> be done live, safely.

Well... you could
  cd /sys/block/mdXX/md
  echo frozen > sync_action
  echo 0 > resync_start
  echo idle > sync_action

that should start a resync on a live array.
Still, only works for non-first device in RAID1

>
>> If it is the first device I can only suggest that you
>> fail the device and add it again:
>> 
>>  mdadm /dev/mdXX --fail /dev/sdYY
>>  mdadm /dev/mdXX --remove /dev/sdYY
>>  mdadm /dev/mdYY --add /dev/sdYY
>> 
>> If the "good" drive fails during the rewrite it might be a little bit
>> fiddley getting the array working again, but all the data will certainly
>> be there on the device you are re-writing, so you won't lose anything.
>
> OK, that sounds good. What would the process be if the good drive fails,
> either completely, or a few specific sectors?

If you think there is a serious risk of that happening, then it's best
to skip this option.
You would need to boot from a rescue disk and re-create the array using
just the working device - and make sure the same data-offset and size
are used.  Certainly possible, but not at all straightforward.

Another thing you could do, particularly if you know what region of the
device needs to be over-written, is to write sector numbers to
suspend_lo and suspend_hi.  This will suspend all IO through the
/dev/mdXX device to that range of array sectors.
Then you could read from/write to the raw device with dd or whatever.

raid6check.c does this on a raid6 to correct errors that can be detected
with the raid6 syndrome, even while the array is online.  A similar
thing could be done to allow individual blocks to be rewritten.
Care is needed to map between array addresses and device addresses.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply

* Re: Assistance Reviewing Proposed Recovery Measures
From: Sarah Newman @ 2016-08-20  2:48 UTC (permalink / raw)
  To: Chris Maxwell; +Cc: linux-raid
In-Reply-To: <alpine.LRH.2.11.1608191717050.23388@bluenose.cs.dal.ca>

On 08/19/2016 01:18 PM, Chris Maxwell wrote:
<snip>

> https://raid.wiki.kernel.org/index.php/RAID_Recovery suggest one can create rather than assemble a sw raid. (Not sure whether to call it md0 or md1)
> mdadm --create /dev/md0 --metadata=0.90 --raid-devices=2 --level=raid1 /dev/sdb1 missing
> 
> However, the it also says
> 
>> Recreating should be considered a *last* resort, only to be used when everything else fails. People getting this wrong is one of the primary reasons
>> people lose data. It is very commonly used way too early in the fault finding process. You have been warned! It's better to send an email to the
>> linux-raid mailing list with detailed information (mdadm --examine from all component drives plus log entries from when the failure happened,
>> including mdadm and kernel version) and ask for advice than to try to use --create --assume-clean and getting it wrong.
> 
> 
> 
> So before I messed things up even worse, I figured I would consult the experts to see what my next step should be.

I am not an expert but 0.9 metadata is at the end, and the md device is a simple mirror. I don't know why the underlying device couldn't be used directly.

I think this should work, depending on your hardware raid:

* verify exact size of md device by looking at size of /dev/md0 block device assembled from failing /dev/sdc1
* copy partition layout from /dev/sdc using sfdisk -d
* pull both sdb1 and sdc1 drives (mark which drive goes where)
* put in 2 new drives to replace failing sdc
* initialize new drives in hardware raid
* observe no existing partition table on new device
* create partition table on new device using sfdisk
* create a new md device /dev/md1 on the new /dev/sd?1, verifying size of /dev/md1 block device is same or bigger than original /dev/md0
* add good drives from original sdb back
* make an overlay of the first partition of the good device
https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID#Making_the_harddisks_read-only_using_an_overlay_file
* zero the metadata on the overlay with mdadm --zero-superblock /path/to/overlay
* dd data from /path/to/overlay onto /dev/md1. This will run out of room if the new md device is exactly the same size as the old one, but should copy
all the actual data.

--Sarah

^ permalink raw reply

* [RFC] Some fixes to allow for more than 128 md devices.
From: Robert LeBlanc @ 2016-08-20  2:10 UTC (permalink / raw)
  To: linux-raid; +Cc: dm-devel, robert
In-Reply-To: <1471659031-30688-1-git-send-email-robert@leblancnet.us>

---
 Manage.c |  5 +++--
 lib.c    |  2 +-
 mdadm.h  |  6 +++---
 util.c   | 25 +++++++++++++------------
 4 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/Manage.c b/Manage.c
index 30adc99..51d0eae 100644
--- a/Manage.c
+++ b/Manage.c
@@ -1380,7 +1380,8 @@ int Manage_subdevs(char *devname, int fd,
 	for (dv = devlist; dv; dv = dv->next) {
 		unsigned long rdev = 0; /* device to add/remove etc */
 		int rv;
-		int mj,mn;
+		int mj;
+		unsigned long mn;
 
 		raid_slot = -1;
 		if (dv->disposition == 'c') {
@@ -1486,7 +1487,7 @@ int Manage_subdevs(char *devname, int fd,
 			if (sysfd >= 0) {
 				char dn[20];
 				if (sysfs_fd_get_str(sysfd, dn, 20) > 0 &&
-				    sscanf(dn, "%d:%d", &mj,&mn) == 2) {
+				    sscanf(dn, "%d:%lu", &mj,&mn) == 2) {
 					rdev = makedev(mj,mn);
 					found = 1;
 				}
diff --git a/lib.c b/lib.c
index 621edf3..e0df6fd 100644
--- a/lib.c
+++ b/lib.c
@@ -99,7 +99,7 @@ char *fd2kname(int fd)
 	return NULL;
 }
 
-char *devid2devnm(int devid)
+char *devid2devnm(unsigned long devid)
 {
 	char path[30];
 	char link[200];
diff --git a/mdadm.h b/mdadm.h
index d209488..7de6589 100755
--- a/mdadm.h
+++ b/mdadm.h
@@ -1348,7 +1348,7 @@ extern int check_partitions(int fd, char *dname,
 			    unsigned long long size);
 
 extern int get_mdp_major(void);
-extern int get_maj_min(char *dev, int *major, int *minor);
+extern int get_maj_min(char *dev, int *major, unsigned long *minor);
 extern int dev_open(char *dev, int flags);
 extern int open_dev(char *devnm);
 extern void reopen_mddev(int mdfd);
@@ -1438,8 +1438,8 @@ extern char *find_free_devnm(int use_partitions);
 
 extern void put_md_name(char *name);
 extern char *devid2kname(int devid);
-extern char *devid2devnm(int devid);
-extern int devnm2devid(char *devnm);
+extern char *devid2devnm(unsigned long devid);
+extern unsigned long devnm2devid(char *devnm);
 extern char *get_md_name(char *devnm);
 
 extern char DefaultConfFile[];
diff --git a/util.c b/util.c
index 2bcb81f..939cf52 100644
--- a/util.c
+++ b/util.c
@@ -928,7 +928,7 @@ int get_data_disks(int level, int layout, int raid_disks)
 	return data_disks;
 }
 
-int devnm2devid(char *devnm)
+unsigned long devnm2devid(char *devnm)
 {
 	/* First look in /sys/block/$DEVNM/dev for %d:%d
 	 * If that fails, try parsing out a number
@@ -936,7 +936,8 @@ int devnm2devid(char *devnm)
 	char path[100];
 	char *ep;
 	int fd;
-	int mjr,mnr;
+	int mjr;
+	unsigned long mnr;
 
 	sprintf(path, "/sys/block/%s/dev", devnm);
 	fd = open(path, O_RDONLY);
@@ -946,18 +947,18 @@ int devnm2devid(char *devnm)
 		close(fd);
 		if (n > 0)
 			buf[n] = 0;
-		if (n > 0 && sscanf(buf, "%d:%d\n", &mjr, &mnr) == 2)
+		if (n > 0 && sscanf(buf, "%d:%lu\n", &mjr, &mnr) == 2)
 			return makedev(mjr, mnr);
 	}
 	if (strncmp(devnm, "md_d", 4) == 0 &&
 	    isdigit(devnm[4]) &&
-	    (mnr = strtoul(devnm+4, &ep, 10)) >= 0 &&
+	    (mnr = strtoul(devnm+4, &ep, 10)) &&
 	    ep > devnm && *ep == 0)
 		return makedev(get_mdp_major(), mnr << MdpMinorShift);
 
 	if (strncmp(devnm, "md", 2) == 0 &&
 	    isdigit(devnm[2]) &&
-	    (mnr = strtoul(devnm+2, &ep, 10)) >= 0 &&
+	    (mnr = strtoul(devnm+2, &ep, 10)) &&
 	    ep > devnm && *ep == 0)
 		return makedev(MD_MAJOR, mnr);
 
@@ -1020,12 +1021,12 @@ void put_md_name(char *name)
 }
 #endif /* !defined(MDASSEMBLE) || defined(MDASSEMBLE) && defined(MDASSEMBLE_AUTO) */
 
-int get_maj_min(char *dev, int *major, int *minor)
+int get_maj_min(char *dev, int *major, unsigned long *minor)
 {
 	char *e;
 	*major = strtoul(dev, &e, 0);
 	return (e > dev && *e == ':' && e[1] &&
-		(*minor = strtoul(e+1, &e, 0)) >= 0 &&
+		(*minor = strtoul(e+1, &e, 0)) &&
 		*e == 0);
 }
 
@@ -1037,13 +1038,13 @@ int dev_open(char *dev, int flags)
 	int fd = -1;
 	char devname[32];
 	int major;
-	int minor;
+	unsigned long minor;
 
 	if (!dev) return -1;
 	flags |= O_DIRECT;
 
 	if (get_maj_min(dev, &major, &minor)) {
-		snprintf(devname, sizeof(devname), "/dev/.tmp.md.%d:%d:%d",
+		snprintf(devname, sizeof(devname), "/dev/.tmp.md.%d:%d:%lu",
 			 (int)getpid(), major, minor);
 		if (mknod(devname, S_IFBLK|0600, makedev(major, minor)) == 0) {
 			fd = open(devname, flags);
@@ -1051,7 +1052,7 @@ int dev_open(char *dev, int flags)
 		}
 		if (fd < 0) {
 			/* Try /tmp as /dev appear to be read-only */
-			snprintf(devname, sizeof(devname), "/tmp/.tmp.md.%d:%d:%d",
+			snprintf(devname, sizeof(devname), "/tmp/.tmp.md.%d:%d:%lu",
 				 (int)getpid(), major, minor);
 			if (mknod(devname, S_IFBLK|0600, makedev(major, minor)) == 0) {
 				fd = open(devname, flags);
@@ -1069,7 +1070,7 @@ int open_dev_flags(char *devnm, int flags)
 	char buf[20];
 
 	devid = devnm2devid(devnm);
-	sprintf(buf, "%d:%d", major(devid), minor(devid));
+	sprintf(buf, "%d:%lu", major(devid), (unsigned long)minor(devid));
 	return dev_open(buf, flags);
 }
 
@@ -1083,7 +1084,7 @@ int open_dev_excl(char *devnm)
 	char buf[20];
 	int i;
 	int flags = O_RDWR;
-	int devid = devnm2devid(devnm);
+	unsigned long devid = devnm2devid(devnm);
 	long delay = 1000;
 
 	sprintf(buf, "%d:%d", major(devid), minor(devid));
-- 
2.8.1


^ permalink raw reply related

* [RFC] Some fixes to allow for more than 128 md devices.
From: Robert LeBlanc @ 2016-08-20  2:10 UTC (permalink / raw)
  To: linux-raid; +Cc: dm-devel, robert

I'm stuck and need some help getting this across the finish line. This
is in no way complete, but to help show what I'm working on.

When we added more than 128 md devices, we started getting failures.
Looking through the code it seems that the minor dev number was being
stored in an int and causing overflow and wrecking havoc on everything.
I finally got the mknod in mdadm to correctly make the dev node with
minors up to 1048574 as expected in the mdadm code. However, I can
only create md devices up to 511. Trying to create an md higher than
that has an error where the device can't be read/opened strace reports:
open("/dev/.tmp.md.15341:9:1048574", O_RDWR|O_EXCL|O_DIRECT) = -1 ENXIO
(No such device or address)
while Python reports:
IOError: [Errno 6] No such device or address: '/dev/.tmp.md.3279:9:512'

A corresponding node is not created in /sys/block/md* for mds over 511.

I believe that there may be a bug in the kernel code that is now being
hit. After looking through the kernel code, I can't seem to find where
this might be. Please help me by either pointing me to the source
location that this might be a problem or fixing it based on these
patches I've worked on so far. I'm using 4.7.0 currently.

I'm using this for testing:
./mdadm --create /dev/md1048574 --assume-clean --verbose --level=1 \
--raid-devices=2 /dev/loop0 missing

Yes, we have a real need for more than 128 and 512 md devices.

Please include me in any replies as I'm not on the ML.

Thank you.

Robert LeBlanc (1):
  Some fixes to allow for more than 128 md devices.

 Manage.c |  5 +++--
 lib.c    |  2 +-
 mdadm.h  |  6 +++---
 util.c   | 25 +++++++++++++------------
 4 files changed, 20 insertions(+), 18 deletions(-)

--
2.8.1

^ permalink raw reply

* Re: Rewrite md raid1 member
From: Chris Dunlop @ 2016-08-20  1:57 UTC (permalink / raw)
  To: NeilBrown; +Cc: Wols Lists, Brad Campbell, linux-raid
In-Reply-To: <87oa4oh2m4.fsf@notabene.neil.brown.name>

Hi Neil,

Nice work on the Bus1 article!

On Sat, Aug 20, 2016 at 07:26:27AM +1000, NeilBrown wrote:
> On Fri, Aug 19 2016, Chris Dunlop wrote:
>> In my case, I want it to write everything.
>>
>> If I do my 'dd' to write everything as previously described, with the window
>> of opportunity for stale data to end up on the written disk, one option
>> would to run a scrub / repair to check the data is the same - but if I'm
>> unlucky with my dd and the data isn't the same for some sector[s], I want to
>> ensure the correct data is copied over the stale data and not the other way
>> around, e.g. to specify "in the event of a mismatch, use the data from sda
>> and overwrite the data on sdb".
>>
>> Unfortunately I don't know how that can be done.
>>
>> Does anyone know?
> 
> If it is the second device in the array (as listed by mdadm --detail)
> then you can stop the array and re-assemble with --update=resync.

That's nearly there - except in this specific case it's my root filesystem
so I can't stop the array without booting into a recovery disk etc. Of
course I could do that, but the point of the exercise is to see if it can
be done live, safely.

> If it is the first device I can only suggest that you
> fail the device and add it again:
> 
>  mdadm /dev/mdXX --fail /dev/sdYY
>  mdadm /dev/mdXX --remove /dev/sdYY
>  mdadm /dev/mdYY --add /dev/sdYY
> 
> If the "good" drive fails during the rewrite it might be a little bit
> fiddley getting the array working again, but all the data will certainly
> be there on the device you are re-writing, so you won't lose anything.

OK, that sounds good. What would the process be if the good drive fails,
either completely, or a few specific sectors?

Thanks,

Chris

^ permalink raw reply

* Re: Rewrite md raid1 member
From: Chris Dunlop @ 2016-08-20  1:43 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Wols Lists, Brad Campbell, Linux-RAID
In-Reply-To: <CAJCQCtQs+z6rso4pc7CgRSia4tM5KK=T8ckh93X9HiDHytkcrw@mail.gmail.com>

On Fri, Aug 19, 2016 at 10:10:23AM -0600, Chris Murphy wrote:
> On Fri, Aug 19, 2016 at 6:46 AM, Chris Dunlop <chris@onthe.net.au> wrote:
>> On Fri, Aug 19, 2016 at 12:52:21PM +0100, Wols Lists wrote:
>>> On 18/08/16 05:01, Chris Dunlop wrote:
>>>> I'm interested to see if there's a way of essentially doing the above on a
>>>> live system, assuming there's appropriate care taken to not trash any
>>>> existing data (including superblocks).
>>>>
>>>> I.e. is it *theoretically* possible to write the same data back to the whole
>>>> disk safely. E.g. using 'dd' from/to the same disk is almost there, but, as
>>>> described, there's a window of opportunity where you could get stale data on
>>>> the disk and a raid repair could then copy that stale data to the good disk.
[snip]
>> If I do my 'dd' to write everything as previously described, with the window
>> of opportunity for stale data to end up on the written disk, one option
>> would to run a scrub / repair to check the data is the same - but if I'm
>> unlucky with my dd and the data isn't the same for some sector[s], I want to
>> ensure the correct data is copied over the stale data and not the other way
>> around, e.g. to specify "in the event of a mismatch, use the data from sda
>> and overwrite the data on sdb".
>>
>> Unfortunately I don't know how that can be done.
>>
>> Does anyone know?
> 
> Basically you want what Btrfs balance does, except simpler: rather
> than relocating extents into new allocation groups, you just want to
> read and rewrite everything as it is.

Sorry, I'm not familiar with btrfs at that level.

> You definitely can't do this with dd when md + mounted file system,
> that's inevitably going to result in the file system making changes
> after this operation has done a read, and therefore its write will
> clobber the file system's modifications. It'll be data loss at a
> minimum, and if it's file system metadata, it'll be worse in that
> it'll make the file system inconsistent.

I'm not convinced it's "inevitable" given the window between reading and
writing can be relatively small, and the filesystem would have to write
to those specific sectors during that window. But, yes, that's the
issue, there's certainly a chance of it happening.

> Further it's a problem overwriting good data, not accounting for the
> possibility of a crash or power failure.  You'd really want this
> operation to be CoW, so that the good data is effectively duplicated
> somewhere else and only once that operation is on stable media would
> it be pointed to, and the original data turned to free space.

It's raid-1, so I have good data at all times, on the disk I'm not
dd'ing to (sda). The problem is there may stale data on the disk dd'ed
to (sdb) due to the window of opportunity described previously, i.e. dd
reads data A from sda:X (sector X), the system writes data B to md0:X
(i.e. to both sda:X and sdb:X), then dd writes stale data A to sdb:X,
putting the disks out of sync.

In fact, the stale data problem is a larger problem than I first
thought: it's not only an issue when doing a repair (i.e. how to tell md
to use the data on the "good" disk in the event of discrepancies), but
also whilst the dd is underway: if you happen to issue a read to a
sector which has good data on one disk but stale data on the other, I
don't know if there's a way to ensure md reads the data on the "good"
disk.

So, in fact, I guess the facility I'm looking for, is a "write only"
flag for that disk, until a repair can be done (assuming the repair also
honours the "write only" flag.

Oh hey, from linux/Documentation/md.txt:

  state
    A file recording the current state of the device in the array
    which can be a comma separated list of
      ...
      writemostly - device will only be subject to read requests if
                    there are no other options. This applies only to
                    raid1 arrays.

I think that's *almost* exactly what I need, but to be safe I think I
really want something like:

  writeonly - no reads will be issued to this drive. If reads can't
              be satisfied from other drives, the array will be failed.

Then again, I guess in the end what I'd really like is to be able to
flag a particular disk to md for "write repair", and tell md to repair.
Then md would read data from unflagged disks to write to the flagged
disk (that could work for parity raids as well as mirrors).

This has the advantage, like "mdadm --replace", that you retain
redundancy at all times whilst still writing to the entire disk. The
advantage over "madm --replace" would be that you don't require another
disk.

But, in the absence of sufficient time and kernel knowledge to add
"write repair" to md myself, I'm interested to see if it can be done at
the user level.

> I'm not really understanding the use case of why you'd want to do
> this. At a fundamental level it sounds like you don't trust the
> devices the data resides on. If that's true, then there are related
> concerns that aren't mitigated by this rewrite feature alone.

My immediate use case is to try to clear the "pending sector" count by
writing to every sector on the disk. The pending sector count indicates
"something" went wrong at some point: it could be a permanent error
(e.g. disk surface is dodgy) or a soft error (e.g. a power supply droop
during a write). I.e. it may or may not indicate the disk itself is
going bad. If the count clears (either by confirming the sector is
good, or reallocating if the sector is really rubbish), I have a
confirmed good disk and life goes on. If something turns up during the
write attempt, I know the disk is bad and I can schedule a replacement.

As stated at the beginning, I know the safest way to do this is to add
in another disk, do a 'mdadm --replace', and then remove the suspect
disk and play with it as much as I like.

As a matter of interest I'm looking to see if there's a safe way of
doing it whilst the disk is online and live. Safe, that is, in that the
data is as safe as it would be on a normally functioning array, *if*
everything is done correctly.

So it's a "hey, it would be good if this can be done" issue rather than
a "help me, I'm afraid I might lose some data!" problem.

Cheers,

Chris

^ permalink raw reply

* [PATCH 2/2] r5cache: remove journal support
From: Song Liu @ 2016-08-19 22:34 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, Song Liu
In-Reply-To: <1471646042-2366825-1-git-send-email-songliubraving@fb.com>

In current r5cache, when the journal device is broken, the raid
array is forced in readonly mode. There is no way to remove the
"journal feature", and thus make the array read-write without
journal.

This patch provides sysfs entry r5c_cache_mode that can be used
to remove journal feature.

r5c_cache_mode has 4 different values:
* no-cache;
* write-through (write journal only);
* write-back (w/ write cache feature, which will be added soon);
* broken-cache (journal missing or Faulty)

By writing into r5c_cache_mode, the array can transit from
broken-cache to no-cache, which removes journal feature for the
array.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/raid5.c       |  5 ++++
 drivers/md/raid5.h       |  6 +++++
 3 files changed, 75 insertions(+)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 5504ce2..508d470 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -40,6 +40,16 @@
  */
 #define R5L_POOL_SIZE	4
 
+enum r5c_cache_mode {
+	R5C_MODE_NO_CACHE = 0,
+	R5C_MODE_WRITE_THROUGH = 1,
+	R5C_MODE_WRITE_BACK = 2,
+	R5C_MODE_BROKEN_CACHE = 3,
+};
+
+static char *r5c_cache_mode_str[] = {"no-cache", "write-through",
+				     "write-back", "broken-cache"};
+
 struct r5l_log {
 	struct md_rdev *rdev;
 
@@ -97,6 +107,8 @@ struct r5l_log {
 
 	bool need_cache_flush;
 	bool in_teardown;
+
+	enum r5c_cache_mode cache_mode;
 };
 
 /*
@@ -1193,6 +1205,56 @@ ioerr:
 	return ret;
 }
 
+ssize_t r5c_show_cache_mode(struct mddev *mddev, char *page)
+{
+	struct r5conf *conf = mddev->private;
+	int val = 0;
+	int ret = 0;
+
+	if (conf->log)
+		val = conf->log->cache_mode;
+	else if (test_bit(MD_HAS_JOURNAL, &mddev->flags))
+		val = R5C_MODE_BROKEN_CACHE;
+	ret += snprintf(page, PAGE_SIZE - ret, "%d: %s\n",
+			val, r5c_cache_mode_str[val]);
+	return ret;
+}
+
+ssize_t r5c_store_cache_mode(struct mddev *mddev, const char *page, size_t len)
+{
+	struct r5conf *conf = mddev->private;
+	struct r5l_log *log = conf->log;
+	int val;
+
+	if (kstrtoint(page, 10, &val))
+		return -EINVAL;
+	if (!log && val != R5C_MODE_NO_CACHE)
+		return -EINVAL;
+	/* currently only support write through (write journal) */
+	if (val < R5C_MODE_NO_CACHE || val > R5C_MODE_WRITE_THROUGH)
+		return -EINVAL;
+	if (val == R5C_MODE_NO_CACHE) {
+		if (conf->log &&
+		    !test_bit(Faulty, &log->rdev->flags)) {
+			pr_err("md/raid:%s: journal device is in use, cannot remove it\n",
+			       mdname(mddev));
+			return -EINVAL;
+		}
+	}
+
+	spin_lock_irq(&conf->device_lock);
+	if (log)
+		conf->log->cache_mode = val;
+	if (val == R5C_MODE_NO_CACHE) {
+		clear_bit(MD_HAS_JOURNAL, &mddev->flags);
+		set_bit(MD_UPDATE_SB_FLAGS, &mddev->flags);
+	}
+	spin_unlock_irq(&conf->device_lock);
+	pr_info("%s: setting r5c cache mode to %d: %s\n",
+		       mdname(mddev), val, r5c_cache_mode_str[val]);
+	return len;
+}
+
 int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 {
 	struct request_queue *q = bdev_get_queue(rdev->bdev);
@@ -1246,6 +1308,8 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 	INIT_LIST_HEAD(&log->no_space_stripes);
 	spin_lock_init(&log->no_space_stripes_lock);
 
+	log->cache_mode = R5C_MODE_WRITE_THROUGH;
+
 	if (r5l_load_log(log))
 		goto error;
 
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 2119e09..665d853 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6230,6 +6230,10 @@ raid5_group_thread_cnt = __ATTR(group_thread_cnt, S_IRUGO | S_IWUSR,
 				raid5_show_group_thread_cnt,
 				raid5_store_group_thread_cnt);
 
+static struct md_sysfs_entry
+r5c_cache_mode = __ATTR(r5c_cache_mode, S_IRUGO | S_IWUSR,
+			r5c_show_cache_mode, r5c_store_cache_mode);
+
 static struct attribute *raid5_attrs[] =  {
 	&raid5_stripecache_size.attr,
 	&raid5_stripecache_active.attr,
@@ -6237,6 +6241,7 @@ static struct attribute *raid5_attrs[] =  {
 	&raid5_group_thread_cnt.attr,
 	&raid5_skip_copy.attr,
 	&raid5_rmw_level.attr,
+	&r5c_cache_mode.attr,
 	NULL,
 };
 static struct attribute_group raid5_attrs_group = {
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 517d4b6..ace9675 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -635,4 +635,10 @@ extern void r5l_stripe_write_finished(struct stripe_head *sh);
 extern int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio);
 extern void r5l_quiesce(struct r5l_log *log, int state);
 extern bool r5l_log_disk_error(struct r5conf *conf);
+
+
+extern ssize_t r5c_show_cache_mode(struct mddev *mddev, char *page);
+extern ssize_t
+r5c_store_cache_mode(struct mddev *mddev, const char *page, size_t len);
+
 #endif
-- 
2.8.0.rc2


^ permalink raw reply related

* [PATCH 1/2] r5cache: set MD_JOURNAL_CLEAN correctly
From: Song Liu @ 2016-08-19 22:34 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, Song Liu

Currently, the code sets MD_JOURNAL_CLEAN when the array has
MD_FEATURE_JOURNAL and the recovery_cp is MaxSector. The array
will be MD_JOURNAL_CLEAN even if the journal device is missing.

With this patch, the MD_JOURNAL_CLEAN is only set when the journal
device presents.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/md.c    |  5 +----
 drivers/md/raid5.c | 13 ++++++++-----
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 51bc35f..46917f7 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1604,11 +1604,8 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev)
 			mddev->new_chunk_sectors = mddev->chunk_sectors;
 		}
 
-		if (le32_to_cpu(sb->feature_map) & MD_FEATURE_JOURNAL) {
+		if (le32_to_cpu(sb->feature_map) & MD_FEATURE_JOURNAL)
 			set_bit(MD_HAS_JOURNAL, &mddev->flags);
-			if (mddev->recovery_cp == MaxSector)
-				set_bit(MD_JOURNAL_CLEAN, &mddev->flags);
-		}
 	} else if (mddev->pers == NULL) {
 		/* Insist of good event counter while assembling, except for
 		 * spares (which don't need an event count) */
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 4f8f524..2119e09 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6840,11 +6840,14 @@ static int raid5_run(struct mddev *mddev)
 	if (IS_ERR(conf))
 		return PTR_ERR(conf);
 
-	if (test_bit(MD_HAS_JOURNAL, &mddev->flags) && !journal_dev) {
-		printk(KERN_ERR "md/raid:%s: journal disk is missing, force array readonly\n",
-		       mdname(mddev));
-		mddev->ro = 1;
-		set_disk_ro(mddev->gendisk, 1);
+	if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
+		if (!journal_dev) {
+			pr_err("md/raid:%s: journal disk is missing, force array readonly\n",
+			       mdname(mddev));
+			mddev->ro = 1;
+			set_disk_ro(mddev->gendisk, 1);
+		} else if (mddev->recovery_cp == MaxSector)
+			set_bit(MD_JOURNAL_CLEAN, &mddev->flags);
 	}
 
 	conf->min_offset_diff = min_offset_diff;
-- 
2.8.0.rc2


^ permalink raw reply related

* Re: Rewrite md raid1 member
From: NeilBrown @ 2016-08-19 21:26 UTC (permalink / raw)
  To: Chris Dunlop, Wols Lists; +Cc: Brad Campbell, linux-raid
In-Reply-To: <20160819124615.GA23247@onthe.net.au>

[-- Attachment #1: Type: text/plain, Size: 1218 bytes --]

On Fri, Aug 19 2016, Chris Dunlop wrote:

>
> In my case, I want it to write everything.
>
> If I do my 'dd' to write everything as previously described, with the window
> of opportunity for stale data to end up on the written disk, one option
> would to run a scrub / repair to check the data is the same - but if I'm
> unlucky with my dd and the data isn't the same for some sector[s], I want to
> ensure the correct data is copied over the stale data and not the other way
> around, e.g. to specify "in the event of a mismatch, use the data from sda
> and overwrite the data on sdb".
>
> Unfortunately I don't know how that can be done.
>
> Does anyone know?

If it is the second device in the array (as listed by mdadm --detail)
then you can stop the array and re-assemble with --update=resync.

If it is the first device I can only suggest that you
fail the device and add it again:

 mdadm /dev/mdXX --fail /dev/sdYY
 mdadm /dev/mdXX --remove /dev/sdYY
 mdadm /dev/mdYY --add /dev/sdYY

If the "good" drive fails during the rewrite it might be a little bit
fiddley getting the array working again, but all the data will certainly
be there on the device you are re-writing, so you won't lose anything.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply

* Assistance Reviewing Proposed Recovery Measures
From: Chris Maxwell @ 2016-08-19 20:18 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: TEXT/PLAIN, Size: 28635 bytes --]

I am looking for some advice rather than blundering around making things worse, as 
recommended by https://raid.wiki.kernel.org/index.php/RAID_Recovery

I have been asked to step in on a situation with a failing disk(s) in a RAID setup.
I can not really explain why it was setup in the below manner, but one would have 
expected it would mean the system is far more recoverable that it is proving to be.

As far as I can tell, I have a hardware RAID array running on a 3Ware controller, 
that has 2 mirrors, each with 2 disks.  Then on top of this is a Linux Software 
RAID disk disk, and then LVM built on top of that.  This is laid out again in slightly 
more detail below.  (However, it appears to me that the hardware RAID would be a 
red herring, at least as a recovery mechanism, since even the remaining non-failed 
disk in sdc is getting errors.  So let us consider this as a pure software Linux RAID issue.)

There was an entry in the log that one side of md0 was failing.   Unfortunately, the sysadmin 
called it wrong as to which side was failing, and removed sdb1 (the good disk) from the md0 instead of sdc1 (the bad disk).
So how do we come back from that?  sdb1 seems fine, I just need to figure out how to 
make it so sdb1 is the only device in the md0 mirror, and then reconstruct the 
LVM structure.

I started with “copy the data off the running disk (the surviving sdc1 mirror)”, but am getting errors, so 
I figured it would be better to get it from the good mirror copy.

Generally, s/w mirror and LVM just work, so I have not really delved closely into 
what should and should not work, and the header blocks on the various devices involved.
Therefore, I don’t really know if it is strange that just using sdb1 as a LVM PV disk and 
build up my LVM structure from that.

I have a bash_history file to list out all the commands that were run (Figure 3).

Versions:
     * Linux private2 2.6.18-308.1.1.el5 #1 SMP Wed Mar 7 04:16:51 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
     * mdadm - v2.6.9 - 10th March 2009

But I figured that something along these lines is something like what is needed:
mdadm  --assemble --force /dev/md0 /dev/sdb1 /dev/sdc1
mdadm  --assemble --force /dev/md1 /dev/sdb1 missing

However, this fails.

https://raid.wiki.kernel.org/index.php/RAID_Recovery suggest one can create rather than assemble a sw raid. 
(Not sure whether to call it md0 or md1)
mdadm --create /dev/md0 --metadata=0.90 --raid-devices=2 --level=raid1 /dev/sdb1 missing

However, the it also says

> Recreating should be considered a *last* resort, only to be used when everything else fails. People getting this wrong is one of the primary reasons people lose data. It is very commonly used way too early in the fault finding process. You have been warned! It's better to send an email to the linux-raid mailing list with detailed information (mdadm --examine from all component drives plus log entries from when the failure happened, including mdadm and kernel version) and ask for advice than to try to use --create --assume-clean and getting it wrong.

So before I messed things up even worse, I figured I would consult the experts to see what my next 
step should be.

==========================================================================
Figure 1: Detailed description of the storage structure:

I had determined that the disk layout is so:
/dev/sda = a SSD

The machine has a 3ware Hardware RAID controller which is showing sdb and sdc as disks.  (Unit 0 and Unit 1).
Unit 0 (sdb) is made up of
Phy 0: WD WCAW35791262
Phy 1:  Seagate 9QJ7N744

Unit 1(sdc) is made up of
Phy 2: Seagate 9QJ7F3PJ and
Phy 3: Seagate 9QJ7R3Y1

These are then combined into mirror md0 made of sdb1 and sdc1
This is the physical volume for LVM VG lvm-raid, which then has LV inside:
lvmdata1 and gokcen

==========================================================================
Figure 2: mdadm —examine of /dev/sdb1 and sdc1:

# mdadm --examine /dev/sd[bc]1 >> raid.status.latest

/dev/sdb1:
           Magic : a92b4efc
         Version : 0.90.00
            UUID : fdd98007:78663948:0760cb1c:ce437c35
   Creation Time : Mon Oct 18 10:54:29 2010
      Raid Level : raid1
   Used Dev Size : 976551040 (931.31 GiB 999.99 GB)
      Array Size : 976551040 (931.31 GiB 999.99 GB)
    Raid Devices : 2
   Total Devices : 2
Preferred Minor : 0

     Update Time : Thu Aug  4 12:11:38 2016
           State : clean
  Active Devices : 1
Working Devices : 2
  Failed Devices : 0
   Spare Devices : 1
        Checksum : d3a40069 - correct
          Events : 858

       Number   Major   Minor   RaidDevice State
this     2       8       17        2      spare   /dev/sdb1

    0     0       0        0        0      removed
    1     1       8       33        1      active sync   /dev/sdc1
    2     2       8       17        2      spare   /dev/sdb1
/dev/sdc1:
           Magic : a92b4efc
         Version : 0.90.00
            UUID : fdd98007:78663948:0760cb1c:ce437c35
   Creation Time : Mon Oct 18 10:54:29 2010
      Raid Level : raid1
   Used Dev Size : 976551040 (931.31 GiB 999.99 GB)
      Array Size : 976551040 (931.31 GiB 999.99 GB)
    Raid Devices : 2
   Total Devices : 2
Preferred Minor : 0

     Update Time : Thu Aug  4 12:11:38 2016
           State : clean
  Active Devices : 1
Working Devices : 2
  Failed Devices : 0
   Spare Devices : 1
        Checksum : d3a4007d - correct
          Events : 858

       Number   Major   Minor   RaidDevice State
this     1       8       33        1      active sync   /dev/sdc1

    0     0       0        0        0      removed
    1     1       8       33        1      active sync   /dev/sdc1
    2     2       8       17        2      spare   /dev/sdb1

=====================================================================
Figure 3: bash_history of what I have done:

This is the history of the commands I have tried, but it does not really say what the 
reported results were:

mdadm list
mdadm /dev/md0 status
mdadm --detail /dev/md0
df -k
mount -o ro /dev/md0 /mnt/data
pvdisplay
vgdisplay
pvdisplay
vgdisplay
lvdisplay
pvdisplay
lvdisplay
umount /mnt/data
mount -o ro /dev/lvm-raid/gokcen /mnt/data
df -k
mdadm --detail /dev/md0
fdisk -l /dev/sdb
blkid /dev/sdb
lsscsi 
dmesg | less
lsscsi 
mdadm /dev/md0  --add /dev/sdb1
mdadm --detail /dev/md0
umount /mnt/data
mdadm /dev/md0 --fail /dev/sbd1
mdadm /dev/md0 --fail /dev/sdb1
mdadm --detail /dev/md0
mdadm /dev/md0 --remove /dev/sdb1
pvdisplay
fdisk -l /dev/sdb
dd if=/dev/sdb1 bs=1024 count=10 | less
pvdisplay
less /proc/mdstat 
mdadm /dev/md0 --assume-clean --re-add /dev/sdb1
mdadm /dev/md0  --re-add --assume-clean /dev/sdb1
mdadm --stop /dev/md0
vgdisplay
vgdisplay -v
vgdisplay -v | less
df -k
mkdir RaidFail
cd RaidFail/
mdadm --examine /dev/sd[bc]1 >> raid.status
less raid.status 
dd if=/dev/sdb1 bs=1024 count=10 | less
dd if=/dev/sdb1 bs=1024 count=1 | less
dd if=/dev/sdb1 bs=1024 count=1 of=sdb1.1024
dd if=/dev/sdc1 bs=1024 count=1 of=sdc1.1024 &
less sdc1.1024 
mdadm --detail /dev/md0
dmesg | less
mount -o ro /mnt/data
ls /mnt/data
df -k
df -h
mdadm --examine /dev/sd[bc]1 | egrep 'Event|dev/sd'
mdadm --assemble --force /dev/md0 /dev/sdb1 /dev/sdc1
lvdisplay | less
less raid.status 
lsscsi 
dmidecode | grep sdb
dmidecode | grep disk
dmidecode | grep -i disk
dmidecode 
dmidecode  | less
lsscsi 
dmidecode  | less
lsscsi 
lsscsi -l
lshw -class disk | less
hdparm -I /dev/sdb | less
hdparm -I /dev/sd 
hdparm -I /dev/sd*
hdparm -I /dev/sd? 2>&1 | less
smartctl -i /dev/sdb | less
smartctl -i -d 3ware,0 /dev/sdb | less
smartctl -i -d 3ware,1 /dev/sdb | less
lshw -class disk | less
df -k
umount /mnt/data
shutdown -h 10:36 "Try to reassemble RAID mirror /dev/data"
mdadm --detail /dev/md0
mdadm /dev/md0 --stop
mdadm /dev/md0 --assemble --force /dev/sdb1 /dev/sdc1
mdadm  --assemble --force /dev/md0 /dev/sdb1 /dev/sdc1
mdadm --stop /dev/md0
df -k
mdadm  --re-add  /dev/md0 /dev/sdb1
cat /proc/mdstat 
cat /etc/fstab
mkdir /mnt/data /mnt/gokcen
mount /mnt/data
lvdisplay
lvchange -an /dev/lvm-raid/gokcen 
lvchange -an /dev/lvm-raid/lvmdata1 
df -k
cd /mnt
cd data/
ls
tar -czf /dev/null .
mount
umount /mnt/data
cd 
umount /mnt/data
mount
lvchange -an /dev/lvm-raid/lvmdata1 
vgdisplay 
vgchange -an lvm-raid
vgdisplay 
man vgchange
dmsetup ls
pvdisplay
mdadm --stop /dev/md0
man mdadm
mdadm --examine /dev/sdb1
mdadm --examine /dev/sdc1
mdadm  --assemble --force /dev/md0 /dev/sdb1 /dev/sdc1
mdadm --examine /dev/sdb1
mdadm --examine /dev/sdc1
mdadm --examine /dev/sdb1
mdadm  --assemble --force /dev/md0 /dev/sdb1 
mdadm --stop /dev/md0
mdadm  --assemble --force /dev/md0 /dev/sdb1 
mdadm --detail /dev/md0
mdadm --examine /dev/sdb1
mdadm --remove /dev/md0 /dev/sdb1
man mdadm 
man mdadm 
less /etc/mdadm.conf 
vi /etc/mdadm.conf 
cd /etc/
mv -iv mdadm.conf mdadm.conf.old
cp mdadm.conf.old mdadm.conf
vi mdadm.conf
vi mdadm.conf
mdadm --run /dev/md0
vi mdadm.conf
mdadm --run /dev/md0
mdadm --detail /dev/md0
mdadm  --assemble --force /dev/md0 /dev/sdb1 
mdadm --examine /dev/sdb1
mdadm --examine /dev/sdc1
mdadm --examine /dev/sdb1
mdadm  --assemble --force /dev/md1 /dev/sdb1 
mdadm  --assemble --force /dev/md1 /dev/sdb1 missing
cat /proc/mdstat 
man pvscan
man pvcreate
man pvdisplay
pvdisplay /dev/sdb1
pvdisplay /dev/sdc1
ls -l /dev/md0
dmraid --raid_devices
pvdisplay 
mdadm  --assemble --force --run /dev/md1 /dev/sdb1 
mdadm  --assemble --force --run /dev/md0 /dev/sdb1 
mdadm --examine /dev/sd[bc]1 >> raid.status.latest
cat /proc/mdstat 
mdadm --detail /dev/md0
/sbin/mdadm --detail /dev/md0

==================================================
Figure 4: Sampling of messages from /var/log/messages

Jul 24 04:22:01 private2 kernel: md: syncing RAID array md0
Jul 24 04:22:01 private2 kernel: md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
Jul 24 04:22:01 private2 kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction.
Jul 24 04:22:01 private2 kernel: md: using 128k window, over a total of 976551040 blocks.
Jul 24 04:26:05 private2 kernel: 3w-9xxx: scsi6: ERROR: (0x03:0x0202): Drive ECC error:port=2.
Jul 24 04:26:05 private2 kernel: sd 6:0:1:0: Unhandled sense code
Jul 24 04:26:05 private2 kernel: sd 6:0:1:0: SCSI error: return code = 0x08000004
Jul 24 04:26:05 private2 kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Jul 24 04:26:05 private2 kernel: sdc: Current: sense key: Medium Error
Jul 24 04:26:05 private2 kernel:     Add. Sense: Unrecovered read error
Jul 24 04:26:05 private2 kernel: 
Jul 24 04:26:31 private2 kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=2.
Jul 24 04:26:35 private2 kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=2.
Jul 24 04:26:51 private2 kernel: 3w-9xxx: scsi6: ERROR: (0x03:0x0204): Drive timeout:port=2.
Jul 24 04:26:51 private2 kernel: sd 6:0:1:0: Unhandled sense code
Jul 24 04:26:51 private2 kernel: sd 6:0:1:0: SCSI error: return code = 0x08000004
Jul 24 04:26:51 private2 kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Jul 24 04:26:51 private2 kernel: sdc: Current: sense key: Hardware Error
Jul 24 04:26:51 private2 kernel:     Add. Sense: Logical unit communication time-out
[….]
Jul 24 05:34:31 private2 kernel: 3w-9xxx: scsi6: ERROR: (0x03:0x0202): Drive ECC error:port=3.
Jul 24 05:34:31 private2 kernel: sd 6:0:1:0: Unhandled sense code
Jul 24 05:34:31 private2 kernel: sd 6:0:1:0: SCSI error: return code = 0x08000004
Jul 24 05:34:31 private2 kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Jul 24 05:34:31 private2 kernel: sdc: Current: sense key: Medium Error
Jul 24 05:34:31 private2 kernel:     Add. Sense: Unrecovered read error
Jul 24 05:34:31 private2 kernel: 
Jul 24 05:36:17 private2 kernel: sd 6:0:1:0: WARNING: (0x06:0x002C): Command (0x28) timed out, resetting card.
Jul 24 05:37:18 private2 kernel: 3w-9xxx: scsi6: ERROR: (0x06:0x001F): Microcontroller not ready during reset sequence.
Jul 24 05:38:20 private2 kernel: 3w-9xxx: scsi6: ERROR: (0x06:0x001F): Microcontroller not ready during reset sequence.
Jul 24 05:38:20 private2 kernel: 3w-9xxx: scsi6: ERROR: (0x06:0x002B): Controller reset failed during scsi host reset.
Jul 24 05:38:20 private2 kernel: sd 6:0:1:0: scsi: Device offlined - not ready after error recovery
Jul 24 05:38:20 private2 last message repeated 17 times
Jul 24 05:38:20 private2 kernel: sd 6:0:0:0: scsi: Device offlined - not ready after error recovery
Jul 24 05:38:20 private2 kernel: sd 6:0:1:0: scsi: Device offlined - not ready after error recovery
Jul 24 05:38:20 private2 kernel: sd 6:0:0:0: scsi: Device offlined - not ready after error recovery
Jul 24 05:38:20 private2 kernel: sd 6:0:1:0: scsi: Device offlined - not ready after error recovery
[….]
Jul 24 05:38:20 private2 kernel: sd 6:0:1:0: rejecting I/O to offline device
Jul 24 05:38:20 private2 kernel: sd 6:0:0:0: rejecting I/O to offline device
Jul 24 05:38:20 private2 last message repeated 4 times
Jul 24 05:38:20 private2 kernel: sd 6:0:1:0: rejecting I/O to offline device
Jul 24 05:38:20 private2 last message repeated 13 times
Jul 24 05:38:20 private2 kernel: sd 6:0:0:0: rejecting I/O to offline device
Jul 24 05:38:20 private2 last message repeated 5 times
Jul 24 05:38:20 private2 kernel: RAID1 conf printout:
Jul 24 05:38:20 private2 kernel:  --- wd:1 rd:2
Jul 24 05:38:20 private2 kernel:  disk 0, wo:0, o:1, dev:sdb1
Jul 24 05:38:20 private2 kernel:  disk 1, wo:1, o:0, dev:sdc1
Jul 24 05:38:20 private2 kernel: RAID1 conf printout:
Jul 24 05:38:20 private2 kernel:  --- wd:1 rd:2
Jul 24 05:38:20 private2 kernel:  disk 0, wo:0, o:1, dev:sdb1
Jul 25 10:10:46 private2 kernel: sd 6:0:0:0: rejecting I/O to offline device
Jul 25 10:10:46 private2 last message repeated 3 times
Jul 25 10:10:46 private2 kernel: Aborting journal on device dm-2.
Jul 25 10:10:46 private2 kernel: sd 6:0:0:0: rejecting I/O to offline device
Jul 25 10:10:46 private2 kernel: Buffer I/O error on device dm-2, logical block 1545
Jul 25 10:10:46 private2 kernel: lost page write due to I/O error on dm-2
Jul 25 10:10:46 private2 kernel: sd 6:0:0:0: rejecting I/O to offline device
Jul 25 10:10:46 private2 last message repeated 2 times
Jul 25 10:10:47 private2 kernel: ext3_abort called.
[….]
Jul 28 17:51:28 private2 kernel: sd 6:0:0:0: rejecting I/O to offline device
Jul 28 17:51:28 private2 kernel: EXT3-fs error (device dm-2): ext3_get_inode_loc: unable to read inode block - inode=2457601, block=4915202
Jul 28 17:51:28 private2 kernel: sd 6:0:0:0: rejecting I/O to offline device
Jul 28 17:51:28 private2 kernel: EXT3-fs error (device dm-2): ext3_get_inode_loc: unable to read inode block - inode=2457606, block=4915202
Jul 28 17:51:28 private2 kernel: sd 6:0:0:0: rejecting I/O to offline device
Jul 28 17:51:28 private2 kernel: EXT3-fs error (device dm-2): ext3_get_inode_loc: unable to read inode block - inode=2457604, block=4915202
[….]

Aug  4 11:42:06 private2 smartd[6062]: Problem creating device name scan list 
Aug  4 11:42:06 private2 smartd[6062]: Device: /dev/sda, opened 
Aug  4 11:42:06 private2 smartd[6062]: Device /dev/sda: using '-d sat' for ATA disk behind SAT layer. 
Aug  4 11:42:06 private2 smartd[6062]: Device: /dev/sda, opened 
Aug  4 11:42:06 private2 smartd[6062]: Device: /dev/sda, not found in smartd database. 
Aug  4 11:42:06 private2 smartd[6062]: Device: /dev/sda, can't monitor Current Pending Sector count - no Attribute 197 
Aug  4 11:42:06 private2 smartd[6062]: Device: /dev/sda, can't monitor Offline Uncorrectable Sector count  - no Attribute 198 
Aug  4 11:42:06 private2 smartd[6062]: Device: /dev/sda, is SMART capable. Adding to "monitor" list. 
Aug  4 11:42:06 private2 smartd[6062]: Device: /dev/sdb, opened 
Aug  4 11:42:06 private2 smartd[6062]: Device /dev/sdb, please try adding '-d 3ware,N' 
Aug  4 11:42:06 private2 smartd[6062]: Device /dev/sdb, you may need to replace /dev/sdb with /dev/twaN or /dev/tweN 
Aug  4 11:42:06 private2 smartd[6062]: Device: /dev/sdc, opened 
Aug  4 11:42:06 private2 smartd[6062]: Device /dev/sdc, please try adding '-d 3ware,N' 
Aug  4 11:42:06 private2 smartd[6062]: Device /dev/sdc, you may need to replace /dev/sdc with /dev/twaN or /dev/tweN 
Aug  4 11:42:06 private2 smartd[6062]: Monitoring 0 ATA and 1 SCSI devices 
Aug  4 11:42:06 private2 smartd[6064]: smartd has fork()ed into background mode. New PID=6064. 
Aug  4 11:42:07 private2 avahi-daemon[6035]: Server startup complete. Host name is private2.local. Local service cookie is 3477185208.
Aug  4 11:43:55 private2 kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0026): Drive ECC error reported:port=3, unit=1.
Aug  4 11:43:55 private2 kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x002D): Source drive error occurred:port=3, unit=1.
Aug  4 11:43:55 private2 kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0004): Rebuild failed:unit=1.
Aug  4 11:43:55 private2 kernel: 3w-9xxx: scsi6: AEN: INFO (0x04:0x003B): Rebuild paused:unit=1.
Aug  4 11:54:07 private2 kernel: md: md0 still in use.
Aug  4 11:55:34 private2 kernel: md: bind<sdb1>
Aug  4 11:55:34 private2 kernel: RAID1 conf printout:
Aug  4 11:55:34 private2 kernel:  --- wd:1 rd:2
Aug  4 11:55:34 private2 kernel:  disk 0, wo:1, o:1, dev:sdb1
Aug  4 11:55:34 private2 kernel:  disk 1, wo:0, o:1, dev:sdc1
Aug  4 11:55:34 private2 kernel: md: syncing RAID array md0
Aug  4 11:55:34 private2 kernel: md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
Aug  4 11:55:34 private2 kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction.
Aug  4 11:55:34 private2 kernel: md: using 128k window, over a total of 976551040 blocks.
Aug  4 11:56:37 private2 kernel: sd 6:0:1:0: WARNING: (0x06:0x002C): Command (0x28) timed out, resetting card.
Aug  4 11:56:51 private2 kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x000A): Drive error detected:unit=1, port=3.
Aug  4 11:56:51 private2 kernel: 3w-9xxx: scsi6: AEN: INFO (0x04:0x005E): Cache synchronization completed:unit=0.
Aug  4 11:56:51 private2 kernel: 3w-9xxx: scsi6: AEN: INFO (0x04:0x005E): Cache synchronization completed:unit=1.
Aug  4 11:57:10 private2 kernel: 3w-9xxx: scsi6: ERROR: (0x03:0x0202): Drive ECC error:port=3.
Aug  4 11:57:10 private2 kernel: sd 6:0:1:0: Unhandled sense code
Aug  4 11:57:10 private2 kernel: sd 6:0:1:0: SCSI error: return code = 0x08000004
Aug  4 11:57:10 private2 kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
[…]
Aug  4 11:58:40 private2 kernel: raid1: sdc: unrecoverable I/O read error for block 388480
Aug  4 11:58:40 private2 kernel: md: md0: sync done.
Aug  4 11:58:41 private2 kernel: RAID1 conf printout:
Aug  4 11:58:41 private2 kernel:  --- wd:1 rd:2
Aug  4 11:58:41 private2 kernel:  disk 0, wo:1, o:1, dev:sdb1
Aug  4 11:58:41 private2 kernel:  disk 1, wo:0, o:1, dev:sdc1
Aug  4 11:58:41 private2 kernel: RAID1 conf printout:
Aug  4 11:58:41 private2 kernel:  --- wd:1 rd:2
Aug  4 11:58:41 private2 kernel:  disk 1, wo:0, o:1, dev:sdc1
Aug  4 11:58:41 private2 kernel: kjournald starting.  Commit interval 5 seconds
Aug  4 11:58:41 private2 kernel: EXT3 FS on dm-2, internal journal
Aug  4 11:58:41 private2 kernel: EXT3-fs: mounted filesystem with ordered data mode.
Aug  4 12:03:15 private2 kernel: 3w-9xxx: scsi6: ERROR: (0x03:0x0202): Drive ECC error:port=3.
Aug  4 12:03:15 private2 kernel: sd 6:0:1:0: Unhandled sense code
Aug  4 12:03:15 private2 kernel: sd 6:0:1:0: SCSI error: return code = 0x08000004
Aug  4 12:03:15 private2 kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Aug  4 12:03:15 private2 kernel: sdc: Current: sense key: Medium Error
Aug  4 12:03:15 private2 kernel:     Add. Sense: Unrecovered read error
Aug  4 12:03:15 private2 kernel: 
Aug  4 12:03:15 private2 kernel: EXT3-fs error (device dm-2): ext3_get_inode_loc: unable to read inode block - inode=71516957, block=143032346
Aug  4 12:03:15 private2 kernel: Aborting journal on device dm-2.
Aug  4 12:03:15 private2 kernel: ext3_abort called.
Aug  4 12:03:15 private2 kernel: EXT3-fs error (device dm-2): ext3_journal_start_sb: Detected aborted journal
Aug  4 12:03:15 private2 kernel: Remounting filesystem read-only
Aug  4 12:03:18 private2 kernel: 3w-9xxx: scsi6: ERROR: (0x03:0x0202): Drive ECC error:port=3.
Aug  4 12:03:18 private2 kernel: sd 6:0:1:0: Unhandled sense code
Aug  4 12:03:18 private2 kernel: sd 6:0:1:0: SCSI error: return code = 0x08000004
Aug  4 12:03:18 private2 kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Aug  4 12:03:18 private2 kernel: sdc: Current: sense key: Medium Error
Aug  4 12:03:18 private2 kernel:     Add. Sense: Unrecovered read error
Aug  4 12:03:18 private2 kernel: 
Aug  4 12:03:18 private2 kernel: EXT3-fs error (device dm-2): ext3_get_inode_loc: unable to read inode block - inode=71516956, block=143032346
Aug  4 12:03:46 private2 kernel: ext3_abort called.
Aug  4 12:03:46 private2 kernel: EXT3-fs error (device dm-2): ext3_put_super: Couldn't clean up the journal
Aug  4 12:06:17 private2 kernel: md: md0 stopped.
Aug  4 12:06:17 private2 kernel: md: unbind<sdb1>
Aug  4 12:06:17 private2 kernel: md: export_rdev(sdb1)
Aug  4 12:06:17 private2 kernel: md: unbind<sdc1>
Aug  4 12:06:17 private2 kernel: md: export_rdev(sdc1)
Aug  4 12:07:59 private2 kernel: md: md0 stopped.
Aug  4 12:07:59 private2 kernel: md: bind<sdb1>
Aug  4 12:07:59 private2 kernel: md: bind<sdc1>
Aug  4 12:07:59 private2 kernel: raid1: raid set md0 active with 1 out of 2 mirrors
Aug  4 12:07:59 private2 kernel: RAID1 conf printout:
Aug  4 12:07:59 private2 kernel:  --- wd:1 rd:2
Aug  4 12:07:59 private2 kernel:  disk 0, wo:1, o:1, dev:sdb1
Aug  4 12:07:59 private2 kernel:  disk 1, wo:0, o:1, dev:sdc1
Aug  4 12:07:59 private2 kernel: md: syncing RAID array md0
Aug  4 12:07:59 private2 kernel: md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
Aug  4 12:07:59 private2 kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction.
Aug  4 12:07:59 private2 kernel: md: using 128k window, over a total of 976551040 blocks.
Aug  4 12:09:58 private2 kernel: sd 6:0:1:0: WARNING: (0x06:0x002C): Command (0x28) timed out, resetting card.
Aug  4 12:10:07 private2 kernel: INFO: task md0_resync:6417 blocked for more than 120 seconds.
Aug  4 12:10:07 private2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  4 12:10:07 private2 kernel: md0_resync    D ffffffff801563dc     0  6417    465                6416 (L-TLB)
Aug  4 12:10:07 private2 kernel:  ffff81065b64bca0 0000000000000046 0000000000000000 ffff81065b677ea0
Aug  4 12:10:07 private2 kernel:  0000000000000001 000000000000000a ffff8106853c9820 ffff81068525f7e0
Aug  4 12:10:07 private2 kernel:  000001c990bde7ec 00000000000040f4 ffff8106853c9a08 000000048008d299
Aug  4 12:10:07 private2 kernel: Call Trace:
Aug  4 12:10:07 private2 kernel:  [<ffffffff883864e9>] :raid1:raise_barrier+0x12c/0x164
Aug  4 12:10:07 private2 kernel:  [<ffffffff8008ee74>] default_wake_function+0x0/0xe
Aug  4 12:10:07 private2 kernel:  [<ffffffff883877fb>] :raid1:sync_request+0x17a/0x50d
Aug  4 12:10:07 private2 kernel:  [<ffffffff801563dc>] __next_cpu+0x19/0x28
Aug  4 12:10:07 private2 kernel:  [<ffffffff8021f649>] is_mddev_idle+0xa7/0x102
Aug  4 12:10:07 private2 kernel:  [<ffffffff80223104>] md_do_sync+0x464/0x84b
Aug  4 12:10:07 private2 kernel:  [<ffffffff800a3290>] keventd_create_kthread+0x0/0xc4
Aug  4 12:10:07 private2 kernel:  [<ffffffff80222c8a>] md_thread+0xf8/0x10e
Aug  4 12:10:07 private2 kernel:  [<ffffffff80222b92>] md_thread+0x0/0x10e
Aug  4 12:10:07 private2 kernel:  [<ffffffff8003264c>] kthread+0xfe/0x132
Aug  4 12:10:07 private2 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Aug  4 12:10:07 private2 kernel:  [<ffffffff800a3290>] keventd_create_kthread+0x0/0xc4
Aug  4 12:10:07 private2 kernel:  [<ffffffff8003254e>] kthread+0x0/0x132
Aug  4 12:10:07 private2 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Aug  4 12:10:07 private2 kernel: 
Aug  4 12:10:12 private2 kernel: 3w-9xxx: scsi6: AEN: INFO (0x04:0x000B): Rebuild started:unit=1.
Aug  4 12:10:12 private2 kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=3.
Aug  4 12:10:12 private2 kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x000A): Drive error detected:unit=1, port=3.
Aug  4 12:10:12 private2 kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=3.
Aug  4 12:10:12 private2 kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0026): Drive ECC error reported:port=3, unit=1.
Aug  4 12:10:12 private2 kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x002D): Source drive error occurred:port=3, unit=1.
Aug  4 12:10:12 private2 kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0004): Rebuild failed:unit=1.
Aug  4 12:10:13 private2 kernel: 3w-9xxx: scsi6: AEN: INFO (0x04:0x003B): Rebuild paused:unit=1.
Aug  4 12:10:13 private2 kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=3.
Aug  4 12:10:13 private2 kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=3.
Aug  4 12:10:27 private2 kernel: 3w-9xxx: scsi6: ERROR: (0x03:0x0202): Drive ECC error:port=3.
Aug  4 12:10:27 private2 kernel: sd 6:0:1:0: Unhandled sense code
Aug  4 12:10:27 private2 kernel: sd 6:0:1:0: SCSI error: return code = 0x08000004
Aug  4 12:10:27 private2 kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Aug  4 12:10:27 private2 kernel: sdc: Current: sense key: Medium Error
Aug  4 12:10:27 private2 kernel:     Add. Sense: Unrecovered read error

[…]
Aug  4 12:11:28 private2 kernel: raid1: sdc: unrecoverable I/O read error for block 390272
Aug  4 12:11:31 private2 kernel: 3w-9xxx: scsi6: ERROR: (0x03:0x0202): Drive ECC error:port=3.
Aug  4 12:11:31 private2 kernel: sd 6:0:1:0: Unhandled sense code
Aug  4 12:11:31 private2 kernel: sd 6:0:1:0: SCSI error: return code = 0x08000004
Aug  4 12:11:31 private2 kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Aug  4 12:11:31 private2 kernel: sdc: Current: sense key: Medium Error
Aug  4 12:11:31 private2 kernel:     Add. Sense: Unrecovered read error
Aug  4 12:11:31 private2 kernel: 
Aug  4 12:11:31 private2 kernel: raid1: sdc: unrecoverable I/O read error for block 390144
Aug  4 12:11:34 private2 kernel: 3w-9xxx: scsi6: ERROR: (0x03:0x0202): Drive ECC error:port=3.
Aug  4 12:11:34 private2 kernel: sd 6:0:1:0: Unhandled sense code
Aug  4 12:11:34 private2 kernel: sd 6:0:1:0: SCSI error: return code = 0x08000004
Aug  4 12:11:34 private2 kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Aug  4 12:11:34 private2 kernel: sdc: Current: sense key: Medium Error
Aug  4 12:11:34 private2 kernel:     Add. Sense: Unrecovered read error
Aug  4 12:11:34 private2 kernel: 
Aug  4 12:11:34 private2 kernel: raid1: sdc: unrecoverable I/O read error for block 390400
Aug  4 12:11:38 private2 kernel: 3w-9xxx: scsi6: ERROR: (0x03:0x0202): Drive ECC error:port=3.
Aug  4 12:11:38 private2 kernel: sd 6:0:1:0: Unhandled sense code
Aug  4 12:11:38 private2 kernel: sd 6:0:1:0: SCSI error: return code = 0x08000004
Aug  4 12:11:38 private2 kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Aug  4 12:11:38 private2 kernel: sdc: Current: sense key: Medium Error
Aug  4 12:11:38 private2 kernel:     Add. Sense: Unrecovered read error
Aug  4 12:11:38 private2 kernel: 
Aug  4 12:11:38 private2 kernel: raid1: sdc: unrecoverable I/O read error for block 392064
Aug  4 12:11:38 private2 kernel: RAID1 conf printout:
Aug  4 12:11:38 private2 kernel:  --- wd:1 rd:2
Aug  4 12:11:38 private2 kernel:  disk 0, wo:1, o:1, dev:sdb1
Aug  4 12:11:38 private2 kernel:  disk 1, wo:0, o:1, dev:sdc1
Aug  4 12:11:38 private2 kernel: RAID1 conf printout:
Aug  4 12:11:38 private2 kernel:  --- wd:1 rd:2
Aug  4 12:11:38 private2 kernel:  disk 1, wo:0, o:1, dev:sdc1
Aug  4 12:12:15 private2 kernel: md: md0 stopped.
Aug  4 12:12:15 private2 kernel: md: unbind<sdc1>
Aug  4 12:12:15 private2 kernel: md: export_rdev(sdc1)
Aug  4 12:12:15 private2 kernel: md: unbind<sdb1>
Aug  4 12:12:15 private2 kernel: md: export_rdev(sdb1)
Aug  4 12:12:19 private2 kernel: md: md0 stopped.
Aug  4 12:19:15 private2 kernel: md: md0 stopped.
Aug  4 12:24:57 private2 kernel: md: md1 stopped.
Aug  4 12:34:03 private2 kernel: md: md1 stopped.
Aug  4 13:06:17 private2 kernel: md: md1 stopped.
Aug  4 13:06:25 private2 kernel: md: md0 stopped.

   Chris Maxwell
   Unix SysAdmin, Faculty of Computer Science, Dalhousie University
   Halifax, Nova Scotia, Canada
   (902) 494-1369 / chris.maxwell@dal.ca / FAX: (902) 492-1517

^ permalink raw reply

* Re: Rewrite md raid1 member
From: Chris Murphy @ 2016-08-19 16:10 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: Wols Lists, Brad Campbell, Linux-RAID
In-Reply-To: <20160819124615.GA23247@onthe.net.au>

On Fri, Aug 19, 2016 at 6:46 AM, Chris Dunlop <chris@onthe.net.au> wrote:
> On Fri, Aug 19, 2016 at 12:52:21PM +0100, Wols Lists wrote:
>> On 18/08/16 05:01, Chris Dunlop wrote:
>>> I'm interested to see if there's a way of essentially doing the above on a
>>> live system, assuming there's appropriate care taken to not trash any
>>> existing data (including superblocks).
>>>
>>> I.e. is it *theoretically* possible to write the same data back to the whole
>>> disk safely. E.g. using 'dd' from/to the same disk is almost there, but, as
>>> described, there's a window of opportunity where you could get stale data on
>>> the disk and a raid repair could then copy that stale data to the good disk.
>>
>> There is something called "scrub". My superficial knowledge of raid
>> doesn't let me know what it is, but as far as I can make out it forces a
>> whole-disk-write or somesuch. Explicitly to flush out such problems. If
>> someone else can tell you how to scrub your disks, I'd try that.
>
> A scrub will read the RAID members to check that both sides match (raid 1,
> 10), or that the checksum is correct (raid 4,5,6).
>
> To initiate a scrub of md0:
>
> echo repair > /sys/block/md0/md/sync_action
>
> You can watch it using /proc/mdstat, e.g.:
>
> watch cat /proc/mdstat
>
> It won't write anything if it doesn't detect any errors.
>
> In my case, I want it to write everything.
>
> If I do my 'dd' to write everything as previously described, with the window
> of opportunity for stale data to end up on the written disk, one option
> would to run a scrub / repair to check the data is the same - but if I'm
> unlucky with my dd and the data isn't the same for some sector[s], I want to
> ensure the correct data is copied over the stale data and not the other way
> around, e.g. to specify "in the event of a mismatch, use the data from sda
> and overwrite the data on sdb".
>
> Unfortunately I don't know how that can be done.
>
> Does anyone know?

Basically you want what Btrfs balance does, except simpler: rather
than relocating extents into new allocation groups, you just want to
read and rewrite everything as it is.

You definitely can't do this with dd when md + mounted file system,
that's inevitably going to result in the file system making changes
after this operation has done a read, and therefore its write will
clobber the file system's modifications. It'll be data loss at a
minimum, and if it's file system metadata, it'll be worse in that
it'll make the file system inconsistent. Further it's a problem
overwriting good data, not accounting for the possibility of a crash
or power failure. You'd really want this operation to be CoW, so that
the good data is effectively duplicated somewhere else and only once
that operation is on stable media would it be pointed to, and the
original data turned to free space.

I'm not really understanding the use case of why you'd want to do
this. At a fundamental level it sounds like you don't trust the
devices the data resides on. If that's true, then there are related
concerns that aren't mitigated by this rewrite feature alone.

-- 
Chris Murphy

^ permalink raw reply

* Re: URE, link resets, user hostile defaults
From: Chris Murphy @ 2016-08-19 15:30 UTC (permalink / raw)
  To: Pasi Kärkkäinen
  Cc: Zygo Blaxell, Chris Murphy, Hannes Reinecke, Linux-RAID
In-Reply-To: <20160704214304.GZ13212@reaktio.net>

On Mon, Jul 4, 2016 at 3:43 PM, Pasi Kärkkäinen <pasik@iki.fi> wrote:
> On Wed, Jun 29, 2016 at 08:17:51AM -0400, Zygo Blaxell wrote:
>> On Tue, Jun 28, 2016 at 11:33:36AM -0600, Chris Murphy wrote:
>> > On Tue, Jun 28, 2016 at 12:33 AM, Hannes Reinecke <hare@suse.de> wrote:
>> > > Can you post a message log detailing this problem?
>> >
>> > Just over the weekend Phil Turmel posted an email with a bunch of back
>> > reading on the subject of timeout mismatches for someone to read. I've
>> > lost track of how many user emails he's replied to, discovering this
>> > common misconfiguration, and get it straightened out and more often
>> > than not helping the user recover data that otherwise would have been
>> > lost *because* of hard link resetting instead of explicit read errors.
>>
>> OK, but the two links you provided are not examples of these.
>>
>
> Here's one of the threads where Phil explains the issue:
>
> http://marc.info/?l=linux-raid&m=133665797115876&w=2
>
> quote:
>
>
> "A very common report I see on this mailing list is people who have lost arrays
> where the drives all appear to be healthy.
> Given the large size of today's hard drives, even healthy drives will occasionally
> have an unrecoverable read error.
>
> When this happens in a raid array with a desktop drive without SCTERC,
> the driver times out and reports an error to MD.  MD proceeds to
> reconstruct the missing data and tries to write it back to the bad
> sector.  However, that drive is still trying to read the bad sector and
> ignores the controller.  The write is immediately rejected.  BOOM!  The
> *write* error ejects that member from the array.  And you are now
> degraded.
>
> If you don't notice the degraded array right away, you probably won't
> notice until a URE on another drive pops up.  Once that happens, you
> can't complete a resync to revive the array.
>
> Running a "check" or "repair" on an array without TLER will have the
> opposite of the intended effect: any URE will kick a drive out instead
> of fixing it.
>
> In the same scenario with an enterprise drive, or a drive with SCTERC
> turned on, the drive read times out before the controller driver, the
> controller never resets the link to the drive, and the followup write
> succeeds.  (The sector is either successfully corrected in place, or
> it is relocated by the drive.)  No BOOM."


The more I think about this, the more the command timer for SATA and
USB drives default just needs to change. It is really the simplest
solution to the problem. Parsing for device SCT ERC support, and then
whether or not there are drive firmware bugs enabling it is risky. And
it's an open question if it persists on all drives after suspend (to
RAM or disk).

Further the problem is if SCT ERC is enabled by default, and the user
wants to disable it for some reason, they might not be able to do this
simply from user space with smartctl -l scterc because I've
encountered drives that only accept one state change, changing it back
to disabled causes the device to "crash" and vanish off the SATA bus.
Clearly a firmware bug.



-- 
Chris Murphy

^ permalink raw reply

* Re: Rewrite md raid1 member
From: Chris Dunlop @ 2016-08-19 12:46 UTC (permalink / raw)
  To: Wols Lists; +Cc: Brad Campbell, linux-raid
In-Reply-To: <57B6F2F5.8090800@youngman.org.uk>

On Fri, Aug 19, 2016 at 12:52:21PM +0100, Wols Lists wrote:
> On 18/08/16 05:01, Chris Dunlop wrote:
>> I'm interested to see if there's a way of essentially doing the above on a
>> live system, assuming there's appropriate care taken to not trash any
>> existing data (including superblocks).
>> 
>> I.e. is it *theoretically* possible to write the same data back to the whole
>> disk safely. E.g. using 'dd' from/to the same disk is almost there, but, as
>> described, there's a window of opportunity where you could get stale data on
>> the disk and a raid repair could then copy that stale data to the good disk.
> 
> There is something called "scrub". My superficial knowledge of raid
> doesn't let me know what it is, but as far as I can make out it forces a
> whole-disk-write or somesuch. Explicitly to flush out such problems. If
> someone else can tell you how to scrub your disks, I'd try that.

A scrub will read the RAID members to check that both sides match (raid 1,
10), or that the checksum is correct (raid 4,5,6).

To initiate a scrub of md0:

echo repair > /sys/block/md0/md/sync_action

You can watch it using /proc/mdstat, e.g.:

watch cat /proc/mdstat

It won't write anything if it doesn't detect any errors.

In my case, I want it to write everything.

If I do my 'dd' to write everything as previously described, with the window
of opportunity for stale data to end up on the written disk, one option
would to run a scrub / repair to check the data is the same - but if I'm
unlucky with my dd and the data isn't the same for some sector[s], I want to
ensure the correct data is copied over the stale data and not the other way
around, e.g. to specify "in the event of a mismatch, use the data from sda
and overwrite the data on sdb".

Unfortunately I don't know how that can be done.

Does anyone know?

Cheers,

Chris

^ permalink raw reply

* Re: URE, link resets, user hostile defaults
From: Phil Turmel @ 2016-08-19 12:36 UTC (permalink / raw)
  To: Pasi Kärkkäinen, Zygo Blaxell
  Cc: Chris Murphy, Hannes Reinecke, linux-raid
In-Reply-To: <20160819100010.GA5195@reaktio.net>

On 08/19/2016 06:00 AM, Pasi Kärkkäinen wrote:
> 
> ping
> 
> Let's not forget this thread :)

Not forgotten.  Just busy working to the pay the bills... :-(


^ permalink raw reply

* Re: Rewrite md raid1 member
From: Wols Lists @ 2016-08-19 11:52 UTC (permalink / raw)
  To: Chris Dunlop, Brad Campbell; +Cc: linux-raid
In-Reply-To: <20160818040151.GA21256@onthe.net.au>

On 18/08/16 05:01, Chris Dunlop wrote:
> I'm interested to see if there's a way of essentially doing the above on a
> live system, assuming there's appropriate care taken to not trash any
> existing data (including superblocks).
> 
> I.e. is it *theoretically* possible to write the same data back to the whole
> disk safely. E.g. using 'dd' from/to the same disk is almost there, but, as
> described, there's a window of opportunity where you could get stale data on
> the disk and a raid repair could then copy that stale data to the good disk.

There is something called "scrub". My superficial knowledge of raid
doesn't let me know what it is, but as far as I can make out it forces a
whole-disk-write or somesuch. Explicitly to flush out such problems. If
someone else can tell you how to scrub your disks, I'd try that.

It's especially recommended, I think, for people with desktop drives in
their array because it flushes out pending problems, which with desktop
drives typically remove the "R" from "raid".

Cheers,
Wol

^ permalink raw reply

* Re: URE, link resets, user hostile defaults
From: Pasi Kärkkäinen @ 2016-08-19 10:00 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Chris Murphy, Hannes Reinecke, linux-raid
In-Reply-To: <20160704214304.GZ13212@reaktio.net>


ping

Let's not forget this thread :)


-- Pasi

On Tue, Jul 05, 2016 at 12:43:04AM +0300, Pasi Kärkkäinen wrote:
> On Wed, Jun 29, 2016 at 08:17:51AM -0400, Zygo Blaxell wrote:
> > On Tue, Jun 28, 2016 at 11:33:36AM -0600, Chris Murphy wrote:
> > > On Tue, Jun 28, 2016 at 12:33 AM, Hannes Reinecke <hare@suse.de> wrote:
> > > > Can you post a message log detailing this problem?
> > >
> > > Just over the weekend Phil Turmel posted an email with a bunch of back
> > > reading on the subject of timeout mismatches for someone to read. I've
> > > lost track of how many user emails he's replied to, discovering this
> > > common misconfiguration, and get it straightened out and more often
> > > than not helping the user recover data that otherwise would have been
> > > lost *because* of hard link resetting instead of explicit read errors.
> > 
> > OK, but the two links you provided are not examples of these.
> > 
> 
> Here's one of the threads where Phil explains the issue:
> 
> http://marc.info/?l=linux-raid&m=133665797115876&w=2
> 
> quote:
> 
> 
> "A very common report I see on this mailing list is people who have lost arrays 
> where the drives all appear to be healthy.  
> Given the large size of today's hard drives, even healthy drives will occasionally 
> have an unrecoverable read error.
> 
> When this happens in a raid array with a desktop drive without SCTERC,
> the driver times out and reports an error to MD.  MD proceeds to
> reconstruct the missing data and tries to write it back to the bad
> sector.  However, that drive is still trying to read the bad sector and
> ignores the controller.  The write is immediately rejected.  BOOM!  The
> *write* error ejects that member from the array.  And you are now
> degraded.
> 
> If you don't notice the degraded array right away, you probably won't
> notice until a URE on another drive pops up.  Once that happens, you
> can't complete a resync to revive the array.
> 
> Running a "check" or "repair" on an array without TLER will have the
> opposite of the intended effect: any URE will kick a drive out instead
> of fixing it.
> 
> In the same scenario with an enterprise drive, or a drive with SCTERC
> turned on, the drive read times out before the controller driver, the
> controller never resets the link to the drive, and the followup write
> succeeds.  (The sector is either successfully corrected in place, or
> it is relocated by the drive.)  No BOOM."
> 
> 
> 
> -- Pasi
> 


^ permalink raw reply

* Re: Adding journal to existing raid5 arrary
From: Song Liu @ 2016-08-18 16:03 UTC (permalink / raw)
  To: Maarten van Malland; +Cc: linux-raid@vger.kernel.org
In-Reply-To: <CAC8wJ3Fjj=ANPpL8MxN_gmcs721TjkMUOooeEh_giAONFx4q-Q@mail.gmail.com>

Yes, we do plan to add removal as well. 
 
Thanks,
Song

>> On 8/18/16, 12:51 AM, "Maarten van Malland" <maartenvanmalland@gmail.com> wrote:

    Well that explains that then :-). Is it also planned to support
    removal of the journal device? I can imagine that there are cases
    (such as performance issues) that you would like to revert to the old
    situation...
    
    On Wed, Aug 17, 2016 at 11:46 PM, Song Liu <songliubraving@fb.com> wrote:
    > Currently, --add-journal does not support adding journal to an existing array,
    > We plan to add that soon.
    >
    > Song
    >
    


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox