Linux RAID subsystem development
 help / color / mirror / Atom feed
* Re: MDADM grow /dev/md0 - chunk size
From: NeilBrown @ 2017-01-22 22:52 UTC (permalink / raw)
  To: J. Cassidy, linux-raid
In-Reply-To: <8ef3e9f2d526ebba88b94a0e6f09fdef.webmail@mx1bln1.prossl.de>

[-- Attachment #1: Type: text/plain, Size: 3154 bytes --]

On Sun, Jan 15 2017, J. Cassidy wrote:

> Hello all/Neil,
>
>
>
>
> I am trying to change the chunk size on a RAID 0 (two SSD) from 512K to 64K.
>
> I am running Debian Stretch with a 4.10 kernel.
>
> MDADM version is 4.0 (GIT).
>
> This is the command string being issued -
>
> mdadm --grow -c 64 --backup-file=/zz/backup.file /dev/md0
>
> or
>
> mdadm --grow -c 64  /dev/md0
>
> both of the abovementioned commands produce this message -
>
>
> "mdadm: /dev/md0: could not set level to raid4"
>
>
> A snippet from dmesg -
> .
> .
> md/raid:md0: cannot takeover raid0 with more than one zone.
> md: md0: raid4 would not accept array

Your two partitions that form the RAID0 array are different sizes.
This causes raid0 to create 2 zones, one which covers all of the smaller
partition and an equal portion of the larger partition, and one which
covers the remainder of the larger partition.

raid4 does not have a similar concept of zones, so it is not possible to
convert the raid0 into a degraded raid4.
raid0 does not support chunk-size changes (or any changes) directly.
These are performed by transforming the RAID0 to RAID4 and having the
raid4 module perform the change.

The consequence of all this is that: sorry, you cannot change the chunk
size of the array.

And... please don't send nag emails so soon - it was barely more than
24hours after the original.  This just comes across as rude and
impatient.  People have other commitments.
My rule of thumb is to wait at least a week before resending - and then
resend the full text of the original.  Your nag email was not only too
soon, but contained no detail and so was useless.

NeilBrown


> .
> .
>
> My MDADM setup -
>
>
> mdadm --detail /dev/md0
> /dev/md0:
>         Version : 1.2
>   Creation Time : Sat Jan 14 16:51:54 2017
>      Raid Level : raid0
>      Array Size : 497783808 (474.72 GiB 509.73 GB)
>    Raid Devices : 2
>   Total Devices : 2
>     Persistence : Superblock is persistent
>
>     Update Time : Sat Jan 14 16:51:54 2017
>           State : clean
>  Active Devices : 2
> Working Devices : 2
>  Failed Devices : 0
>   Spare Devices : 0
>
>      Chunk Size : 512K
>
>            Name : Pezenas:0  (local to host Pezenas)
>            UUID : 77cd6f4e:f98bf2b0:862948df:12da38fa
>          Events : 0
>
>     Number   Major   Minor   RaidDevice State
>        0     259        4        0      active sync   /dev/nvme0n1p2
>        1     259        2        1      active sync   /dev/nvme1n1p1
>
>
> I recall doing something similiar a few years ago and it worked, though not using
> NVME drives.
>
>
> Any help/pointers much appreciated.
>
>
>
>
> Regards,
>
>
>
> John
>
>
>
>
>
>
> John Cassidy
>
> Obere Bühlstrasse 21
> 8700 Küsnacht (ZH)
> Switzerland / Suisse / Schweiz
>
>
> Mobile:    +49  152 58961601 (Germany)
> Mobile:    +352 621 577 149  (Luxembourg)
> Mobile:    +41  78 769 17 97 (CH)
> Landline:  +41  44 509 1957
> Mobile email: mobile@jdcassidy.eu
>
> http://www.jdcassidy.eu
>
> "Aut viam inveniam aut faciam" - Hannibal.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: Input/Output error reading from a clean raid
From: John Stoffel @ 2017-01-23  0:18 UTC (permalink / raw)
  To: Salatiel Filho; +Cc: linux-raid
In-Reply-To: <CAGmni9p7T5VQsSGQrwkpYgV=H4B-pjXtg8TxnuWst-FQSJh_fA@mail.gmail.com>


Salatiel> I am trying to recover a few files from my backup. The
Salatiel> backup is on a raid 5 + ext4.  There are several files where
Salatiel> i get I/O error. The raid appears to be clean and fsck shows
Salatiel> no errors. Any ideas what could it be ?

Salatiel> md1 : active raid5 sdd1[0] sdg1[4] sdf1[2] sde1[1]
Salatiel>       3220829184 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
Salatiel>       bitmap: 1/8 pages [4KB], 65536KB chunk

It would help if you could post the error(s) you're getting, along
with any output from dmesg during that time.  Have you done a full
scan of the disk looking for errors?  You might just have silent
read errors in your array.  So as root do:

   # echo check >>/sys/block/md??/md/sync_action

where md?? is the name of your md array you want to check.  You can
get the name from:

   cat /proc/mdstat

and of course it would help to post that info as well if you want more
help.

John

^ permalink raw reply

* Re: Soft-Raid 0 Performance | Transfer two Data-Streams (CPU+FPGA) to the same Soft-Raid
From: Coly Li @ 2017-01-23  8:40 UTC (permalink / raw)
  To: Eric Schwarz; +Cc: linux-raid
In-Reply-To: <0e04b53af8bdf4f39a08b31ea4499626@sw-optimization.com>

On 2017/1/17 下午11:00, Eric Schwarz wrote:
> Hello mailing list,
> 
> I have got two questions:
> 
> 1.) I have set-up a softraid (raid level 0) with mdadm using two M.2
> modules. For one module the throughput is ~350MB/s (no mdadm) for two
> modules the throughput is ~500MB/s which is less than factor 1,5 of the
> throughput of a single drive. The filesystem used is ext4. Is there
> someone having some values for comparison? For me the throughput gain
> seems to be too little. The test was done using a HP Z840 workstation.
> 

Hi Eric,

Could you attach your testing script as well, than we can have a look.



> 2.) We want to configure a softraid (raid level 0) with mdadm which can
> be used from within Linux but also it should be possible to write data
> to the raid w/ DMA directly from the FPGA which is also connected to the
> PCIe bus as a slave as well as the M.2 modules. How can that be achieved
> using existing kernel infrastructure?
> 

I don't know such existing kernel code. Maybe other people can provide
useful information.

Coly


^ permalink raw reply

* Re: Soft-Raid 0 Performance | Transfer two Data-Streams (CPU+FPGA) to the same Soft-Raid
From: NeilBrown @ 2017-01-23  9:56 UTC (permalink / raw)
  To: Eric Schwarz, linux-raid
In-Reply-To: <0e04b53af8bdf4f39a08b31ea4499626@sw-optimization.com>

[-- Attachment #1: Type: text/plain, Size: 1883 bytes --]

On Wed, Jan 18 2017, Eric Schwarz wrote:

> Hello mailing list,
>
> I have got two questions:
>
> 1.) I have set-up a softraid (raid level 0) with mdadm using two M.2 
> modules. For one module the throughput is ~350MB/s (no mdadm) for two 
> modules the throughput is ~500MB/s which is less than factor 1,5 of the 
> throughput of a single drive. The filesystem used is ext4. Is there 
> someone having some values for comparison? For me the throughput gain 
> seems to be too little. The test was done using a HP Z840 workstation.

Can you try a range of chunk sizes and see what difference it makes?
RAID0 simply remaps each request to the appropriate target device/offset,
imposing very little direct overhead.  There may be an indirect overhead
as a small chunk size causes large request to be chopped up into smaller
requests first.

>
> 2.) We want to configure a softraid (raid level 0) with mdadm which can 
> be used from within Linux but also it should be possible to write data 
> to the raid w/ DMA directly from the FPGA which is also connected to the 
> PCIe bus as a slave as well as the M.2 modules. How can that be achieved 
> using existing kernel infrastructure?

What sort of help do you want from the existing kernel infrastructure.
Presumably your FPGA must already understand RAID0 striping, so you just
need to tell the FPGA what devices, offsets, chunk sizes to use.
I suspect it would be easiest to have a user-space program query the
array and feed data to your FPGA, but as you provide no detail for how
the FPGA would be configured, I'm only guessing.

NeilBrown


>
> Many thanks for helpful replies
> Eric
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: raid0 vs. mkfs
From: Coly Li @ 2017-01-23 12:26 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-raid
In-Reply-To: <06411e37-ee35-c8d9-a578-4a9cd2fbb0d9@scylladb.com>

[-- Attachment #1: Type: text/plain, Size: 2130 bytes --]

On 2017/1/23 上午2:01, Avi Kivity wrote:
> Hello,
> 
> 
> On 11/27/2016 05:24 PM, Avi Kivity wrote:
>> mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large
>> disk that supports TRIM/DISCARD (erase whichever is inappropriate). 
>> That is because mkfs issues a TRIM/DISCARD (erase whichever is
>> inappropriate) for the entire partition. As far as I can tell, md
>> converts the large TRIM/DISCARD (erase whichever is inappropriate)
>> into a large number of TRIM/DISCARD (erase whichever is inappropriate)
>> requests, one per chunk-size worth of disk, and issues them to the
>> RAID components individually.
>>
>>
>> It seems to me that md can convert the large TRIM/DISCARD (erase
>> whichever is inappropriate) request it gets into one TRIM/DISCARD
>> (erase whichever is inappropriate) per RAID component, converting an
>> O(disk size / chunk size) operation into an O(number of RAID
>> components) operation, which is much faster.
>>
>>
>> I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices,
>> with the operation taking about a quarter of an hour, continuously
>> pushing half-megabyte TRIM/DISCARD (erase whichever is inappropriate)
>> requests to the disk. Linux 4.1.12.
>>
> 
> Did anyone pick this up by any chance?  The only thing I could find is
> more people complaining about the same issue.

Hi Avi,

I proposed a POC patch, Shaohua and Neil provide review comments,
suggest me to simplify this patch.

If you notice, there is a patch I sent out on Dec 9, 2016, in this email
thread. This patch works, but not the final version to be accepted by
upstream. I quote the performance number here,

“ On 4x3TB NVMe raid0, format it with mkfs.xfs. Current upstream kernel
spends 306 seconds, the patched kernel spends 15 seconds. I see average
request size increases from 1 chunk (1024 sectors) to 2048 chunks
(2097152 sectors).”

I call this patch as RFC v2 patch, and attach it again in this email.
Now I am working on a new version by the suggestion from Shaohua and
Neil, but any testing feed back of RFC v2 patch is welcome. It works,
just too complex to review.

Thanks.

Coly

[-- Attachment #2: raid0_handle_large_discard_bio.patch --]
[-- Type: text/plain, Size: 11145 bytes --]

Subject: [RFC v2] optimization for large size DISCARD bio by per-device bios 

This is a very early prototype, still needs more block layer code
modification to make it work.

Current upstream raid0_make_request() only handles TRIM/DISCARD bio by
chunk size, it meams for large raid0 device built by SSDs will call
million times generic_make_request() for the split bio. This patch
tries to combine small bios into large one if they are on same real
device and continuous on this real device, then send the combined large
bio to underlying device by single call to generic_make_request().

For example, use mkfs.xfs to trim a raid0 device built with 4 x 3TB
NVMeSSD, current upstream raid0_make_request() will call
generic_make_request() 5.7 million times, with this patch only 4 calls
to generic_make_request() is required.

This patch won't work in real world, because in block/blk-lib.c:
__blkdev_issue_discard() the original large bio will be split into
smaller ones by restriction of discard_granularity.

If some day SSD supports whole device sized discard_granularity, it
will be very interesting then...

The basic idea is, if a large discard bio received
by raid0_make_request(), for example it requests to discard chunk 1
to 24 on a raid0 device built by 4 SSDs. This large discard bio will
be split and written to each SSD as the following layout,
	SSD1: C1,C5,C9,C13,C17,C21
	SSD2: C2,C6,C10,C14,C18,C22
	SSD3: C3,C7,C11,C15,C19,C23
	SSD4: C4,C8,C12,C16,C20,C24
Current raid0 code will call generic_make_request() for 24 times for
each split bio. But it is possible to calculate the final layout of
each split bio, so we can combine all the bios into four per-SSD large
bio, like this,
	bio1 (on SSD1): C{1,5,9,13,17,21}
	bio2 (on SSD2): C{2,6,10,14,18,22}
	bio3 (on SSD3): C{3,7,11,15,19,23}
	bio4 (on SSD4): C{4,8,12,16,20,24}
Now we only need to call generic_make_request() for 4 times.

The code is not simple, I need more time to write text to complain how
it works. Currently you can treat it as a proof of concept.

Changelogs
v1, Initial prototype.
v2, Major changes inlcude,
    - rename function names, now handle_discard_bio() takes care
      in chunk size DISCARD bio and single disk sutiation, large DISCARD
      bio will be handled in handle_large_discard_bio().
    - Set max_discard_sectors to raid0 device size.
    - Fix several bugs which I find in basic testing..

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/raid0.c | 267 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 266 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 258986a..c7afe0c 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -378,7 +378,7 @@ static int raid0_run(struct mddev *mddev)
 
 		blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors);
 		blk_queue_max_write_same_sectors(mddev->queue, mddev->chunk_sectors);
-		blk_queue_max_discard_sectors(mddev->queue, mddev->chunk_sectors);
+		blk_queue_max_discard_sectors(mddev->queue, raid0_size(mddev, 0, 0));
 
 		blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9);
 		blk_queue_io_opt(mddev->queue,
@@ -452,6 +452,266 @@ static inline int is_io_in_chunk_boundary(struct mddev *mddev,
 	}
 }
 
+
+struct bio_record {
+	sector_t	bi_sector;
+	unsigned long	sectors;
+	struct md_rdev	*rdev;
+};
+
+static void handle_large_discard_bio(struct mddev *mddev, struct bio *bio)
+{
+	struct bio_record *recs = NULL;
+	struct bio *split;
+	struct r0conf *conf = mddev->private;
+	sector_t sectors, sector;
+	struct strip_zone *first_zone;
+	int zone_idx;
+	sector_t zone_start, zone_end;
+	int nr_strip_zones = conf->nr_strip_zones;
+	int disks;
+	int first_rdev_idx = -1, rdev_idx;
+	struct md_rdev *first_rdev;
+	unsigned int chunk_sects = mddev->chunk_sectors;
+
+	sector = bio->bi_iter.bi_sector;
+	first_zone = find_zone(conf, &sector);
+	first_rdev = map_sector(mddev, first_zone, sector, &sector);
+
+	/* bio is large enough to be split, allocate recs firstly */
+	disks = mddev->raid_disks;
+	recs = kcalloc(disks, sizeof(struct bio_record), GFP_NOIO);
+	if (recs == NULL) {
+		printk(KERN_ERR "md/raid0:%s: failed to allocate memory " \
+				"for bio_record", mdname(mddev));
+		bio->bi_error = -ENOMEM;
+		bio_endio(bio);
+		return;
+	}
+
+	zone_idx = first_zone - conf->strip_zone;
+	for (rdev_idx = 0; rdev_idx < first_zone->nb_dev; rdev_idx++) {
+		struct md_rdev *rdev;
+
+		rdev = conf->devlist[zone_idx * disks + rdev_idx];
+		recs[rdev_idx].rdev = rdev;
+		if (rdev == first_rdev)
+			first_rdev_idx = rdev_idx;
+	}
+
+	sectors = chunk_sects -
+		(likely(is_power_of_2(chunk_sects))
+		? (sector & (chunk_sects - 1))
+		: sector_div(sector, chunk_sects));
+		sector = bio->bi_iter.bi_sector;
+
+	recs[first_rdev_idx].bi_sector = sector + first_zone->dev_start;
+	recs[first_rdev_idx].sectors = sectors;
+
+	/* recs[first_rdev_idx] is initialized with 'sectors', we need to
+	 * handle the rested sectors, which is sotred in 'sectors' too.
+	 */
+	sectors = bio_sectors(bio) - sectors;
+
+	/* bio may not be chunk size aligned, the split bio on first rdev
+	 * may not be chunk size aligned too. But the rested split bios
+	 * on rested rdevs must be chunk size aligned, and aligned to
+	 * round down chunk number.
+	 */
+	zone_end = first_zone->zone_end;
+	rdev_idx = first_rdev_idx + 1;
+	sector = likely(is_power_of_2(chunk_sects))
+		 ? sector & (~(chunk_sects - 1))
+		 : chunk_sects * (sector/chunk_sects);
+
+	while (rdev_idx < first_zone->nb_dev) {
+		if (recs[rdev_idx].sectors == 0) {
+			recs[rdev_idx].bi_sector = sector + first_zone->dev_start;
+			if (sectors <= chunk_sects) {
+				recs[rdev_idx].sectors = sectors;
+				goto issue;
+			}
+			recs[rdev_idx].sectors = chunk_sects;
+			sectors -= chunk_sects;
+		}
+		rdev_idx++;
+	}
+
+	sector += chunk_sects;
+	zone_start = sector + first_zone->dev_start;
+	if (zone_start == zone_end) {
+		zone_idx++;
+		if (zone_idx == nr_strip_zones) {
+			if (sectors != 0)
+				printk(KERN_INFO "bio size exceeds raid0 " \
+					"capability, ignore extra " \
+					"TRIM/DISCARD range.\n");
+			goto issue;
+		}
+		zone_start = conf->strip_zone[zone_idx].dev_start;
+	}
+
+	while (zone_idx < nr_strip_zones) {
+		int rdevs_in_zone = conf->strip_zone[zone_idx].nb_dev;
+		int chunks_per_rdev, rested_chunks, rested_sectors;
+		sector_t zone_sectors, grow_sectors;
+		int add_rested_sectors = 0;
+
+		zone_end = conf->strip_zone[zone_idx].zone_end;
+		zone_sectors = zone_end - zone_start;
+		chunks_per_rdev = sectors;
+		rested_sectors =
+			sector_div(chunks_per_rdev, chunk_sects * rdevs_in_zone);
+		rested_chunks = rested_sectors;
+		rested_sectors = sector_div(rested_chunks, chunk_sects);
+
+		if ((chunks_per_rdev * chunk_sects) > zone_sectors)
+			chunks_per_rdev = zone_sectors/chunk_sects;
+
+		/* rested_chunks and rested_sectors go into next zone, we won't
+		 * handle them in this zone. Set them to 0.
+		 */
+		if ((chunks_per_rdev * chunk_sects) == zone_sectors &&
+		    (rested_chunks != 0 || rested_sectors != 0)) {
+			if (rested_chunks != 0)
+				rested_chunks = 0;
+			if (rested_sectors != 0)
+				rested_sectors = 0;
+		}
+
+		if (rested_chunks == 0 && rested_sectors != 0)
+			add_rested_sectors ++;
+
+		for (rdev_idx = 0; rdev_idx < rdevs_in_zone; rdev_idx++) {
+			/* if .sectors is not initailized (== 0), it indicates
+			 * .bi_sector is not initialized neither. We initiate
+			 * .bi_sector firstly, then set .sectors by
+			 * grow_sectors.
+			 */
+			if (recs[rdev_idx].sectors == 0)
+				recs[rdev_idx].bi_sector = zone_start;
+			grow_sectors = chunks_per_rdev * chunk_sects;
+			if (rested_chunks) {
+				grow_sectors += chunk_sects;
+				rested_chunks--;
+				if (rested_chunks == 0 &&
+				    rested_sectors != 0) {
+					recs[rdev_idx].sectors += grow_sectors;
+					sectors -= grow_sectors;
+					add_rested_sectors ++;
+					continue;
+				}
+			}
+
+			/* if add_rested_sectors != 0, it indicates
+			 * rested_sectors != 0
+			 */
+			if (add_rested_sectors == 1) {
+				grow_sectors += rested_sectors;
+				add_rested_sectors ++;
+			}
+			recs[rdev_idx].sectors += grow_sectors;
+			sectors -= grow_sectors;
+			if (sectors == 0)
+				break;
+		}
+
+		if (sectors == 0)
+			break;
+		zone_start = zone_end;
+		zone_idx++;
+		if (zone_idx < nr_strip_zones)
+			BUG_ON(zone_start != conf->strip_zone[zone_idx].dev_start);
+	}
+
+
+issue:
+	/* recs contains the re-ordered requests layout, now we can
+	 * chain split bios from recs
+	 */
+	for (rdev_idx = 0; rdev_idx < disks; rdev_idx++) {
+		if (rdev_idx == first_rdev_idx ||
+		    recs[rdev_idx].sectors == 0)
+			continue;
+		split = bio_split(bio,
+				  recs[rdev_idx].sectors,
+				  GFP_NOIO,
+				  fs_bio_set);
+		if (split == NULL)
+			break;
+		bio_chain(split, bio);
+		BUG_ON(split->bi_iter.bi_size != recs[rdev_idx].sectors << 9);
+		split->bi_bdev = recs[rdev_idx].rdev->bdev;
+		split->bi_iter.bi_sector = recs[rdev_idx].bi_sector +
+				recs[rdev_idx].rdev->data_offset;
+
+		if (unlikely(!blk_queue_discard(
+				bdev_get_queue(split->bi_bdev))))
+			/* Just ignore it */
+			bio_endio(split);
+		else
+			generic_make_request(split);
+	}
+	BUG_ON(bio->bi_iter.bi_size != recs[first_rdev_idx].sectors << 9);
+	bio->bi_iter.bi_sector = recs[first_rdev_idx].bi_sector +
+					recs[first_rdev_idx].rdev->data_offset;
+	bio->bi_bdev = recs[first_rdev_idx].rdev->bdev;
+
+	if (unlikely(!blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
+		/* Just ignore it */
+		bio_endio(bio);
+	else
+		generic_make_request(bio);
+
+	kfree(recs);
+}
+
+static void handle_discard_bio(struct mddev *mddev, struct bio *bio)
+{
+	struct r0conf *conf = mddev->private;
+	unsigned int chunk_sects = mddev->chunk_sectors;
+	sector_t sector, sectors;
+	struct md_rdev *rdev;
+	struct strip_zone *zone;
+
+	sector = bio->bi_iter.bi_sector;
+	zone = find_zone(conf, &sector);
+	rdev = map_sector(mddev, zone, sector, &sector);
+	bio->bi_bdev = rdev->bdev;
+	sectors = chunk_sects -
+		(likely(is_power_of_2(chunk_sects))
+		 ? (sector & (chunk_sects - 1))
+		 : sector_div(sector, chunk_sects));
+
+	if (unlikely(sectors >= bio_sectors(bio))) {
+		bio->bi_iter.bi_sector = sector + zone->dev_start +
+					 rdev->data_offset;
+		goto single_bio;
+	}
+
+	if (unlikely(zone->nb_dev == 1)) {
+		sectors = conf->strip_zone[0].zone_end -
+			  sector;
+		if (bio_sectors(bio) > sectors)
+			bio->bi_iter.bi_size = sectors << 9;
+		bio->bi_iter.bi_sector = sector + rdev->data_offset;
+		goto single_bio;
+	}
+
+	handle_large_discard_bio(mddev, bio);
+	return;
+
+single_bio:
+	if (unlikely(!blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
+		/* Just ignore it */
+		bio_endio(bio);
+	else
+		generic_make_request(bio);
+
+	return;
+}
+
+
 static void raid0_make_request(struct mddev *mddev, struct bio *bio)
 {
 	struct strip_zone *zone;
@@ -463,6 +723,11 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
 		return;
 	}
 
+	if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) {
+		handle_discard_bio(mddev, bio);
+		return;
+	}
+
 	do {
 		sector_t sector = bio->bi_iter.bi_sector;
 		unsigned chunk_sects = mddev->chunk_sectors;

^ permalink raw reply related

* Re: Input/Output error reading from a clean raid
From: Salatiel Filho @ 2017-01-23 14:02 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <20170123010334.GA7546@metamorpher.de>

Ok, i have run echo check >>/sys/block/md1/md/sync_action, and now the output of
mdadm mdadm --examine-badblocks /dev/sdd1 /dev/sdg1 /dev/sdf1  /dev/sde1

Bad-blocks on /dev/sdd1:
          1515723072 for 512 sectors
          1515723584 for 512 sectors
          1515724096 for 512 sectors
          1515724608 for 512 sectors
          1515725120 for 512 sectors
          1515725632 for 512 sectors
          1515726144 for 512 sectors
          1515726656 for 512 sectors
          1515727168 for 512 sectors
          1515727680 for 512 sectors
          1515728192 for 512 sectors
          1515728704 for 512 sectors
          1515729216 for 512 sectors
          1515729728 for 512 sectors
          1515730240 for 512 sectors
          1515730752 for 512 sectors
          1515731264 for 512 sectors
          1515731776 for 512 sectors
          1515732288 for 512 sectors
          1515732800 for 512 sectors
          1515733312 for 512 sectors
          1515733824 for 512 sectors
          1515734336 for 512 sectors
          1515734848 for 512 sectors
          1515735360 for 512 sectors
          1515735872 for 512 sectors
          1515736384 for 512 sectors
          1515736896 for 512 sectors
          1515737408 for 512 sectors
          1515737920 for 512 sectors
          1515738432 for 512 sectors
          1515738944 for 512 sectors
          1515739456 for 512 sectors
          1515739968 for 512 sectors
          1515740480 for 512 sectors
          1515740992 for 512 sectors
          1515741504 for 512 sectors
          1515742016 for 192 sectors
          1515743712 for 512 sectors
          1515744224 for 512 sectors
          1515744736 for 512 sectors
          1515745248 for 512 sectors
          1515745760 for 512 sectors
          1515746272 for 512 sectors
          1515746784 for 512 sectors
          1515747296 for 512 sectors
          1515747808 for 512 sectors
          1515748320 for 512 sectors
          1515749072 for 304 sectors
          1515750400 for 512 sectors
          1515750912 for 512 sectors
          1515751424 for 512 sectors
          1515751936 for 512 sectors
          1515752448 for 512 sectors
          1515752960 for 512 sectors
          1515753472 for 512 sectors
          1515753984 for 512 sectors
          1515754496 for 232 sectors
Bad-blocks list is empty in /dev/sdg1
Bad-blocks list is empty in /dev/sdf1
Bad-blocks on /dev/sde1:
          1515723072 for 512 sectors
          1515723584 for 512 sectors
          1515724096 for 512 sectors
          1515724608 for 512 sectors
          1515725120 for 512 sectors
          1515725632 for 512 sectors
          1515726144 for 512 sectors
          1515726656 for 512 sectors
          1515727168 for 512 sectors
          1515727680 for 512 sectors
          1515728192 for 512 sectors
          1515728704 for 512 sectors
          1515729216 for 512 sectors
          1515729728 for 512 sectors
          1515730240 for 512 sectors
          1515730752 for 512 sectors
          1515731264 for 512 sectors
          1515731776 for 512 sectors
          1515732288 for 512 sectors
          1515732800 for 512 sectors
          1515733312 for 512 sectors
          1515733824 for 512 sectors
          1515734336 for 512 sectors
          1515734848 for 512 sectors
          1515735360 for 512 sectors
          1515735872 for 512 sectors
          1515736384 for 512 sectors
          1515736896 for 512 sectors
          1515737408 for 512 sectors
          1515737920 for 512 sectors
          1515738432 for 512 sectors
          1515738944 for 512 sectors
          1515739456 for 512 sectors
          1515739968 for 512 sectors
          1515740480 for 512 sectors
          1515740992 for 512 sectors
          1515741504 for 512 sectors
          1515742016 for 192 sectors
          1515743712 for 512 sectors
          1515744224 for 512 sectors
          1515744736 for 512 sectors
          1515745248 for 512 sectors
          1515745760 for 512 sectors
          1515746272 for 512 sectors
          1515746784 for 512 sectors
          1515747296 for 512 sectors
          1515747808 for 512 sectors
          1515748320 for 512 sectors
          1515749072 for 304 sectors
          1515750400 for 512 sectors
          1515750912 for 512 sectors
          1515751424 for 512 sectors
          1515751936 for 512 sectors
          1515752448 for 512 sectors
          1515752960 for 512 sectors
          1515753472 for 512 sectors
          1515753984 for 512 sectors
          1515754496 for 232 sectors
[]'s
Salatiel


On Sun, Jan 22, 2017 at 10:03 PM, Andreas Klauer
<Andreas.Klauer@metamorpher.de> wrote:
> On Sun, Jan 22, 2017 at 11:08:40AM -0300, Salatiel Filho wrote:
>> Any ideas what could it be ?
>
> mdadm --examine-badblocks
>
> Regards
> Andreas Klauer

^ permalink raw reply

* Urgent Please,,
From: Joyes Dadi @ 2017-01-23 14:07 UTC (permalink / raw)


Good Day Dear,

My name is Ms. Joyes Dadi, I am glad you are reading this letter and I hope
we will start our communication and I know that this message will look strange,
surprising and probably unbelievable to you, but it is the reality. I want to
make a donation of money to you.

I contact you by the will of God. I am a firm German woman specialized in
mining gold and diamonds in Africa. But now, I'm very sick of a cancer. My
husband died in an accident two years ago with our two children and now I have
cancer of the esophagus that damaged almost all the cells in my system/agencies
and I'll die soon according to my doctor.

My most concern now is, we grew up in the orphanage and were married in
orphanage. If I die this deposited fund will soon be left alone in the hand of
the bank, and I do want to it that  way. Please, if you can be reliable and
sincere to accept my humble proposal; I have (10.5Millions Euro) in a fixed
deposit account; I will order the Bank to transfer the money into your account
in your country immediately, and then you will take the fund to  your country
and invest it to the orphanage homes Please, answer as quickly as possible.

God bless you.
Ms. Joyes Dadi
Email: joyesdadi767@citromail.hu

^ permalink raw reply

* Re: Input/Output error reading from a clean raid
From: Salatiel Filho @ 2017-01-23 14:42 UTC (permalink / raw)
  To: John Stoffel; +Cc: linux-raid
In-Reply-To: <22661.19403.314279.130591@quad.stoffel.home>

The output of the command is:

# dd if=Fotos.zip of=/dev/null
dd: error reading ‘Fotos.zip’: Input/output error
328704+0 records in
328704+0 records out
168296448 bytes (168 MB) copied, 0.127723 s, 1.3 GB/s

or

# cp Fotos.zip /tmp/
cp: error reading ‘Fotos.zip’: Input/output error
cp: failed to extend ‘/tmp/Fotos.zip’: Input/output error


There is nothing on dmesg after running those commands;

[]'s
Salatiel


On Sun, Jan 22, 2017 at 9:18 PM, John Stoffel <john@stoffel.org> wrote:
>
> Salatiel> I am trying to recover a few files from my backup. The
> Salatiel> backup is on a raid 5 + ext4.  There are several files where
> Salatiel> i get I/O error. The raid appears to be clean and fsck shows
> Salatiel> no errors. Any ideas what could it be ?
>
> Salatiel> md1 : active raid5 sdd1[0] sdg1[4] sdf1[2] sde1[1]
> Salatiel>       3220829184 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
> Salatiel>       bitmap: 1/8 pages [4KB], 65536KB chunk
>
> It would help if you could post the error(s) you're getting, along
> with any output from dmesg during that time.  Have you done a full
> scan of the disk looking for errors?  You might just have silent
> read errors in your array.  So as root do:
>
>    # echo check >>/sys/block/md??/md/sync_action
>
> where md?? is the name of your md array you want to check.  You can
> get the name from:
>
>    cat /proc/mdstat
>
> and of course it would help to post that info as well if you want more
> help.
>
> John

^ permalink raw reply

* Re: Input/Output error reading from a clean raid
From: John Stoffel @ 2017-01-23 16:12 UTC (permalink / raw)
  To: Salatiel Filho; +Cc: John Stoffel, linux-raid
In-Reply-To: <CAGmni9qwyHo+G8e4APF99Ai-N_d5WfOhkT0xLoq-8tVkTOjonA@mail.gmail.com>


Salatiel> The output of the command is:
Salatiel> # dd if=Fotos.zip of=/dev/null
Salatiel> dd: error reading ‘Fotos.zip’: Input/output error
Salatiel> 328704+0 records in
Salatiel> 328704+0 records out
Salatiel> 168296448 bytes (168 MB) copied, 0.127723 s, 1.3 GB/s

Salatiel> or

Salatiel> # cp Fotos.zip /tmp/
Salatiel> cp: error reading ‘Fotos.zip’: Input/output error
Salatiel> cp: failed to extend ‘/tmp/Fotos.zip’: Input/output error

Can you do a 'zip -l Fotos.zip' and get anything back?  It looks like
the first 168mb might be ok... so you might get something back.

You might also want to try and start doing a dd from 328705 records
(or even a couple more records farther) to see if you can get anything
else from there.

In this case, the tool 'ddrescue' might be your answer, since it is
designed to handle errors like this and continue reading past errors.
It might, or might not, let you get more of your data back.  On debian based
systems you should be able to just do:

	apt-get install gddrescue

or just do:

   apt-cache search ddrescue

For RedHat fedora you could do:

   dnf search ddrescue

too.  

Did you run the "echo check > ..." command at all?  What did it say in
the output of: cat /proc/mdstat  when you did this?  

Salatiel> There is nothing on dmesg after running those commands;

You might be out of luck.  This is one reason why I like A) mirroring
my data and B) saving multiple copies to multiple locations.  Storage
is cheap these days.

Though I admit I'm not perfect either.

Please get us more information so we can try to help more.

Also, have you unmounted the filesystem and done an 'fsck -y /dev/...'
on it as well?  You might want to do a more in-depth check of the
filesystem to see if there's any corruption somewhere.

Also, going to the end of the file, and seeking backwards and reading
off blocks might help you recover more of the zip file.



Salatiel> On Sun, Jan 22, 2017 at 9:18 PM, John Stoffel <john@stoffel.org> wrote:
>> 
Salatiel> I am trying to recover a few files from my backup. The
Salatiel> backup is on a raid 5 + ext4.  There are several files where
Salatiel> i get I/O error. The raid appears to be clean and fsck shows
Salatiel> no errors. Any ideas what could it be ?
>> 
Salatiel> md1 : active raid5 sdd1[0] sdg1[4] sdf1[2] sde1[1]
Salatiel> 3220829184 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
Salatiel> bitmap: 1/8 pages [4KB], 65536KB chunk
>> 
>> It would help if you could post the error(s) you're getting, along
>> with any output from dmesg during that time.  Have you done a full
>> scan of the disk looking for errors?  You might just have silent
>> read errors in your array.  So as root do:
>> 
>> # echo check >>/sys/block/md??/md/sync_action
>> 
>> where md?? is the name of your md array you want to check.  You can
>> get the name from:
>> 
>> cat /proc/mdstat
>> 
>> and of course it would help to post that info as well if you want more
>> help.
>> 
>> John

^ permalink raw reply

* Re: Input/Output error reading from a clean raid
From: John Stoffel @ 2017-01-23 17:07 UTC (permalink / raw)
  To: Salatiel Filho; +Cc: linux-raid
In-Reply-To: <CAGmni9pazRfia++-s5Hn4FT+ubVzEhsnrQrKC3cSLja53-SizQ@mail.gmail.com>

>>>>> "Salatiel" == Salatiel Filho <salatiel.filho@gmail.com> writes:

Salatiel> Ok, i have run echo check >>/sys/block/md1/md/sync_action,
Salatiel> and now the output of mdadm mdadm --examine-badblocks
Salatiel> /dev/sdd1 /dev/sdg1 /dev/sdf1 /dev/sde1


Salatiel> Bad-blocks on /dev/sdd1:
Salatiel>           1515723072 for 512 sectors
Salatiel>           1515723584 for 512 sectors
Salatiel>           1515724096 for 512 sectors
Salatiel>           1515724608 for 512 sectors

You have bad disks in your array.  First thing off is that I would go
buy replacements and then use 'ddrescue' to copy the data from the old
disks to new disks.  Then I'd try to assemble the NEW disks only into
an array, and then I'd fsck the filesystem(s).

You're going to lose data, no doubt about it.  You're now in the mode
where you're trying to save as much as you can as quickly as possible.

Personally, I'd be setting up a RAID6 array for your new setup.  Then
I would also be setting up weekly checks of the raid array as well.

You're going to lose data no matter what.  So get new disks and start
copying what you can.

John

^ permalink raw reply

* Re: Input/Output error reading from a clean raid
From: Wols Lists @ 2017-01-23 17:23 UTC (permalink / raw)
  To: John Stoffel, Salatiel Filho; +Cc: linux-raid
In-Reply-To: <22662.14390.551624.557189@quad.stoffel.home>

On 23/01/17 17:07, John Stoffel wrote:
> You have bad disks in your array.  First thing off is that I would go
> buy replacements and then use 'ddrescue' to copy the data from the old
> disks to new disks.  Then I'd try to assemble the NEW disks only into
> an array, and then I'd fsck the filesystem(s).
> 
> You're going to lose data, no doubt about it.  You're now in the mode
> where you're trying to save as much as you can as quickly as possible.
> 
> Personally, I'd be setting up a RAID6 array for your new setup.  Then
> I would also be setting up weekly checks of the raid array as well.
> 
> You're going to lose data no matter what.  So get new disks and start
> copying what you can.

Go read the raid wiki. https://raid.wiki.kernel.org/index.php/Linux_Raid

Especially replacing a failed drive
https://raid.wiki.kernel.org/index.php/Replacing_a_failed_drive

And please - can you get ddrescue's error log that it mentions and email
me a copy. If you've got some Perl or Python or shell skills, maybe you
could even write that script it mentions (which is described in a bit
more detail in programming projects
https://raid.wiki.kernel.org/index.php/Programming_projects) Otherwise
I'll try and write it - might be a good way of learning Python :-) but
at the moment I think I'm learning by jumping in out of my depth, so
we'll see how far I get :-)

Cheers,
Wol



^ permalink raw reply

* Re: Input/Output error reading from a clean raid
From: Andreas Klauer @ 2017-01-23 17:34 UTC (permalink / raw)
  To: Salatiel Filho; +Cc: linux-raid
In-Reply-To: <CAGmni9pazRfia++-s5Hn4FT+ubVzEhsnrQrKC3cSLja53-SizQ@mail.gmail.com>

On Mon, Jan 23, 2017 at 11:02:24AM -0300, Salatiel Filho wrote:
> mdadm mdadm --examine-badblocks /dev/sdd1 /dev/sdg1 /dev/sdf1  /dev/sde1
> 
> Bad-blocks on /dev/sdd1:
>           1515723072 for 512 sectors
> Bad-blocks on /dev/sde1:
>           1515723072 for 512 sectors

md believes you have bad blocks in identical places so it won't return 
whatever data is in these blocks. Thus you get read errors even if there 
is no bad block on the disk itself. Those bad block entries can be caused 
by cable or controller flukes, making temporary problems permanent...

Personally I disable the bad block list everywhere.

You can search this list for old messages regarding --examine-badblocks, 
this problem came up several times. Clearing the mdadm bad block list is 
worth a try. There's an undocumented option, update=force-no-bbl or such.

Regards
Andreas Klauer

^ permalink raw reply

* Re: performance of raid5 on fast devices
From: Jake Yao @ 2017-01-23 22:20 UTC (permalink / raw)
  To: Coly Li; +Cc: Heinz Mauelshagen, Roman Mamedov, linux-raid
In-Reply-To: <a7cf3ce7-688d-819c-60ab-a9b53e14d869@suse.de>

I run tests with multiple IO threads, but it looks like it does not
affect the overall performance.

In this run with 8 io threads,

[global]
ioengine=libaio
iodepth=64
bs=192k
direct=1
thread=1
time_based=1
runtime=20
numjobs=8
loops=1
group_reporting=1
rwmixread=70
rwmixwrite=30
exitall
#
# end of global
#
[nvme_md_write]
rw=write
filename=/dev/md127
runtime=20

[nvme_drv_write]
rw=write
filename=/dev/nvme1n1p2
runtime=20

I got following for nvme based raid5 and single drive:

md thrd-cnt 0: write: io=27992MB, bw=1397.5MB/s, iops=7452, runt= 20031msec
md thrd-cnt 1: write: io=43065MB, bw=2148.6MB/s, iops=11458, runt= 20044msec
md thrd-cnt 2: write: io=43209MB, bw=2155.9MB/s, iops=11497, runt= 20043msec
md thrd-cnt 3: write: io=43163MB, bw=2153.9MB/s, iops=11487, runt= 20040msec
md thrd-cnt 4: write: io=43316MB, bw=2163.2MB/s, iops=11536, runt= 20024msec
md thrd-cnt 5: write: io=43390MB, bw=2164.7MB/s, iops=11544, runt= 20045msec
md thrd-cnt 6: write: io=43295MB, bw=2160.2MB/s, iops=11521, runt= 20042msec
single drive: write: io=36004MB, bw=1795.4MB/s, iops=9575, runt= 20054msec

It also does not show little effect on ssd based raid5 and single
drive. Same fio config as above, just changing the corresponding
device filenames. The result is following:

md thrd-cnt 0: write: io=13646MB, bw=696242KB/s, iops=3626, runt= 20070msec
md thrd-cnt 1: write: io=24519MB, bw=1221.5MB/s, iops=6514, runt= 20074msec
md thrd-cnt 2: write: io=24780MB, bw=1234.9MB/s, iops=6585, runt= 20068msec
md thrd-cnt 3: write: io=24890MB, bw=1240.2MB/s, iops=6613, runt= 20072msec
md thrd-cnt 4: write: io=24937MB, bw=1242.5MB/s, iops=6626, runt= 20071msec
md thrd-cnt 5: write: io=24948MB, bw=1242.9MB/s, iops=6628, runt= 20073msec
md thrd-cnt 6: write: io=24701MB, bw=1230.1MB/s, iops=6564, runt= 20068msec
single drive: write: io=8389.4MB, bw=428184KB/s, iops=2230, runt= 20063msec

In the ssd case, raid5 array is 3x better than a single drive.

On Fri, Jan 20, 2017 at 9:58 AM, Coly Li <colyli@suse.de> wrote:
> On 2017/1/19 上午3:25, Jake Yao wrote:
>> It is interesting. I do not see the similar behavior with the change
>> of group_thread_cnt.
>>
>> The raid5 I have is following:
>>
>> md125 : active raid5 nvme0n1p1[0] nvme2n1p1[2] nvme1n1p1[1] nvme3n1p1[4]
>>       943325184 blocks super 1.2 level 5, 32k chunk, algorithm 2 [4/4] [UUUU]
>>       bitmap: 0/3 pages [0KB], 65536KB chunk
>>
>> /dev/md125:
>>         Version : 1.2
>>   Creation Time : Thu Dec 15 20:11:46 2016
>>      Raid Level : raid5
>>      Array Size : 943325184 (899.63 GiB 965.96 GB)
>>   Used Dev Size : 314441728 (299.88 GiB 321.99 GB)
>>    Raid Devices : 4
>>   Total Devices : 4
>>     Persistence : Superblock is persistent
>>
>>   Intent Bitmap : Internal
>>
>>     Update Time : Wed Jan 18 16:24:52 2017
>>           State : clean
>>  Active Devices : 4
>> Working Devices : 4
>>  Failed Devices : 0
>>   Spare Devices : 0
>>
>>          Layout : left-symmetric
>>      Chunk Size : 32K
>>
>>            Name : localhost:nvme  (local to host localhost)
>>            UUID : 477a94af:79f5a10a:0d513dc6:7f5e670d
>>          Events : 108
>>
>>     Number   Major   Minor   RaidDevice State
>>        0     259        6        0      active sync   /dev/nvme0n1p1
>>        1     259        8        1      active sync   /dev/nvme1n1p1
>>        2     259        9        2      active sync   /dev/nvme2n1p1
>>        4     259        1        3      active sync   /dev/nvme3n1p1
>>
>> The fio config is:
>>
>> [global]
>> ioengine=libaio
>> iodepth=64
>> bs=96K
>> direct=1
>> thread=1
>> time_based=1
>> runtime=20
>> numjobs=1
>
> You only have 1 I/O thread, bottle neck is here. Have a try with numjobs=8.
>
>> loops=1
>> group_reporting=1
>> exitall
> [snip]
>
> Coly

^ permalink raw reply

* [PATCH v2 1/3] md/r5cache: flush data only stripes in r5l_recovery_log()
From: Song Liu @ 2017-01-24  1:12 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuan,
	liuyun01, Song Liu, Jes.Sorensen

For safer operation, all arrays start in write-through mode.
However, if recovery found data-only stripes before the shutdown
(from previous write-back mode), it is not safe to run the array
in write-through mode. To solve this problem, we flush all data-only
stripes in r5l_recovery_log(). This logic is implemented in
r5c_recovery_flush_data_only_stripes():

1. enable write back cache
2. flush all stripes
3. wake up conf->mddev->thread
4. wait for all stripes get flushed (reuse wait_for_quiescent)
5. disable write back cache

The wait in 4 will be waked up in release_inactive_stripe_list()
when conf->active_stripes reaches 0.

It is safe to wake up mddev->thread here because all the resource
required for the thread has been initialized.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/md.c          |  5 +++++
 drivers/md/raid5-cache.c | 56 ++++++++++++++++++++++++++++++++++--------------
 2 files changed, 45 insertions(+), 16 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0abb147..85ac984 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5333,6 +5333,11 @@ int md_run(struct mddev *mddev)
 	if (start_readonly && mddev->ro == 0)
 		mddev->ro = 2; /* read-only, but switch on first write */
 
+	/*
+	 * NOTE: some pers->run(), for example r5l_recovery_log(), wakes
+	 * up mddev->thread. It is important to initialize critical
+	 * resources for mddev->thread BEFORE calling pers->run().
+	 */
 	err = pers->run(mddev);
 	if (err)
 		pr_warn("md: pers->run() failed ...\n");
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 3da5e2a..00d2838 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2102,7 +2102,7 @@ static int
 r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
 				       struct r5l_recovery_ctx *ctx)
 {
-	struct stripe_head *sh, *next;
+	struct stripe_head *sh;
 	struct mddev *mddev = log->rdev->mddev;
 	struct page *page;
 	sector_t next_checkpoint = MaxSector;
@@ -2116,7 +2116,7 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
 
 	WARN_ON(list_empty(&ctx->cached_list));
 
-	list_for_each_entry_safe(sh, next, &ctx->cached_list, lru) {
+	list_for_each_entry(sh, &ctx->cached_list, lru) {
 		struct r5l_meta_block *mb;
 		int i;
 		int offset;
@@ -2166,14 +2166,39 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
 		ctx->pos = write_pos;
 		ctx->seq += 1;
 		next_checkpoint = sh->log_start;
-		list_del_init(&sh->lru);
-		raid5_release_stripe(sh);
 	}
 	log->next_checkpoint = next_checkpoint;
 	__free_page(page);
 	return 0;
 }
 
+static void r5c_recovery_flush_data_only_stripes(struct r5l_log *log,
+						 struct r5l_recovery_ctx *ctx)
+{
+	struct mddev *mddev = log->rdev->mddev;
+	struct r5conf *conf = mddev->private;
+	struct stripe_head *sh, *next;
+
+	if (ctx->data_only_stripes == 0)
+		return;
+
+	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_BACK;
+
+	list_for_each_entry_safe(sh, next, &ctx->cached_list, lru) {
+		r5c_make_stripe_write_out(sh);
+		set_bit(STRIPE_HANDLE, &sh->state);
+		list_del_init(&sh->lru);
+		raid5_release_stripe(sh);
+	}
+
+	md_wakeup_thread(conf->mddev->thread);
+	/* reuse conf->wait_for_quiescent in recovery */
+	wait_event(conf->wait_for_quiescent,
+		   atomic_read(&conf->active_stripes) == 0);
+
+	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
+}
+
 static int r5l_recovery_log(struct r5l_log *log)
 {
 	struct mddev *mddev = log->rdev->mddev;
@@ -2200,32 +2225,31 @@ static int r5l_recovery_log(struct r5l_log *log)
 	pos = ctx.pos;
 	ctx.seq += 10000;
 
-	if (ctx.data_only_stripes == 0) {
-		log->next_checkpoint = ctx.pos;
-		r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq++);
-		ctx.pos = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
-	}
 
 	if ((ctx.data_only_stripes == 0) && (ctx.data_parity_stripes == 0))
 		pr_debug("md/raid:%s: starting from clean shutdown\n",
 			 mdname(mddev));
-	else {
+	else
 		pr_debug("md/raid:%s: recovering %d data-only stripes and %d data-parity stripes\n",
 			 mdname(mddev), ctx.data_only_stripes,
 			 ctx.data_parity_stripes);
 
-		if (ctx.data_only_stripes > 0)
-			if (r5c_recovery_rewrite_data_only_stripes(log, &ctx)) {
-				pr_err("md/raid:%s: failed to rewrite stripes to journal\n",
-				       mdname(mddev));
-				return -EIO;
-			}
+	if (ctx.data_only_stripes == 0) {
+		log->next_checkpoint = ctx.pos;
+		r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq++);
+		ctx.pos = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
+	} else if (r5c_recovery_rewrite_data_only_stripes(log, &ctx)) {
+		pr_err("md/raid:%s: failed to rewrite stripes to journal\n",
+		       mdname(mddev));
+		return -EIO;
 	}
 
 	log->log_start = ctx.pos;
 	log->seq = ctx.seq;
 	log->last_checkpoint = pos;
 	r5l_write_super(log, pos);
+
+	r5c_recovery_flush_data_only_stripes(log, &ctx);
 	return 0;
 }
 
-- 
2.9.3


^ permalink raw reply related

* [PATCH v2 2/3] md/r5cache: shift complex rmw from read path to write path
From: Song Liu @ 2017-01-24  1:12 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuan,
	liuyun01, Song Liu, Jes.Sorensen
In-Reply-To: <20170124011259.3351506-1-songliubraving@fb.com>

Write back cache requires a complex RMW mechanism, where old data is
read into dev->orig_page for prexor, and then xor is done with
dev->page. This logic is already implemented in the write path.

However, current read path is not awared of this requirement. When
the array is optimal, the RMW is not required, as the data are
read from raid disks. However, when the target stripe is degraded,
complex RMW is required to generate right data.

To keep read path as clean as possible, we handle read path by
flushing degraded, in-journal stripes before processing reads to
missing dev.

Specifically, when there is read requests to a degraded stripe
with data in journal, handle_stripe_fill() calls
r5c_make_stripe_write_out() and exits. Then handle_stripe_dirtying()
will do the complex RMW and flush the stripe to RAID disks. After
that, read requests are handled.

There is one more corner case when there is non-overwrite bio for
the missing (or out of sync) dev. handle_stripe_dirtying() will not
be able to process the non-overwrite bios without constructing the
data in handle_stripe_fill(). This is fixed by delaying non-overwrite
bios in handle_stripe_dirtying(). So handle_stripe_fill() works on
these bios after the stripe is flushed to raid disks.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5.c | 48 ++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 44 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f060ad6..ad8f24c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2934,6 +2934,30 @@ sector_t raid5_compute_blocknr(struct stripe_head *sh, int i, int previous)
 	return r_sector;
 }
 
+/*
+ * There are cases where we want handle_stripe_dirtying() and
+ * schedule_reconstruction() to delay towrite to some dev of a stripe.
+ *
+ * This function checks whether we want to delay the towrite. Specifically,
+ * we delay the towrite when:
+ *
+ *   1. degraded stripe has a non-overwrite to the missing dev, AND this
+ *      stripe has data in journal (for other devices).
+ *
+ *      In this case, when reading data for the non-overwrite dev, it is
+ *      necessary to handle complex rmw of write back cache (prexor with
+ *      orig_page, and xor with page). To keep read path simple, we would
+ *      like to flush data in journal to RAID disks first, so complex rmw
+ *      is handled in the write patch (handle_stripe_dirtying).
+ *
+ */
+static inline bool delay_towrite(struct r5dev *dev,
+				   struct stripe_head_state *s)
+{
+	return !test_bit(R5_OVERWRITE, &dev->flags) &&
+		!test_bit(R5_Insync, &dev->flags) && s->injournal;
+}
+
 static void
 schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
 			 int rcw, int expand)
@@ -2954,7 +2978,7 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
 		for (i = disks; i--; ) {
 			struct r5dev *dev = &sh->dev[i];
 
-			if (dev->towrite) {
+			if (dev->towrite && !delay_towrite(dev, s)) {
 				set_bit(R5_LOCKED, &dev->flags);
 				set_bit(R5_Wantdrain, &dev->flags);
 				if (!expand)
@@ -3531,10 +3555,25 @@ static void handle_stripe_fill(struct stripe_head *sh,
 	 * midst of changing due to a write
 	 */
 	if (!test_bit(STRIPE_COMPUTE_RUN, &sh->state) && !sh->check_state &&
-	    !sh->reconstruct_state)
+	    !sh->reconstruct_state) {
+
+		/* for degraded stripe with data in journal, do not handle
+		 * read requests yet, instead, flush the stripe to raid
+		 * disks first, this avoids handling complex rmw of write
+		 * back cache (prexor with orig_page, and then xor with
+		 * page) in the read path
+		 */
+		if (s->injournal && s->failed) {
+			if (test_bit(STRIPE_R5C_CACHING, &sh->state))
+				r5c_make_stripe_write_out(sh);
+			goto out;
+		}
+
 		for (i = disks; i--; )
 			if (fetch_block(sh, s, i, disks))
 				break;
+	}
+out:
 	set_bit(STRIPE_HANDLE, &sh->state);
 }
 
@@ -3690,7 +3729,8 @@ static int handle_stripe_dirtying(struct r5conf *conf,
 	} else for (i = disks; i--; ) {
 		/* would I have to read this buffer for read_modify_write */
 		struct r5dev *dev = &sh->dev[i];
-		if ((dev->towrite || i == sh->pd_idx || i == sh->qd_idx ||
+		if (((dev->towrite && !delay_towrite(dev, s)) ||
+		     i == sh->pd_idx || i == sh->qd_idx ||
 		     test_bit(R5_InJournal, &dev->flags)) &&
 		    !test_bit(R5_LOCKED, &dev->flags) &&
 		    !(uptodate_for_rmw(dev) ||
@@ -3754,7 +3794,7 @@ static int handle_stripe_dirtying(struct r5conf *conf,
 
 		for (i = disks; i--; ) {
 			struct r5dev *dev = &sh->dev[i];
-			if ((dev->towrite ||
+			if (((dev->towrite && !delay_towrite(dev, s)) ||
 			     i == sh->pd_idx || i == sh->qd_idx ||
 			     test_bit(R5_InJournal, &dev->flags)) &&
 			    !test_bit(R5_LOCKED, &dev->flags) &&
-- 
2.9.3


^ permalink raw reply related

* [PATCH v2 3/3] md/r5cache: disable write back for degraded array
From: Song Liu @ 2017-01-24  1:12 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuan,
	liuyun01, Song Liu, Jes.Sorensen
In-Reply-To: <20170124011259.3351506-1-songliubraving@fb.com>

write-back cache in degraded mode introduces corner cases to the array.
Although we try to cover all these corner cases, it is safer to just
disable write-back cache when the array is in degraded mode.

In this patch, we disable writeback cache for degraded mode:
1. On device failure, if the array enters degraded mode, raid5_error()
   will submit async job r5c_disable_writeback_async to disable
   writeback;
2. In r5c_journal_mode_store(), it is invalid to enable writeback in
   degraded mode;
3. In r5c_try_caching_write(), stripes with s->failed>0 will be handled
   in write-through mode.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/raid5.c       |  3 ++-
 drivers/md/raid5.h       |  2 ++
 3 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 00d2838..55f1a37 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -164,6 +164,9 @@ struct r5l_log {
 	/* to submit async io_units, to fulfill ordering of flush */
 	struct work_struct deferred_io_work;
 
+	/* to disable write back during in degraded mode */
+	struct work_struct disable_writeback_work;
+
 	/* to for chunk_aligned_read in writeback mode, details below */
 	spinlock_t tree_lock;
 	struct radix_tree_root big_stripe_tree;
@@ -653,6 +656,20 @@ static void r5l_submit_io_async(struct work_struct *work)
 		r5l_do_submit_io(log, io);
 }
 
+static void r5c_disable_writeback_async(struct work_struct *work)
+{
+	struct r5l_log *log = container_of(work, struct r5l_log,
+					   disable_writeback_work);
+	struct mddev *mddev = log->rdev->mddev;
+	struct r5conf *conf = mddev->private;
+
+	pr_crit("md/raid:%s: Disabling writeback cache for degraded array.\n",
+		mdname(mddev));
+	mddev_suspend(mddev);
+	conf->log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
+	mddev_resume(mddev);
+}
+
 static void r5l_submit_current_io(struct r5l_log *log)
 {
 	struct r5l_io_unit *io = log->current_io;
@@ -2311,6 +2328,9 @@ static ssize_t r5c_journal_mode_store(struct mddev *mddev,
 	    val > R5C_JOURNAL_MODE_WRITE_BACK)
 		return -EINVAL;
 
+	if (calc_degraded(conf) > 0 && val == R5C_JOURNAL_MODE_WRITE_BACK)
+		return -EINVAL;
+
 	mddev_suspend(mddev);
 	conf->log->r5c_journal_mode = val;
 	mddev_resume(mddev);
@@ -2369,6 +2389,16 @@ int r5c_try_caching_write(struct r5conf *conf,
 		set_bit(STRIPE_R5C_CACHING, &sh->state);
 	}
 
+	/*
+	 * When run in degraded mode, array is set to write-through mode.
+	 * This check helps drain pending write safely in the transition to
+	 * write-through mode.
+	 */
+	if (s->failed) {
+		r5c_make_stripe_write_out(sh);
+		return -EAGAIN;
+	}
+
 	for (i = disks; i--; ) {
 		dev = &sh->dev[i];
 		/* if non-overwrite, use writing-out phase */
@@ -2713,6 +2743,19 @@ static int r5l_load_log(struct r5l_log *log)
 	return ret;
 }
 
+void r5c_update_on_rdev_error(struct mddev *mddev)
+{
+	struct r5conf *conf = mddev->private;
+	struct r5l_log *log = conf->log;
+
+	if (!log)
+		return;
+
+	if (calc_degraded(conf) > 0 &&
+	    conf->log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_BACK)
+		schedule_work(&log->disable_writeback_work);
+}
+
 int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 {
 	struct request_queue *q = bdev_get_queue(rdev->bdev);
@@ -2788,6 +2831,7 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 	spin_lock_init(&log->no_space_stripes_lock);
 
 	INIT_WORK(&log->deferred_io_work, r5l_submit_io_async);
+	INIT_WORK(&log->disable_writeback_work, r5c_disable_writeback_async);
 
 	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
 	INIT_LIST_HEAD(&log->stripe_in_journal_list);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ad8f24c..f8223e5 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -556,7 +556,7 @@ static struct stripe_head *__find_stripe(struct r5conf *conf, sector_t sector,
  * of the two sections, and some non-in_sync devices may
  * be insync in the section most affected by failed devices.
  */
-static int calc_degraded(struct r5conf *conf)
+int calc_degraded(struct r5conf *conf)
 {
 	int degraded, degraded2;
 	int i;
@@ -2606,6 +2606,7 @@ static void raid5_error(struct mddev *mddev, struct md_rdev *rdev)
 		bdevname(rdev->bdev, b),
 		mdname(mddev),
 		conf->raid_disks - mddev->degraded);
+	r5c_update_on_rdev_error(mddev);
 }
 
 /*
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 8ae498c..36f28d1 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -762,6 +762,7 @@ extern sector_t raid5_compute_sector(struct r5conf *conf, sector_t r_sector,
 extern struct stripe_head *
 raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
 			int previous, int noblock, int noquiesce);
+extern int calc_degraded(struct r5conf *conf);
 extern int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev);
 extern void r5l_exit_log(struct r5l_log *log);
 extern int r5l_write_stripe(struct r5l_log *log, struct stripe_head *head_sh);
@@ -791,4 +792,5 @@ extern void r5c_check_stripe_cache_usage(struct r5conf *conf);
 extern void r5c_check_cached_full_stripe(struct r5conf *conf);
 extern struct md_sysfs_entry r5c_journal_mode;
 extern bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect);
+extern void r5c_update_on_rdev_error(struct mddev *mddev);
 #endif
-- 
2.9.3


^ permalink raw reply related

* Re: performance of raid5 on fast devices
From: Coly Li @ 2017-01-24  7:11 UTC (permalink / raw)
  To: Jake Yao; +Cc: Heinz Mauelshagen, Roman Mamedov, linux-raid
In-Reply-To: <CA+Dh761AoRa3g3O4hztTACe5M1sbNRz3_hQQKgikupxLNLVQ8g@mail.gmail.com>

Hi Jake,

Hmm, is the hardware powerful enough ? When I did similar testing, I
used a machine with 2x10 core XEON CPU, and 80GB memory.
And could you please try bs=64K? I got a good performance number with
64KB blocksize.

And could you have a look at top out put, are all the CPUs 100%
utilized, or still idle on some CPUs ?

Coly

On 2017/1/24 上午6:20, Jake Yao wrote:
> I run tests with multiple IO threads, but it looks like it does not
> affect the overall performance.
> 
> In this run with 8 io threads,
> 
> [global]
> ioengine=libaio
> iodepth=64
> bs=192k
> direct=1
> thread=1
> time_based=1
> runtime=20
> numjobs=8
> loops=1
> group_reporting=1
> rwmixread=70
> rwmixwrite=30
> exitall
> #
> # end of global
> #
> [nvme_md_write]
> rw=write
> filename=/dev/md127
> runtime=20
> 
> [nvme_drv_write]
> rw=write
> filename=/dev/nvme1n1p2
> runtime=20
> 
> I got following for nvme based raid5 and single drive:
> 
> md thrd-cnt 0: write: io=27992MB, bw=1397.5MB/s, iops=7452, runt= 20031msec
> md thrd-cnt 1: write: io=43065MB, bw=2148.6MB/s, iops=11458, runt= 20044msec
> md thrd-cnt 2: write: io=43209MB, bw=2155.9MB/s, iops=11497, runt= 20043msec
> md thrd-cnt 3: write: io=43163MB, bw=2153.9MB/s, iops=11487, runt= 20040msec
> md thrd-cnt 4: write: io=43316MB, bw=2163.2MB/s, iops=11536, runt= 20024msec
> md thrd-cnt 5: write: io=43390MB, bw=2164.7MB/s, iops=11544, runt= 20045msec
> md thrd-cnt 6: write: io=43295MB, bw=2160.2MB/s, iops=11521, runt= 20042msec
> single drive: write: io=36004MB, bw=1795.4MB/s, iops=9575, runt= 20054msec
> 
> It also does not show little effect on ssd based raid5 and single
> drive. Same fio config as above, just changing the corresponding
> device filenames. The result is following:
> 
> md thrd-cnt 0: write: io=13646MB, bw=696242KB/s, iops=3626, runt= 20070msec
> md thrd-cnt 1: write: io=24519MB, bw=1221.5MB/s, iops=6514, runt= 20074msec
> md thrd-cnt 2: write: io=24780MB, bw=1234.9MB/s, iops=6585, runt= 20068msec
> md thrd-cnt 3: write: io=24890MB, bw=1240.2MB/s, iops=6613, runt= 20072msec
> md thrd-cnt 4: write: io=24937MB, bw=1242.5MB/s, iops=6626, runt= 20071msec
> md thrd-cnt 5: write: io=24948MB, bw=1242.9MB/s, iops=6628, runt= 20073msec
> md thrd-cnt 6: write: io=24701MB, bw=1230.1MB/s, iops=6564, runt= 20068msec
> single drive: write: io=8389.4MB, bw=428184KB/s, iops=2230, runt= 20063msec
> 
> In the ssd case, raid5 array is 3x better than a single drive.
> 
> On Fri, Jan 20, 2017 at 9:58 AM, Coly Li <colyli@suse.de> wrote:
>> On 2017/1/19 上午3:25, Jake Yao wrote:
>>> It is interesting. I do not see the similar behavior with the change
>>> of group_thread_cnt.
>>>
>>> The raid5 I have is following:
>>>
>>> md125 : active raid5 nvme0n1p1[0] nvme2n1p1[2] nvme1n1p1[1] nvme3n1p1[4]
>>>       943325184 blocks super 1.2 level 5, 32k chunk, algorithm 2 [4/4] [UUUU]
>>>       bitmap: 0/3 pages [0KB], 65536KB chunk
>>>
>>> /dev/md125:
>>>         Version : 1.2
>>>   Creation Time : Thu Dec 15 20:11:46 2016
>>>      Raid Level : raid5
>>>      Array Size : 943325184 (899.63 GiB 965.96 GB)
>>>   Used Dev Size : 314441728 (299.88 GiB 321.99 GB)
>>>    Raid Devices : 4
>>>   Total Devices : 4
>>>     Persistence : Superblock is persistent
>>>
>>>   Intent Bitmap : Internal
>>>
>>>     Update Time : Wed Jan 18 16:24:52 2017
>>>           State : clean
>>>  Active Devices : 4
>>> Working Devices : 4
>>>  Failed Devices : 0
>>>   Spare Devices : 0
>>>
>>>          Layout : left-symmetric
>>>      Chunk Size : 32K
>>>
>>>            Name : localhost:nvme  (local to host localhost)
>>>            UUID : 477a94af:79f5a10a:0d513dc6:7f5e670d
>>>          Events : 108
>>>
>>>     Number   Major   Minor   RaidDevice State
>>>        0     259        6        0      active sync   /dev/nvme0n1p1
>>>        1     259        8        1      active sync   /dev/nvme1n1p1
>>>        2     259        9        2      active sync   /dev/nvme2n1p1
>>>        4     259        1        3      active sync   /dev/nvme3n1p1
>>>
>>> The fio config is:
>>>
>>> [global]
>>> ioengine=libaio
>>> iodepth=64
>>> bs=96K
>>> direct=1
>>> thread=1
>>> time_based=1
>>> runtime=20
>>> numjobs=1
>>
>> You only have 1 I/O thread, bottle neck is here. Have a try with numjobs=8.
>>
>>> loops=1
>>> group_reporting=1
>>> exitall
>> [snip]
>>
>> Coly


^ permalink raw reply

* [PATCH 0/2] Bad block notification
From: Tomasz Majchrzak @ 2017-01-24 12:03 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, Jes.Sorensen, jes.sorensen, Tomasz Majchrzak

At the moment there is no way to be notified that bad blocks have been found on
a disk. It is only possible to check manually with 'mdadm --examine-badblocks'.
User might not be aware there is a bad block for a long period. If another disk
in the array fails, data is lost.

These patches add a new event to the kernel and mdadm in order to send
notification on the first bad block on a disk. I have chosen to do it only for
first bad block as I think it's sufficient indication that the drive requires
replacement.

Tomasz Majchrzak (1):
  md: add bad block flag to disk state

 drivers/md/md.c                | 2 ++
 include/uapi/linux/raid/md_p.h | 1 +
 2 files changed, 3 insertions(+)

Tomasz Majchrzak (1):
  Monitor: add new event BadBlocks

 Monitor.c  | 14 +++++++++-----
 md_p.h     |  1 + 
 mdadm.8.in | 10 ++++++++--
 3 files changed, 18 insertions(+), 7 deletions(-)

-- 
1.8.3.1


^ permalink raw reply

* [PATCH 1/2] md: add bad block flag to disk state
From: Tomasz Majchrzak @ 2017-01-24 12:03 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, Jes.Sorensen, jes.sorensen, Tomasz Majchrzak
In-Reply-To: <1485259419-2308-1-git-send-email-tomasz.majchrzak@intel.com>

Add a new flag to report that bad blocks are present on a disk. It will
allow userspace to notify the user of the problem.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
---
 drivers/md/md.c                | 2 ++
 include/uapi/linux/raid/md_p.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0abb147..1a807ec 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6034,6 +6034,8 @@ static int get_disk_info(struct mddev *mddev, void __user * arg)
 			info.state |= (1<<MD_DISK_WRITEMOSTLY);
 		if (test_bit(FailFast, &rdev->flags))
 			info.state |= (1<<MD_DISK_FAILFAST);
+		if (rdev->badblocks.count)
+			info.state |= (1<<MD_DISK_BB_PRESENT);
 	} else {
 		info.major = info.minor = 0;
 		info.raid_disk = -1;
diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h
index 9930f3e..b151e93 100644
--- a/include/uapi/linux/raid/md_p.h
+++ b/include/uapi/linux/raid/md_p.h
@@ -93,6 +93,7 @@
 				   * read requests will only be sent here in
 				   * dire need
 				   */
+#define MD_DISK_BB_PRESENT	11 /* disk has bad blocks */
 #define MD_DISK_JOURNAL		18 /* disk is used as the write journal in RAID-5/6 */
 
 #define MD_DISK_ROLE_SPARE	0xffff
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH 2/2] Monitor: add new event BadBlocks
From: Tomasz Majchrzak @ 2017-01-24 12:03 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, Jes.Sorensen, jes.sorensen, Tomasz Majchrzak
In-Reply-To: <1485259419-2308-1-git-send-email-tomasz.majchrzak@intel.com>

Add new event BadBlocks to notify the user when bad blocks are found on a
disk. Send an email (if configured) and write it to syslog as a warning.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
---
 Monitor.c  | 14 +++++++++-----
 md_p.h     |  1 +
 mdadm.8.in | 10 ++++++++--
 3 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/Monitor.c b/Monitor.c
index 802a9d9..cb92a23 100644
--- a/Monitor.c
+++ b/Monitor.c
@@ -363,10 +363,11 @@ static void alert(char *event, char *dev, char *disc, struct alert_info *info)
 		}
 	}
 	if (info->mailaddr &&
-	    (strncmp(event, "Fail", 4)==0 ||
-	     strncmp(event, "Test", 4)==0 ||
-	     strncmp(event, "Spares", 6)==0 ||
-	     strncmp(event, "Degrade", 7)==0)) {
+	    (strncmp(event, "Fail", 4) == 0 ||
+	     strncmp(event, "Test", 4) == 0 ||
+	     strncmp(event, "Spares", 6) == 0 ||
+	     strncmp(event, "Degrade", 7) == 0 ||
+	     strncmp(event, "BadBlocks", 9) == 0)) {
 		FILE *mp = popen(Sendmail, "w");
 		if (mp) {
 			FILE *mdstat;
@@ -422,7 +423,8 @@ static void alert(char *event, char *dev, char *disc, struct alert_info *info)
 		/* Good to know about, but are not failures: */
 		else if (strncmp(event, "Rebuild", 7)==0 ||
 			 strncmp(event, "MoveSpare", 9)==0 ||
-			 strncmp(event, "Spares", 6) != 0)
+			 strncmp(event, "Spares", 6) != 0 ||
+			 strncmp(event, "BadBlocks", 9) != 0)
 			priority = LOG_WARNING;
 		/* Everything else: */
 		else
@@ -668,6 +670,8 @@ static int check_array(struct state *st, struct mdstat_ent *mdstat,
 				alert("FailSpare", dev, dv, ainfo);
 			else if ((newstate&change)&(1<<MD_DISK_SYNC))
 				alert("SpareActive", dev, dv, ainfo);
+			else if ((newstate&change)&(1<<MD_DISK_BB_PRESENT))
+				alert("BadBlocks", dev, dv, ainfo);
 		}
 		st->devstate[i] = newstate;
 		st->devid[i] = makedev(disc.major, disc.minor);
diff --git a/md_p.h b/md_p.h
index dc9fec1..39884be 100644
--- a/md_p.h
+++ b/md_p.h
@@ -90,6 +90,7 @@
 				   * dire need
 				   */
 #define	MD_DISK_FAILFAST	10 /* Fewer retries, more failures */
+#define MD_DISK_BB_PRESENT	11 /* disk has bad blocks */
 
 #define MD_DISK_REPLACEMENT	17
 #define MD_DISK_JOURNAL		18 /* disk is used as the write journal in RAID-5/6 */
diff --git a/mdadm.8.in b/mdadm.8.in
index 1e4f91d..7b89a8a 100644
--- a/mdadm.8.in
+++ b/mdadm.8.in
@@ -2552,6 +2552,10 @@ message.
 (syslog priority: Warning)
 
 .TP
+.B BadBlocks
+Bad blocks have been found on the device. (syslog priority: Warning)
+
+.TP
 .B TestMessage
 An array was found at startup, and the
 .B \-\-test
@@ -2563,7 +2567,8 @@ Only
 .B Fail,
 .B FailSpare,
 .B DegradedArray,
-.B SparesMissing
+.B SparesMissing,
+.B BadBlocks
 and
 .B TestMessage
 cause Email to be sent.  All events cause the program to be run.
@@ -2575,8 +2580,9 @@ Each event has an associated array device (e.g.
 and possibly a second device.  For
 .BR Fail ,
 .BR FailSpare ,
+.BR SpareActive
 and
-.B SpareActive
+.BR BadBlocks
 the second device is the relevant component device.
 For
 .B MoveSpare
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH 1/1] imsm: fix missing error message during migration
From: Pawel Baldysiak @ 2017-01-24 13:29 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Pawel Baldysiak

If user tries to migrate from raid0 to raid5 and there is no spare
drive to perform it - mdadm will exit with errorcode, but
no error message is printed.

Print error instead of debug message when this condition occurs,
so user is informed why requested migration is not started.

Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
---
 super-intel.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/super-intel.c b/super-intel.c
index 433bb6d..d5e9517 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -10718,7 +10718,7 @@ static int imsm_create_metadata_update_for_migration(
 			free(u);
 			sysfs_free(spares);
 			update_memory_size = 0;
-			dprintf("error: cannot get spare device for requested migration");
+			pr_err("cannot get spare device for requested migration\n");
 			return 0;
 		}
 		sysfs_free(spares);
-- 
2.9.3


^ permalink raw reply related

* Re: [PATCH v2 3/3] md/r5cache: disable write back for degraded array
From: Shaohua Li @ 2017-01-24 17:56 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid, neilb, shli, kernel-team, dan.j.williams, hch,
	liuzhengyuan, liuyun01, Jes.Sorensen
In-Reply-To: <20170124011259.3351506-3-songliubraving@fb.com>

On Mon, Jan 23, 2017 at 05:12:59PM -0800, Song Liu wrote:
> write-back cache in degraded mode introduces corner cases to the array.
> Although we try to cover all these corner cases, it is safer to just
> disable write-back cache when the array is in degraded mode.
> 
> In this patch, we disable writeback cache for degraded mode:
> 1. On device failure, if the array enters degraded mode, raid5_error()
>    will submit async job r5c_disable_writeback_async to disable
>    writeback;
> 2. In r5c_journal_mode_store(), it is invalid to enable writeback in
>    degraded mode;
> 3. In r5c_try_caching_write(), stripes with s->failed>0 will be handled
>    in write-through mode.

Applied the first 2, have some comments about this one, please see below
 
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
>  drivers/md/raid5-cache.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  drivers/md/raid5.c       |  3 ++-
>  drivers/md/raid5.h       |  2 ++
>  3 files changed, 48 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 00d2838..55f1a37 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -164,6 +164,9 @@ struct r5l_log {
>  	/* to submit async io_units, to fulfill ordering of flush */
>  	struct work_struct deferred_io_work;
>  
> +	/* to disable write back during in degraded mode */
> +	struct work_struct disable_writeback_work;
> +
>  	/* to for chunk_aligned_read in writeback mode, details below */
>  	spinlock_t tree_lock;
>  	struct radix_tree_root big_stripe_tree;
> @@ -653,6 +656,20 @@ static void r5l_submit_io_async(struct work_struct *work)
>  		r5l_do_submit_io(log, io);
>  }
>  
> +static void r5c_disable_writeback_async(struct work_struct *work)
> +{
> +	struct r5l_log *log = container_of(work, struct r5l_log,
> +					   disable_writeback_work);
> +	struct mddev *mddev = log->rdev->mddev;
> +	struct r5conf *conf = mddev->private;
> +
> +	pr_crit("md/raid:%s: Disabling writeback cache for degraded array.\n",
> +		mdname(mddev));

does this need to be pr_crit? This isn't an error. So I think pr_info is more
appropriate.

> +	mddev_suspend(mddev);
> +	conf->log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
> +	mddev_resume(mddev);
> +}
> +
>  static void r5l_submit_current_io(struct r5l_log *log)
>  {
>  	struct r5l_io_unit *io = log->current_io;
> @@ -2311,6 +2328,9 @@ static ssize_t r5c_journal_mode_store(struct mddev *mddev,
>  	    val > R5C_JOURNAL_MODE_WRITE_BACK)
>  		return -EINVAL;
>  
> +	if (calc_degraded(conf) > 0 && val == R5C_JOURNAL_MODE_WRITE_BACK)
> +		return -EINVAL;
> +
>  	mddev_suspend(mddev);
>  	conf->log->r5c_journal_mode = val;
>  	mddev_resume(mddev);
> @@ -2369,6 +2389,16 @@ int r5c_try_caching_write(struct r5conf *conf,
>  		set_bit(STRIPE_R5C_CACHING, &sh->state);
>  	}
>  
> +	/*
> +	 * When run in degraded mode, array is set to write-through mode.
> +	 * This check helps drain pending write safely in the transition to
> +	 * write-through mode.
> +	 */
> +	if (s->failed) {
> +		r5c_make_stripe_write_out(sh);
> +		return -EAGAIN;
> +	}
> +
>  	for (i = disks; i--; ) {
>  		dev = &sh->dev[i];
>  		/* if non-overwrite, use writing-out phase */
> @@ -2713,6 +2743,19 @@ static int r5l_load_log(struct r5l_log *log)
>  	return ret;
>  }
>  
> +void r5c_update_on_rdev_error(struct mddev *mddev)
> +{
> +	struct r5conf *conf = mddev->private;
> +	struct r5l_log *log = conf->log;
> +
> +	if (!log)
> +		return;
> +
> +	if (calc_degraded(conf) > 0 &&
> +	    conf->log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_BACK)
> +		schedule_work(&log->disable_writeback_work);
> +}
> +
>  int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
>  {
>  	struct request_queue *q = bdev_get_queue(rdev->bdev);
> @@ -2788,6 +2831,7 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
>  	spin_lock_init(&log->no_space_stripes_lock);
>  
>  	INIT_WORK(&log->deferred_io_work, r5l_submit_io_async);
> +	INIT_WORK(&log->disable_writeback_work, r5c_disable_writeback_async);

In teardown, we need to make sure the work is finished. so please flush the
work at that time.
 
>  	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
>  	INIT_LIST_HEAD(&log->stripe_in_journal_list);
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index ad8f24c..f8223e5 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -556,7 +556,7 @@ static struct stripe_head *__find_stripe(struct r5conf *conf, sector_t sector,
>   * of the two sections, and some non-in_sync devices may
>   * be insync in the section most affected by failed devices.
>   */
> -static int calc_degraded(struct r5conf *conf)
> +int calc_degraded(struct r5conf *conf)

Since this one is exported to other file, let's rename it to raid5_calc_degraded

Thanks,
Shaohua

^ permalink raw reply

* [PATCH v3] md/r5cache: disable write back for degraded array
From: Song Liu @ 2017-01-24 18:45 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuan,
	liuyun01, Song Liu, jsorensen

write-back cache in degraded mode introduces corner cases to the array.
Although we try to cover all these corner cases, it is safer to just
disable write-back cache when the array is in degraded mode.

In this patch, we disable writeback cache for degraded mode:
1. On device failure, if the array enters degraded mode, raid5_error()
   will submit async job r5c_disable_writeback_async to disable
   writeback;
2. In r5c_journal_mode_store(), it is invalid to enable writeback in
   degraded mode;
3. In r5c_try_caching_write(), stripes with s->failed>0 will be handled
   in write-through mode.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/raid5.c       | 15 ++++++++-------
 drivers/md/raid5.h       |  2 ++
 3 files changed, 56 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 00d2838..8ab6e1a 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -164,6 +164,9 @@ struct r5l_log {
 	/* to submit async io_units, to fulfill ordering of flush */
 	struct work_struct deferred_io_work;
 
+	/* to disable write back during in degraded mode */
+	struct work_struct disable_writeback_work;
+
 	/* to for chunk_aligned_read in writeback mode, details below */
 	spinlock_t tree_lock;
 	struct radix_tree_root big_stripe_tree;
@@ -653,6 +656,20 @@ static void r5l_submit_io_async(struct work_struct *work)
 		r5l_do_submit_io(log, io);
 }
 
+static void r5c_disable_writeback_async(struct work_struct *work)
+{
+	struct r5l_log *log = container_of(work, struct r5l_log,
+					   disable_writeback_work);
+	struct mddev *mddev = log->rdev->mddev;
+	struct r5conf *conf = mddev->private;
+
+	pr_info("md/raid:%s: Disabling writeback cache for degraded array.\n",
+		mdname(mddev));
+	mddev_suspend(mddev);
+	conf->log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
+	mddev_resume(mddev);
+}
+
 static void r5l_submit_current_io(struct r5l_log *log)
 {
 	struct r5l_io_unit *io = log->current_io;
@@ -2311,6 +2328,10 @@ static ssize_t r5c_journal_mode_store(struct mddev *mddev,
 	    val > R5C_JOURNAL_MODE_WRITE_BACK)
 		return -EINVAL;
 
+	if (raid5_calc_degraded(conf) > 0 &&
+	    val == R5C_JOURNAL_MODE_WRITE_BACK)
+		return -EINVAL;
+
 	mddev_suspend(mddev);
 	conf->log->r5c_journal_mode = val;
 	mddev_resume(mddev);
@@ -2369,6 +2390,16 @@ int r5c_try_caching_write(struct r5conf *conf,
 		set_bit(STRIPE_R5C_CACHING, &sh->state);
 	}
 
+	/*
+	 * When run in degraded mode, array is set to write-through mode.
+	 * This check helps drain pending write safely in the transition to
+	 * write-through mode.
+	 */
+	if (s->failed) {
+		r5c_make_stripe_write_out(sh);
+		return -EAGAIN;
+	}
+
 	for (i = disks; i--; ) {
 		dev = &sh->dev[i];
 		/* if non-overwrite, use writing-out phase */
@@ -2713,6 +2744,19 @@ static int r5l_load_log(struct r5l_log *log)
 	return ret;
 }
 
+void r5c_update_on_rdev_error(struct mddev *mddev)
+{
+	struct r5conf *conf = mddev->private;
+	struct r5l_log *log = conf->log;
+
+	if (!log)
+		return;
+
+	if (raid5_calc_degraded(conf) > 0 &&
+	    conf->log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_BACK)
+		schedule_work(&log->disable_writeback_work);
+}
+
 int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 {
 	struct request_queue *q = bdev_get_queue(rdev->bdev);
@@ -2788,6 +2832,7 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 	spin_lock_init(&log->no_space_stripes_lock);
 
 	INIT_WORK(&log->deferred_io_work, r5l_submit_io_async);
+	INIT_WORK(&log->disable_writeback_work, r5c_disable_writeback_async);
 
 	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
 	INIT_LIST_HEAD(&log->stripe_in_journal_list);
@@ -2820,6 +2865,7 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 
 void r5l_exit_log(struct r5l_log *log)
 {
+	flush_scheduled_work();
 	md_unregister_thread(&log->reclaim_thread);
 	mempool_destroy(log->meta_pool);
 	bioset_free(log->bs);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ad8f24c..9c93ca1 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -556,7 +556,7 @@ static struct stripe_head *__find_stripe(struct r5conf *conf, sector_t sector,
  * of the two sections, and some non-in_sync devices may
  * be insync in the section most affected by failed devices.
  */
-static int calc_degraded(struct r5conf *conf)
+int raid5_calc_degraded(struct r5conf *conf)
 {
 	int degraded, degraded2;
 	int i;
@@ -619,7 +619,7 @@ static int has_failed(struct r5conf *conf)
 	if (conf->mddev->reshape_position == MaxSector)
 		return conf->mddev->degraded > conf->max_degraded;
 
-	degraded = calc_degraded(conf);
+	degraded = raid5_calc_degraded(conf);
 	if (degraded > conf->max_degraded)
 		return 1;
 	return 0;
@@ -2592,7 +2592,7 @@ static void raid5_error(struct mddev *mddev, struct md_rdev *rdev)
 
 	spin_lock_irqsave(&conf->device_lock, flags);
 	clear_bit(In_sync, &rdev->flags);
-	mddev->degraded = calc_degraded(conf);
+	mddev->degraded = raid5_calc_degraded(conf);
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
 
@@ -2606,6 +2606,7 @@ static void raid5_error(struct mddev *mddev, struct md_rdev *rdev)
 		bdevname(rdev->bdev, b),
 		mdname(mddev),
 		conf->raid_disks - mddev->degraded);
+	r5c_update_on_rdev_error(mddev);
 }
 
 /*
@@ -7147,7 +7148,7 @@ static int raid5_run(struct mddev *mddev)
 	/*
 	 * 0 for a fully functional array, 1 or 2 for a degraded array.
 	 */
-	mddev->degraded = calc_degraded(conf);
+	mddev->degraded = raid5_calc_degraded(conf);
 
 	if (has_failed(conf)) {
 		pr_crit("md/raid:%s: not enough operational devices (%d/%d failed)\n",
@@ -7394,7 +7395,7 @@ static int raid5_spare_active(struct mddev *mddev)
 		}
 	}
 	spin_lock_irqsave(&conf->device_lock, flags);
-	mddev->degraded = calc_degraded(conf);
+	mddev->degraded = raid5_calc_degraded(conf);
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 	print_raid5_conf(conf);
 	return count;
@@ -7754,7 +7755,7 @@ static int raid5_start_reshape(struct mddev *mddev)
 		 * pre and post number of devices.
 		 */
 		spin_lock_irqsave(&conf->device_lock, flags);
-		mddev->degraded = calc_degraded(conf);
+		mddev->degraded = raid5_calc_degraded(conf);
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 	}
 	mddev->raid_disks = conf->raid_disks;
@@ -7842,7 +7843,7 @@ static void raid5_finish_reshape(struct mddev *mddev)
 		} else {
 			int d;
 			spin_lock_irq(&conf->device_lock);
-			mddev->degraded = calc_degraded(conf);
+			mddev->degraded = raid5_calc_degraded(conf);
 			spin_unlock_irq(&conf->device_lock);
 			for (d = conf->raid_disks ;
 			     d < conf->raid_disks - mddev->delta_disks;
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 8ae498c..bbdc5a4 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -762,6 +762,7 @@ extern sector_t raid5_compute_sector(struct r5conf *conf, sector_t r_sector,
 extern struct stripe_head *
 raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
 			int previous, int noblock, int noquiesce);
+extern int raid5_calc_degraded(struct r5conf *conf);
 extern int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev);
 extern void r5l_exit_log(struct r5l_log *log);
 extern int r5l_write_stripe(struct r5l_log *log, struct stripe_head *head_sh);
@@ -791,4 +792,5 @@ extern void r5c_check_stripe_cache_usage(struct r5conf *conf);
 extern void r5c_check_cached_full_stripe(struct r5conf *conf);
 extern struct md_sysfs_entry r5c_journal_mode;
 extern bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect);
+extern void r5c_update_on_rdev_error(struct mddev *mddev);
 #endif
-- 
2.9.3


^ permalink raw reply related

* Re: [PATCH v3] md/r5cache: disable write back for degraded array
From: Shaohua Li @ 2017-01-24 19:17 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid, neilb, shli, kernel-team, dan.j.williams, hch,
	liuzhengyuan, liuyun01, jsorensen
In-Reply-To: <20170124184530.1097979-1-songliubraving@fb.com>

On Tue, Jan 24, 2017 at 10:45:30AM -0800, Song Liu wrote:
> write-back cache in degraded mode introduces corner cases to the array.
> Although we try to cover all these corner cases, it is safer to just
> disable write-back cache when the array is in degraded mode.
> 
> In this patch, we disable writeback cache for degraded mode:
> 1. On device failure, if the array enters degraded mode, raid5_error()
>    will submit async job r5c_disable_writeback_async to disable
>    writeback;
> 2. In r5c_journal_mode_store(), it is invalid to enable writeback in
>    degraded mode;
> 3. In r5c_try_caching_write(), stripes with s->failed>0 will be handled
>    in write-through mode.

Applied, thanks! I did a slight change: replaced flush_scheduled_work with flush_work

> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
>  drivers/md/raid5-cache.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/md/raid5.c       | 15 ++++++++-------
>  drivers/md/raid5.h       |  2 ++
>  3 files changed, 56 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 00d2838..8ab6e1a 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -164,6 +164,9 @@ struct r5l_log {
>  	/* to submit async io_units, to fulfill ordering of flush */
>  	struct work_struct deferred_io_work;
>  
> +	/* to disable write back during in degraded mode */
> +	struct work_struct disable_writeback_work;
> +
>  	/* to for chunk_aligned_read in writeback mode, details below */
>  	spinlock_t tree_lock;
>  	struct radix_tree_root big_stripe_tree;
> @@ -653,6 +656,20 @@ static void r5l_submit_io_async(struct work_struct *work)
>  		r5l_do_submit_io(log, io);
>  }
>  
> +static void r5c_disable_writeback_async(struct work_struct *work)
> +{
> +	struct r5l_log *log = container_of(work, struct r5l_log,
> +					   disable_writeback_work);
> +	struct mddev *mddev = log->rdev->mddev;
> +	struct r5conf *conf = mddev->private;
> +
> +	pr_info("md/raid:%s: Disabling writeback cache for degraded array.\n",
> +		mdname(mddev));
> +	mddev_suspend(mddev);
> +	conf->log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
> +	mddev_resume(mddev);
> +}
> +
>  static void r5l_submit_current_io(struct r5l_log *log)
>  {
>  	struct r5l_io_unit *io = log->current_io;
> @@ -2311,6 +2328,10 @@ static ssize_t r5c_journal_mode_store(struct mddev *mddev,
>  	    val > R5C_JOURNAL_MODE_WRITE_BACK)
>  		return -EINVAL;
>  
> +	if (raid5_calc_degraded(conf) > 0 &&
> +	    val == R5C_JOURNAL_MODE_WRITE_BACK)
> +		return -EINVAL;
> +
>  	mddev_suspend(mddev);
>  	conf->log->r5c_journal_mode = val;
>  	mddev_resume(mddev);
> @@ -2369,6 +2390,16 @@ int r5c_try_caching_write(struct r5conf *conf,
>  		set_bit(STRIPE_R5C_CACHING, &sh->state);
>  	}
>  
> +	/*
> +	 * When run in degraded mode, array is set to write-through mode.
> +	 * This check helps drain pending write safely in the transition to
> +	 * write-through mode.
> +	 */
> +	if (s->failed) {
> +		r5c_make_stripe_write_out(sh);
> +		return -EAGAIN;
> +	}
> +
>  	for (i = disks; i--; ) {
>  		dev = &sh->dev[i];
>  		/* if non-overwrite, use writing-out phase */
> @@ -2713,6 +2744,19 @@ static int r5l_load_log(struct r5l_log *log)
>  	return ret;
>  }
>  
> +void r5c_update_on_rdev_error(struct mddev *mddev)
> +{
> +	struct r5conf *conf = mddev->private;
> +	struct r5l_log *log = conf->log;
> +
> +	if (!log)
> +		return;
> +
> +	if (raid5_calc_degraded(conf) > 0 &&
> +	    conf->log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_BACK)
> +		schedule_work(&log->disable_writeback_work);
> +}
> +
>  int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
>  {
>  	struct request_queue *q = bdev_get_queue(rdev->bdev);
> @@ -2788,6 +2832,7 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
>  	spin_lock_init(&log->no_space_stripes_lock);
>  
>  	INIT_WORK(&log->deferred_io_work, r5l_submit_io_async);
> +	INIT_WORK(&log->disable_writeback_work, r5c_disable_writeback_async);
>  
>  	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
>  	INIT_LIST_HEAD(&log->stripe_in_journal_list);
> @@ -2820,6 +2865,7 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
>  
>  void r5l_exit_log(struct r5l_log *log)
>  {
> +	flush_scheduled_work();
>  	md_unregister_thread(&log->reclaim_thread);
>  	mempool_destroy(log->meta_pool);
>  	bioset_free(log->bs);
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index ad8f24c..9c93ca1 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -556,7 +556,7 @@ static struct stripe_head *__find_stripe(struct r5conf *conf, sector_t sector,
>   * of the two sections, and some non-in_sync devices may
>   * be insync in the section most affected by failed devices.
>   */
> -static int calc_degraded(struct r5conf *conf)
> +int raid5_calc_degraded(struct r5conf *conf)
>  {
>  	int degraded, degraded2;
>  	int i;
> @@ -619,7 +619,7 @@ static int has_failed(struct r5conf *conf)
>  	if (conf->mddev->reshape_position == MaxSector)
>  		return conf->mddev->degraded > conf->max_degraded;
>  
> -	degraded = calc_degraded(conf);
> +	degraded = raid5_calc_degraded(conf);
>  	if (degraded > conf->max_degraded)
>  		return 1;
>  	return 0;
> @@ -2592,7 +2592,7 @@ static void raid5_error(struct mddev *mddev, struct md_rdev *rdev)
>  
>  	spin_lock_irqsave(&conf->device_lock, flags);
>  	clear_bit(In_sync, &rdev->flags);
> -	mddev->degraded = calc_degraded(conf);
> +	mddev->degraded = raid5_calc_degraded(conf);
>  	spin_unlock_irqrestore(&conf->device_lock, flags);
>  	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>  
> @@ -2606,6 +2606,7 @@ static void raid5_error(struct mddev *mddev, struct md_rdev *rdev)
>  		bdevname(rdev->bdev, b),
>  		mdname(mddev),
>  		conf->raid_disks - mddev->degraded);
> +	r5c_update_on_rdev_error(mddev);
>  }
>  
>  /*
> @@ -7147,7 +7148,7 @@ static int raid5_run(struct mddev *mddev)
>  	/*
>  	 * 0 for a fully functional array, 1 or 2 for a degraded array.
>  	 */
> -	mddev->degraded = calc_degraded(conf);
> +	mddev->degraded = raid5_calc_degraded(conf);
>  
>  	if (has_failed(conf)) {
>  		pr_crit("md/raid:%s: not enough operational devices (%d/%d failed)\n",
> @@ -7394,7 +7395,7 @@ static int raid5_spare_active(struct mddev *mddev)
>  		}
>  	}
>  	spin_lock_irqsave(&conf->device_lock, flags);
> -	mddev->degraded = calc_degraded(conf);
> +	mddev->degraded = raid5_calc_degraded(conf);
>  	spin_unlock_irqrestore(&conf->device_lock, flags);
>  	print_raid5_conf(conf);
>  	return count;
> @@ -7754,7 +7755,7 @@ static int raid5_start_reshape(struct mddev *mddev)
>  		 * pre and post number of devices.
>  		 */
>  		spin_lock_irqsave(&conf->device_lock, flags);
> -		mddev->degraded = calc_degraded(conf);
> +		mddev->degraded = raid5_calc_degraded(conf);
>  		spin_unlock_irqrestore(&conf->device_lock, flags);
>  	}
>  	mddev->raid_disks = conf->raid_disks;
> @@ -7842,7 +7843,7 @@ static void raid5_finish_reshape(struct mddev *mddev)
>  		} else {
>  			int d;
>  			spin_lock_irq(&conf->device_lock);
> -			mddev->degraded = calc_degraded(conf);
> +			mddev->degraded = raid5_calc_degraded(conf);
>  			spin_unlock_irq(&conf->device_lock);
>  			for (d = conf->raid_disks ;
>  			     d < conf->raid_disks - mddev->delta_disks;
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index 8ae498c..bbdc5a4 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -762,6 +762,7 @@ extern sector_t raid5_compute_sector(struct r5conf *conf, sector_t r_sector,
>  extern struct stripe_head *
>  raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
>  			int previous, int noblock, int noquiesce);
> +extern int raid5_calc_degraded(struct r5conf *conf);
>  extern int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev);
>  extern void r5l_exit_log(struct r5l_log *log);
>  extern int r5l_write_stripe(struct r5l_log *log, struct stripe_head *head_sh);
> @@ -791,4 +792,5 @@ extern void r5c_check_stripe_cache_usage(struct r5conf *conf);
>  extern void r5c_check_cached_full_stripe(struct r5conf *conf);
>  extern struct md_sysfs_entry r5c_journal_mode;
>  extern bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect);
> +extern void r5c_update_on_rdev_error(struct mddev *mddev);
>  #endif
> -- 
> 2.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Input/Output error reading from a clean raid
From: Salatiel Filho @ 2017-01-24 21:15 UTC (permalink / raw)
  To: Andreas Klauer; +Cc: linux-raid
In-Reply-To: <20170123173411.GA9270@metamorpher.de>

On Mon, Jan 23, 2017 at 2:34 PM, Andreas Klauer
<Andreas.Klauer@metamorpher.de> wrote:
> On Mon, Jan 23, 2017 at 11:02:24AM -0300, Salatiel Filho wrote:
>> mdadm mdadm --examine-badblocks /dev/sdd1 /dev/sdg1 /dev/sdf1  /dev/sde1
>>
>> Bad-blocks on /dev/sdd1:
>>           1515723072 for 512 sectors
>> Bad-blocks on /dev/sde1:
>>           1515723072 for 512 sectors
>
> md believes you have bad blocks in identical places so it won't return
> whatever data is in these blocks. Thus you get read errors even if there
> is no bad block on the disk itself. Those bad block entries can be caused
> by cable or controller flukes, making temporary problems permanent...
>
> Personally I disable the bad block list everywhere.
>
> You can search this list for old messages regarding --examine-badblocks,
> this problem came up several times. Clearing the mdadm bad block list is
> worth a try. There's an undocumented option, update=force-no-bbl or such.
>
> Regards
> Andreas Klauer

Thanks all of you for the help.
Andreas, the force-no-bbl from mdadm 3.4 did the trick. I was able to
retrieve all files and their md5 matches, so it is great =)
I really think it is very unlikely that two different disks from two
different brands would have problems at exactly the same block.
I have a question, who populates the badblock list ? Is the check
action send to the /sys/block/md??/md/sync_action OR each read error
updates it ?
I think it was maybe some problem with the cable ( it is a 4 disks usb3 bay ).
Anyway, thank you very much !

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox