Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: raid0 vs. mkfs
From: Coly Li @ 2016-11-27 17:09 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-raid
In-Reply-To: <56c83c4e-d451-07e5-88e2-40b085d8681c@scylladb.com>

On 2016/11/27 下午11:24, Avi Kivity wrote:
> mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large
> disk that supports TRIM/DISCARD (erase whichever is inappropriate). 
> That is because mkfs issues a TRIM/DISCARD (erase whichever is
> inappropriate) for the entire partition. As far as I can tell, md
> converts the large TRIM/DISCARD (erase whichever is inappropriate) into
> a large number of TRIM/DISCARD (erase whichever is inappropriate)
> requests, one per chunk-size worth of disk, and issues them to the RAID
> components individually.
> 
> 
> It seems to me that md can convert the large TRIM/DISCARD (erase
> whichever is inappropriate) request it gets into one TRIM/DISCARD (erase
> whichever is inappropriate) per RAID component, converting an O(disk
> size / chunk size) operation into an O(number of RAID components)
> operation, which is much faster.
> 
> 
> I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, with
> the operation taking about a quarter of an hour, continuously pushing
> half-megabyte TRIM/DISCARD (erase whichever is inappropriate) requests
> to the disk. Linux 4.1.12.

It might be possible to improve a bit for DISCARD performance, by your
suggestion. The implementation might be tricky, but it is worthy to try.

Indeed, it is not only for DISCARD, for read or write, it might be
helpful for better performance as well. We can check the bio size, if,
	bio_sectors(bio)/conf->nr_strip_zones >= SOMETHRESHOLD
it means on each underlying device, we have more then SOMETHRESHOLD
continuous chunks to issue, and they can be merged into a larger bio.

IMHO it's interesting, good suggestion!

Coly


^ permalink raw reply

* [PATCH] md/raid5: limit request size according to implementation limits
From: Konstantin Khlebnikov @ 2016-11-27 16:32 UTC (permalink / raw)
  To: Shaohua Li, Neil Brown; +Cc: linux-raid, linux-kernel, stable

Current implementation employ 16bit counter of active stripes in lower
bits of bio->bi_phys_segments. If request is big enough to overflow
this counter bio will be completed and freed too early.

Fortunately this not happens in default configuration because several
other limits prevent that: stripe_cache_size * nr_disks effectively
limits count of active stripes. And small max_sectors_kb at lower
disks prevent that during normal read/write operations.

Overflow easily happens in discard if it's enabled by module parameter
"devices_handle_discard_safely" and stripe_cache_size is set big enough.

This patch limits requests size with 256Mb - 8Kb to prevent overflows.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Shaohua Li <shli@kernel.org>
Cc: Neil Brown <neilb@suse.com>
Cc: stable@vger.kernel.org
---
 drivers/md/raid5.c |    9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 92ac251e91e6..cce6057b9aca 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6984,6 +6984,15 @@ static int raid5_run(struct mddev *mddev)
 			stripe = (stripe | (stripe-1)) + 1;
 		mddev->queue->limits.discard_alignment = stripe;
 		mddev->queue->limits.discard_granularity = stripe;
+
+		/*
+		 * We use 16-bit counter of active stripes in bi_phys_segments
+		 * (minus one for over-loaded initialization)
+		 */
+		blk_queue_max_hw_sectors(mddev->queue, 0xfffe * STRIPE_SECTORS);
+		blk_queue_max_discard_sectors(mddev->queue,
+					      0xfffe * STRIPE_SECTORS);
+
 		/*
 		 * unaligned part of discard request will be ignored, so can't
 		 * guarantee discard_zeroes_data

^ permalink raw reply related

* raid0 vs. mkfs
From: Avi Kivity @ 2016-11-27 15:24 UTC (permalink / raw)
  To: linux-raid

mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large 
disk that supports TRIM/DISCARD (erase whichever is inappropriate).  
That is because mkfs issues a TRIM/DISCARD (erase whichever is 
inappropriate) for the entire partition. As far as I can tell, md 
converts the large TRIM/DISCARD (erase whichever is inappropriate) into 
a large number of TRIM/DISCARD (erase whichever is inappropriate) 
requests, one per chunk-size worth of disk, and issues them to the RAID 
components individually.

It seems to me that md can convert the large TRIM/DISCARD (erase 
whichever is inappropriate) request it gets into one TRIM/DISCARD (erase 
whichever is inappropriate) per RAID component, converting an O(disk 
size / chunk size) operation into an O(number of RAID components) 
operation, which is much faster.

I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, with 
the operation taking about a quarter of an hour, continuously pushing 
half-megabyte TRIM/DISCARD (erase whichever is inappropriate) requests 
to the disk. Linux 4.1.12.

^ permalink raw reply

* Re: [PATCH] Avoid nested function definition
From: Coly Li @ 2016-11-27  4:06 UTC (permalink / raw)
  To: Peter Foley, shli, linux-bcache, linux-raid; +Cc: linux-kernel, kent.overstreet
In-Reply-To: <20161126222415.13404-1-pefoley2@pefoley.com>

On 2016/11/27 上午6:24, Peter Foley wrote:
> Fixes below error with clang:
> ../drivers/md/bcache/sysfs.c:759:3: error: function definition is not allowed here
>                 {       return *((uint16_t *) r) - *((uint16_t *) l); }
>                 ^
> ../drivers/md/bcache/sysfs.c:789:32: error: use of undeclared identifier 'cmp'
>                 sort(p, n, sizeof(uint16_t), cmp, NULL);
>                                              ^
> 2 errors generated.
> 
> Signed-off-by: Peter Foley <pefoley2@pefoley.com>
> ---
>  drivers/md/bcache/sysfs.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> index b3ff57d61dde..22ac9e6676a0 100644
> --- a/drivers/md/bcache/sysfs.c
> +++ b/drivers/md/bcache/sysfs.c
> @@ -731,6 +731,11 @@ static struct attribute *bch_cache_set_internal_files[] = {
>  };
>  KTYPE(bch_cache_set_internal);
>  
> +static int cmp(const void *l, const void *r)
> +{
> +	return *((uint16_t *)r) - *((uint16_t *)l);
> +}
> +
>  SHOW(__bch_cache)
>  {
>  	struct cache *ca = container_of(kobj, struct cache, kobj);
> @@ -755,9 +760,6 @@ SHOW(__bch_cache)
>  					       CACHE_REPLACEMENT(&ca->sb));
>  
>  	if (attr == &sysfs_priority_stats) {
> -		int cmp(const void *l, const void *r)
> -		{	return *((uint16_t *) r) - *((uint16_t *) l); }
> -
>  		struct bucket *b;
>  		size_t n = ca->sb.nbuckets, i;
>  		size_t unused = 0, available = 0, dirty = 0, meta = 0;
> 

 I agree with this fix. Anyway, Can we use a more unique name like
__bch_cache_cmp() ?

Thanks.

-- 
Coly Li

^ permalink raw reply

* [PATCH] Avoid nested function definition
From: Peter Foley @ 2016-11-26 22:24 UTC (permalink / raw)
  To: linux-kernel, kent.overstreet, shli, linux-bcache, linux-raid; +Cc: Peter Foley

Fixes below error with clang:
../drivers/md/bcache/sysfs.c:759:3: error: function definition is not allowed here
                {       return *((uint16_t *) r) - *((uint16_t *) l); }
                ^
../drivers/md/bcache/sysfs.c:789:32: error: use of undeclared identifier 'cmp'
                sort(p, n, sizeof(uint16_t), cmp, NULL);
                                             ^
2 errors generated.

Signed-off-by: Peter Foley <pefoley2@pefoley.com>
---
 drivers/md/bcache/sysfs.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index b3ff57d61dde..22ac9e6676a0 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -731,6 +731,11 @@ static struct attribute *bch_cache_set_internal_files[] = {
 };
 KTYPE(bch_cache_set_internal);
 
+static int cmp(const void *l, const void *r)
+{
+	return *((uint16_t *)r) - *((uint16_t *)l);
+}
+
 SHOW(__bch_cache)
 {
 	struct cache *ca = container_of(kobj, struct cache, kobj);
@@ -755,9 +760,6 @@ SHOW(__bch_cache)
 					       CACHE_REPLACEMENT(&ca->sb));
 
 	if (attr == &sysfs_priority_stats) {
-		int cmp(const void *l, const void *r)
-		{	return *((uint16_t *) r) - *((uint16_t *) l); }
-
 		struct bucket *b;
 		size_t n = ca->sb.nbuckets, i;
 		size_t unused = 0, available = 0, dirty = 0, meta = 0;
-- 
2.11.0.rc2

^ permalink raw reply related

* Re: Please help RAID1 complete fail no superblock
From: Phil Turmel @ 2016-11-26 19:58 UTC (permalink / raw)
  To: YK
  Cc: linux-raid@vger.kernel.org, antlists@youngman.org.uk,
	george.rapp@gmail.com
In-Reply-To: <2d5f7ae7-16ed-47aa-4d74-7973ab8b6348@gmail.com>

Sorry Yaniv, caught up in the holiday weekend here )-:

On 11/26/2016 12:45 PM, YK wrote:
> On 11/23/2016 01:04 AM, YK wrote:
>> On 11/23/2016 12:01 AM, Phil Turmel wrote:
>>> Please provide the output of these two commands:
>>>
>>> dd if=/dev/sdb1 bs=4k count=4k |hexdump -C |head -n1000
>>>
>>> dd if=/dev/sdc bs=4k count=4k |hexdump -C |head -n1000
>> Here is the output for sdb1
>> ...
>> 00000420  00 80 00 00 00 80 00 00  00 20 00 00 d8 ef ba 53 |.........
>> .....S|
>> 00000430  e6 ef ba 53 01 00 ff ff  53 ef 01 00 01 00 00 00
>> |...S....S.......|
>> 00000440  ef ee ba 53 00 00 00 00  00 00 00 00 01 00 00 00
>> |...S............|
> 
> On 11/23/2016 11:47 PM, YK wrote:
>> On 11/23/2016 10:25 PM, Wols Lists wrote:
>>> Basically, you need to run hexdump, and look for evidence of a damaged
>>> superblock, and/or a filesystem superblock.
>>> If you find them, the other experts will be able, hopefully, to tell you
>>> how to get the array back
>>
>> I do see "53 ef" in the fifth line of the output for my sdb1 hexdump ! 
> After a lot of reading, I'm getting to the conclusion that the hexdump
> for sdb1
> shows my ext4 superblock in the correct location with the expected
> offset (line 430, 0x38).
> Am I right?

Yes.  sdb1 should be directly mountable.

> I tried to mount the partition without offset, by executing:
> mount /dev/sdb1 -o ro -t ext4 /mnt/recovery
> 
> But I still get this message:
> mount: wrong fs type, bad option, bad superblock on /dev/sdb1
> 
> Is there something I can do to successfully mount my old partition?

so, something stomped on the beginning of your filesystem superblock.
Fortunately, ext2/3/4 keeps backup copies that should let you fix your
filesystem, then mount it.  The manpage for e2fsck shows the '-b' option
that uses a backup superblock, and describes the methods for figuring
out where your backup superblocks are located.

Phil

^ permalink raw reply

* Re: Please help RAID1 complete fail no superblock
From: YK @ 2016-11-26 17:45 UTC (permalink / raw)
  To: Phil Turmel
  Cc: linux-raid@vger.kernel.org, antlists@youngman.org.uk,
	george.rapp@gmail.com
In-Reply-To: <14eb4917-a0d2-5a5e-02bb-9b6be8f6626d@gmail.com>

On 11/23/2016 01:04 AM, YK wrote:
> On 11/23/2016 12:01 AM, Phil Turmel wrote:
>> Please provide the output of these two commands:
>>
>> dd if=/dev/sdb1 bs=4k count=4k |hexdump -C |head -n1000
>>
>> dd if=/dev/sdc bs=4k count=4k |hexdump -C |head -n1000
> Here is the output for sdb1
> ...
> 00000420  00 80 00 00 00 80 00 00  00 20 00 00 d8 ef ba 53 |......... 
> .....S|
> 00000430  e6 ef ba 53 01 00 ff ff  53 ef 01 00 01 00 00 00 
> |...S....S.......|
> 00000440  ef ee ba 53 00 00 00 00  00 00 00 00 01 00 00 00 
> |...S............|

On 11/23/2016 11:47 PM, YK wrote:
> On 11/23/2016 10:25 PM, Wols Lists wrote:
>> Basically, you need to run hexdump, and look for evidence of a damaged
>> superblock, and/or a filesystem superblock.
>> If you find them, the other experts will be able, hopefully, to tell you
>> how to get the array back
>
> I do see "53 ef" in the fifth line of the output for my sdb1 hexdump ! 
After a lot of reading, I'm getting to the conclusion that the hexdump 
for sdb1
shows my ext4 superblock in the correct location with the expected 
offset (line 430, 0x38).
Am I right?

I tried to mount the partition without offset, by executing:
mount /dev/sdb1 -o ro -t ext4 /mnt/recovery

But I still get this message:
mount: wrong fs type, bad option, bad superblock on /dev/sdb1

Is there something I can do to successfully mount my old partition?

Thank you all again,

Yaniv


^ permalink raw reply

* Re: MD Remnants After --stop
From: Marc Smith @ 2016-11-26 16:41 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <87oa15938i.fsf@notabene.neil.brown.name>

So, I modified mdopen.c to look like this:

--- a/mdopen.c    2016-11-25 17:04:25.782299330 -0500
+++ b/mdopen.c    2016-11-26 10:57:35.883621355 -0500
@@ -416,7 +416,7 @@
  */
 int open_mddev(char *dev, int report_errors)
 {
-    int mdfd = open(dev, O_RDWR);
+    int mdfd = open(dev, O_RDONLY);
     if (mdfd < 0 && errno == EACCES)
         mdfd = open(dev, O_RDONLY);
     if (mdfd < 0) {


And now, when running 'mdadm --stop' here is what I see...

From the output 'udevadm monitor -pku':

--snip--
KERNEL[297486.536908] offline
/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e (dlm)
ACTION=offline
DEVPATH=/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e
LOCKSPACE=62fccfd6-605f-19e6-be6d-99a1e3cb987e
SEQNUM=3651
SUBSYSTEM=dlm

UDEV  [297486.537541] offline
/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e (dlm)
ACTION=offline
DEVPATH=/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e
LOCKSPACE=62fccfd6-605f-19e6-be6d-99a1e3cb987e
SEQNUM=3651
SUBSYSTEM=dlm
USEC_INITIALIZED=7486537404

KERNEL[297486.538325] remove
/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e (dlm)
ACTION=remove
DEVPATH=/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e
LOCKSPACE=62fccfd6-605f-19e6-be6d-99a1e3cb987e
SEQNUM=3652
SUBSYSTEM=dlm

UDEV  [297486.538644] remove
/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e (dlm)
ACTION=remove
DEVPATH=/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e
LOCKSPACE=62fccfd6-605f-19e6-be6d-99a1e3cb987e
SEQNUM=3652
SUBSYSTEM=dlm
USEC_INITIALIZED=86538345
--snip--

And from the kernel log:

--snip--
[297504.958244] md127: detected capacity change from 73340747776 to 0
[297504.958249] md: md127 stopped.
[297504.958884] dlm: 62fccfd6-605f-19e6-be6d-99a1e3cb987e: leaving the
lockspace group...
[297504.959004] udevd[487]: seq 3651 queued, 'offline' 'dlm'
[297504.959161] udevd[487]: seq 3651 forked new worker [5392]
[297504.959417] udevd[5392]: seq 3651 running
[297504.959474] udevd[5392]: no db file to read
/run/udev/data/+dlm:62fccfd6-605f-19e6-be6d-99a1e3cb987e: No such file
or directory
[297504.959524] udevd[5392]: passed device to netlink monitor 0x2251c30
[297504.959527] udevd[5392]: seq 3651 processed
[297504.960101] dlm: 62fccfd6-605f-19e6-be6d-99a1e3cb987e: group event done 0 0
[297504.960299] dlm: 62fccfd6-605f-19e6-be6d-99a1e3cb987e:
release_lockspace final free
[297504.960329] md: unbind<dm-0>
[297504.960448] udevd[487]: seq 3652 queued, 'remove' 'dlm'
[297504.960500] udevd[487]: passed 214 byte device to netlink monitor 0x224b130
[297504.960584] udevd[5392]: seq 3652 running
[297504.960606] udevd[5392]: no db file to read
/run/udev/data/+dlm:62fccfd6-605f-19e6-be6d-99a1e3cb987e: No such file
or directory
[297504.967168] md: export_rdev(dm-0)
[297504.967231] md: unbind<dm-1>
[297504.975176] md: export_rdev(dm-1)
--snip--

So that did get rid of the synthesized CHANGE event, but still no
REMOVE event. =)

Still trying to rule-out there isn't anything strange with my Linux
distro / setup... but I assume even if udev was mishandling something,
we should still be seeing a REMOVE event on '--stop'.


--Marc


On Wed, Nov 23, 2016 at 6:38 PM, NeilBrown <neilb@suse.com> wrote:
> On Thu, Nov 24 2016, Marc Smith wrote:
>
>> On Tue, Nov 22, 2016 at 6:51 PM, NeilBrown <neilb@suse.com> wrote:
>>> On Wed, Nov 23 2016, Marc Smith wrote:
>>>
>>>> Hi,
>>>>
>>>> Sorry, I'm not trying to beat a dead horse here, but I do feel
>>>> something has changed... I just tested with Linux 4.5.2 and when
>>>> stopping an md array (with mdadm --stop) the entry in /sys/block/ is
>>>> removed, and even the /dev/mdXXX and /dev/md/name link are removed
>>>> properly.
>>>>
>>>> When testing with Linux 4.9-rc3, the entries in /sys/block/ still
>>>> remain (array_state attribute value is "clear") after using mdadm
>>>> --stop and the /dev/mdXXX device exists (the /dev/md/name link is
>>>> removed, by udev I assume).
>>>
>>> With the latest (git) mdadm, when events are reported by "udevadm monitor"??
>>>
>>> I only see remove events, and the entries from /dev and /sys are
>>> removed.
>>>
>>> If I could reproduce your problem, I would fix it...
>>
>> On one set of hosts I can reliably reproduce this issue, and on
>> another system I could previously reproduce this, but now seems to be
>> working fine... I haven't found the connection (same distro / kernel
>> versions on all hosts).
>>
>> # mdadm --version
>> mdadm - v3.4-100-g52a9408 - 26th October 2016
>>
>>
>> From 'mdadm --stop /dev/md/blah1' (non-clustered RAID0 array):
>>
>> --snip--
>> # udevadm monitor -pku
>> calling: monitor
>> monitor will print the received events for:
>> UDEV - the event which udev sends out after rule processing
>> KERNEL - the kernel uevent
>>
>> KERNEL[32930.834312] change   /devices/virtual/block/md126 (block)
>> ACTION=change
>> DEVNAME=/dev/md126
>> DEVPATH=/devices/virtual/block/md126
>> DEVTYPE=disk
>> MAJOR=9
>> MINOR=126
>> SEQNUM=3678
>> SUBSYSTEM=block
>>
>> UDEV  [32930.836032] change   /devices/virtual/block/md126 (block)
>> ACTION=change
>> DEVNAME=/dev/md126
>> DEVPATH=/devices/virtual/block/md126
>> DEVTYPE=disk
>> MAJOR=9
>> MINOR=126
>> SEQNUM=3678
>> SUBSYSTEM=block
>> SYSTEMD_READY=0
>> USEC_INITIALIZED=843336612
>
> Using the recent version of mdadm, and a kernel newer than 4.6 (commit
> 399146b8) a CHANGE event should only be generated when the array is
> started. To see a CHANGE of stop, you must have an older mdadm or and
> older kernel.... unless something else is synthesizing a CHANGE event.
>
> When using a clustered array, you can also see a CHANGE event when a new
> disk is being added to an array.
>
>> --snip--
>>
>> Kernel logs from that:
>> --snip--
>> [32928.465695] md126: detected capacity change from 146681102336 to 0
>> [32928.465699] md: md126 stopped.
>> [32928.465702] md: unbind<dm-3>
>> [32928.465798] udevd[499]: inotify event: 8 for /dev/md126
>> [32928.465964] udevd[499]: device /dev/md126 closed, synthesising 'change'
>> [32928.466029] udevd[499]: seq 3678 queued, 'change' 'block'
>> [32928.466129] udevd[499]: seq 3678 forked new worker [27035]
>> [32928.466357] udevd[27035]: seq 3678 running
>> [32928.466423] udevd[27035]: removing watch on '/dev/md126'
>> [32928.466492] udevd[27035]: IMPORT 'probe-bcache -o udev /dev/md126'
>> /usr/lib/udev/rules.d/69-bcache.rules:16
>> [32928.466712] udevd[27036]: starting 'probe-bcache -o udev /dev/md126'
>> [32928.467540] udevd[27035]: 'probe-bcache -o udev /dev/md126' [27036]
>> exit with return code 0
>> [32928.467564] udevd[27035]: update old name,
>> '/dev/disk/by-id/md-name-tgtnode1.parodyne.com:blah1' no longer
>> belonging to '/devices/virtual/block/md126'
>> [32928.470851] md: export_rdev(dm-3)
>> [32928.470920] md: unbind<dm-2>
>> [32928.478843] md: export_rdev(dm-2)
>> --snip--
>>
>>
>> From 'mdadm --stop /dev/md/asdf1' (clustered RAID1 array):
>>
>> --snip--
>> # udevadm monitor -pku
>> calling: monitor
>> monitor will print the received events for:
>> UDEV - the event which udev sends out after rule processing
>> KERNEL - the kernel uevent
>>
>> KERNEL[34402.247229] change   /devices/virtual/block/md127 (block)
>> ACTION=change
>> DEVNAME=/dev/md127
>> DEVPATH=/devices/virtual/block/md127
>> DEVTYPE=disk
>> MAJOR=9
>> MINOR=127
>> SEQNUM=3679
>> SUBSYSTEM=block
>>
>> KERNEL[34402.247885] offline
>> /kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e (dlm)
>> ACTION=offline
>> DEVPATH=/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e
>> LOCKSPACE=62fccfd6-605f-19e6-be6d-99a1e3cb987e
>> SEQNUM=3680
>> SUBSYSTEM=dlm
>>
>> UDEV  [34402.248269] offline
>> /kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e (dlm)
>> ACTION=offline
>> DEVPATH=/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e
>> LOCKSPACE=62fccfd6-605f-19e6-be6d-99a1e3cb987e
>> SEQNUM=3680
>> SUBSYSTEM=dlm
>> USEC_INITIALIZED=402248230
>>
>> KERNEL[34402.248841] remove
>> /kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e (dlm)
>> ACTION=remove
>> DEVPATH=/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e
>> LOCKSPACE=62fccfd6-605f-19e6-be6d-99a1e3cb987e
>> SEQNUM=3681
>> SUBSYSTEM=dlm
>>
>> UDEV  [34402.248899] change   /devices/virtual/block/md127 (block)
>> ACTION=change
>> DEVNAME=/dev/md127
>> DEVPATH=/devices/virtual/block/md127
>> DEVTYPE=disk
>> MAJOR=9
>> MINOR=127
>> SEQNUM=3679
>> SUBSYSTEM=block
>> SYSTEMD_READY=0
>> USEC_INITIALIZED=1273670
>>
>> UDEV  [34402.248990] remove
>> /kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e (dlm)
>> ACTION=remove
>> DEVPATH=/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e
>> LOCKSPACE=62fccfd6-605f-19e6-be6d-99a1e3cb987e
>> SEQNUM=3681
>> SUBSYSTEM=dlm
>> USEC_INITIALIZED=2248905
>> --snip--
>>
>>
>> Kernel logs from that:
>> --snip--
>> [34399.753876] udevd[499]: inotify event: 8 for /dev/md127
>> [34399.765389] md127: detected capacity change from 73340747776 to 0
>> [34399.765393] md: md127 stopped.
>> [34399.765579] udevd[499]: device /dev/md127 closed, synthesising 'change'
>
> That is weird.  udev is synthesizing a CHANGE event.  How nice of it
> (not!).
> It doesn't do this for dm, only other devices.
>
> This happens when an open-for-write is closed.
> If you edit mdopen.c and change the O_RDWR in open_mddev() to O_RDONLY,
> this change will go away.
>
>
>> [34399.765656] udevd[499]: seq 3679 queued, 'change' 'block'
>> [34399.765751] udevd[499]: seq 3679 forked new worker [6317]
>> [34399.765878] udevd[6317]: seq 3679 running
>> [34399.765943] udevd[6317]: removing watch on '/dev/md127'
>> [34399.766012] udevd[6317]: IMPORT 'probe-bcache -o udev /dev/md127'
>> /usr/lib/udev/rules.d/69-bcache.rules:16
>> [34399.766259] udevd[6318]: starting 'probe-bcache -o udev /dev/md127'
>> [34399.766295] dlm: 62fccfd6-605f-19e6-be6d-99a1e3cb987e: leaving the
>> lockspace group...
>> [34399.766421] udevd[499]: seq 3680 queued, 'offline' 'dlm'
>> [34399.766549] udevd[499]: seq 3680 forked new worker [6319]
>> [34399.767080] dlm: 62fccfd6-605f-19e6-be6d-99a1e3cb987e: group event done 0 0
>> [34399.767297] dlm: 62fccfd6-605f-19e6-be6d-99a1e3cb987e:
>> release_lockspace final free
>> [34399.767320] md: unbind<dm-1>
>> [34399.795574] md: export_rdev(dm-1)
>> [34399.795640] md: unbind<dm-0>
>> [34399.803565] md: export_rdev(dm-0)
>> --snip--
>>
>>
>> On the other machines where the md array stopped correctly (removing
>> the entries from /dev and /sys) I do see the 'remove' events with
>> "udevadm monitor". What produces those remove events? Is that
>> something directly from the mdadm tool, or indirectly as part of the
>> stop/tear-down that mdadm initiates?
>
> REMOVE events are generated by
>
> md_free() -> del_gendisk() ->  blk_unregister_queue()
>
> which should happen when the md object is finally released.
> If there is no CHANGE event, then nothing should stop this from
> happening.
>
> NeilBrown

^ permalink raw reply

* [PATCH 2/2] raid5-cache: don't set STRIPE_R5C_PARTIAL_STRIPE flag while load stripe into cache
From: Zhengyuan Liu @ 2016-11-26  2:57 UTC (permalink / raw)
  To: shli, songliubraving; +Cc: neilb, linux-raid, liuyun01
In-Reply-To: <1480129034-14700-1-git-send-email-liuzhengyuan@kylinos.cn>

r5c_recovery_load_one_stripe should not set STRIPE_R5C_PARTIAL_STRIPE flag,as
the data-only stripe may be STRIPE_R5C_FULL_STRIPE stripe. The state machine
would release the stripe later and add it into neither r5c_cached_full_stripes
list or r5c_cached_partial_stripes list and set correct flag. Also we should
fix the counter corresponding.

Reviewed-by: JackieLiu <liuyun01@kylinos.cn>
Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
---
 drivers/md/raid5-cache.c |  3 +--
 drivers/md/raid5.c       | 11 +++++++++++
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 9911164..e0ac758 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -1930,9 +1930,8 @@ static void r5c_recovery_load_one_stripe(struct r5l_log *log,
 			set_bit(R5_UPTODATE, &dev->flags);
 		}
 	}
-	set_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state);
-	atomic_inc(&conf->r5c_cached_partial_stripes);
 	list_add_tail(&sh->r5c, &log->stripe_in_journal_list);
+	atomic_inc(&log->stripe_in_journal_count);
 }
 
 /*
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index dbab8c7..8120ce4 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -677,12 +677,23 @@ raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
 					atomic_inc(&conf->active_stripes);
 				BUG_ON(list_empty(&sh->lru) &&
 				       !test_bit(STRIPE_EXPANDING, &sh->state));
+
 				inc_empty_inactive_list_flag = 0;
 				if (!list_empty(conf->inactive_list + hash))
 					inc_empty_inactive_list_flag = 1;
 				list_del_init(&sh->lru);
 				if (list_empty(conf->inactive_list + hash) && inc_empty_inactive_list_flag)
 					atomic_inc(&conf->empty_inactive_list_nr);
+
+				if (test_and_clear_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state)) {
+					WARN_ON(atomic_read(&conf->r5c_cached_partial_stripes) == 0);
+					atomic_dec(&conf->r5c_cached_partial_stripes);
+				}
+				if (test_and_clear_bit(STRIPE_R5C_FULL_STRIPE, &sh->state)) {
+					WARN_ON(atomic_read(&conf->r5c_cached_full_stripes) == 0);
+					atomic_dec(&conf->r5c_cached_full_stripes);
+				}
+
 				if (sh->group) {
 					sh->group->stripes_cnt--;
 					sh->group = NULL;
-- 
2.7.4




^ permalink raw reply related

* [PATCH 1/2] raid5-cache: add another check conditon before replaying one stripe
From: Zhengyuan Liu @ 2016-11-26  2:57 UTC (permalink / raw)
  To: shli, songliubraving; +Cc: neilb, linux-raid, liuyun01

New stripe that was just allocated has no STRIPE_R5C_CACHING state too,
add this check condition could avoid unnecessary replaying for empty stripe.

r5l_recovery_replay_one_stripe would reset stripe for any case, delete it
to make code more clean.

Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
---
 drivers/md/raid5-cache.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 5f817bd..9911164 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -1887,9 +1887,9 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
 		}
 
 		if (payload->header.type == R5LOG_PAYLOAD_DATA) {
-			if (!test_bit(STRIPE_R5C_CACHING, &sh->state)) {
+			if (!test_bit(STRIPE_R5C_CACHING, &sh->state) &&
+			    test_bit(R5_Wantwrite, &sh->dev[sh->pd_idx].flags)) {
 				r5l_recovery_replay_one_stripe(conf, sh, ctx);
-				r5l_recovery_reset_stripe(sh);
 				sh->log_start = ctx->pos;
 				list_move_tail(&sh->lru, cached_stripe_list);
 			}
-- 
2.7.4




^ permalink raw reply related

* Re: [BUG] MD/RAID1 hung forever on freeze_array
From: Jinpu Wang @ 2016-11-25 13:59 UTC (permalink / raw)
  To: linux-raid, Shaohua Li; +Cc: Neil Brown, Nate Dailey
In-Reply-To: <CAMGffE=1kp7gMhG+dFxTvhD6VkRuSfbuXbf9vYSWwGG+=vvhDA@mail.gmail.com>

On Fri, Nov 25, 2016 at 2:30 PM, Jinpu Wang <jinpu.wang@profitbricks.com> wrote:
> Hi,
>
> I'm hitting hung task in mdx_raid1 when running test, I can reproduce
> it easily with my tests below:
>
> I create one md with one local loop device and one remote scsi
> exported by SRP. running fio with mix rw on top of md, force_close
> session on storage side. mdx_raid1 is wait on free_array in D state,
> and a lot of fio also in D state in wait_barrier.
>
> [  335.154711] blk_update_request: I/O error, dev sdb, sector 8
> [  335.154855] md: super_written gets error=-5
> [  335.154999] md/raid1:md1: Disk failure on sdb, disabling device.
>                md/raid1:md1: Operation continuing on 1 devices.
> [  335.155258] sd 1:0:0:0: rejecting I/O to offline device
> [  335.155402] blk_update_request: I/O error, dev sdb, sector 80
> [  335.155547] md: super_written gets error=-5
> [  340.158828] scsi host1: ib_srp: reconnect succeeded
> [  373.017608] md/raid1:md1: redirecting sector 616617 to other mirror: loop1
> [  373.110527] md/raid1:md1: redirecting sector 1320893 to other mirror: loop1
> [  373.117230] md/raid1:md1: redirecting sector 1564499 to other mirror: loop1
> [  373.127652] md/raid1:md1: redirecting sector 104034 to other mirror: loop1
> [  373.135665] md/raid1:md1: redirecting sector 1209765 to other mirror: loop1
> [  373.145634] md/raid1:md1: redirecting sector 51200 to other mirror: loop1
> [  373.158824] md/raid1:md1: redirecting sector 755750 to other mirror: loop1
> [  373.169964] md/raid1:md1: redirecting sector 1681631 to other mirror: loop1
> [  373.178619] md/raid1:md1: redirecting sector 1894296 to other mirror: loop1
> [  373.186153] md/raid1:md1: redirecting sector 1905016 to other mirror: loop1
> [  374.364370] RAID1 conf printout:
> [  374.364377]  --- wd:1 rd:2
> [  374.364379]  disk 0, wo:1, o:0, dev:sdb
> [  374.364381]  disk 1, wo:0, o:1, dev:loop1
> [  374.437099] RAID1 conf printout:
> [  374.437103]  --- wd:1 rd:2
> snip
>
>
> [  810.266112] sysrq: SysRq : Show Blocked State
> [  810.266235]   task                        PC stack   pid father
> [  810.266362] md1_raid1       D ffff88022d927c48     0  4022      2 0x00000000
> [  810.266487]  ffff88022d927c48 ffff8802351a0000 ffff8800b91bc100
> 000000008010000e
> [  810.266747]  ffff88022d927c30 ffff88022d928000 0000000000000001
> ffff880233b49b70
> [  810.266975]  ffff880233b49b88 ffff8802325d5a40 ffff88022d927c60
> ffffffff81810600
> [  810.267203] Call Trace:
> [  810.267322]  [<ffffffff81810600>] schedule+0x30/0x80
> [  810.267437]  [<ffffffffa01342c1>] freeze_array+0x71/0xc0 [raid1]
> [  810.267555]  [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70
> [  810.267669]  [<ffffffffa013578b>] handle_read_error+0x3b/0x570 [raid1]
> [  810.267816]  [<ffffffff81185783>] ? kmem_cache_free+0x183/0x190
> [  810.267929]  [<ffffffff81094e36>] ? __wake_up+0x46/0x60
> [  810.268045]  [<ffffffffa0136dcd>] raid1d+0x20d/0xfc0 [raid1]
> [  810.268159]  [<ffffffff81813043>] ? schedule_timeout+0x1a3/0x230
> [  810.268274]  [<ffffffff8180fe77>] ? __schedule+0x2e7/0xa40
> [  810.268391]  [<ffffffffa0211839>] md_thread+0x119/0x120 [md_mod]
> [  810.268508]  [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70
> [  810.268624]  [<ffffffffa0211720>] ? find_pers+0x70/0x70 [md_mod]
> [  810.268741]  [<ffffffff81075614>] kthread+0xc4/0xe0
> [  810.268853]  [<ffffffff81075550>] ? kthread_worker_fn+0x150/0x150
> [  810.268970]  [<ffffffff8181415f>] ret_from_fork+0x3f/0x70
> [  810.269114]  [<ffffffff81075550>] ? kthread_worker_fn+0x150/0x150
> [  810.269227] fio             D ffff8802325137a0     0  4212   4206 0x00000000
> [  810.269347]  ffff8802325137a0 ffff88022de3db00 ffff8800ba7bb400
> 0000000000000000
> [  810.269574]  ffff880233b49b00 ffff880232513788 ffff880232514000
> ffff880233b49b88
> [  810.269801]  ffff880233b49b70 ffff8800ba7bb400 ffff8800b5f5db00
> ffff8802325137b8
> [  810.270028] Call Trace:
> [  810.270138]  [<ffffffff81810600>] schedule+0x30/0x80
> [  810.270282]  [<ffffffffa0133727>] wait_barrier+0x117/0x1f0 [raid1]
> [  810.270396]  [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70
> [  810.270513]  [<ffffffffa0135d72>] make_request+0xb2/0xd80 [raid1]
> [  810.270628]  [<ffffffffa02123fc>] md_make_request+0xec/0x230 [md_mod]
> [  810.270746]  [<ffffffff813f96f9>] ? generic_make_request_checks+0x219/0x500
> [  810.270860]  [<ffffffff813fc851>] blk_prologue_bio+0x91/0xc0
> [  810.270976]  [<ffffffff813fc230>] generic_make_request+0xe0/0x1b0
> [  810.271090]  [<ffffffff813fc362>] submit_bio+0x62/0x140
> [  810.271209]  [<ffffffff811d2bbc>] do_blockdev_direct_IO+0x289c/0x33c0
> [  810.271323]  [<ffffffff81810600>] ? schedule+0x30/0x80
> [  810.271468]  [<ffffffff811cd620>] ? I_BDEV+0x10/0x10
> [  810.271580]  [<ffffffff811d371e>] __blockdev_direct_IO+0x3e/0x40
> [  810.271696]  [<ffffffff811cdfb7>] blkdev_direct_IO+0x47/0x50
> [  810.271828]  [<ffffffff81132cbf>] generic_file_read_iter+0x44f/0x570
> [  810.271949]  [<ffffffff811ceaa0>] ? blkdev_write_iter+0x110/0x110
> [  810.272062]  [<ffffffff811cead0>] blkdev_read_iter+0x30/0x40
> [  810.272179]  [<ffffffff811de5a6>] aio_run_iocb+0x126/0x2b0
> [  810.272291]  [<ffffffff8181209d>] ? mutex_lock+0xd/0x30
> [  810.272407]  [<ffffffff811ddd04>] ? aio_read_events+0x284/0x370
> [  810.272521]  [<ffffffff81183c29>] ? kmem_cache_alloc+0xd9/0x180
> [  810.272665]  [<ffffffff811df438>] ? do_io_submit+0x178/0x4a0
> [  810.272778]  [<ffffffff811df4ed>] do_io_submit+0x22d/0x4a0
> [  810.272895]  [<ffffffff811df76b>] SyS_io_submit+0xb/0x10
> [  810.273007]  [<ffffffff81813e17>] entry_SYSCALL_64_fastpath+0x12/0x66
> [  810.273130] fio             D ffff88022fa6f730     0  4213   4206 0x00000000
> [  810.273247]  ffff88022fa6f730 ffff8800b549a700 ffff8800af703400
> 0000000002011200
> [  810.273475]  ffff880236001700 ffff88022fa6f718 ffff88022fa70000
> ffff880233b49b88
> [  810.273702]  ffff880233b49b70 ffff8800af703400 ffff88022f843700
> ffff88022fa6f748
> [  810.273958] Call Trace:
> [  810.274070]  [<ffffffff81810600>] schedule+0x30/0x80
> [  810.274183]  [<ffffffffa0133727>] wait_barrier+0x117/0x1f0 [raid1]
> [  810.274300]  [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70
> [  810.274413]  [<ffffffffa0135d72>] make_request+0xb2/0xd80 [raid1]
> [  810.274537]  [<ffffffff81408f15>] ? __bt_get.isra.7+0xd5/0x1b0
> [  810.274650]  [<ffffffff81094feb>] ? finish_wait+0x5b/0x80
> [  810.274766]  [<ffffffff8140917f>] ? bt_get+0x18f/0x1b0
> [  810.274881]  [<ffffffffa02123fc>] md_make_request+0xec/0x230 [md_mod]
> [  810.274998]  [<ffffffff813f96f9>] ? generic_make_request_checks+0x219/0x500
> [  810.275144]  [<ffffffff813fc851>] blk_prologue_bio+0x91/0xc0
> [  810.275257]  [<ffffffff813fc230>] generic_make_request+0xe0/0x1b0
> [  810.275373]  [<ffffffff813fc362>] submit_bio+0x62/0x140
> [  810.275486]  [<ffffffff811d2bbc>] do_blockdev_direct_IO+0x289c/0x33c0
> [  810.275607]  [<ffffffff811cd620>] ? I_BDEV+0x10/0x10
> [  810.275721]  [<ffffffff811d371e>] __blockdev_direct_IO+0x3e/0x40
> [  810.275843]  [<ffffffff811cdfb7>] blkdev_direct_IO+0x47/0x50
> [  810.275956]  [<ffffffff81132e8c>] generic_file_direct_write+0xac/0x170
> [  810.276073]  [<ffffffff8113301d>] __generic_file_write_iter+0xcd/0x1f0
> [  810.276187]  [<ffffffff811ce990>] ? blkdev_close+0x30/0x30
> [  810.276332]  [<ffffffff811cea17>] blkdev_write_iter+0x87/0x110
> [  810.276445]  [<ffffffff811de6d0>] aio_run_iocb+0x250/0x2b0
> [  810.276560]  [<ffffffff8181209d>] ? mutex_lock+0xd/0x30
> [  810.276673]  [<ffffffff811ddd04>] ? aio_read_events+0x284/0x370
> [  810.276786]  [<ffffffff81183c29>] ? kmem_cache_alloc+0xd9/0x180
> [  810.276902]  [<ffffffff811df438>] ? do_io_submit+0x178/0x4a0
> [  810.277015]  [<ffffffff811df4ed>] do_io_submit+0x22d/0x4a0
> [  810.277131]  [<ffffffff811df76b>] SyS_io_submit+0xb/0x10
> [  810.277244]  [<ffffffff81813e17>] entry_SYSCALL_64_fastpath+0x12/0x66
> I dump r1conf in crash:
> struct r1conf {
>   mddev = 0xffff88022d761800,
>   mirrors = 0xffff88023456a180,
>   raid_disks = 2,
>   next_resync = 18446744073709527039,
>   start_next_window = 18446744073709551615,
>   current_window_requests = 0,
>   next_window_requests = 0,
>   device_lock = {
>     {
>       rlock = {
>         raw_lock = {
>           val = {
>             counter = 0
>           }
>         }
>       }
>     }
>   },
>   retry_list = {
>     next = 0xffff8800b5fe3b40,
>     prev = 0xffff8800b50164c0
>   },
>   bio_end_io_list = {
>     next = 0xffff88022fcd45c0,
>     prev = 0xffff8800b53d57c0
>   },
>   pending_bio_list = {
>     head = 0x0,
>     tail = 0x0
>   },
>   pending_count = 0,
>   wait_barrier = {
>     lock = {
>       {
>         rlock = {
>           raw_lock = {
>             val = {
>               counter = 0
>             }
>           }
>         }
>       }
>     },
>     task_list = {
>       next = 0xffff8800b51d37e0,
>       prev = 0xffff88022fbbb770
>     }
>   },
>   resync_lock = {
>     {
>       rlock = {
>         raw_lock = {
>           val = {
>             counter = 0
>           }
>         }
>       }
>     }
>   },
>   nr_pending = 406,
>   nr_waiting = 100,
>   nr_queued = 404,
>   barrier = 0,
>   array_frozen = 1,
>   fullsync = 0,
>   recovery_disabled = 1,
>   poolinfo = 0xffff88022d829bb0,
>   r1bio_pool = 0xffff88022b4512a0,
>   r1buf_pool = 0x0,
>   tmppage = 0xffffea0008c97b00,
>   thread = 0x0,
>   cluster_sync_low = 0,
>   cluster_sync_high = 0
> }
>
> every time nr_pending is 1 bigger then (nr_queued + 1), so seems we
> forgot to increase nr_queued somewhere?
>
> I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in
> nr_queued to prevent freeze_array hang. Seems it fixed similar bug.
>
> Could you give your suggestion?
>
Sorry, forgot to mention kernel version is 4.4.28


-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply

* [BUG] MD/RAID1 hung forever on freeze_array
From: Jinpu Wang @ 2016-11-25 13:30 UTC (permalink / raw)
  To: linux-raid, Shaohua Li; +Cc: Neil Brown, Nate Dailey

Hi,

I'm hitting hung task in mdx_raid1 when running test, I can reproduce
it easily with my tests below:

I create one md with one local loop device and one remote scsi
exported by SRP. running fio with mix rw on top of md, force_close
session on storage side. mdx_raid1 is wait on free_array in D state,
and a lot of fio also in D state in wait_barrier.

[  335.154711] blk_update_request: I/O error, dev sdb, sector 8
[  335.154855] md: super_written gets error=-5
[  335.154999] md/raid1:md1: Disk failure on sdb, disabling device.
               md/raid1:md1: Operation continuing on 1 devices.
[  335.155258] sd 1:0:0:0: rejecting I/O to offline device
[  335.155402] blk_update_request: I/O error, dev sdb, sector 80
[  335.155547] md: super_written gets error=-5
[  340.158828] scsi host1: ib_srp: reconnect succeeded
[  373.017608] md/raid1:md1: redirecting sector 616617 to other mirror: loop1
[  373.110527] md/raid1:md1: redirecting sector 1320893 to other mirror: loop1
[  373.117230] md/raid1:md1: redirecting sector 1564499 to other mirror: loop1
[  373.127652] md/raid1:md1: redirecting sector 104034 to other mirror: loop1
[  373.135665] md/raid1:md1: redirecting sector 1209765 to other mirror: loop1
[  373.145634] md/raid1:md1: redirecting sector 51200 to other mirror: loop1
[  373.158824] md/raid1:md1: redirecting sector 755750 to other mirror: loop1
[  373.169964] md/raid1:md1: redirecting sector 1681631 to other mirror: loop1
[  373.178619] md/raid1:md1: redirecting sector 1894296 to other mirror: loop1
[  373.186153] md/raid1:md1: redirecting sector 1905016 to other mirror: loop1
[  374.364370] RAID1 conf printout:
[  374.364377]  --- wd:1 rd:2
[  374.364379]  disk 0, wo:1, o:0, dev:sdb
[  374.364381]  disk 1, wo:0, o:1, dev:loop1
[  374.437099] RAID1 conf printout:
[  374.437103]  --- wd:1 rd:2
snip


[  810.266112] sysrq: SysRq : Show Blocked State
[  810.266235]   task                        PC stack   pid father
[  810.266362] md1_raid1       D ffff88022d927c48     0  4022      2 0x00000000
[  810.266487]  ffff88022d927c48 ffff8802351a0000 ffff8800b91bc100
000000008010000e
[  810.266747]  ffff88022d927c30 ffff88022d928000 0000000000000001
ffff880233b49b70
[  810.266975]  ffff880233b49b88 ffff8802325d5a40 ffff88022d927c60
ffffffff81810600
[  810.267203] Call Trace:
[  810.267322]  [<ffffffff81810600>] schedule+0x30/0x80
[  810.267437]  [<ffffffffa01342c1>] freeze_array+0x71/0xc0 [raid1]
[  810.267555]  [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70
[  810.267669]  [<ffffffffa013578b>] handle_read_error+0x3b/0x570 [raid1]
[  810.267816]  [<ffffffff81185783>] ? kmem_cache_free+0x183/0x190
[  810.267929]  [<ffffffff81094e36>] ? __wake_up+0x46/0x60
[  810.268045]  [<ffffffffa0136dcd>] raid1d+0x20d/0xfc0 [raid1]
[  810.268159]  [<ffffffff81813043>] ? schedule_timeout+0x1a3/0x230
[  810.268274]  [<ffffffff8180fe77>] ? __schedule+0x2e7/0xa40
[  810.268391]  [<ffffffffa0211839>] md_thread+0x119/0x120 [md_mod]
[  810.268508]  [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70
[  810.268624]  [<ffffffffa0211720>] ? find_pers+0x70/0x70 [md_mod]
[  810.268741]  [<ffffffff81075614>] kthread+0xc4/0xe0
[  810.268853]  [<ffffffff81075550>] ? kthread_worker_fn+0x150/0x150
[  810.268970]  [<ffffffff8181415f>] ret_from_fork+0x3f/0x70
[  810.269114]  [<ffffffff81075550>] ? kthread_worker_fn+0x150/0x150
[  810.269227] fio             D ffff8802325137a0     0  4212   4206 0x00000000
[  810.269347]  ffff8802325137a0 ffff88022de3db00 ffff8800ba7bb400
0000000000000000
[  810.269574]  ffff880233b49b00 ffff880232513788 ffff880232514000
ffff880233b49b88
[  810.269801]  ffff880233b49b70 ffff8800ba7bb400 ffff8800b5f5db00
ffff8802325137b8
[  810.270028] Call Trace:
[  810.270138]  [<ffffffff81810600>] schedule+0x30/0x80
[  810.270282]  [<ffffffffa0133727>] wait_barrier+0x117/0x1f0 [raid1]
[  810.270396]  [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70
[  810.270513]  [<ffffffffa0135d72>] make_request+0xb2/0xd80 [raid1]
[  810.270628]  [<ffffffffa02123fc>] md_make_request+0xec/0x230 [md_mod]
[  810.270746]  [<ffffffff813f96f9>] ? generic_make_request_checks+0x219/0x500
[  810.270860]  [<ffffffff813fc851>] blk_prologue_bio+0x91/0xc0
[  810.270976]  [<ffffffff813fc230>] generic_make_request+0xe0/0x1b0
[  810.271090]  [<ffffffff813fc362>] submit_bio+0x62/0x140
[  810.271209]  [<ffffffff811d2bbc>] do_blockdev_direct_IO+0x289c/0x33c0
[  810.271323]  [<ffffffff81810600>] ? schedule+0x30/0x80
[  810.271468]  [<ffffffff811cd620>] ? I_BDEV+0x10/0x10
[  810.271580]  [<ffffffff811d371e>] __blockdev_direct_IO+0x3e/0x40
[  810.271696]  [<ffffffff811cdfb7>] blkdev_direct_IO+0x47/0x50
[  810.271828]  [<ffffffff81132cbf>] generic_file_read_iter+0x44f/0x570
[  810.271949]  [<ffffffff811ceaa0>] ? blkdev_write_iter+0x110/0x110
[  810.272062]  [<ffffffff811cead0>] blkdev_read_iter+0x30/0x40
[  810.272179]  [<ffffffff811de5a6>] aio_run_iocb+0x126/0x2b0
[  810.272291]  [<ffffffff8181209d>] ? mutex_lock+0xd/0x30
[  810.272407]  [<ffffffff811ddd04>] ? aio_read_events+0x284/0x370
[  810.272521]  [<ffffffff81183c29>] ? kmem_cache_alloc+0xd9/0x180
[  810.272665]  [<ffffffff811df438>] ? do_io_submit+0x178/0x4a0
[  810.272778]  [<ffffffff811df4ed>] do_io_submit+0x22d/0x4a0
[  810.272895]  [<ffffffff811df76b>] SyS_io_submit+0xb/0x10
[  810.273007]  [<ffffffff81813e17>] entry_SYSCALL_64_fastpath+0x12/0x66
[  810.273130] fio             D ffff88022fa6f730     0  4213   4206 0x00000000
[  810.273247]  ffff88022fa6f730 ffff8800b549a700 ffff8800af703400
0000000002011200
[  810.273475]  ffff880236001700 ffff88022fa6f718 ffff88022fa70000
ffff880233b49b88
[  810.273702]  ffff880233b49b70 ffff8800af703400 ffff88022f843700
ffff88022fa6f748
[  810.273958] Call Trace:
[  810.274070]  [<ffffffff81810600>] schedule+0x30/0x80
[  810.274183]  [<ffffffffa0133727>] wait_barrier+0x117/0x1f0 [raid1]
[  810.274300]  [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70
[  810.274413]  [<ffffffffa0135d72>] make_request+0xb2/0xd80 [raid1]
[  810.274537]  [<ffffffff81408f15>] ? __bt_get.isra.7+0xd5/0x1b0
[  810.274650]  [<ffffffff81094feb>] ? finish_wait+0x5b/0x80
[  810.274766]  [<ffffffff8140917f>] ? bt_get+0x18f/0x1b0
[  810.274881]  [<ffffffffa02123fc>] md_make_request+0xec/0x230 [md_mod]
[  810.274998]  [<ffffffff813f96f9>] ? generic_make_request_checks+0x219/0x500
[  810.275144]  [<ffffffff813fc851>] blk_prologue_bio+0x91/0xc0
[  810.275257]  [<ffffffff813fc230>] generic_make_request+0xe0/0x1b0
[  810.275373]  [<ffffffff813fc362>] submit_bio+0x62/0x140
[  810.275486]  [<ffffffff811d2bbc>] do_blockdev_direct_IO+0x289c/0x33c0
[  810.275607]  [<ffffffff811cd620>] ? I_BDEV+0x10/0x10
[  810.275721]  [<ffffffff811d371e>] __blockdev_direct_IO+0x3e/0x40
[  810.275843]  [<ffffffff811cdfb7>] blkdev_direct_IO+0x47/0x50
[  810.275956]  [<ffffffff81132e8c>] generic_file_direct_write+0xac/0x170
[  810.276073]  [<ffffffff8113301d>] __generic_file_write_iter+0xcd/0x1f0
[  810.276187]  [<ffffffff811ce990>] ? blkdev_close+0x30/0x30
[  810.276332]  [<ffffffff811cea17>] blkdev_write_iter+0x87/0x110
[  810.276445]  [<ffffffff811de6d0>] aio_run_iocb+0x250/0x2b0
[  810.276560]  [<ffffffff8181209d>] ? mutex_lock+0xd/0x30
[  810.276673]  [<ffffffff811ddd04>] ? aio_read_events+0x284/0x370
[  810.276786]  [<ffffffff81183c29>] ? kmem_cache_alloc+0xd9/0x180
[  810.276902]  [<ffffffff811df438>] ? do_io_submit+0x178/0x4a0
[  810.277015]  [<ffffffff811df4ed>] do_io_submit+0x22d/0x4a0
[  810.277131]  [<ffffffff811df76b>] SyS_io_submit+0xb/0x10
[  810.277244]  [<ffffffff81813e17>] entry_SYSCALL_64_fastpath+0x12/0x66
I dump r1conf in crash:
struct r1conf {
  mddev = 0xffff88022d761800,
  mirrors = 0xffff88023456a180,
  raid_disks = 2,
  next_resync = 18446744073709527039,
  start_next_window = 18446744073709551615,
  current_window_requests = 0,
  next_window_requests = 0,
  device_lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0
          }
        }
      }
    }
  },
  retry_list = {
    next = 0xffff8800b5fe3b40,
    prev = 0xffff8800b50164c0
  },
  bio_end_io_list = {
    next = 0xffff88022fcd45c0,
    prev = 0xffff8800b53d57c0
  },
  pending_bio_list = {
    head = 0x0,
    tail = 0x0
  },
  pending_count = 0,
  wait_barrier = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff8800b51d37e0,
      prev = 0xffff88022fbbb770
    }
  },
  resync_lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0
          }
        }
      }
    }
  },
  nr_pending = 406,
  nr_waiting = 100,
  nr_queued = 404,
  barrier = 0,
  array_frozen = 1,
  fullsync = 0,
  recovery_disabled = 1,
  poolinfo = 0xffff88022d829bb0,
  r1bio_pool = 0xffff88022b4512a0,
  r1buf_pool = 0x0,
  tmppage = 0xffffea0008c97b00,
  thread = 0x0,
  cluster_sync_low = 0,
  cluster_sync_high = 0
}

every time nr_pending is 1 bigger then (nr_queued + 1), so seems we
forgot to increase nr_queued somewhere?

I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in
nr_queued to prevent freeze_array hang. Seems it fixed similar bug.

Could you give your suggestion?


-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply

* [PATCH] raid5-cache: Fix the logic of raid5-cache recovery
From: Jackie Liu @ 2016-11-25 11:39 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid, 刘正元, shli
In-Reply-To: <20161119072057.1302854-1-songliubraving@fb.com>

Hi Song.

There is a doubt for r5l_recovery_log. I think we need write an empty block first,
then call r5c_recovery_rewrite_data_only_stripes functions. this empty 
block will be mark as the last_checkpoint. when the CACHING block is rewritten,
the superblock should be update this time. at the same time, we cann't be 
released the stripe_head at the front, it also be used in 
r5c_recovery_rewrite_data_only_stripes.

here is the patch

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 5f817bd..fad1808 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -67,7 +67,7 @@ static char *r5c_journal_mode_str[] = {"write-through",
 /*
  * raid5 cache state machine
  *
- * With rhe RAID cache, each stripe works in two phases:
+ * With the RAID cache, each stripe works in two phases:
  *	- caching phase
  *	- writing-out phase
  *
@@ -1674,7 +1674,6 @@ r5l_recovery_replay_one_stripe(struct r5conf *conf,

 static struct stripe_head *
 r5c_recovery_alloc_stripe(struct r5conf *conf,
-			  struct list_head *recovery_list,
 			  sector_t stripe_sect,
 			  sector_t log_start)
 {
@@ -1855,8 +1854,8 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
 						stripe_sect);

 		if (!sh) {
-			sh = r5c_recovery_alloc_stripe(conf, cached_stripe_list,
-						       stripe_sect, ctx->pos);
+			sh = r5c_recovery_alloc_stripe(conf, stripe_sect,
+							ctx->pos);
 			/*
 			 * cannot get stripe from raid5_get_active_stripe
 			 * try replay some stripes
@@ -1865,8 +1864,7 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
 				r5c_recovery_replay_stripes(
 					cached_stripe_list, ctx);
 				sh = r5c_recovery_alloc_stripe(
-					conf, cached_stripe_list,
-					stripe_sect, ctx->pos);
+					conf, stripe_sect, ctx->pos);
 			}
 			if (!sh) {
 				pr_debug("md/raid:%s: Increasing stripe cache size to %d to recovery data on journal.\n",
@@ -1875,8 +1873,7 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
 				raid5_set_cache_size(mddev,
 						     conf->min_nr_stripes * 2);
 				sh = r5c_recovery_alloc_stripe(
-					conf, cached_stripe_list, stripe_sect,
-					ctx->pos);
+					conf, stripe_sect, ctx->pos);
 			}
 			if (!sh) {
 				pr_err("md/raid:%s: Cannot get enough stripes due to memory pressure. Recovery failed.\n",
@@ -1986,8 +1983,6 @@ static int r5c_recovery_flush_log(struct r5l_log *log,
 	list_for_each_entry_safe(sh, next, &ctx->cached_list, lru) {
 		WARN_ON(!test_bit(STRIPE_R5C_CACHING, &sh->state));
 		r5c_recovery_load_one_stripe(log, sh);
-		list_del_init(&sh->lru);
-		raid5_release_stripe(sh);
 		ctx->data_only_stripes++;
 	}

@@ -2078,7 +2073,6 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
 		return -ENOMEM;
 	}

-	ctx->seq += 10;
 	list_for_each_entry(sh, &ctx->cached_list, lru) {
 		struct r5l_meta_block *mb;
 		int i;
@@ -2090,7 +2084,7 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
 						     ctx->pos, ctx->seq);
 		mb = page_address(page);
 		offset = le32_to_cpu(mb->meta_size);
-		write_pos = ctx->pos + BLOCK_SECTORS;
+		write_pos = r5l_ring_add(log, ctx->pos, BLOCK_SECTORS);

 		for (i = sh->disks; i--; ) {
 			struct r5dev *dev = &sh->dev[i];
@@ -2125,6 +2119,9 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
 		sh->log_start = ctx->pos;
 		ctx->pos = write_pos;
 		ctx->seq += 1;
+
+		list_del_init(&sh->lru);
+		raid5_release_stripe(sh);
 	}
 	__free_page(page);
 	return 0;
@@ -2135,6 +2132,7 @@ static int r5l_recovery_log(struct r5l_log *log)
 	struct mddev *mddev = log->rdev->mddev;
 	struct r5l_recovery_ctx ctx;
 	int ret;
+	sector_t pos;

 	ctx.pos = log->last_checkpoint;
 	ctx.seq = log->last_cp_seq;
@@ -2152,6 +2150,10 @@ static int r5l_recovery_log(struct r5l_log *log)
 	if (ret)
 		return ret;

+	pos = ctx.pos;
+	r5l_log_write_empty_meta_block(log, ctx.pos, (ctx.seq += 10));
+	ctx.pos = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
+
 	if ((ctx.data_only_stripes == 0) && (ctx.data_parity_stripes == 0))
 		pr_debug("md/raid:%s: starting from clean shutdown\n",
 			 mdname(mddev));
@@ -2170,9 +2172,9 @@ static int r5l_recovery_log(struct r5l_log *log)

 	log->log_start = ctx.pos;
 	log->next_checkpoint = ctx.pos;
+	log->last_checkpoint = pos;
 	log->seq = ctx.seq;
-	r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq);
-	r5l_write_super(log, ctx.pos);
+	r5l_write_super(log, pos);
 	return 0;
 }

--
2.7.4

Thanks.
Jackie

^ permalink raw reply related

* Re: [PATCH 05/12] raid5-ppl: Partial Parity Log implementation
From: Artur Paszkiewicz @ 2016-11-25  8:52 UTC (permalink / raw)
  To: kbuild test robot; +Cc: kbuild-all, shli, linux-raid
In-Reply-To: <201611251045.EItygkLe%fengguang.wu@intel.com>

On 11/25/2016 03:26 AM, kbuild test robot wrote:
> Hi Artur,
> 
> [auto build test ERROR on next-20161124]
> [cannot apply to md/for-next v4.9-rc6 v4.9-rc5 v4.9-rc4 v4.9-rc6]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Artur-Paszkiewicz/Partial-Parity-Log-for-MD-RAID-5/20161125-100404
> config: m68k-sun3_defconfig (attached as .config)
> compiler: m68k-linux-gcc (GCC) 4.9.0
> reproduce:
>         wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # save the attached .config to linux build tree
>         make.cross ARCH=m68k 
> 
> All errors (new ones prefixed by >>):
> 
>    drivers/md/raid5-ppl.c: In function 'ppl_init_log_child':
>>> drivers/md/raid5-ppl.c:429:2: error: too few arguments to function 'bio_init'
>      bio_init(&log->flush_bio);
>      ^
>    In file included from include/linux/blkdev.h:19:0,
>                     from drivers/md/raid5-ppl.c:16:
>    include/linux/bio.h:423:13: note: declared here
>     extern void bio_init(struct bio *bio, struct bio_vec *table,
>                 ^

This whole patchset applies cleanly on Shaohua's md for-next tree. It
does not contain the patches that change bio_init().

Artur

^ permalink raw reply

* Re: [PATCH v5] md/r5cache: handle alloc_page failure
From: NeilBrown @ 2016-11-25  3:38 UTC (permalink / raw)
  To: linux-raid
  Cc: shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu
In-Reply-To: <20161124065039.2151784-1-songliubraving@fb.com>

[-- Attachment #1: Type: text/plain, Size: 8706 bytes --]

On Thu, Nov 24 2016, Song Liu wrote:

> RMW of r5c write back cache uses an extra page to store old data for
> prexor. handle_stripe_dirtying() allocates this page by calling
> alloc_page(). However, alloc_page() may fail.
>
> To handle alloc_page() failures, this patch adds an extra page to
> disk_info. When alloc_page fails, handle_stripe() trys to use these
> pages. When these pages are used by other stripe (R5C_EXTRA_PAGE_IN_USE),
> the stripe is added to delayed_list.
>
> Signed-off-by: Song Liu <songliubraving@fb.com>

Reviewed-by: NeilBrown <neilb@suse.com>

Thanks,
NeilBrown


> ---
>  drivers/md/raid5-cache.c | 27 ++++++++++++++++-
>  drivers/md/raid5.c       | 78 ++++++++++++++++++++++++++++++++++++++++--------
>  drivers/md/raid5.h       |  6 ++++
>  3 files changed, 98 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 8cb79fc..818874d 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -2334,15 +2334,40 @@ int r5c_try_caching_write(struct r5conf *conf,
>   */
>  void r5c_release_extra_page(struct stripe_head *sh)
>  {
> +	struct r5conf *conf = sh->raid_conf;
>  	int i;
> +	bool using_disk_info_extra_page;
> +
> +	using_disk_info_extra_page =
> +		sh->dev[0].orig_page == conf->disks[0].extra_page;
>  
>  	for (i = sh->disks; i--; )
>  		if (sh->dev[i].page != sh->dev[i].orig_page) {
>  			struct page *p = sh->dev[i].orig_page;
>  
>  			sh->dev[i].orig_page = sh->dev[i].page;
> -			put_page(p);
> +			if (!using_disk_info_extra_page)
> +				put_page(p);
>  		}
> +
> +	if (using_disk_info_extra_page) {
> +		clear_bit(R5C_EXTRA_PAGE_IN_USE, &conf->cache_state);
> +		md_wakeup_thread(conf->mddev->thread);
> +	}
> +}
> +
> +void r5c_use_extra_page(struct stripe_head *sh)
> +{
> +	struct r5conf *conf = sh->raid_conf;
> +	int i;
> +	struct r5dev *dev;
> +
> +	for (i = sh->disks; i--; ) {
> +		dev = &sh->dev[i];
> +		if (dev->orig_page != dev->page)
> +			put_page(dev->orig_page);
> +		dev->orig_page = conf->disks[i].extra_page;
> +	}
>  }
>  
>  /*
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index dbab8c7..db909b9 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -876,6 +876,8 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
>  
>  	if (!test_bit(STRIPE_R5C_CACHING, &sh->state)) {
>  		/* writing out phase */
> +		if (s->waiting_extra_page)
> +			return;
>  		if (r5l_write_stripe(conf->log, sh) == 0)
>  			return;
>  	} else {  /* caching phase */
> @@ -2007,6 +2009,7 @@ static struct stripe_head *alloc_stripe(struct kmem_cache *sc, gfp_t gfp,
>  		INIT_LIST_HEAD(&sh->batch_list);
>  		INIT_LIST_HEAD(&sh->lru);
>  		INIT_LIST_HEAD(&sh->r5c);
> +		INIT_LIST_HEAD(&sh->log_list);
>  		atomic_set(&sh->count, 1);
>  		sh->log_start = MaxSector;
>  		for (i = 0; i < disks; i++) {
> @@ -2253,10 +2256,24 @@ static int resize_stripes(struct r5conf *conf, int newsize)
>  	 */
>  	ndisks = kzalloc(newsize * sizeof(struct disk_info), GFP_NOIO);
>  	if (ndisks) {
> -		for (i=0; i<conf->raid_disks; i++)
> +		for (i = 0; i < conf->pool_size; i++)
>  			ndisks[i] = conf->disks[i];
> -		kfree(conf->disks);
> -		conf->disks = ndisks;
> +
> +		for (i = conf->pool_size; i < newsize; i++) {
> +			ndisks[i].extra_page = alloc_page(GFP_NOIO);
> +			if (!ndisks[i].extra_page)
> +				err = -ENOMEM;
> +		}
> +
> +		if (err) {
> +			for (i = conf->pool_size; i < newsize; i++)
> +				if (ndisks[i].extra_page)
> +					put_page(ndisks[i].extra_page);
> +			kfree(ndisks);
> +		} else {
> +			kfree(conf->disks);
> +			conf->disks = ndisks;
> +		}
>  	} else
>  		err = -ENOMEM;
>  
> @@ -3580,10 +3597,10 @@ static void handle_stripe_clean_event(struct r5conf *conf,
>  		break_stripe_batch_list(head_sh, STRIPE_EXPAND_SYNC_FLAGS);
>  }
>  
> -static void handle_stripe_dirtying(struct r5conf *conf,
> -				   struct stripe_head *sh,
> -				   struct stripe_head_state *s,
> -				   int disks)
> +static int handle_stripe_dirtying(struct r5conf *conf,
> +				  struct stripe_head *sh,
> +				  struct stripe_head_state *s,
> +				  int disks)
>  {
>  	int rmw = 0, rcw = 0, i;
>  	sector_t recovery_cp = conf->mddev->recovery_cp;
> @@ -3649,12 +3666,32 @@ static void handle_stripe_dirtying(struct r5conf *conf,
>  			    dev->page == dev->orig_page &&
>  			    !test_bit(R5_LOCKED, &sh->dev[sh->pd_idx].flags)) {
>  				/* alloc page for prexor */
> -				dev->orig_page = alloc_page(GFP_NOIO);
> +				struct page *p = alloc_page(GFP_NOIO);
> +
> +				if (p) {
> +					dev->orig_page = p;
> +					continue;
> +				}
>  
> -				/* will handle failure in a later patch*/
> -				BUG_ON(!dev->orig_page);
> +				/*
> +				 * alloc_page() failed, try use
> +				 * disk_info->extra_page
> +				 */
> +				if (!test_and_set_bit(R5C_EXTRA_PAGE_IN_USE,
> +						      &conf->cache_state)) {
> +					r5c_use_extra_page(sh);
> +					break;
> +				}
> +
> +				/* extra_page in use, add to delayed_list */
> +				set_bit(STRIPE_DELAYED, &sh->state);
> +				s->waiting_extra_page = 1;
> +				return -EAGAIN;
>  			}
> +		}
>  
> +		for (i = disks; i--; ) {
> +			struct r5dev *dev = &sh->dev[i];
>  			if ((dev->towrite ||
>  			     i == sh->pd_idx || i == sh->qd_idx ||
>  			     test_bit(R5_InJournal, &dev->flags)) &&
> @@ -3730,6 +3767,7 @@ static void handle_stripe_dirtying(struct r5conf *conf,
>  	    (s->locked == 0 && (rcw == 0 || rmw == 0) &&
>  	     !test_bit(STRIPE_BIT_DELAY, &sh->state)))
>  		schedule_reconstruction(sh, s, rcw == 0, 0);
> +	return 0;
>  }
>  
>  static void handle_parity_checks5(struct r5conf *conf, struct stripe_head *sh,
> @@ -4545,8 +4583,12 @@ static void handle_stripe(struct stripe_head *sh)
>  			if (ret == -EAGAIN ||
>  			    /* stripe under reclaim: !caching && injournal */
>  			    (!test_bit(STRIPE_R5C_CACHING, &sh->state) &&
> -			     s.injournal > 0))
> -				handle_stripe_dirtying(conf, sh, &s, disks);
> +			     s.injournal > 0)) {
> +				ret = handle_stripe_dirtying(conf, sh, &s,
> +							     disks);
> +				if (ret == -EAGAIN)
> +					goto finish;
> +			}
>  		}
>  	}
>  
> @@ -6458,6 +6500,8 @@ static void raid5_free_percpu(struct r5conf *conf)
>  
>  static void free_conf(struct r5conf *conf)
>  {
> +	int i;
> +
>  	if (conf->log)
>  		r5l_exit_log(conf->log);
>  	if (conf->shrinker.nr_deferred)
> @@ -6466,6 +6510,9 @@ static void free_conf(struct r5conf *conf)
>  	free_thread_groups(conf);
>  	shrink_stripes(conf);
>  	raid5_free_percpu(conf);
> +	for (i = 0; i < conf->pool_size; i++)
> +		if (conf->disks[i].extra_page)
> +			put_page(conf->disks[i].extra_page);
>  	kfree(conf->disks);
>  	kfree(conf->stripe_hashtbl);
>  	kfree(conf);
> @@ -6612,9 +6659,16 @@ static struct r5conf *setup_conf(struct mddev *mddev)
>  
>  	conf->disks = kzalloc(max_disks * sizeof(struct disk_info),
>  			      GFP_KERNEL);
> +
>  	if (!conf->disks)
>  		goto abort;
>  
> +	for (i = 0; i < max_disks; i++) {
> +		conf->disks[i].extra_page = alloc_page(GFP_KERNEL);
> +		if (!conf->disks[i].extra_page)
> +			goto abort;
> +	}
> +
>  	conf->mddev = mddev;
>  
>  	if ((conf->stripe_hashtbl = kzalloc(PAGE_SIZE, GFP_KERNEL)) == NULL)
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index d13fe45..ed8e136 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -276,6 +276,7 @@ struct stripe_head_state {
>  	struct md_rdev *blocked_rdev;
>  	int handle_bad_blocks;
>  	int log_failed;
> +	int waiting_extra_page;
>  };
>  
>  /* Flags for struct r5dev.flags */
> @@ -439,6 +440,7 @@ enum {
>  
>  struct disk_info {
>  	struct md_rdev	*rdev, *replacement;
> +	struct page	*extra_page; /* extra page to use in prexor */
>  };
>  
>  /*
> @@ -559,6 +561,9 @@ enum r5_cache_state {
>  				 * only process stripes that are already
>  				 * occupying the log
>  				 */
> +	R5C_EXTRA_PAGE_IN_USE,	/* a stripe is using disk_info.extra_page
> +				 * for prexor
> +				 */
>  };
>  
>  struct r5conf {
> @@ -765,6 +770,7 @@ extern void
>  r5c_finish_stripe_write_out(struct r5conf *conf, struct stripe_head *sh,
>  			    struct stripe_head_state *s);
>  extern void r5c_release_extra_page(struct stripe_head *sh);
> +extern void r5c_use_extra_page(struct stripe_head *sh);
>  extern void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
>  extern void r5c_handle_cached_data_endio(struct r5conf *conf,
>  	struct stripe_head *sh, int disks, struct bio_list *return_bi);
> -- 
> 2.9.3

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH 05/12] raid5-ppl: Partial Parity Log implementation
From: kbuild test robot @ 2016-11-25  2:26 UTC (permalink / raw)
  Cc: kbuild-all, shli, linux-raid, Artur Paszkiewicz
In-Reply-To: <20161124122847.16456-6-artur.paszkiewicz@intel.com>

[-- Attachment #1: Type: text/plain, Size: 1727 bytes --]

Hi Artur,

[auto build test ERROR on next-20161124]
[cannot apply to md/for-next v4.9-rc6 v4.9-rc5 v4.9-rc4 v4.9-rc6]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Artur-Paszkiewicz/Partial-Parity-Log-for-MD-RAID-5/20161125-100404
config: m68k-sun3_defconfig (attached as .config)
compiler: m68k-linux-gcc (GCC) 4.9.0
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=m68k 

All errors (new ones prefixed by >>):

   drivers/md/raid5-ppl.c: In function 'ppl_init_log_child':
>> drivers/md/raid5-ppl.c:429:2: error: too few arguments to function 'bio_init'
     bio_init(&log->flush_bio);
     ^
   In file included from include/linux/blkdev.h:19:0,
                    from drivers/md/raid5-ppl.c:16:
   include/linux/bio.h:423:13: note: declared here
    extern void bio_init(struct bio *bio, struct bio_vec *table,
                ^

vim +/bio_init +429 drivers/md/raid5-ppl.c

   423		spin_lock_init(&log->io_list_lock);
   424		INIT_LIST_HEAD(&log->running_ios);
   425		INIT_LIST_HEAD(&log->io_end_ios);
   426		INIT_LIST_HEAD(&log->flushing_ios);
   427		INIT_LIST_HEAD(&log->finished_ios);
   428		INIT_LIST_HEAD(&log->no_mem_stripes);
 > 429		bio_init(&log->flush_bio);
   430	
   431		log->io_kc = log_parent->io_kc;
   432		log->io_pool = log_parent->io_pool;

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 11676 bytes --]

^ permalink raw reply

* [mdadm PATCH] Add failfast support.
From: NeilBrown @ 2016-11-24 23:55 UTC (permalink / raw)
  To: Jes.Sorensen
  Cc: Shaohua Li, linux-raid, linux-block, Christoph Hellwig,
	linux-kernel, hare
In-Reply-To: <20161122020238.qtuxwo5etcwmts4r@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 16573 bytes --]


Allow per-device "failfast" flag to be set when creating an
array or adding devices to an array.

When re-adding a device which had the failfast flag, it can be removed
using --nofailfast.

failfast status is printed in --detail and --examine output.

Signed-off-by: NeilBrown <neilb@suse.com>
---

Hi Jes,
 this patch adds mdadm support for the failfast functionality that
Shaohua recently included in his for-next.
Hopefully the man-page additions provide all necessary context.
If there is anything that seems to be missing, I'll be very happy to
add it.

Thanks,
NeilBrown


 Create.c      |  2 ++
 Detail.c      |  1 +
 Incremental.c |  1 +
 Manage.c      | 20 +++++++++++++++++++-
 ReadMe.c      |  2 ++
 md.4          | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 md_p.h        |  1 +
 mdadm.8.in    | 32 +++++++++++++++++++++++++++++++-
 mdadm.c       | 11 +++++++++++
 mdadm.h       |  5 +++++
 super0.c      | 12 ++++++++----
 super1.c      | 13 +++++++++++++
 12 files changed, 148 insertions(+), 6 deletions(-)
 mode change 100755 => 100644 mdadm.h

diff --git a/Create.c b/Create.c
index 1594a3919139..bd114eabafc1 100644
--- a/Create.c
+++ b/Create.c
@@ -890,6 +890,8 @@ int Create(struct supertype *st, char *mddev,
 
 				if (dv->writemostly == 1)
 					inf->disk.state |= (1<<MD_DISK_WRITEMOSTLY);
+				if (dv->failfast == 1)
+					inf->disk.state |= (1<<MD_DISK_FAILFAST);
 
 				if (have_container)
 					fd = -1;
diff --git a/Detail.c b/Detail.c
index 925e4794c983..509b0d418768 100644
--- a/Detail.c
+++ b/Detail.c
@@ -658,6 +658,7 @@ This is pretty boring
 			}
 			if (disk.state & (1<<MD_DISK_REMOVED)) printf(" removed");
 			if (disk.state & (1<<MD_DISK_WRITEMOSTLY)) printf(" writemostly");
+			if (disk.state & (1<<MD_DISK_FAILFAST)) printf(" failfast");
 			if (disk.state & (1<<MD_DISK_JOURNAL)) printf(" journal");
 			if ((disk.state &
 			     ((1<<MD_DISK_ACTIVE)|(1<<MD_DISK_SYNC)
diff --git a/Incremental.c b/Incremental.c
index cc01d41e641a..75d95ccc497a 100644
--- a/Incremental.c
+++ b/Incremental.c
@@ -1035,6 +1035,7 @@ static int array_try_spare(char *devname, int *dfdp, struct dev_policy *pol,
 			devlist.next = NULL;
 			devlist.used = 0;
 			devlist.writemostly = 0;
+			devlist.failfast = 0;
 			devlist.devname = chosen_devname;
 			sprintf(chosen_devname, "%d:%d", major(stb.st_rdev),
 				minor(stb.st_rdev));
diff --git a/Manage.c b/Manage.c
index 1b7b0c111c83..429d8631cd23 100644
--- a/Manage.c
+++ b/Manage.c
@@ -683,8 +683,13 @@ int attempt_re_add(int fd, int tfd, struct mddev_dev *dv,
 			disc.state |= 1 << MD_DISK_WRITEMOSTLY;
 		if (dv->writemostly == 2)
 			disc.state &= ~(1 << MD_DISK_WRITEMOSTLY);
+		if (dv->failfast == 1)
+			disc.state |= 1 << MD_DISK_FAILFAST;
+		if (dv->failfast == 2)
+			disc.state &= ~(1 << MD_DISK_FAILFAST);
 		remove_partitions(tfd);
-		if (update || dv->writemostly > 0) {
+		if (update || dv->writemostly > 0
+			|| dv->failfast > 0) {
 			int rv = -1;
 			tfd = dev_open(dv->devname, O_RDWR);
 			if (tfd < 0) {
@@ -700,6 +705,14 @@ int attempt_re_add(int fd, int tfd, struct mddev_dev *dv,
 				rv = dev_st->ss->update_super(
 					dev_st, NULL, "readwrite",
 					devname, verbose, 0, NULL);
+			if (dv->failfast == 1)
+				rv = dev_st->ss->update_super(
+					dev_st, NULL, "failfast",
+					devname, verbose, 0, NULL);
+			if (dv->failfast == 2)
+				rv = dev_st->ss->update_super(
+					dev_st, NULL, "nofailfast",
+					devname, verbose, 0, NULL);
 			if (update)
 				rv = dev_st->ss->update_super(
 					dev_st, NULL, update,
@@ -964,6 +977,8 @@ int Manage_add(int fd, int tfd, struct mddev_dev *dv,
 			disc.state |= (1 << MD_DISK_JOURNAL) | (1 << MD_DISK_SYNC);
 		if (dv->writemostly == 1)
 			disc.state |= 1 << MD_DISK_WRITEMOSTLY;
+		if (dv->failfast == 1)
+			disc.state |= 1 << MD_DISK_FAILFAST;
 		dfd = dev_open(dv->devname, O_RDWR | O_EXCL|O_DIRECT);
 		if (tst->ss->add_to_super(tst, &disc, dfd,
 					  dv->devname, INVALID_SECTORS))
@@ -1009,6 +1024,8 @@ int Manage_add(int fd, int tfd, struct mddev_dev *dv,
 
 	if (dv->writemostly == 1)
 		disc.state |= (1 << MD_DISK_WRITEMOSTLY);
+	if (dv->failfast == 1)
+		disc.state |= (1 << MD_DISK_FAILFAST);
 	if (tst->ss->external) {
 		/* add a disk
 		 * to an external metadata container */
@@ -1785,6 +1802,7 @@ int move_spare(char *from_devname, char *to_devname, dev_t devid)
 	devlist.next = NULL;
 	devlist.used = 0;
 	devlist.writemostly = 0;
+	devlist.failfast = 0;
 	devlist.devname = devname;
 	sprintf(devname, "%d:%d", major(devid), minor(devid));
 
diff --git a/ReadMe.c b/ReadMe.c
index d3fcb6132fe9..8da49ef46dfb 100644
--- a/ReadMe.c
+++ b/ReadMe.c
@@ -136,6 +136,8 @@ struct option long_options[] = {
     {"bitmap-chunk", 1, 0, BitmapChunk},
     {"write-behind", 2, 0, WriteBehind},
     {"write-mostly",0, 0, WriteMostly},
+    {"failfast",  0, 0,  FailFast},
+    {"nofailfast",0, 0,  NoFailFast},
     {"re-add",    0, 0,  ReAdd},
     {"homehost",  1, 0,  HomeHost},
     {"symlinks",  1, 0,  Symlinks},
diff --git a/md.4 b/md.4
index f1b88ee6bb03..5bdf7a7bd375 100644
--- a/md.4
+++ b/md.4
@@ -916,6 +916,60 @@ slow).  The extra latency of the remote link will not slow down normal
 operations, but the remote system will still have a reasonably
 up-to-date copy of all data.
 
+.SS FAILFAST
+
+From Linux 4.10,
+.I
+md
+supports FAILFAST for RAID1 and RAID10 arrays.  This is a flag that
+can be set on individual drives, though it is usually set on all
+drives, or no drives.
+
+When
+.I md
+sends an I/O request to a drive that is marked as FAILFAST, and when
+the array could survive the loss of that drive without losing data,
+.I md
+will request that the underlying device does not perform any retries.
+This means that a failure will be reported to
+.I md
+promptly, and it can mark the device as faulty and continue using the
+other device(s).
+.I md
+cannot control the timeout that the underlying devices use to
+determine failure.  Any changes desired to that timeout must be set
+explictly on the underlying device, separately from using
+.IR mdadm .
+
+If a FAILFAST request does fail, and if it is still safe to mark the
+device as faulty without data loss, that will be done and the array
+will continue functioning on a reduced number of devices.  If it is not
+possible to safely mark the device as faulty,
+.I md
+will retry the request without disabling retries in the underlying
+device.  In any case,
+.I md
+will not attempt to repair read errors on a device marked as FAILFAST
+by writing out the correct.  It will just mark the device as faulty.
+
+FAILFAST is appropriate for storage arrays that have a low probability
+of true failure, but will sometimes introduce unacceptable delays to
+I/O requests while performing internal maintenance.  The value of
+setting FAILFAST involves a trade-off.  The gain is that the chance of
+unacceptable delays is substantially reduced.  The cost is that the
+unlikely event of data-loss on one device is slightly more likely to
+result in data-loss for the array.
+
+When a device in an array using FAILFAST is marked as faulty, it will
+usually become usable again in a short while.
+.I mdadm
+makes no attempt to detect that possibility.  Some separate
+mechanism, tuned to the specific details of the expected failure modes,
+needs to be created to monitor devices to see when they return to full
+functionality, and to then re-add them to the array.  In order of
+this "re-add" functionality to be effective, an array using FAILFAST
+should always have a write-intent bitmap.
+
 .SS RESTRIPING
 
 .IR Restriping ,
diff --git a/md_p.h b/md_p.h
index 0d691fbc987d..dc9fec165cb6 100644
--- a/md_p.h
+++ b/md_p.h
@@ -89,6 +89,7 @@
 				   * read requests will only be sent here in
 				   * dire need
 				   */
+#define	MD_DISK_FAILFAST	10 /* Fewer retries, more failures */
 
 #define MD_DISK_REPLACEMENT	17
 #define MD_DISK_JOURNAL		18 /* disk is used as the write journal in RAID-5/6 */
diff --git a/mdadm.8.in b/mdadm.8.in
index 3c0c58f95f35..aa80f0c1a631 100644
--- a/mdadm.8.in
+++ b/mdadm.8.in
@@ -747,7 +747,7 @@ subsequent devices listed in a
 .BR \-\-create ,
 or
 .B \-\-add
-command will be flagged as 'write-mostly'.  This is valid for RAID1
+command will be flagged as 'write\-mostly'.  This is valid for RAID1
 only and means that the 'md' driver will avoid reading from these
 devices if at all possible.  This can be useful if mirroring over a
 slow link.
@@ -762,6 +762,25 @@ mode, and write-behind is only attempted on drives marked as
 .IR write-mostly .
 
 .TP
+.BR \-\-failfast
+subsequent devices listed in a
+.B \-\-create
+or
+.B \-\-add
+command will be flagged as  'failfast'.  This is valid for RAID1 and
+RAID10 only.  IO requests to these devices will be encouraged to fail
+quickly rather than cause long delays due to error handling.  Also no
+attempt is made to repair a read error on these devices.
+
+If an array becomes degraded so that the 'failfast' device is the only
+usable device, the 'failfast' flag will then be ignored and extended
+delays will be preferred to complete failure.
+
+The 'failfast' flag is appropriate for storage arrays which have a
+low probability of true failure, but which may sometimes
+cause unacceptable delays due to internal maintenance functions.
+
+.TP
 .BR \-\-assume\-clean
 Tell
 .I mdadm
@@ -1452,6 +1471,17 @@ that had a failed journal. To avoid interrupting on-going write opertions,
 .B \-\-add-journal
 only works for array in Read-Only state.
 
+.TP
+.BR \-\-failfast
+Subsequent devices that are added or re\-added will have
+the 'failfast' flag set.  This is only valid for RAID1 and RAID10 and
+means that the 'md' driver will avoid long timeouts on error handling
+where possible.
+.TP
+.BR \-\-nofailfast
+Subsequent devices that are re\-added will be re\-added without
+the 'failfast' flag set.
+
 .P
 Each of these options requires that the first device listed is the array
 to be acted upon, and the remainder are component devices to be added,
diff --git a/mdadm.c b/mdadm.c
index cca093318d8d..3c8f273c8254 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -90,6 +90,7 @@ int main(int argc, char *argv[])
 	int spare_sharing = 1;
 	struct supertype *ss = NULL;
 	int writemostly = 0;
+	int failfast = 0;
 	char *shortopt = short_options;
 	int dosyslog = 0;
 	int rebuild_map = 0;
@@ -295,6 +296,7 @@ int main(int argc, char *argv[])
 					dv->devname = optarg;
 					dv->disposition = devmode;
 					dv->writemostly = writemostly;
+					dv->failfast = failfast;
 					dv->used = 0;
 					dv->next = NULL;
 					*devlistend = dv;
@@ -351,6 +353,7 @@ int main(int argc, char *argv[])
 			dv->devname = optarg;
 			dv->disposition = devmode;
 			dv->writemostly = writemostly;
+			dv->failfast = failfast;
 			dv->used = 0;
 			dv->next = NULL;
 			*devlistend = dv;
@@ -417,6 +420,14 @@ int main(int argc, char *argv[])
 			writemostly = 2;
 			continue;
 
+		case O(MANAGE,FailFast):
+		case O(CREATE,FailFast):
+			failfast = 1;
+			continue;
+		case O(MANAGE,NoFailFast):
+			failfast = 2;
+			continue;
+
 		case O(GROW,'z'):
 		case O(CREATE,'z'):
 		case O(BUILD,'z'): /* size */
diff --git a/mdadm.h b/mdadm.h
old mode 100755
new mode 100644
index 240ab7f831bc..d47de01f725b
--- a/mdadm.h
+++ b/mdadm.h
@@ -383,6 +383,8 @@ enum special_options {
 	ConfigFile,
 	ChunkSize,
 	WriteMostly,
+	FailFast,
+	NoFailFast,
 	Layout,
 	Auto,
 	Force,
@@ -516,6 +518,7 @@ struct mddev_dev {
 				 * Not set for names read from .config
 				 */
 	char writemostly;	/* 1 for 'set writemostly', 2 for 'clear writemostly' */
+	char failfast;		/* Ditto but for 'failfast' flag */
 	int used;		/* set when used */
 	long long data_offset;
 	struct mddev_dev *next;
@@ -821,6 +824,8 @@ extern struct superswitch {
 	 *   linear-grow-update - now change the size of the array.
 	 *   writemostly - set the WriteMostly1 bit in the superblock devflags
 	 *   readwrite - clear the WriteMostly1 bit in the superblock devflags
+	 *   failfast - set the FailFast1 bit in the superblock
+	 *   nofailfast - clear the FailFast1 bit
 	 *   no-bitmap - clear any record that a bitmap is present.
 	 *   bbl       - add a bad-block-log if possible
 	 *   no-bbl    - remove any bad-block-log is it is empty.
diff --git a/super0.c b/super0.c
index 55ebd8bc7877..938cfd95fa25 100644
--- a/super0.c
+++ b/super0.c
@@ -232,14 +232,15 @@ static void examine_super0(struct supertype *st, char *homehost)
 		mdp_disk_t *dp;
 		char *dv;
 		char nb[5];
-		int wonly;
+		int wonly, failfast;
 		if (d>=0) dp = &sb->disks[d];
 		else dp = &sb->this_disk;
 		snprintf(nb, sizeof(nb), "%4d", d);
 		printf("%4s %5d   %5d    %5d    %5d     ", d < 0 ? "this" : nb,
 		       dp->number, dp->major, dp->minor, dp->raid_disk);
 		wonly = dp->state & (1 << MD_DISK_WRITEMOSTLY);
-		dp->state &= ~(1 << MD_DISK_WRITEMOSTLY);
+		failfast = dp->state & (1<<MD_DISK_FAILFAST);
+		dp->state &= ~(wonly | failfast);
 		if (dp->state & (1 << MD_DISK_FAULTY))
 			printf(" faulty");
 		if (dp->state & (1 << MD_DISK_ACTIVE))
@@ -250,6 +251,8 @@ static void examine_super0(struct supertype *st, char *homehost)
 			printf(" removed");
 		if (wonly)
 			printf(" write-mostly");
+		if (failfast)
+			printf(" failfast");
 		if (dp->state == 0)
 			printf(" spare");
 		if ((dv = map_dev(dp->major, dp->minor, 0)))
@@ -581,7 +584,8 @@ static int update_super0(struct supertype *st, struct mdinfo *info,
 	} else if (strcmp(update, "assemble")==0) {
 		int d = info->disk.number;
 		int wonly = sb->disks[d].state & (1<<MD_DISK_WRITEMOSTLY);
-		int mask = (1<<MD_DISK_WRITEMOSTLY);
+		int failfast = sb->disks[d].state & (1<<MD_DISK_FAILFAST);
+		int mask = (1<<MD_DISK_WRITEMOSTLY)|(1<<MD_DISK_FAILFAST);
 		int add = 0;
 		if (sb->minor_version >= 91)
 			/* During reshape we don't insist on everything
@@ -590,7 +594,7 @@ static int update_super0(struct supertype *st, struct mdinfo *info,
 			add = (1<<MD_DISK_SYNC);
 		if (((sb->disks[d].state & ~mask) | add)
 		    != (unsigned)info->disk.state) {
-			sb->disks[d].state = info->disk.state | wonly;
+			sb->disks[d].state = info->disk.state | wonly |failfast;
 			rv = 1;
 		}
 		if (info->reshape_active &&
diff --git a/super1.c b/super1.c
index d3234392d453..87a74cb94508 100644
--- a/super1.c
+++ b/super1.c
@@ -77,6 +77,7 @@ struct mdp_superblock_1 {
 	__u8	device_uuid[16]; /* user-space setable, ignored by kernel */
 	__u8    devflags;        /* per-device flags.  Only one defined...*/
 #define WriteMostly1    1        /* mask for writemostly flag in above */
+#define FailFast1	2        /* Device should get FailFast requests */
 	/* bad block log.  If there are any bad blocks the feature flag is set.
 	 * if offset and size are non-zero, that space is reserved and available.
 	 */
@@ -430,6 +431,8 @@ static void examine_super1(struct supertype *st, char *homehost)
 		printf("          Flags :");
 		if (sb->devflags & WriteMostly1)
 			printf(" write-mostly");
+		if (sb->devflags & FailFast1)
+			printf(" failfast");
 		printf("\n");
 	}
 
@@ -1020,6 +1023,8 @@ static void getinfo_super1(struct supertype *st, struct mdinfo *info, char *map)
 	}
 	if (sb->devflags & WriteMostly1)
 		info->disk.state |= (1 << MD_DISK_WRITEMOSTLY);
+	if (sb->devflags & FailFast1)
+		info->disk.state |= (1 << MD_DISK_FAILFAST);
 	info->events = __le64_to_cpu(sb->events);
 	sprintf(info->text_version, "1.%d", st->minor_version);
 	info->safe_mode_delay = 200;
@@ -1377,6 +1382,10 @@ static int update_super1(struct supertype *st, struct mdinfo *info,
 		sb->devflags |= WriteMostly1;
 	else if (strcmp(update, "readwrite")==0)
 		sb->devflags &= ~WriteMostly1;
+	else if (strcmp(update, "failfast") == 0)
+		sb->devflags |= FailFast1;
+	else if (strcmp(update, "nofailfast") == 0)
+		sb->devflags &= ~FailFast1;
 	else
 		rv = -1;
 
@@ -1713,6 +1722,10 @@ static int write_init_super1(struct supertype *st)
 			sb->devflags |= WriteMostly1;
 		else
 			sb->devflags &= ~WriteMostly1;
+		if (di->disk.state & (1<<MD_DISK_FAILFAST))
+			sb->devflags |= FailFast1;
+		else
+			sb->devflags &= ~FailFast1;
 
 		random_uuid(sb->device_uuid);
 
-- 
2.10.2


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply related

* Re: [PATCH/RFC] add "failfast" support for raid1/raid10.
From: Jack Wang @ 2016-11-24 16:06 UTC (permalink / raw)
  To: NeilBrown
  Cc: Shaohua Li, linux-raid, linux-block, Christoph Hellwig,
	linux-kernel, hare
In-Reply-To: <877f7tbi20.fsf@notabene.neil.brown.name>

Hi Neil,

2016-11-24 5:47 GMT+01:00 NeilBrown <neilb@suse.com>:
> On Sat, Nov 19 2016, Jack Wang wrote:
>
>> 2016-11-18 6:16 GMT+01:00 NeilBrown <neilb@suse.com>:
>>> Hi,
>>>
>>>  I've been sitting on these patches for a while because although they
>>>  solve a real problem, it is a fairly limited use-case, and I don't
>>>  really like some of the details.
>>>
>>>  So I'm posting them as RFC in the hope that a different perspective
>>>  might help me like them better, or find a better approach.
>>>
>>>  The core idea is that when you have multiple copies of data
>>>  (i.e. mirrored drives) it doesn't make sense to wait for a read from
>>>  a drive that seems to be having problems.  It will probably be faster
>>>  to just cancel that read, and read from the other device.
>>>  Similarly, in some circumstances, it might be better to fail a drive
>>>  that is being slow to respond to writes, rather than cause all writes
>>>  to be very slow.
>>>
>>>  The particular context where this comes up is when mirroring across
>>>  storage arrays, where the storage arrays can temporarily take an
>>>  unusually long time to respond to requests (firmware updates have
>>>  been mentioned).  As the array will have redundancy internally, there
>>>  is little risk to the data.  The mirrored pair is really only for
>>>  disaster recovery, and it is deemed better to lose the last few
>>>  minutes of updates in the case of a serious disaster, rather than
>>>  occasionally having latency issues because one array needs to do some
>>>  maintenance for a few minutes.  The particular storage arrays in
>>>  question are DASD devices which are part of the s390 ecosystem.
>>
>> Hi Neil,
>>
>> Thanks for pushing this feature also to mainline.
>> We at Profitbricks use raid1 across IB network, one pserver with
>> raid1, both legs on 2 remote storages.
>> We've noticed if one remote storage crash , and raid1 still keep
>> sending IO to the faulty leg, even after 5 minutes,
>> md still redirect I/Os, and md refuse to remove active disks, eg:
>
> That make sense.  It cannot remove the active disk until all pending IO
> completes, either with an error or with success.
>
> If the target has a long timeout, that can delay progress a lot.
>
>>
>> I tried to port you patch from SLES[1], with the patchset, it reduce
>> the time to ~30 seconds.
>>
>> I'm happy to see this feature upstream :)
>> I will test again this new patchset.
>
> Thanks for your confirmation that this is more generally useful than I
> thought, and I'm always happy to hear for more testing :-)
>
> Thanks,
> NeilBrown

Just want to update test result, so far it's working fine, no regression :)
Will report if anything breaks.

Thanks
Jack

^ permalink raw reply

* Re: [PATCH 1/4 v2] mdadm: bad block support for external metadata - initialization
From: Jes Sorensen @ 2016-11-24 15:53 UTC (permalink / raw)
  To: Tomasz Majchrzak; +Cc: linux-raid
In-Reply-To: <20161124140103.GA8047@proton.igk.intel.com>

Tomasz Majchrzak <tomasz.majchrzak@intel.com> writes:
> On Thu, Oct 27, 2016 at 10:53:42AM +0200, Tomasz Majchrzak wrote:
>> If metadata handler provides support for bad blocks, tell md by writing
>> 'external_bbl' to rdev state file (both on create and assemble),
>> followed by a list of known bad blocks written via sysfs 'bad_blocks'
>> file.
>> 
>> Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
>> Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>> ---
>>  mdadm.h | 13 +++++++++++++
>>  sysfs.c | 29 ++++++++++++++++++++++++++++-
>>  2 files changed, 41 insertions(+), 1 deletion(-)
>
> Hi Jes,
>
> Do you have any comments for this patch set? It is a generic code for bad
> block support for external metadata. It comes ahead of my other patch set
> which provides IMSM implementation of new functionality.
>
> Regards,
>
> Tomek

Hi Tomek,

I thought I had responded to this one - I'll have a look early next week
(holiday here).

Cheers,
jes

^ permalink raw reply

* Re: [PATCH 1/4 v2] mdadm: bad block support for external metadata - initialization
From: Tomasz Majchrzak @ 2016-11-24 14:01 UTC (permalink / raw)
  To: Jes.Sorensen; +Cc: linux-raid
In-Reply-To: <1477558425-13332-1-git-send-email-tomasz.majchrzak@intel.com>

On Thu, Oct 27, 2016 at 10:53:42AM +0200, Tomasz Majchrzak wrote:
> If metadata handler provides support for bad blocks, tell md by writing
> 'external_bbl' to rdev state file (both on create and assemble),
> followed by a list of known bad blocks written via sysfs 'bad_blocks'
> file.
> 
> Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
> Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  mdadm.h | 13 +++++++++++++
>  sysfs.c | 29 ++++++++++++++++++++++++++++-
>  2 files changed, 41 insertions(+), 1 deletion(-)

Hi Jes,

Do you have any comments for this patch set? It is a generic code for bad
block support for external metadata. It comes ahead of my other patch set
which provides IMSM implementation of new functionality.

Regards,

Tomek

^ permalink raw reply

* [PATCH 7/7] Man page changes for --rwh-policy
From: Artur Paszkiewicz @ 2016-11-24 12:29 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Artur Paszkiewicz
In-Reply-To: <20161124122952.16529-1-artur.paszkiewicz@intel.com>

Describe the usage of the --rwh-policy parameter in Create and Misc
modes.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 mdadm.8.in | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/mdadm.8.in b/mdadm.8.in
index 3c0c58f..9295dcb 100644
--- a/mdadm.8.in
+++ b/mdadm.8.in
@@ -996,6 +996,26 @@ simultaneously. If not specified, this defaults to 4.
 Specify journal device for the RAID-4/5/6 array. The journal device
 should be a SSD with reasonable lifetime.
 
+.TP
+.BR \-\-rwh-policy=
+Specify the RAID Write Hole policy for a RAID-4/5/6 array. Currently supported
+options are
+.BR off ,
+.B journal
+and
+.BR ppl .
+
+The
+.B journal
+policy is implicitly selected when using
+.BR \-\-write-journal .
+
+The
+.B ppl
+policy (Partial Parity Log) is a mechanism that can be used with RAID5 arrays.
+This feature prevents data loss by keeping parity consistent with data even in
+case of drive failure during dirty shutdown. PPL is stored in the metadata
+region of RAID member drives, no additional journal drive is needed.
 
 .SH For assemble:
 
@@ -1675,6 +1695,14 @@ can be found it
 under
 .BR "SCRUBBING AND MISMATCHES" .
 
+.TP
+.BR \-\-rwh-policy=
+Change the RAID Write Hole policy for a RAID-4/5/6 array at runtime. For
+details about the RWH policies, see the description for the same parameter
+under
+.B Create mode
+options.
+
 .SH For Incremental Assembly mode:
 .TP
 .BR \-\-rebuild\-map ", " \-r
-- 
2.10.1


^ permalink raw reply related

* [PATCH 6/7] Allow changing the RWH policy for a running array
From: Artur Paszkiewicz @ 2016-11-24 12:29 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Artur Paszkiewicz
In-Reply-To: <20161124122952.16529-1-artur.paszkiewicz@intel.com>

This extends the --rwh-policy parameter to work also in Misc mode. Using
it changes the currently active RWH policy in the kernel driver and
updates the metadata to make this change permanent. Updating metadata is
not yet implemented for super1, so this is limited to IMSM for now.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 Manage.c      | 79 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 mdadm.c       |  9 +++++++
 mdadm.h       |  1 +
 super-intel.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 153 insertions(+), 1 deletion(-)

diff --git a/Manage.c b/Manage.c
index 1b7b0c1..2343eba 100644
--- a/Manage.c
+++ b/Manage.c
@@ -1805,4 +1805,83 @@ int move_spare(char *from_devname, char *to_devname, dev_t devid)
 	close(fd2);
 	return 0;
 }
+
+int ChangeRwhPolicy(char *dev, char *update, int verbose)
+{
+	struct supertype *st;
+	struct mdinfo *info;
+	char *subarray = NULL;
+	int ret = 0;
+	int fd;
+	int new_policy = map_name(rwh_policies, update);
+
+	if (new_policy == UnSet)
+		return 1;
+
+	fd = open(dev, O_RDONLY);
+	if (fd < 0)
+		return 1;
+
+	st = super_by_fd(fd, &subarray);
+	if (!st) {
+		close(fd);
+		return 1;
+	}
+
+	info = sysfs_read(fd, NULL, GET_RWH_POLICY|GET_LEVEL);
+	close(fd);
+	if (!info) {
+		ret = 1;
+		goto free_st;
+	}
+
+	if (new_policy == RWH_POLICY_PPL && !st->ss->supports_ppl) {
+		pr_err("%s metadata does not support PPL\n", st->ss->name);
+		ret = 1;
+		goto free_info;
+	}
+
+	if (info->array.level < 4 || info->array.level > 6) {
+		pr_err("Operation not supported for array level %d\n",
+				info->array.level);
+		ret = 1;
+		goto free_info;
+	}
+
+	if (info->rwh_policy != (unsigned)new_policy) {
+		if (!st->ss->external && new_policy == RWH_POLICY_PPL) {
+			pr_err("Operation supported for external metadata only.\n");
+			ret = 1;
+			goto free_info;
+		}
+
+		if (sysfs_set_str(info, NULL, "rwh_policy", update)) {
+			pr_err("Failed to change array RWH Policy\n");
+			ret = 1;
+			goto free_info;
+		}
+		info->rwh_policy = new_policy;
+	}
+
+	if (subarray) {
+		char container_dev[PATH_MAX];
+		struct mddev_ident ident;
+
+		sprintf(container_dev, "/dev/%s", st->container_devnm);
+
+		st->info = info;
+		ident.st = st;
+
+		ret = Update_subarray(container_dev, subarray, "rwh-policy",
+				&ident, verbose);
+	}
+
+free_info:
+	sysfs_free(info);
+free_st:
+	free(st);
+	free(subarray);
+
+	return ret;
+}
 #endif
diff --git a/mdadm.c b/mdadm.c
index 9ecdce6..82db22c 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -251,6 +251,7 @@ int main(int argc, char *argv[])
 		case UpdateSubarray:
 		case UdevRules:
 		case KillOpt:
+		case RwhPolicy:
 			if (!mode)
 				newmode = MISC;
 			break;
@@ -1200,11 +1201,16 @@ int main(int argc, char *argv[])
 			s.journaldisks = 1;
 			continue;
 		case O(CREATE, RwhPolicy):
+		case O(MISC, RwhPolicy):
 			s.rwh_policy = map_name(rwh_policies, optarg);
 			if (s.rwh_policy == UnSet) {
 				pr_err("Invalid RWH policy: %s\n", optarg);
 				exit(2);
 			}
+			if (mode == MISC) {
+				devmode = opt;
+				c.update = optarg;
+			}
 			continue;
 		}
 		/* We have now processed all the valid options. Anything else is
@@ -1916,6 +1922,9 @@ static int misc_list(struct mddev_dev *devlist,
 		case Action:
 			rv |= SetAction(dv->devname, c->action);
 			continue;
+		case RwhPolicy:
+			rv |= ChangeRwhPolicy(dv->devname, c->update, c->verbose);
+			continue;
 		}
 		if (dv->devname[0] == '/')
 			mdfd = open_mddev(dv->devname, 1);
diff --git a/mdadm.h b/mdadm.h
index 570d108..60a964a 100755
--- a/mdadm.h
+++ b/mdadm.h
@@ -1332,6 +1332,7 @@ extern int Update_subarray(char *dev, char *subarray, char *update, struct mddev
 extern int Wait(char *dev);
 extern int WaitClean(char *dev, int sock, int verbose);
 extern int SetAction(char *dev, char *action);
+extern int ChangeRwhPolicy(char *dev, char *update, int verbose);
 
 extern int Incremental(struct mddev_dev *devlist, struct context *c,
 		       struct supertype *st);
diff --git a/super-intel.c b/super-intel.c
index e6bd9ec..aa44f8c 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -450,6 +450,7 @@ enum imsm_update_type {
 	update_takeover,
 	update_general_migration_checkpoint,
 	update_size_change,
+	update_rwh_policy,
 };
 
 struct imsm_update_activate_spare {
@@ -538,6 +539,12 @@ struct imsm_update_add_remove_disk {
 	enum imsm_update_type type;
 };
 
+struct imsm_update_rwh_policy {
+	enum imsm_update_type type;
+	int new_policy;
+	int dev_idx;
+};
+
 static const char *_sys_dev_type[] = {
 	[SYS_DEV_UNKNOWN] = "Unknown",
 	[SYS_DEV_SAS] = "SAS",
@@ -2896,7 +2903,6 @@ static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info,
 	info->custom_array_size   <<= 32;
 	info->custom_array_size   |= __le32_to_cpu(dev->size_low);
 	info->recovery_blocked = imsm_reshape_blocks_arrays_changes(st->sb);
-	info->journal_clean = dev->rwh_policy;
 
 	if (is_gen_migration(dev)) {
 		info->reshape_active = 1;
@@ -3061,6 +3067,8 @@ static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info,
 			info->rwh_policy = RWH_POLICY_PPL;
 		else
 			info->rwh_policy = RWH_POLICY_UNKNOWN;
+
+		info->journal_clean = info->rwh_policy == RWH_POLICY_PPL;
 	}
 }
 
@@ -6875,6 +6883,41 @@ static int update_subarray_imsm(struct supertype *st, char *subarray,
 			}
 			super->updates_pending++;
 		}
+	} else if (strcmp(update, "rwh-policy") == 0) {
+		struct mdinfo *info;
+		int new_policy;
+		char *ep;
+		int vol = strtoul(subarray, &ep, 10);
+
+		if (!ident->st || !ident->st->info)
+			return 2;
+
+		info = ident->st->info;
+
+		if (*ep != '\0' || vol >= super->anchor->num_raid_devs)
+			return 2;
+
+		if (info->rwh_policy == RWH_POLICY_OFF)
+			new_policy = RWH_OFF;
+		else if (info->rwh_policy == RWH_POLICY_PPL)
+			new_policy = RWH_DISTRIBUTED;
+		else
+			return 2;
+
+		if (st->update_tail) {
+			struct imsm_update_rwh_policy *u = xmalloc(sizeof(*u));
+
+			u->type = update_rwh_policy;
+			u->dev_idx = vol;
+			u->new_policy = new_policy;
+			append_metadata_update(st, u, sizeof(*u));
+		} else {
+			struct imsm_dev *dev;
+
+			dev = get_imsm_dev(super, vol);
+			dev->rwh_policy = new_policy;
+			super->updates_pending++;
+		}
 	} else
 		return 2;
 
@@ -9029,6 +9072,21 @@ static void imsm_process_update(struct supertype *st,
 		}
 		break;
 	}
+	case update_rwh_policy: {
+		struct imsm_update_rwh_policy *u = (void *)update->buf;
+		int target = u->dev_idx;
+		struct imsm_dev *dev = get_imsm_dev(super, target);
+		if (!dev) {
+			dprintf("could not find subarray-%d\n", target);
+			break;
+		}
+
+		if (dev->rwh_policy != u->new_policy) {
+			dev->rwh_policy = u->new_policy;
+			super->updates_pending++;
+		}
+		break;
+	}
 	default:
 		pr_err("error: unsuported process update type:(type: %d)\n",	type);
 	}
@@ -9270,6 +9328,11 @@ static int imsm_prepare_update(struct supertype *st,
 	case update_add_remove_disk:
 		/* no update->len needed */
 		break;
+	case update_rwh_policy: {
+		if (update->len < (int)sizeof(struct imsm_update_rwh_policy))
+			return 0;
+		break;
+	}
 	default:
 		return 0;
 	}
-- 
2.10.1


^ permalink raw reply related

* [PATCH 5/7] imsm: allow to assemble with PPL even if dirty degraded
From: Artur Paszkiewicz @ 2016-11-24 12:29 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Pawel Baldysiak, Artur Paszkiewicz
In-Reply-To: <20161124122952.16529-1-artur.paszkiewicz@intel.com>

From: Pawel Baldysiak <pawel.baldysiak@intel.com>

This is necessary to allow PPL recovery in the kernel driver. If the
recovery succeeds, the array is no longer in dirty state and can be
started normally as degraded.

Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 Assemble.c    | 4 +++-
 super-intel.c | 1 +
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/Assemble.c b/Assemble.c
index 3da0903..022d826 100644
--- a/Assemble.c
+++ b/Assemble.c
@@ -1943,7 +1943,9 @@ int assemble_container_content(struct supertype *st, int mdfd,
 		   content->uuid, chosen_name);
 
 	if (enough(content->array.level, content->array.raid_disks,
-		   content->array.layout, content->array.state & 1, avail) == 0) {
+		   content->array.layout,
+		   (content->array.state & 1) || content->journal_clean,
+		   avail) == 0) {
 		if (c->export && result)
 			*result |= INCR_NO;
 		else if (c->verbose >= 0) {
diff --git a/super-intel.c b/super-intel.c
index 7f12230..e6bd9ec 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -2896,6 +2896,7 @@ static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info,
 	info->custom_array_size   <<= 32;
 	info->custom_array_size   |= __le32_to_cpu(dev->size_low);
 	info->recovery_blocked = imsm_reshape_blocks_arrays_changes(st->sb);
+	info->journal_clean = dev->rwh_policy;
 
 	if (is_gen_migration(dev)) {
 		info->reshape_active = 1;
-- 
2.10.1


^ permalink raw reply related

* [PATCH 4/7] super1: PPL support
From: Artur Paszkiewicz @ 2016-11-24 12:29 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Artur Paszkiewicz
In-Reply-To: <20161124122952.16529-1-artur.paszkiewicz@intel.com>

Enable creating and assembling raid5 arrays with PPL for 1.1 and 1.2
metadata.

When creating, reserve enough space for PPL and store its size and
location in the superblock and set MD_FEATURE_PPL bit. PPL is stored in
the metadata region reserved for internal write-intent bitmap, so don't
allow using bitmap and PPL together.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 Create.c      | 19 +++++++++++--
 Grow.c        | 15 +++++++++-
 mdadm.h       |  4 ++-
 super-ddf.c   |  2 +-
 super-gpt.c   |  2 +-
 super-intel.c |  4 +--
 super-mbr.c   |  2 +-
 super0.c      |  2 +-
 super1.c      | 88 +++++++++++++++++++++++++++++++++++++++++++++++++----------
 9 files changed, 113 insertions(+), 25 deletions(-)

diff --git a/Create.c b/Create.c
index 52e7e2b..590ed69 100644
--- a/Create.c
+++ b/Create.c
@@ -201,6 +201,15 @@ int Create(struct supertype *st, char *mddev,
 		return 1;
 	}
 
+	if (s->rwh_policy == RWH_POLICY_PPL) {
+		if (s->bitmap_file) {
+			pr_err("PPL is not compatible with bitmap\n");
+			return 1;
+		} else {
+			s->bitmap_file = "none";
+		}
+	}
+
 	/* now set some defaults */
 
 	if (s->layout == UnSet) {
@@ -259,7 +268,8 @@ int Create(struct supertype *st, char *mddev,
 	if (st && ! st->ss->validate_geometry(st, s->level, s->layout, s->raiddisks,
 					      &s->chunk, s->size*2,
 					      data_offset, NULL,
-					      &newsize, c->verbose>=0))
+					      &newsize, s->rwh_policy,
+					      c->verbose>=0))
 		return 1;
 
 	if (s->chunk && s->chunk != UnSet) {
@@ -358,7 +368,8 @@ int Create(struct supertype *st, char *mddev,
 						st, s->level, s->layout, s->raiddisks,
 						&s->chunk, s->size*2,
 						dv->data_offset, dname,
-						&freesize, c->verbose > 0)) {
+						&freesize, s->rwh_policy,
+						c->verbose > 0)) {
 				case -1: /* Not valid, message printed, and not
 					  * worth checking any further */
 					exit(2);
@@ -395,6 +406,7 @@ int Create(struct supertype *st, char *mddev,
 						       &s->chunk, s->size*2,
 						       dv->data_offset,
 						       dname, &freesize,
+						       s->rwh_policy,
 						       c->verbose >= 0)) {
 
 				pr_err("%s is not suitable for this array.\n",
@@ -501,7 +513,8 @@ int Create(struct supertype *st, char *mddev,
 						       s->raiddisks,
 						       &s->chunk, minsize*2,
 						       data_offset,
-						       NULL, NULL, 0)) {
+						       NULL, NULL,
+						       s->rwh_policy, 0)) {
 				pr_err("devices too large for RAID level %d\n", s->level);
 				return 1;
 			}
diff --git a/Grow.c b/Grow.c
index 455c5f9..8bfb09c 100755
--- a/Grow.c
+++ b/Grow.c
@@ -290,6 +290,7 @@ int Grow_addbitmap(char *devname, int fd, struct context *c, struct shape *s)
 	int major = BITMAP_MAJOR_HI;
 	int vers = md_get_version(fd);
 	unsigned long long bitmapsize, array_size;
+	struct mdinfo *mdi;
 
 	if (vers < 9003) {
 		major = BITMAP_MAJOR_HOSTENDIAN;
@@ -389,12 +390,23 @@ int Grow_addbitmap(char *devname, int fd, struct context *c, struct shape *s)
 		free(st);
 		return 1;
 	}
+
+	mdi = sysfs_read(fd, NULL, GET_RWH_POLICY);
+	if (mdi) {
+		if (mdi->rwh_policy == RWH_POLICY_PPL) {
+			pr_err("Cannot add bitmap to array with PPL\n");
+			free(mdi);
+			free(st);
+			return 1;
+		}
+		free(mdi);
+	}
+
 	if (strcmp(s->bitmap_file, "internal") == 0 ||
 	    strcmp(s->bitmap_file, "clustered") == 0) {
 		int rv;
 		int d;
 		int offset_setable = 0;
-		struct mdinfo *mdi;
 		if (st->ss->add_internal_bitmap == NULL) {
 			pr_err("Internal bitmaps not supported with %s metadata\n", st->ss->name);
 			return 1;
@@ -446,6 +458,7 @@ int Grow_addbitmap(char *devname, int fd, struct context *c, struct shape *s)
 			sysfs_init(mdi, fd, NULL);
 			rv = sysfs_set_num_signed(mdi, NULL, "bitmap/location",
 						  mdi->bitmap_offset);
+			free(mdi);
 		} else {
 			if (strcmp(s->bitmap_file, "clustered") == 0)
 				array.state |= (1 << MD_SB_CLUSTERED);
diff --git a/mdadm.h b/mdadm.h
index 5600341..570d108 100755
--- a/mdadm.h
+++ b/mdadm.h
@@ -288,6 +288,8 @@ struct mdinfo {
 		#define MaxSector  (~0ULL) /* resync/recovery complete position */
 	};
 	long			bitmap_offset;	/* 0 == none, 1 == a file */
+	unsigned int		ppl_offset;
+	unsigned int		ppl_sectors;
 	unsigned long		safe_mode_delay; /* ms delay to mark clean */
 	int			new_level, delta_disks, new_layout, new_chunk;
 	int			errors;
@@ -948,7 +950,7 @@ extern struct superswitch {
 				 int *chunk, unsigned long long size,
 				 unsigned long long data_offset,
 				 char *subdev, unsigned long long *freesize,
-				 int verbose);
+				 int rwh_policy, int verbose);
 
 	/* Return a linked list of 'mdinfo' structures for all arrays
 	 * in the container.  For non-containers, it is like
diff --git a/super-ddf.c b/super-ddf.c
index 18e1e77..6184a73 100644
--- a/super-ddf.c
+++ b/super-ddf.c
@@ -3347,7 +3347,7 @@ static int validate_geometry_ddf(struct supertype *st,
 				 int *chunk, unsigned long long size,
 				 unsigned long long data_offset,
 				 char *dev, unsigned long long *freesize,
-				 int verbose)
+				 int rwh_policy, int verbose)
 {
 	int fd;
 	struct mdinfo *sra;
diff --git a/super-gpt.c b/super-gpt.c
index 1a2adce..efb0c00 100644
--- a/super-gpt.c
+++ b/super-gpt.c
@@ -195,7 +195,7 @@ static int validate_geometry(struct supertype *st, int level,
 			     int *chunk, unsigned long long size,
 			     unsigned long long data_offset,
 			     char *subdev, unsigned long long *freesize,
-			     int verbose)
+			     int rwh_policy, int verbose)
 {
 	pr_err("gpt metadata cannot be used this way\n");
 	return 0;
diff --git a/super-intel.c b/super-intel.c
index 79a3d78..7f12230 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -6631,7 +6631,7 @@ static int validate_geometry_imsm(struct supertype *st, int level, int layout,
 				  int raiddisks, int *chunk, unsigned long long size,
 				  unsigned long long data_offset,
 				  char *dev, unsigned long long *freesize,
-				  int verbose)
+				  int rwh_policy, int verbose)
 {
 	int fd, cfd;
 	struct mdinfo *sra;
@@ -10447,7 +10447,7 @@ enum imsm_reshape_type imsm_analyze_change(struct supertype *st,
 				    geo->raid_disks + devNumChange,
 				    &chunk,
 				    geo->size, INVALID_SECTORS,
-				    0, 0, 1))
+				    0, 0, info.rwh_policy, 1))
 		change = -1;
 
 	if (check_devs) {
diff --git a/super-mbr.c b/super-mbr.c
index f5e4cea..66d984c 100644
--- a/super-mbr.c
+++ b/super-mbr.c
@@ -193,7 +193,7 @@ static int validate_geometry(struct supertype *st, int level,
 			     int *chunk, unsigned long long size,
 			     unsigned long long data_offset,
 			     char *subdev, unsigned long long *freesize,
-			     int verbose)
+			     int rwh_policy, int verbose)
 {
 	pr_err("mbr metadata cannot be used this way\n");
 	return 0;
diff --git a/super0.c b/super0.c
index 151e52a..be6256f 100644
--- a/super0.c
+++ b/super0.c
@@ -1263,7 +1263,7 @@ static int validate_geometry0(struct supertype *st, int level,
 			      int *chunk, unsigned long long size,
 			      unsigned long long data_offset,
 			      char *subdev, unsigned long long *freesize,
-			      int verbose)
+			      int rwh_policy, int verbose)
 {
 	unsigned long long ldsize;
 	int fd;
diff --git a/super1.c b/super1.c
index 8a98ac2..b5825e6 100644
--- a/super1.c
+++ b/super1.c
@@ -48,10 +48,18 @@ struct mdp_superblock_1 {
 
 	__u32	chunksize;	/* in 512byte sectors */
 	__u32	raid_disks;
-	__u32	bitmap_offset;	/* sectors after start of superblock that bitmap starts
-				 * NOTE: signed, so bitmap can be before superblock
-				 * only meaningful of feature_map[0] is set.
-				 */
+	union {
+		__u32	bitmap_offset;	/* sectors after start of superblock that bitmap starts
+					 * NOTE: signed, so bitmap can be before superblock
+					 * only meaningful of feature_map[0] is set.
+					 */
+
+		/* only meaningful when feature_map[MD_FEATURE_PPL] is set */
+		struct {
+			__u16 offset; /* sectors after start of superblock that ppl starts */
+			__u16 size; /* PPL size (including header) in sectors */
+		} ppl;
+	};
 
 	/* These are only valid with feature bit '4' */
 	__u32	new_level;	/* new level we are reshaping to		*/
@@ -130,6 +138,7 @@ struct misc_dev_info {
 #define	MD_FEATURE_NEW_OFFSET		64 /* new_offset must be honoured */
 #define	MD_FEATURE_BITMAP_VERSIONED	256 /* bitmap version number checked properly */
 #define	MD_FEATURE_JOURNAL		512 /* support write journal */
+#define	MD_FEATURE_PPL			1024 /* support PPL */
 #define	MD_FEATURE_ALL			(MD_FEATURE_BITMAP_OFFSET	\
 					|MD_FEATURE_RECOVERY_OFFSET	\
 					|MD_FEATURE_RESHAPE_ACTIVE	\
@@ -139,6 +148,7 @@ struct misc_dev_info {
 					|MD_FEATURE_NEW_OFFSET		\
 					|MD_FEATURE_BITMAP_VERSIONED	\
 					|MD_FEATURE_JOURNAL		\
+					|MD_FEATURE_PPL			\
 					)
 
 #ifndef MDASSEMBLE
@@ -288,6 +298,11 @@ static int awrite(struct align_fd *afd, void *buf, int len)
 	return len;
 }
 
+static inline unsigned int choose_ppl_space(int chunk)
+{
+	return 8 + (chunk > 128*2 ? chunk : 128*2);
+}
+
 #ifndef MDASSEMBLE
 static void examine_super1(struct supertype *st, char *homehost)
 {
@@ -391,6 +406,10 @@ static void examine_super1(struct supertype *st, char *homehost)
 	if (sb->feature_map & __cpu_to_le32(MD_FEATURE_BITMAP_OFFSET)) {
 		printf("Internal Bitmap : %ld sectors from superblock\n",
 		       (long)(int32_t)__le32_to_cpu(sb->bitmap_offset));
+	} else if (sb->feature_map & __cpu_to_le32(MD_FEATURE_PPL)) {
+		printf("            PPL : %u sectors at offset %u sectors from superblock\n",
+		       __le16_to_cpu(sb->ppl.size),
+		       __le16_to_cpu(sb->ppl.offset));
 	}
 	if (sb->feature_map & __cpu_to_le32(MD_FEATURE_RESHAPE_ACTIVE)) {
 		printf("  Reshape pos'n : %llu%s\n", (unsigned long long)__le64_to_cpu(sb->reshape_position)/2,
@@ -932,8 +951,12 @@ static void getinfo_super1(struct supertype *st, struct mdinfo *info, char *map)
 
 	info->data_offset = __le64_to_cpu(sb->data_offset);
 	info->component_size = __le64_to_cpu(sb->size);
-	if (sb->feature_map & __le32_to_cpu(MD_FEATURE_BITMAP_OFFSET))
+	if (sb->feature_map & __le32_to_cpu(MD_FEATURE_BITMAP_OFFSET)) {
 		info->bitmap_offset = (int32_t)__le32_to_cpu(sb->bitmap_offset);
+	} else if (sb->feature_map & __le32_to_cpu(MD_FEATURE_PPL)) {
+		info->ppl_offset = __le16_to_cpu(sb->ppl.offset);
+		info->ppl_sectors = __le16_to_cpu(sb->ppl.size);
+	}
 
 	info->disk.major = 0;
 	info->disk.minor = 0;
@@ -978,6 +1001,11 @@ static void getinfo_super1(struct supertype *st, struct mdinfo *info, char *map)
 			bmend += size;
 			if (bmend > earliest)
 				earliest = bmend;
+		} else if (info->ppl_offset > 0) {
+			unsigned long long pplend = info->ppl_offset +
+						    info->ppl_sectors;
+			if (pplend > earliest)
+				earliest = pplend;
 		}
 		if (sb->bblog_offset && sb->bblog_size) {
 			unsigned long long bbend = super_offset;
@@ -1069,9 +1097,16 @@ static void getinfo_super1(struct supertype *st, struct mdinfo *info, char *map)
 	}
 
 	info->array.working_disks = working;
-	if (sb->feature_map & __le32_to_cpu(MD_FEATURE_JOURNAL))
+
+	if (sb->feature_map & __le32_to_cpu(MD_FEATURE_JOURNAL)) {
 		info->journal_device_required = 1;
-	info->journal_clean = 0;
+		info->rwh_policy = RWH_POLICY_JOURNAL;
+	} else if (sb->feature_map & __le32_to_cpu(MD_FEATURE_PPL)) {
+		info->rwh_policy = RWH_POLICY_PPL;
+		info->journal_clean = 1;
+	} else {
+		info->rwh_policy = RWH_POLICY_UNKNOWN;
+	}
 }
 
 static struct mdinfo *container_content1(struct supertype *st, char *subarray)
@@ -1233,6 +1268,9 @@ static int update_super1(struct supertype *st, struct mdinfo *info,
 		if (sb->feature_map & __cpu_to_le32(MD_FEATURE_BITMAP_OFFSET)) {
 			bitmap_offset = (long)__le32_to_cpu(sb->bitmap_offset);
 			bm_sectors = calc_bitmap_size(bms, 4096) >> 9;
+		} else if (sb->feature_map & __cpu_to_le32(MD_FEATURE_PPL)) {
+			bitmap_offset = (long)__le16_to_cpu(sb->ppl.offset);
+			bm_sectors = (long)__le16_to_cpu(sb->ppl.size);
 		}
 #endif
 		if (sb_offset < data_offset) {
@@ -1462,6 +1500,9 @@ static int init_super1(struct supertype *st, mdu_array_info_t *info,
 
 	memset(sb->dev_roles, 0xff, MAX_SB_SIZE - sizeof(struct mdp_superblock_1));
 
+	if (s->rwh_policy == RWH_POLICY_PPL)
+		sb->feature_map |= __cpu_to_le32(MD_FEATURE_PPL);
+
 	return 1;
 }
 
@@ -1663,7 +1704,7 @@ static int write_empty_r5l_meta_block(struct supertype *st, int fd)
 	crc = crc32c_le(crc, (void *)mb, META_BLOCK_SIZE);
 	mb->checksum = crc;
 
-	if (lseek64(fd, (sb->data_offset) * 512, 0) < 0LL) {
+	if (lseek64(fd, __le64_to_cpu(sb->data_offset) * 512, 0) < 0LL) {
 		pr_err("cannot seek to offset of the meta block\n");
 		goto fail_to_write;
 	}
@@ -1696,7 +1737,7 @@ static int write_init_super1(struct supertype *st)
 
 	for (di = st->info; di; di = di->next) {
 		if (di->disk.state & (1 << MD_DISK_JOURNAL))
-			sb->feature_map |= MD_FEATURE_JOURNAL;
+			sb->feature_map |= __cpu_to_le32(MD_FEATURE_JOURNAL);
 	}
 
 	for (di = st->info; di; di = di->next) {
@@ -1767,6 +1808,11 @@ static int write_init_super1(struct supertype *st)
 					(((char *)sb) + MAX_SB_SIZE);
 			bm_space = calc_bitmap_size(bms, 4096) >> 9;
 			bm_offset = (long)__le32_to_cpu(sb->bitmap_offset);
+		} else if (sb->feature_map & __cpu_to_le32(MD_FEATURE_PPL)) {
+			bm_space = choose_ppl_space(__le32_to_cpu(sb->chunksize));
+			bm_offset = 8;
+			sb->ppl.offset = __cpu_to_le16(bm_offset);
+			sb->ppl.size = __cpu_to_le16(bm_space);
 		} else {
 			bm_space = choose_bm_space(array_size);
 			bm_offset = 8;
@@ -1838,8 +1884,10 @@ static int write_init_super1(struct supertype *st)
 				goto error_out;
 		}
 
-		if (rv == 0 && (__le32_to_cpu(sb->feature_map) & 1))
+		if (rv == 0 &&
+		    (__le32_to_cpu(sb->feature_map) & MD_FEATURE_BITMAP_OFFSET))
 			rv = st->ss->write_bitmap(st, di->fd, NodeNumUpdate);
+
 		close(di->fd);
 		di->fd = -1;
 		if (rv)
@@ -2107,11 +2155,13 @@ static __u64 avail_size1(struct supertype *st, __u64 devsize,
 		return 0;
 
 #ifndef MDASSEMBLE
-	if (__le32_to_cpu(super->feature_map)&MD_FEATURE_BITMAP_OFFSET) {
+	if (__le32_to_cpu(super->feature_map) & MD_FEATURE_BITMAP_OFFSET) {
 		/* hot-add. allow for actual size of bitmap */
 		struct bitmap_super_s *bsb;
 		bsb = (struct bitmap_super_s *)(((char*)super)+MAX_SB_SIZE);
 		bmspace = calc_bitmap_size(bsb, 4096) >> 9;
+	} else if (__le32_to_cpu(super->feature_map) & MD_FEATURE_PPL) {
+		bmspace = __le16_to_cpu(super->ppl.size);
 	}
 #endif
 	/* Allow space for bad block log */
@@ -2479,7 +2529,7 @@ static int validate_geometry1(struct supertype *st, int level,
 			      int *chunk, unsigned long long size,
 			      unsigned long long data_offset,
 			      char *subdev, unsigned long long *freesize,
-			      int verbose)
+			      int rwh_policy, int verbose)
 {
 	unsigned long long ldsize, devsize;
 	int bmspace;
@@ -2501,6 +2551,14 @@ static int validate_geometry1(struct supertype *st, int level,
 		/* not specified, so time to set default */
 		st->minor_version = 2;
 
+	if (st->minor_version != 2 && st->minor_version != 1 &&
+	    rwh_policy == RWH_POLICY_PPL) {
+		if (verbose)
+			pr_err("1.%d metadata does not support PPL\n",
+			       st->minor_version);
+		return 0;
+	}
+
 	fd = open(subdev, O_RDONLY|O_EXCL, 0);
 	if (fd < 0) {
 		if (verbose)
@@ -2521,8 +2579,9 @@ static int validate_geometry1(struct supertype *st, int level,
 		return 0;
 	}
 
-	/* creating:  allow suitable space for bitmap */
-	bmspace = choose_bm_space(devsize);
+	/* creating:  allow suitable space for bitmap or PPL */
+	bmspace = rwh_policy == RWH_POLICY_PPL ?
+		  choose_ppl_space((*chunk)*2) : choose_bm_space(devsize);
 
 	if (data_offset == INVALID_SECTORS)
 		data_offset = st->data_offset;
@@ -2654,5 +2713,6 @@ struct superswitch super1 = {
 #else
 	.swapuuid = 1,
 #endif
+	.supports_ppl = 1,
 	.name = "1.x",
 };
-- 
2.10.1


^ permalink raw reply related

* [PATCH 3/7] imsm: PPL support
From: Artur Paszkiewicz @ 2016-11-24 12:29 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Artur Paszkiewicz
In-Reply-To: <20161124122952.16529-1-artur.paszkiewicz@intel.com>

Enable creating and assembling IMSM raid5 arrays with PPL.

Write the IMSM MPB location for a device to the newly added rdev
sb_start sysfs attribute and 'journal_ppl' to 'state' attribute for
every active member.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 mdadm.h       |  1 +
 super-intel.c | 33 +++++++++++++++++++++++++++++++++
 sysfs.c       |  4 ++++
 3 files changed, 38 insertions(+)

diff --git a/mdadm.h b/mdadm.h
index 4eabf59..5600341 100755
--- a/mdadm.h
+++ b/mdadm.h
@@ -252,6 +252,7 @@ struct mdinfo {
 	unsigned long long	custom_array_size; /* size for non-default sized
 						    * arrays (in sectors)
 						    */
+	unsigned long long	sb_start;
 #define NO_RESHAPE		0
 #define VOLUME_RESHAPE		1
 #define CONTAINER_RESHAPE	2
diff --git a/super-intel.c b/super-intel.c
index df09272..79a3d78 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -1261,6 +1261,15 @@ static void print_imsm_dev(struct intel_super *super,
 	}
 	printf("\n");
 	printf("    Dirty State : %s\n", dev->vol.dirty ? "dirty" : "clean");
+	printf("     RWH Policy : ");
+	if (dev->rwh_policy == RWH_OFF)
+		printf("off\n");
+	else if (dev->rwh_policy == RWH_DISTRIBUTED)
+		printf("PPL distributed\n");
+	else if (dev->rwh_policy == RWH_JOURNALING_DRIVE)
+		printf("PPL journaling drive\n");
+	else
+		printf("<unknown:%d>\n", dev->rwh_policy);
 }
 
 static void print_imsm_disk(struct imsm_disk *disk, int index, __u32 reserved)
@@ -3043,6 +3052,15 @@ static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info,
 			}
 		}
 	}
+
+	if (info->array.level == 5) {
+		if (dev->rwh_policy == RWH_OFF)
+			info->rwh_policy = RWH_POLICY_OFF;
+		else if (dev->rwh_policy == RWH_DISTRIBUTED)
+			info->rwh_policy = RWH_POLICY_PPL;
+		else
+			info->rwh_policy = RWH_POLICY_UNKNOWN;
+	}
 }
 
 static __u8 imsm_check_degraded(struct intel_super *super, struct imsm_dev *dev,
@@ -3177,6 +3195,9 @@ static void getinfo_super_imsm(struct supertype *st, struct mdinfo *info, char *
 
 		disk = &super->disks->disk;
 		info->data_offset = total_blocks(&super->disks->disk) - reserved;
+		/* mpb anchor sector - see store_imsm_mpb() */
+		info->sb_start = total_blocks(&super->disks->disk) -
+				 ((2 * super->sector_size) >> 9);
 		info->component_size = reserved;
 		info->disk.state  = is_configured(disk) ? (1 << MD_DISK_ACTIVE) : 0;
 		/* we don't change info->disk.raid_disk here because
@@ -5034,6 +5055,17 @@ static int init_super_imsm_volume(struct supertype *st, mdu_array_info_t *info,
 	}
 	mpb->num_raid_devs++;
 
+	if (s->rwh_policy == UnSet || s->rwh_policy == RWH_POLICY_OFF) {
+		dev->rwh_policy = RWH_OFF;
+	} else if (s->rwh_policy == RWH_POLICY_PPL) {
+		dev->rwh_policy = RWH_DISTRIBUTED;
+	} else {
+		free(dev);
+		free(dv);
+		pr_err("imsm supports only PPL RWH Policy\n");
+		return 0;
+	}
+
 	dv->dev = dev;
 	dv->index = super->current_vol;
 	dv->next = super->devlist;
@@ -11061,6 +11093,7 @@ struct superswitch super_imsm = {
 	.container_content = container_content_imsm,
 	.validate_container = validate_container_imsm,
 
+	.supports_ppl	= 1,
 	.external	= 1,
 	.name = "imsm",
 
diff --git a/sysfs.c b/sysfs.c
index 4772d77..b4437a3 100644
--- a/sysfs.c
+++ b/sysfs.c
@@ -732,7 +732,11 @@ int sysfs_add_disk(struct mdinfo *sra, struct mdinfo *sd, int resume)
 			rv |= sysfs_set_num(sra, sd, "slot", sd->disk.raid_disk);
 		if (resume)
 			sysfs_set_num(sra, sd, "recovery_start", sd->recovery_start);
+		if (sra->rwh_policy == RWH_POLICY_PPL &&
+		    (sd->recovery_start == MaxSector || !resume))
+			sysfs_set_str(sra, sd, "state", "journal_ppl");
 	}
+	sysfs_set_num(sra, sd, "sb_start", sd->sb_start);
 	return rv;
 }
 
-- 
2.10.1


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox