Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* RAID10 reshape and change lout possible?
From: Reindl Harald @ 2017-02-26 14:37 UTC (permalink / raw)
  To: linux-raid

since in recent news i saw that convert several raid-levels with respahe 
is possible now

is it somehow possible to change a existing md RAID10 array from "2 
near-copies" to "2 far-copies" and what would be the mdadm syntax? the 
IO intense workload here are large reads within virtual disks and as far 
as i can see far-layout would here be the better option but as installed 
the stuff in 2011 not sure if anaconda even offered that in the OS installer

so i would like to change the layout "on-the-fly" without dump/restore 
data (which is hard for the OS itself while keep dracut and friends 
working just fine) and also important keep all the UUID's (there is a 
second cloned machine which is in large parts rsynced including 
grub-config and fstab)

^ permalink raw reply

* Re: RAID10 reshape and change lout possible?
From: Andy Smith @ 2017-02-26 18:51 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <ae9c2ffb-808a-c905-ea29-1c8f5493b2ad@thelounge.net>

Hello,

On Sun, Feb 26, 2017 at 03:37:24PM +0100, Reindl Harald wrote:
> is it somehow possible to change a existing md RAID10 array from "2
> near-copies" to "2 far-copies" and what would be the mdadm syntax?

https://manpages.debian.org/jessie/mdadm/mdadm.8.en.html suggests
that --grow --layout can be changed on RAID-10. I have never tried
it myself. Experiment with loop devices. It looks like the syntax
would be:

# mdadm --grow /dev/mdX --layout=f2

Cheers,
Andy

-- 
https://bitfolk.com/ -- No-nonsense VPS hosting

> The optimum programming team size is 1.
Has Jurassic Park taught us nothing?         — pfilandr

^ permalink raw reply

* Re: RAID10 reshape and change lout possible?
From: Reindl Harald @ 2017-02-26 20:28 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <20170226185147.GD21587@bitfolk.com>



Am 26.02.2017 um 19:51 schrieb Andy Smith:
> On Sun, Feb 26, 2017 at 03:37:24PM +0100, Reindl Harald wrote:
>> is it somehow possible to change a existing md RAID10 array from "2
>> near-copies" to "2 far-copies" and what would be the mdadm syntax?
>
> https://manpages.debian.org/jessie/mdadm/mdadm.8.en.html suggests
> that --grow --layout can be changed on RAID-10. I have never tried
> it myself. Experiment with loop devices. It looks like the syntax
> would be:
>
> # mdadm --grow /dev/mdX --layout=f2

thanks for your answer

the syntax seems to be correct given the error message says clear 
"Cannot reshape RAID10 to far-mode" which sadly answers the question - 
at least for now, maybe that will change in the future

[root@testserver:~]$ mdadm --grow /dev/md0 --layout=f2
mdadm: Cannot reshape RAID10 to far-mode

[root@testserver:~]$ rpm -q mdadm
mdadm-3.4-3.fc24.x86_64

[root@testserver:~]$ uname -a
Linux testserver.rhsoft.net 4.9.12-100.fc24.x86_64 #1 SMP Thu Feb 23 
21:29:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

^ permalink raw reply

* Re: RAID10 reshape and change lout possible?
From: Andy Smith @ 2017-02-26 21:17 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <3ef7f0b2-1f3e-40e2-ab23-3ff9436f9ced@thelounge.net>

Hi,

On Sun, Feb 26, 2017 at 09:28:34PM +0100, Reindl Harald wrote:
> Am 26.02.2017 um 19:51 schrieb Andy Smith:
> >https://manpages.debian.org/jessie/mdadm/mdadm.8.en.html suggests
> >that --grow --layout can be changed on RAID-10.

[…]

> [root@testserver:~]$ mdadm --grow /dev/md0 --layout=f2
> mdadm: Cannot reshape RAID10 to far-mode

Hmm, that man page says that Debian jessie has version 3.3.2-5 and
still it says:

Grow

[…]

    …changing the chunk size and layout for RAID 0,4,5,6,10 as well
    as adding or removing a write-intent bitmap.

So is the man page wrong?

Admittedly until I looked I thought RAID-10 couldn't be reshaped at
all in any way. And now I don't know at all. :)

Cheers,
Andy

-- 
https://bitfolk.com/ -- No-nonsense VPS hosting

^ permalink raw reply

* Re: RAID10 reshape and change lout possible?
From: Reindl Harald @ 2017-02-26 22:57 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <20170226211728.GE21587@bitfolk.com>



Am 26.02.2017 um 22:17 schrieb Andy Smith:
> Hi,
>
> On Sun, Feb 26, 2017 at 09:28:34PM +0100, Reindl Harald wrote:
>> Am 26.02.2017 um 19:51 schrieb Andy Smith:
>>> https://manpages.debian.org/jessie/mdadm/mdadm.8.en.html suggests
>>> that --grow --layout can be changed on RAID-10.
>
> […]
>
>> [root@testserver:~]$ mdadm --grow /dev/md0 --layout=f2
>> mdadm: Cannot reshape RAID10 to far-mode
>
> Hmm, that man page says that Debian jessie has version 3.3.2-5 and
> still it says:
>     …changing the chunk size and layout for RAID 0,4,5,6,10 as well
>     as adding or removing a write-intent bitmap.
>
> So is the man page wrong?
>
> Admittedly until I looked I thought RAID-10 couldn't be reshaped at
> all in any way. And now I don't know at all. :)

well, change to "offset" seems to be supported (also not that long at 
all) if there is enough room and Google round confirms that

[root@testserver:~]$ mdadm --grow /dev/md0 --layout=o2
mdadm: insufficient head-room on /dev/sdg1

with the knowledge from today i would have chosen RAID1+0 or two RAID1 
and LVM stripe on top at least for the data partitions which would 
support the current 4 disk setup with "writemostly" and so achieve the 
full read performance out of the two SSD instead need another 2 SSD 
disks or build the raid-setup from scratch, both not making me terrible 
happy :-)

^ permalink raw reply

* Re: RAID10 reshape and change lout possible?
From: Brad Campbell @ 2017-02-27  0:42 UTC (permalink / raw)
  To: Reindl Harald, linux-raid
In-Reply-To: <ae9c2ffb-808a-c905-ea29-1c8f5493b2ad@thelounge.net>

On 26/02/17 22:37, Reindl Harald wrote:
> since in recent news i saw that convert several raid-levels with respahe
> is possible now
>
> is it somehow possible to change a existing md RAID10 array from "2
> near-copies" to "2 far-copies" and what would be the mdadm syntax? the
> IO intense workload here are large reads within virtual disks and as far
> as i can see far-layout would here be the better option but as installed
> the stuff in 2011 not sure if anaconda even offered that in the OS
> installer

Far layout will significantly improve read performance, however it will 
also significantly impact write performance. You really want to spool up 
a test system and benchmark the performance in your specific application 
before considering changes like this. The write performance degradation 
on magnetic media is significant due to the extra seeking required.

As you've already discovered. It's a backup and restore exercise on RAID10.

Regards,
Brad

^ permalink raw reply

* Re: [BUG] non-metadata arrays cannot use more than 27 component devices
From: NeilBrown @ 2017-02-27  5:55 UTC (permalink / raw)
  To: ian_bruce, linux-raid
In-Reply-To: <20170224040816.41f2f372.ian_bruce@mail.ru>

[-- Attachment #1: Type: text/plain, Size: 1225 bytes --]

On Fri, Feb 24 2017, ian_bruce@mail.ru wrote:

> When assembling non-metadata arrays ("mdadm --build"), the in-kernel
> superblock apparently defaults to the MD-RAID v0.90 type. This imposes a
> maximum of 27 component block devices, presumably as well as limits on
> device size.
>
> mdadm does not allow you to override this default, by specifying the
> v1.2 superblock. It is not clear whether mdadm tells the kernel to use
> the v0.90 superblock, or the kernel assumes this by itself. One or other
> of them should be fixed; there does not appear to be any reason why the
> v1.2 superblock should not be the default in this case.

Can you see if this change improves the behavior for you?

NeilBrown


diff --git a/drivers/md/md.c b/drivers/md/md.c
index ba485dcf1064..e0ac7f5a8e68 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6464,9 +6464,8 @@ static int set_array_info(struct mddev *mddev, mdu_array_info_t *info)
 	mddev->layout        = info->layout;
 	mddev->chunk_sectors = info->chunk_size >> 9;
 
-	mddev->max_disks     = MD_SB_DISKS;
-
 	if (mddev->persistent) {
+		mddev->max_disks     = MD_SB_DISKS;
 		mddev->flags         = 0;
 		mddev->sb_flags         = 0;
 	}


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* Re: IMSM RAID10: Rebuild/Resync difference on Linux vs Windows
From: Matthias Dahl @ 2017-02-27  9:30 UTC (permalink / raw)
  To: Artur Paszkiewicz, linux-raid
In-Reply-To: <4f749c02-3478-cafc-2ae4-40af25459f65@intel.com>

[-- Attachment #1: Type: text/plain, Size: 1190 bytes --]

Hello Artur...

On 23/02/17 16:27, Artur Paszkiewicz wrote:

> Windows IRSM should not make any "shortcuts" in this case.

I expected and dreaded that answer equally. ;-)

> [...] When it finishes you can check md/mismatch_cnt if there were any
> unsynchronized blocks.

I did a resync completely on Linux and the mismatch_cnt stayed at zero,
which is nice to know but does unfortunately not change the fact that
there was a huge jump going from Linux to Windows with regards to sync
completion the last time around. :(

Knowing almost nothing about the on-disk (meta-data) structure here, I'd
guess that having an in-sync raid could also be completely due to luck
in this case.

> Can you share the output of mdadm -E <raid disks> and
 > mdadm --detail-platform?

It is attached. If there is anything else you need, please let me know.
The data was pulled, btw, before I did the "test resync" above -- if
that is relevant to you.

> What version of RST software are you using on Windows?

14.8.0.1042 on Win10 Pro (x86_64)

Thanks,
Matthias

-- 
Dipl.-Inf. (FH) Matthias Dahl | Software Engineer | binary-island.eu
  Hire me for contract work: Open Source, Proprietary, Web/Mobile/Desktop

[-- Attachment #2: raid-details.txt --]
[-- Type: text/plain, Size: 5901 bytes --]

       Platform : Intel(R) Rapid Storage Technology
        Version : 14.6.0.2285
    RAID Levels : raid0 raid1 raid10 raid5
    Chunk Sizes : 4k 8k 16k 32k 64k 128k
    2TB volumes : supported
      2TB disks : supported
      Max Disks : 7
    Max Volumes : 2 per array, 4 per controller
 I/O Controller : /sys/devices/pci0000:00/0000:00:17.0 (SATA)
          Port5 : /dev/sdd (Z1X69GC2)
          Port3 : /dev/sdb (Z1X6VEEQ)
          Port4 : /dev/sdc (Z1X69HB2)
          Port1 : - non-disk device (PIONEER BD-RW   BDR-S09) -
          Port2 : /dev/sda (Z1X6V5JZ)
          Port0 : - no device attached -

/dev/sda:
          Magic : Intel Raid ISM Cfg Sig.
        Version : 1.3.00
    Orig Family : 8bf33aa4
         Family : 8bf33aa4
     Generation : 00312a2c
     Attributes : All supported
           UUID : 9c45a5a7:83c364db:f1e8b47c:e9aee53f
       Checksum : 65c084b8 correct
    MPB Sectors : 2
          Disks : 4
   RAID Devices : 1

  Disk00 Serial : Z1X6V5JZ
          State : active
             Id : 00000002
    Usable Size : 3907024136 (1863.01 GiB 2000.40 GB)

[MainR10]:
           UUID : 92adf4b0:129c3b94:9a2b5610:d7c8c5d5
     RAID Level : 10
        Members : 4
          Slots : [UUUU]
    Failed disk : none
      This Slot : 0
     Array Size : 7814047744 (3726.03 GiB 4000.79 GB)
   Per Dev Size : 3907024136 (1863.01 GiB 2000.40 GB)
  Sector Offset : 0
    Num Stripes : 15261812
     Chunk Size : 64 KiB
       Reserved : 0
  Migrate State : idle
      Map State : normal
    Dirty State : dirty

  Disk01 Serial : Z1X6VEEQ
          State : active
             Id : 00000003
    Usable Size : 3907024136 (1863.01 GiB 2000.40 GB)

  Disk02 Serial : Z1X69HB2
          State : active
             Id : 00000004
    Usable Size : 3907024136 (1863.01 GiB 2000.40 GB)

  Disk03 Serial : Z1X69GC2
          State : active
             Id : 00000005
    Usable Size : 3907024136 (1863.01 GiB 2000.40 GB)

/dev/sdb:
          Magic : Intel Raid ISM Cfg Sig.
        Version : 1.3.00
    Orig Family : 8bf33aa4
         Family : 8bf33aa4
     Generation : 00312a2c
     Attributes : All supported
           UUID : 9c45a5a7:83c364db:f1e8b47c:e9aee53f
       Checksum : 65c084b8 correct
    MPB Sectors : 2
          Disks : 4
   RAID Devices : 1

  Disk01 Serial : Z1X6VEEQ
          State : active
             Id : 00000003
    Usable Size : 3907024136 (1863.01 GiB 2000.40 GB)

[MainR10]:
           UUID : 92adf4b0:129c3b94:9a2b5610:d7c8c5d5
     RAID Level : 10
        Members : 4
          Slots : [UUUU]
    Failed disk : none
      This Slot : 1
     Array Size : 7814047744 (3726.03 GiB 4000.79 GB)
   Per Dev Size : 3907024136 (1863.01 GiB 2000.40 GB)
  Sector Offset : 0
    Num Stripes : 15261812
     Chunk Size : 64 KiB
       Reserved : 0
  Migrate State : idle
      Map State : normal
    Dirty State : dirty

  Disk00 Serial : Z1X6V5JZ
          State : active
             Id : 00000002
    Usable Size : 3907024136 (1863.01 GiB 2000.40 GB)

  Disk02 Serial : Z1X69HB2
          State : active
             Id : 00000004
    Usable Size : 3907024136 (1863.01 GiB 2000.40 GB)

  Disk03 Serial : Z1X69GC2
          State : active
             Id : 00000005
    Usable Size : 3907024136 (1863.01 GiB 2000.40 GB)

/dev/sdc:
          Magic : Intel Raid ISM Cfg Sig.
        Version : 1.3.00
    Orig Family : 8bf33aa4
         Family : 8bf33aa4
     Generation : 00312a2e
     Attributes : All supported
           UUID : 9c45a5a7:83c364db:f1e8b47c:e9aee53f
       Checksum : 65c084ba correct
    MPB Sectors : 2
          Disks : 4
   RAID Devices : 1

  Disk02 Serial : Z1X69HB2
          State : active
             Id : 00000004
    Usable Size : 3907024136 (1863.01 GiB 2000.40 GB)

[MainR10]:
           UUID : 92adf4b0:129c3b94:9a2b5610:d7c8c5d5
     RAID Level : 10
        Members : 4
          Slots : [UUUU]
    Failed disk : none
      This Slot : 2
     Array Size : 7814047744 (3726.03 GiB 4000.79 GB)
   Per Dev Size : 3907024136 (1863.01 GiB 2000.40 GB)
  Sector Offset : 0
    Num Stripes : 15261812
     Chunk Size : 64 KiB
       Reserved : 0
  Migrate State : idle
      Map State : normal
    Dirty State : dirty

  Disk00 Serial : Z1X6V5JZ
          State : active
             Id : 00000002
    Usable Size : 3907024136 (1863.01 GiB 2000.40 GB)

  Disk01 Serial : Z1X6VEEQ
          State : active
             Id : 00000003
    Usable Size : 3907024136 (1863.01 GiB 2000.40 GB)

  Disk03 Serial : Z1X69GC2
          State : active
             Id : 00000005
    Usable Size : 3907024136 (1863.01 GiB 2000.40 GB)

/dev/sdd:
          Magic : Intel Raid ISM Cfg Sig.
        Version : 1.3.00
    Orig Family : 8bf33aa4
         Family : 8bf33aa4
     Generation : 00312a2f
     Attributes : All supported
           UUID : 9c45a5a7:83c364db:f1e8b47c:e9aee53f
       Checksum : 65bf84bb correct
    MPB Sectors : 2
          Disks : 4
   RAID Devices : 1

  Disk03 Serial : Z1X69GC2
          State : active
             Id : 00000005
    Usable Size : 3907024136 (1863.01 GiB 2000.40 GB)

[MainR10]:
           UUID : 92adf4b0:129c3b94:9a2b5610:d7c8c5d5
     RAID Level : 10
        Members : 4
          Slots : [UUUU]
    Failed disk : none
      This Slot : 3
     Array Size : 7814047744 (3726.03 GiB 4000.79 GB)
   Per Dev Size : 3907024136 (1863.01 GiB 2000.40 GB)
  Sector Offset : 0
    Num Stripes : 15261812
     Chunk Size : 64 KiB
       Reserved : 0
  Migrate State : idle
      Map State : normal
    Dirty State : clean

  Disk00 Serial : Z1X6V5JZ
          State : active
             Id : 00000002
    Usable Size : 3907024136 (1863.01 GiB 2000.40 GB)

  Disk01 Serial : Z1X6VEEQ
          State : active
             Id : 00000003
    Usable Size : 3907024136 (1863.01 GiB 2000.40 GB)

  Disk02 Serial : Z1X69HB2
          State : active
             Id : 00000004
    Usable Size : 3907024136 (1863.01 GiB 2000.40 GB)

^ permalink raw reply

* exposed and pushed the mdadm/test improvement progress
From: zhilong @ 2017-02-27 12:37 UTC (permalink / raw)
  To: NeilBrown, Jes Sorensen; +Cc: linux-raid@vger.kernel.org, Guoqing Jiang

hi, Neil and Jes;

    Several weeks ago, the mdadm version 4.0 has announced and we talked 
about how to improve the testing part in that mail-tree.
I have spent two weeks to verify the testing part in different testing 
environments include of
    kernel:                                    mdadm: 
                                   filesystem:
    SUSE 12 SP2 Enterprise      mdadm 4.0 source code       ext3, ext4 
and btrfs
    linux upstream 4.10             mdadm 
4.0                             ext3, ext4 and btrfs

    According to the two weeks' testing and my understanding, my first 
step has done to redraft the 'test' script, as my purpose,
I mainly made it easier to read, added some important checking I thought 
and changed some func(). After the 'test' file has done,
the second part is to improve every case under tests/ folder.

    The 'tests/check' and 'tests/testdev' files would be deleted after 
finish the testing part improvement. I have added comments in the
RFC patch, I wanna ask for all ideas about the testing improvement in 
the mailing-list, would you have any suggestion?

Thanks very much,
-Zhilong

^ permalink raw reply

* Re: [PATCH v5 4/4] dt-bindings: Add DT bindings document for Broadcom SBA RAID driver
From: Rob Herring @ 2017-02-27 14:47 UTC (permalink / raw)
  To: Anup Patel
  Cc: Mark Rutland, devicetree, Herbert Xu, Scott Branden, Vinod Koul,
	Ray Jui, Jassi Brar, linux-kernel, linux-raid, Jon Mason,
	bcm-kernel-feedback-list, linux-crypto, Rob Rice, dmaengine,
	Dan Williams, David S . Miller, linux-arm-kernel
In-Reply-To: <1487227695-965-5-git-send-email-anup.patel@broadcom.com>

On Thu, Feb 16, 2017 at 12:18:15PM +0530, Anup Patel wrote:
> This patch adds the DT bindings document for newly added Broadcom
> SBA RAID driver.
> 
> Signed-off-by: Anup Patel <anup.patel@broadcom.com>
> Reviewed-by: Ray Jui <ray.jui@broadcom.com>
> Reviewed-by: Scott Branden <scott.branden@broadcom.com>
> ---
>  .../devicetree/bindings/dma/brcm,iproc-sba.txt     | 29 ++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/dma/brcm,iproc-sba.txt

Acked-by: Rob Herring <robh@kernel.org>

^ permalink raw reply

* Re: Process stuck in md_flush_request (state: D)
From: Les Stroud @ 2017-02-27 14:49 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <9589FA71-C458-4B44-B2F3-3D42E8B0885D@lesstroud.com>

After a period of a couple of weeks with one of our test instances having this problem every other day, they were all nice enough to operate without an issue for 9 days.  It finally reoccurred last night on one of the machines.  

It exhibits the same symptoms and the call traces look as they did previously.  This particular instance is configured with a deadline scheduler.  I was able to capture the inflight you requested:

$ cat /sys/block/xvd[abcde]/inflight
       0        0
       0        0
       0        0
       0        0
       0        0

I’ve had this happen on instances with the deadline scheduler and the noop scheduler.  At this point, I have not had this happen on an instance that is noop and the raid filesystem (ext4) is mounted with nobarrier.  The instances with noop/nobarrier have not been running long enough for me to make any sort of conclusion that it works around the problem.  Frankly, I’m not sure I understand the interaction between ext4 barriers and raid0 block flushes well enough to theorize whether it should or shouldn’t make a difference.

Does any of this help with identifying the bug?  Is there anymore information I can get that would be useful?  

I have some systems that need to be moved into production in the next couple of weeks that are having this problem.  Do you have any ideas how I might be able to work around the problem?

Thanx for the time,
LES

> On Feb 17, 2017, at 3:40 PM, Les Stroud <les@lesstroud.com> wrote:
> 
> It’ll take a day or two for it to happen again.  When it does, I’ll pull the inflight stats.  Anything else I should grab while I’m at it?
> 
> Thanx,
> LES
> 
> 
>> On Feb 17, 2017, at 3:06 PM, Shaohua Li <shli@kernel.org> wrote:
>> 
>> On Fri, Feb 17, 2017 at 02:05:49PM -0500, Les Stroud wrote:
>>> 
>>> I have a problem with processes entering an uninterruptible sleep state in md_flush_request and never returning. I having trouble identifying the underlying issue. I’m hoping someone on here may be able to help.
>>> 
>>> The servers in question are running in aws (xen hvm) with kernel 3.8.13-118.16.2.el6uek.x86_64.  The server has two mounts.  The first is vanilla ext4.  The second is a software RAID0 array, striped with 256k chunks, buiIt with md.  It’s file system is ext4. 
>>> 
>>> The most immediately and obvious symptom of the issue are kernel errors “kernel: INFO task [some process]: blocked for more than 120 seconds.”.  Shortly there after, other processes start entering the same uninterruptible wait state (D). This almost always impacts ssh logins.
>>> 
>>> The problem does not occur when the system is under load, or was recently under load.  It happens when the system is quiet (no cpu, very little io).
>> 
>> This seems suggesting we have a missed blk-plug flush in light workload. Can
>> you check the output of /sys/block/disk-bame/inflight for both md and the
>> underlayer disks? This will let us know if there is IO pending.
>> Also it would be great if you can test a upstream kernel.
>> 
>> Thanks,
>> Shaohua
> 

^ permalink raw reply

* Re: LSI RAID
From: Gandalf Corvotempesta @ 2017-02-27 16:02 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <CAJH6TXj+766Xj4w8BAOoq9wQ_Vym_JD66pbqRgpXriNguG6zTQ@mail.gmail.com>

anyone?

2017-02-13 11:33 GMT+01:00 Gandalf Corvotempesta
<gandalf.corvotempesta@gmail.com>:
> Hi to all
> silly question: i've read that LSI/PERC Hardware controller supports DDF.
>
> Would be possible, for mdadm, to use a RAID created with an LSI/PERC
> controller supporting DDF ?

^ permalink raw reply

* Re: Process stuck in md_flush_request (state: D)
From: Shaohua Li @ 2017-02-27 18:28 UTC (permalink / raw)
  To: Les Stroud; +Cc: linux-raid
In-Reply-To: <829563C6-A2AF-4E5F-B5AF-D33D2E5A734E@lesstroud.com>

On Mon, Feb 27, 2017 at 09:49:59AM -0500, Les Stroud wrote:
> After a period of a couple of weeks with one of our test instances having this problem every other day, they were all nice enough to operate without an issue for 9 days.  It finally reoccurred last night on one of the machines.  
> 
> It exhibits the same symptoms and the call traces look as they did previously.  This particular instance is configured with a deadline scheduler.  I was able to capture the inflight you requested:
> 
> $ cat /sys/block/xvd[abcde]/inflight
>        0        0
>        0        0
>        0        0
>        0        0
>        0        0
> 
> I’ve had this happen on instances with the deadline scheduler and the noop scheduler.  At this point, I have not had this happen on an instance that is noop and the raid filesystem (ext4) is mounted with nobarrier.  The instances with noop/nobarrier have not been running long enough for me to make any sort of conclusion that it works around the problem.  Frankly, I’m not sure I understand the interaction between ext4 barriers and raid0 block flushes well enough to theorize whether it should or shouldn’t make a difference.

If nobarrier, ext4 doesn't send flush request.
 
> Does any of this help with identifying the bug?  Is there anymore information I can get that would be useful?  


Unfortunately I can't find anything fishing. Does the xcdx disk correctly
handle flush request? For example, you can do the same test with a single such
disk and check if anything wrong.

Thanks,
Shaohua

^ permalink raw reply

* Re: Process stuck in md_flush_request (state: D)
From: Les Stroud @ 2017-02-27 18:48 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <20170227182806.jntxzhyw3nkohl5r@kernel.org>





> On Feb 27, 2017, at 1:28 PM, Shaohua Li <shli@kernel.org> wrote:
> 
> On Mon, Feb 27, 2017 at 09:49:59AM -0500, Les Stroud wrote:
>> After a period of a couple of weeks with one of our test instances having this problem every other day, they were all nice enough to operate without an issue for 9 days.  It finally reoccurred last night on one of the machines.  
>> 
>> It exhibits the same symptoms and the call traces look as they did previously.  This particular instance is configured with a deadline scheduler.  I was able to capture the inflight you requested:
>> 
>> $ cat /sys/block/xvd[abcde]/inflight
>>        0        0
>>        0        0
>>        0        0
>>        0        0
>>        0        0
>> 
>> I’ve had this happen on instances with the deadline scheduler and the noop scheduler.  At this point, I have not had this happen on an instance that is noop and the raid filesystem (ext4) is mounted with nobarrier.  The instances with noop/nobarrier have not been running long enough for me to make any sort of conclusion that it works around the problem. Frankly, I’m not sure I understand the interaction between ext4 barriers and raid0 block flushes well enough to theorize whether it should or shouldn’t make a difference.
> 
> If nobarrier, ext4 doesn't send flush request.

So, could ext4’s flush request deadlock with an md_flush_request?  Do they share a mutex of some sort? Could one of them be failing to acquire a mutex and not handling it?

> 
>> Does any of this help with identifying the bug?  Is there anymore information I can get that would be useful?  
> 
> 
> Unfortunately I can't find anything fishing. Does the xcdx disk correctly
> handle flush request? For example, you can do the same test with a single such
> disk and check if anything wrong.

Until recently, we had a number of these systems setup without raid0.  This issue never occurred on those systems.  Unfortunately, I can’t find a way to make it happen other than stand a server up and let it run.

I suppose I could try a different filesystem and see if that makes a difference (maybe ext3, xfs, etc).


> 
> Thanks,
> Shaohua


^ permalink raw reply

* Re: LSI RAID
From: Stan Hoeppner @ 2017-02-27 23:04 UTC (permalink / raw)
  To: Gandalf Corvotempesta, linux-raid
In-Reply-To: <CAJH6TXhwfoWKM2=0v0+qhUWp=WppF5qO7zm5Y1-vS4zWQHA5WA@mail.gmail.com>

No, you must use the LSI tools.


On 02/27/2017 10:02 AM, Gandalf Corvotempesta wrote:
> anyone?
>
> 2017-02-13 11:33 GMT+01:00 Gandalf Corvotempesta
> <gandalf.corvotempesta@gmail.com>:
>> Hi to all
>> silly question: i've read that LSI/PERC Hardware controller supports DDF.
>>
>> Would be possible, for mdadm, to use a RAID created with an LSI/PERC
>> controller supporting DDF ?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* GRUB warning after replacing disk drive in RAID1
From: Peter Sangas @ 2017-02-27 23:37 UTC (permalink / raw)
  To: linux-raid

Hi all.

I have a RAID1 with 3 disks sda,sdb,sdc.  After replacing sdc and re-syncing
it to the array I issued the following command to load grub but I get this 
warning:

grub-install /dev/sdc

Installing for i386-pc platform.
grub-install: warning: Couldn't find physical volume `(null)'. Some modules
may be missing from core image..
grub-install: warning: Couldn't find physical volume `(null)'. Some modules
may be missing from core image..
Installation finished. No error reported.

Does anyone know why I get this warning and how to avoid it.

uname -a 
Linux green 4.4.0-47-generic #68-Ubuntu SMP Wed Oct 26 19:39:52 UTC 2016
x86_64 x86_64 x86_64 GNU/Linux

mdadm -V
mdadm - v3.3 - 3rd September 2013

Thanks,
Pete

^ permalink raw reply

* Re: [PATCH v4 2/7] raid5: calculate partial parity for a stripe
From: Shaohua Li @ 2017-02-27 23:45 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: shli, linux-raid
In-Reply-To: <20170221194401.18733-3-artur.paszkiewicz@intel.com>

On Tue, Feb 21, 2017 at 08:43:56PM +0100, Artur Paszkiewicz wrote:
> Attach a page for holding the partial parity data to stripe_head.
> Allocate it only if mddev has the MD_HAS_PPL flag set.
> 
> Partial parity is the xor of not modified data chunks of a stripe and is
> calculated as follows:
> 
> - reconstruct-write case:
>   xor data from all not updated disks in a stripe
> 
> - read-modify-write case:
>   xor old data and parity from all updated disks in a stripe
> 
> Implement it using the async_tx API and integrate into raid_run_ops().
> It must be called when we still have access to old data, so do it when
> STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
> stored into sh->ppl_page.
> 
> Partial parity is not meaningful for full stripe write and is not stored
> in the log or used for recovery, so don't attempt to calculate it when
> stripe has STRIPE_FULL_WRITE.
> 
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  drivers/md/raid5.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/md/raid5.h |   3 ++
>  2 files changed, 103 insertions(+)
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 7b7722bb2e8d..02e02fe5b04e 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -463,6 +463,11 @@ static void shrink_buffers(struct stripe_head *sh)
>  		sh->dev[i].page = NULL;
>  		put_page(p);
>  	}
> +
> +	if (sh->ppl_page) {
> +		put_page(sh->ppl_page);
> +		sh->ppl_page = NULL;
> +	}
>  }
>  
>  static int grow_buffers(struct stripe_head *sh, gfp_t gfp)
> @@ -479,6 +484,13 @@ static int grow_buffers(struct stripe_head *sh, gfp_t gfp)
>  		sh->dev[i].page = page;
>  		sh->dev[i].orig_page = page;
>  	}
> +
> +	if (test_bit(MD_HAS_PPL, &sh->raid_conf->mddev->flags)) {
> +		sh->ppl_page = alloc_page(gfp);
> +		if (!sh->ppl_page)
> +			return 1;
> +	}
> +
>  	return 0;
>  }
>  
> @@ -1974,6 +1986,55 @@ static void ops_run_check_pq(struct stripe_head *sh, struct raid5_percpu *percpu
>  			   &sh->ops.zero_sum_result, percpu->spare_page, &submit);
>  }
>  
> +static struct dma_async_tx_descriptor *
> +ops_run_partial_parity(struct stripe_head *sh, struct raid5_percpu *percpu,
> +		       struct dma_async_tx_descriptor *tx)
> +{
> +	int disks = sh->disks;
> +	struct page **xor_srcs = flex_array_get(percpu->scribble, 0);
> +	int count = 0, pd_idx = sh->pd_idx, i;
> +	struct async_submit_ctl submit;
> +
> +	pr_debug("%s: stripe %llu\n", __func__, (unsigned long long)sh->sector);
> +
> +	/*
> +	 * Partial parity is the XOR of stripe data chunks that are not changed
> +	 * during the write request. Depending on available data
> +	 * (read-modify-write vs. reconstruct-write case) we calculate it
> +	 * differently.
> +	 */
> +	if (sh->reconstruct_state == reconstruct_state_prexor_drain_run) {
> +		/* rmw: xor old data and parity from updated disks */
> +		for (i = disks; i--;) {
> +			struct r5dev *dev = &sh->dev[i];
> +			if (test_bit(R5_Wantdrain, &dev->flags) || i == pd_idx)
> +				xor_srcs[count++] = dev->page;
> +		}
> +	} else if (sh->reconstruct_state == reconstruct_state_drain_run) {
> +		/* rcw: xor data from all not updated disks */
> +		for (i = disks; i--;) {
> +			struct r5dev *dev = &sh->dev[i];
> +			if (test_bit(R5_UPTODATE, &dev->flags))
> +				xor_srcs[count++] = dev->page;
> +		}
> +	} else {
> +		return tx;
> +	}
> +
> +	init_async_submit(&submit, ASYNC_TX_XOR_ZERO_DST, tx, NULL, sh,
> +			  flex_array_get(percpu->scribble, 0)
> +			  + sizeof(struct page *) * (sh->disks + 2));
> +
> +	if (count == 1)
> +		tx = async_memcpy(sh->ppl_page, xor_srcs[0], 0, 0, PAGE_SIZE,
> +				  &submit);
> +	else
> +		tx = async_xor(sh->ppl_page, xor_srcs, 0, count, PAGE_SIZE,
> +			       &submit);
> +
> +	return tx;
> +}

Can you put this function to raid5-ppl.c? I'd like to keep all the log related
out raid5.c if possible.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v4 3/7] raid5-ppl: Partial Parity Log write logging implementation
From: Shaohua Li @ 2017-02-28  0:04 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: shli, linux-raid
In-Reply-To: <20170221194401.18733-4-artur.paszkiewicz@intel.com>

On Tue, Feb 21, 2017 at 08:43:57PM +0100, Artur Paszkiewicz wrote:
> This implements the PPL write logging functionality. The description
> of PPL is added to the documentation. More details can be found in the
> comments in raid5-ppl.c.
> 
> Put the PPL metadata structures to md_p.h because userspace tools
> (mdadm) will also need to read/write PPL.
> 
> Warn about using PPL with enabled disk volatile write-back cache for
> now. It can be removed once disk cache flushing before writing PPL is
> implemented.
> 
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  Documentation/md/raid5-ppl.txt |  44 +++
>  drivers/md/Makefile            |   2 +-
>  drivers/md/raid5-ppl.c         | 617 +++++++++++++++++++++++++++++++++++++++++
>  drivers/md/raid5.c             |  49 +++-
>  drivers/md/raid5.h             |   8 +
>  include/uapi/linux/raid/md_p.h |  26 ++
>  6 files changed, 738 insertions(+), 8 deletions(-)
>  create mode 100644 Documentation/md/raid5-ppl.txt
>  create mode 100644 drivers/md/raid5-ppl.c
> 
> diff --git a/Documentation/md/raid5-ppl.txt b/Documentation/md/raid5-ppl.txt
> new file mode 100644
> index 000000000000..127072b09363
> --- /dev/null
> +++ b/Documentation/md/raid5-ppl.txt
> @@ -0,0 +1,44 @@
> +Partial Parity Log
> +
> +Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue
> +addressed by PPL is that after a dirty shutdown, parity of a particular stripe
> +may become inconsistent with data on other member disks. If the array is also
> +in degraded state, there is no way to recalculate parity, because one of the
> +disks is missing. This can lead to silent data corruption when rebuilding the
> +array or using it is as degraded - data calculated from parity for array blocks
> +that have not been touched by a write request during the unclean shutdown can
> +be incorrect. Such condition is known as the RAID5 Write Hole. Because of
> +this, md by default does not allow starting a dirty degraded array.
> +
> +Partial parity for a write operation is the XOR of stripe data chunks not
> +modified by this write. It is just enough data needed for recovering from the
> +write hole. XORing partial parity with the modified chunks produces parity for
> +the stripe, consistent with its state before the write operation, regardless of
> +which chunk writes have completed. If one of the not modified data disks of
> +this stripe is missing, this updated parity can be used to recover its
> +contents. PPL recovery is also performed when starting an array after an
> +unclean shutdown and all disks are available, eliminating the need to resync
> +the array. Because of this, using write-intent bitmap and PPL together is not
> +supported.
> +
> +When handling a write request PPL writes partial parity before new data and
> +parity are dispatched to disks. PPL is a distributed log - it is stored on
> +array member drives in the metadata area, on the parity drive of a particular
> +stripe.  It does not require a dedicated journaling drive. Write performance is
> +reduced by up to 30%-40% but it scales with the number of drives in the array
> +and the journaling drive does not become a bottleneck or a single point of
> +failure.
> +
> +Unlike raid5-cache, the other solution in md for closing the write hole, PPL is
> +not a true journal. It does not protect from losing in-flight data, only from
> +silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is
> +performed for this stripe (parity is not updated). So it is possible to have
> +arbitrary data in the written part of a stripe if that disk is lost. In such
> +case the behavior is the same as in plain raid5.
> +
> +PPL is available for md version-1 metadata and external (specifically IMSM)
> +metadata arrays. It can be enabled using mdadm option --consistency-policy=ppl.
> +
> +Currently, volatile write-back cache should be disabled on all member drives
> +when using PPL. Otherwise it cannot guarantee consistency in case of power
> +failure.
> diff --git a/drivers/md/Makefile b/drivers/md/Makefile
> index 3cbda1af87a0..4d48714ccc6b 100644
> --- a/drivers/md/Makefile
> +++ b/drivers/md/Makefile
> @@ -18,7 +18,7 @@ dm-cache-cleaner-y += dm-cache-policy-cleaner.o
>  dm-era-y	+= dm-era-target.o
>  dm-verity-y	+= dm-verity-target.o
>  md-mod-y	+= md.o bitmap.o
> -raid456-y	+= raid5.o raid5-cache.o
> +raid456-y	+= raid5.o raid5-cache.o raid5-ppl.o
>  
>  # Note: link order is important.  All raid personalities
>  # and must come before md.o, as they each initialise 
> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
> new file mode 100644
> index 000000000000..a00cabf1adf6
> --- /dev/null
> +++ b/drivers/md/raid5-ppl.c
> @@ -0,0 +1,617 @@
> +/*
> + * Partial Parity Log for closing the RAID5 write hole
> + * Copyright (c) 2017, Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/blkdev.h>
> +#include <linux/slab.h>
> +#include <linux/crc32c.h>
> +#include <linux/raid/md_p.h>
> +#include "md.h"
> +#include "raid5.h"
> +
> +/*
> + * PPL consists of a 4KB header (struct ppl_header) and at least 128KB for
> + * partial parity data. The header contains an array of entries
> + * (struct ppl_header_entry) which describe the logged write requests.
> + * Partial parity for the entries comes after the header, written in the same
> + * sequence as the entries:
> + *
> + * Header
> + *   entry0
> + *   ...
> + *   entryN
> + * PP data
> + *   PP for entry0
> + *   ...
> + *   PP for entryN
> + *
> + * Every entry holds a checksum of its partial parity, the header also has a
> + * checksum of the header itself. Entries for full stripes writes contain no
> + * partial parity, they only mark the stripes for which parity should be
> + * recalculated after an unclean shutdown.

Please describe the disk format in details. I had trouble to understand
ppl_log_stripe() and ppl_submit_iounit(), things like the ppl merge, full
stripes handling.

> + *
> + * A write request is always logged to the PPL instance stored on the parity
> + * disk of the corresponding stripe. For each member disk there is one ppl_log
> + * used to handle logging for this disk, independently from others. They are
> + * grouped in child_logs array in struct ppl_conf, which is assigned to
> + * r5conf->ppl and used in raid5 core.
> + *
> + * ppl_io_unit represents a full PPL write, header_page contains the ppl_header.
> + * PPL entries for logged stripes are added in ppl_log_stripe(). A stripe can
> + * be appended to the last entry if the chunks to write are the same, otherwise
> + * a new entry is added. Checksums of entries are calculated incrementally as
> + * stripes containing partial parity are being added to entries.
> + * ppl_submit_iounit() calculates the checksum of the header and submits a bio
> + * containing the header page and partial parity pages (sh->ppl_page) for all
> + * stripes of the io_unit. When the PPL write completes, the stripes associated
> + * with the io_unit are released and raid5d starts writing their data and
> + * parity. When all stripes are written, the io_unit is freed and the next can
> + * be submitted.
> + *
> + * An io_unit is used to gather stripes until it is submitted or becomes full
> + * (if the maximum number of entries or size of PPL is reached). Another io_unit
> + * can't be submitted until the previous has completed (PPL and stripe
> + * data+parity is written). The log->io_list tracks all io_units of a log
> + * (for a single member disk). New io_units are added to the end of the list
> + * and the first io_unit is submitted, if it is not submitted already.
> + * The current io_unit accepting new stripes is always at the end of the list.
> + */
> +
> +struct ppl_conf {
> +	struct mddev *mddev;
> +
> +	/* array of child logs, one for each raid disk */
> +	struct ppl_log *child_logs;
> +	int count;
> +
> +	int block_size;		/* the logical block size used for data_sector
> +				 * in ppl_header_entry */
> +	u32 signature;		/* raid array identifier */
> +	atomic64_t seq;		/* current log write sequence number */
> +
> +	struct kmem_cache *io_kc;
> +	mempool_t *io_pool;
> +	struct bio_set *bs;
> +	mempool_t *meta_pool;
> +};
> +
> +struct ppl_log {
> +	struct ppl_conf *ppl_conf;	/* shared between all log instances */
> +
> +	struct md_rdev *rdev;		/* array member disk associated with
> +					 * this log instance */
> +	struct mutex io_mutex;
> +	struct ppl_io_unit *current_io;	/* current io_unit accepting new data
> +					 * always at the end of io_list */
> +	spinlock_t io_list_lock;
> +	struct list_head io_list;	/* all io_units of this log */
> +	struct list_head no_mem_stripes;/* stripes to retry if failed to
> +					 * allocate io_unit */
> +};
> +
> +struct ppl_io_unit {
> +	struct ppl_log *log;
> +
> +	struct page *header_page;	/* for ppl_header */
> +
> +	unsigned int entries_count;	/* number of entries in ppl_header */
> +	unsigned int pp_size;		/* total size current of partial parity */
> +
> +	u64 seq;			/* sequence number of this log write */
> +	struct list_head log_sibling;	/* log->io_list */
> +
> +	struct list_head stripe_list;	/* stripes added to the io_unit */
> +	atomic_t pending_stripes;	/* how many stripes not written to raid */
> +
> +	bool submitted;			/* true if write to log started */
> +};
> +
> +static struct ppl_io_unit *ppl_new_iounit(struct ppl_log *log,
> +					  struct stripe_head *sh)
> +{
> +	struct ppl_conf *ppl_conf = log->ppl_conf;
> +	struct ppl_io_unit *io;
> +	struct ppl_header *pplhdr;
> +
> +	io = mempool_alloc(ppl_conf->io_pool, GFP_ATOMIC);
> +	if (!io)
> +		return NULL;
> +
> +	memset(io, 0, sizeof(*io));
> +	io->log = log;
> +	INIT_LIST_HEAD(&io->log_sibling);
> +	INIT_LIST_HEAD(&io->stripe_list);
> +	atomic_set(&io->pending_stripes, 0);
> +
> +	io->header_page = mempool_alloc(ppl_conf->meta_pool, GFP_NOIO);
> +	pplhdr = page_address(io->header_page);
> +	clear_page(pplhdr);
> +	memset(pplhdr->reserved, 0xff, PPL_HDR_RESERVED);
> +	pplhdr->signature = cpu_to_le32(ppl_conf->signature);
> +
> +	io->seq = atomic64_add_return(1, &ppl_conf->seq);
> +	pplhdr->generation = cpu_to_le64(io->seq);
> +
> +	return io;
> +}
> +
> +static int ppl_log_stripe(struct ppl_log *log, struct stripe_head *sh)
> +{
> +	struct ppl_io_unit *io = log->current_io;
> +	struct ppl_header_entry *e = NULL;
> +	struct ppl_header *pplhdr;
> +	int i;
> +	sector_t data_sector = 0;
> +	int data_disks = 0;
> +	unsigned int entry_space = (log->rdev->ppl.size << 9) - PPL_HEADER_SIZE;
> +	struct r5conf *conf = sh->raid_conf;
> +
> +	pr_debug("%s: stripe: %llu\n", __func__, (unsigned long long)sh->sector);
> +
> +	/* check if current io_unit is full */
> +	if (io && (io->pp_size == entry_space ||
> +		   io->entries_count == PPL_HDR_MAX_ENTRIES)) {
> +		pr_debug("%s: add io_unit blocked by seq: %llu\n",
> +			 __func__, io->seq);
> +		io = NULL;
> +	}
> +
> +	/* add a new unit if there is none or the current is full */
> +	if (!io) {
> +		io = ppl_new_iounit(log, sh);
> +		if (!io)
> +			return -ENOMEM;
> +		spin_lock_irq(&log->io_list_lock);
> +		list_add_tail(&io->log_sibling, &log->io_list);
> +		spin_unlock_irq(&log->io_list_lock);
> +
> +		log->current_io = io;
> +	}
> +
> +	for (i = 0; i < sh->disks; i++) {
> +		struct r5dev *dev = &sh->dev[i];
> +
> +		if (i != sh->pd_idx && test_bit(R5_Wantwrite, &dev->flags)) {
> +			if (!data_disks || dev->sector < data_sector)
> +				data_sector = dev->sector;
> +			data_disks++;
> +		}
> +	}
> +	BUG_ON(!data_disks);
> +
> +	pr_debug("%s: seq: %llu data_sector: %llu data_disks: %d\n", __func__,
> +		 io->seq, (unsigned long long)data_sector, data_disks);
> +
> +	pplhdr = page_address(io->header_page);
> +
> +	if (io->entries_count > 0) {
> +		struct ppl_header_entry *last =
> +				&pplhdr->entries[io->entries_count - 1];
> +		u64 data_sector_last = le64_to_cpu(last->data_sector);
> +		u32 data_size_last = le32_to_cpu(last->data_size);
> +		u32 pp_size_last = le32_to_cpu(last->pp_size);
> +
> +		/*
> +		 * Check if we can merge with the last entry. Must be on
> +		 * the same stripe and disks. Use bit shift and logarithm
> +		 * to avoid 64-bit division.
> +		 */
> +		if ((data_sector >> ilog2(conf->chunk_sectors) ==
> +		     data_sector_last >> ilog2(conf->chunk_sectors)) &&
> +		    ((pp_size_last == 0 &&
> +		      test_bit(STRIPE_FULL_WRITE, &sh->state)) ||
> +		     ((data_sector_last + (pp_size_last >> 9) == data_sector) &&
> +		      (data_size_last == pp_size_last * data_disks))))

please explain this part

> +			e = last;
> +	}
> +
> +	if (!e) {
> +		e = &pplhdr->entries[io->entries_count++];
> +		e->data_sector = cpu_to_le64(data_sector);

Don't really understand what's the data_sector for. Should this be stripe->sector?

> +		e->parity_disk = cpu_to_le32(sh->pd_idx);
> +		e->checksum = cpu_to_le32(~0);
> +	}
> +
> +	le32_add_cpu(&e->data_size, data_disks << PAGE_SHIFT);
> +
> +	/* don't write any PP if full stripe write */
> +	if (!test_bit(STRIPE_FULL_WRITE, &sh->state)) {
> +		le32_add_cpu(&e->pp_size, PAGE_SIZE);
> +		io->pp_size += PAGE_SIZE;
> +		e->checksum = cpu_to_le32(crc32c_le(le32_to_cpu(e->checksum),
> +						    page_address(sh->ppl_page),
> +						    PAGE_SIZE));
> +	}
> +
> +	list_add_tail(&sh->log_list, &io->stripe_list);
> +	atomic_inc(&io->pending_stripes);
> +	sh->ppl_log_io = io;
> +
> +	return 0;
> +}
> +
> +int ppl_write_stripe(struct ppl_conf *ppl_conf, struct stripe_head *sh)
> +{
> +	struct ppl_log *log;
> +	struct ppl_io_unit *io = sh->ppl_log_io;
> +
> +	if (io || test_bit(STRIPE_SYNCING, &sh->state) ||
> +	    !test_bit(R5_Wantwrite, &sh->dev[sh->pd_idx].flags) ||
> +	    !test_bit(R5_Insync, &sh->dev[sh->pd_idx].flags)) {
> +		clear_bit(STRIPE_LOG_TRAPPED, &sh->state);
> +		return -EAGAIN;
> +	}
> +
> +	log = &ppl_conf->child_logs[sh->pd_idx];
> +
> +	mutex_lock(&log->io_mutex);
> +
> +	if (!log->rdev || test_bit(Faulty, &log->rdev->flags)) {

is the log->rdev check required?

> +		mutex_unlock(&log->io_mutex);
> +		return -EAGAIN;
> +	}
> +
> +	set_bit(STRIPE_LOG_TRAPPED, &sh->state);
> +	clear_bit(STRIPE_DELAYED, &sh->state);
> +	atomic_inc(&sh->count);
> +
> +	if (ppl_log_stripe(log, sh)) {
> +		spin_lock_irq(&log->io_list_lock);
> +		list_add_tail(&sh->log_list, &log->no_mem_stripes);
> +		spin_unlock_irq(&log->io_list_lock);
> +	}
> +
> +	mutex_unlock(&log->io_mutex);
> +
> +	return 0;
> +}
> +
> +static void ppl_log_endio(struct bio *bio)
> +{
> +	struct ppl_io_unit *io = bio->bi_private;
> +	struct ppl_log *log = io->log;
> +	struct ppl_conf *ppl_conf = log->ppl_conf;
> +	struct stripe_head *sh, *next;
> +
> +	pr_debug("%s: seq: %llu\n", __func__, io->seq);
> +
> +	if (bio->bi_error)
> +		md_error(ppl_conf->mddev, log->rdev);
> +
> +	bio_put(bio);
> +	mempool_free(io->header_page, ppl_conf->meta_pool);
> +
> +	list_for_each_entry_safe(sh, next, &io->stripe_list, log_list) {
> +		list_del_init(&sh->log_list);
> +
> +		set_bit(STRIPE_HANDLE, &sh->state);
> +		raid5_release_stripe(sh);
> +	}
> +}
> +
> +static void ppl_submit_iounit(struct ppl_io_unit *io)
> +{
> +	struct ppl_log *log = io->log;
> +	struct ppl_conf *ppl_conf = log->ppl_conf;
> +	struct r5conf *conf = ppl_conf->mddev->private;
> +	struct ppl_header *pplhdr = page_address(io->header_page);
> +	struct bio *bio;
> +	struct stripe_head *sh;
> +	int i;
> +	struct bio_list bios = BIO_EMPTY_LIST;
> +	char b[BDEVNAME_SIZE];
> +
> +	bio = bio_alloc_bioset(GFP_NOIO, BIO_MAX_PAGES, ppl_conf->bs);
> +	bio->bi_private = io;
> +	bio->bi_end_io = ppl_log_endio;
> +	bio->bi_opf = REQ_OP_WRITE | REQ_FUA;
> +	bio->bi_bdev = log->rdev->bdev;
> +	bio->bi_iter.bi_sector = log->rdev->ppl.sector;
> +	bio_add_page(bio, io->header_page, PAGE_SIZE, 0);
> +	bio_list_add(&bios, bio);
> +
> +	sh = list_first_entry(&io->stripe_list, struct stripe_head, log_list);
> +
> +	for (i = 0; i < io->entries_count; i++) {
> +		struct ppl_header_entry *e = &pplhdr->entries[i];
> +		u32 pp_size = le32_to_cpu(e->pp_size);
> +		u32 data_size = le32_to_cpu(e->data_size);
> +		u64 data_sector = le64_to_cpu(e->data_sector);
> +		int stripes_count;
> +
> +		if (pp_size > 0)
> +			stripes_count = pp_size >> PAGE_SHIFT;
> +		else
> +			stripes_count = (data_size /
> +					 (conf->raid_disks -
> +					  conf->max_degraded)) >> PAGE_SHIFT;

please explain this part

> +
> +		while (stripes_count--) {
> +			/*
> +			 * if entry without partial parity just skip its stripes
> +			 * without adding pages to bio
> +			 */
> +			if (pp_size > 0 &&
> +			    !bio_add_page(bio, sh->ppl_page, PAGE_SIZE, 0)) {
> +				struct bio *prev = bio;
> +
> +				bio = bio_alloc_bioset(GFP_NOIO, BIO_MAX_PAGES,
> +						       ppl_conf->bs);

The code keeps allocating bios but doesn't dispatch any yet. The bioset can't
guarantee the allocation successes. To have the guarantee, we must make sure
previous bios could finish.

> +				bio->bi_opf = prev->bi_opf;
> +				bio->bi_bdev = prev->bi_bdev;
> +				bio->bi_iter.bi_sector = bio_end_sector(prev);
> +				bio_add_page(bio, sh->ppl_page, PAGE_SIZE, 0);
> +				bio_chain(bio, prev);
> +				bio_list_add(&bios, bio);
> +			}
> +			sh = list_next_entry(sh, log_list);
> +		}
> +
> +		pr_debug("%s: seq: %llu entry: %d data_sector: %llu pp_size: %u data_size: %u\n",
> +			 __func__, io->seq, i, data_sector, pp_size, data_size);
> +
> +		e->data_sector = cpu_to_le64(data_sector >>
> +					     ilog2(ppl_conf->block_size >> 9));
> +		e->checksum = cpu_to_le32(~le32_to_cpu(e->checksum));
> +	}
> +
> +	pplhdr->entries_count = cpu_to_le32(io->entries_count);
> +	pplhdr->checksum = cpu_to_le32(~crc32c_le(~0, pplhdr, PPL_HEADER_SIZE));

so the checksum is ~crc32c, right? please doc in ppl_header_entry definition

> +
> +	while ((bio = bio_list_pop(&bios))) {
> +		pr_debug("%s: seq: %llu submit_bio() size: %u sector: %llu dev: %s\n",
> +			 __func__, io->seq, bio->bi_iter.bi_size,
> +			 (unsigned long long)bio->bi_iter.bi_sector,
> +			 bdevname(bio->bi_bdev, b));
> +		submit_bio(bio);
> +	}
> +}
> +
> +static void ppl_submit_current_io(struct ppl_log *log)
> +{
> +	struct ppl_io_unit *io;
> +
> +	spin_lock_irq(&log->io_list_lock);
> +
> +	io = list_first_entry_or_null(&log->io_list, struct ppl_io_unit,
> +				      log_sibling);
> +	if (io && io->submitted)
> +		io = NULL;
> +
> +	spin_unlock_irq(&log->io_list_lock);
> +
> +	if (io) {
> +		io->submitted = true;
> +
> +		if (io == log->current_io)
> +			log->current_io = NULL;
> +
> +		ppl_submit_iounit(io);
> +	}
> +}
> +
> +void ppl_write_stripe_run(struct ppl_conf *ppl_conf)
> +{
> +	struct ppl_log *log;
> +	int i;
> +
> +	for (i = 0; i < ppl_conf->count; i++) {
> +		log = &ppl_conf->child_logs[i];
> +
> +		mutex_lock(&log->io_mutex);
> +		ppl_submit_current_io(log);
> +		mutex_unlock(&log->io_mutex);
> +	}
> +}
> +
> +static void ppl_io_unit_finished(struct ppl_io_unit *io)
> +{
> +	struct ppl_log *log = io->log;
> +	unsigned long flags;
> +
> +	pr_debug("%s: seq: %llu\n", __func__, io->seq);
> +
> +	spin_lock_irqsave(&log->io_list_lock, flags);
> +
> +	list_del(&io->log_sibling);
> +	mempool_free(io, log->ppl_conf->io_pool);
> +
> +	if (!list_empty(&log->no_mem_stripes)) {
> +		struct stripe_head *sh = list_first_entry(&log->no_mem_stripes,
> +							  struct stripe_head,
> +							  log_list);
> +		list_del_init(&sh->log_list);
> +		set_bit(STRIPE_HANDLE, &sh->state);
> +		raid5_release_stripe(sh);
> +	}
> +
> +	spin_unlock_irqrestore(&log->io_list_lock, flags);
> +}
> +
> +void ppl_stripe_write_finished(struct stripe_head *sh)
> +{
> +	struct ppl_io_unit *io;
> +
> +	io = sh->ppl_log_io;
> +	sh->ppl_log_io = NULL;
> +
> +	if (io && atomic_dec_and_test(&io->pending_stripes))
> +		ppl_io_unit_finished(io);
> +}
> +
> +void ppl_exit_log(struct ppl_conf *ppl_conf)
> +{
> +	kfree(ppl_conf->child_logs);
> +
> +	mempool_destroy(ppl_conf->meta_pool);
> +	if (ppl_conf->bs)
> +		bioset_free(ppl_conf->bs);
> +	mempool_destroy(ppl_conf->io_pool);
> +	kmem_cache_destroy(ppl_conf->io_kc);
> +
> +	kfree(ppl_conf);
> +}
> +
> +static int ppl_validate_rdev(struct md_rdev *rdev)
> +{
> +	char b[BDEVNAME_SIZE];
> +	int ppl_data_sectors;
> +	int ppl_size_new;
> +
> +	/*
> +	 * The configured PPL size must be enough to store
> +	 * the header and (at the very least) partial parity
> +	 * for one stripe. Round it down to ensure the data
> +	 * space is cleanly divisible by stripe size.
> +	 */
> +	ppl_data_sectors = rdev->ppl.size - (PPL_HEADER_SIZE >> 9);
> +
> +	if (ppl_data_sectors > 0)
> +		ppl_data_sectors = rounddown(ppl_data_sectors, STRIPE_SECTORS);
> +
> +	if (ppl_data_sectors <= 0) {
> +		pr_warn("md/raid:%s: PPL space too small on %s\n",
> +			mdname(rdev->mddev), bdevname(rdev->bdev, b));
> +		return -ENOSPC;
> +	}
> +
> +	ppl_size_new = ppl_data_sectors + (PPL_HEADER_SIZE >> 9);
> +
> +	if ((rdev->ppl.sector < rdev->data_offset &&
> +	     rdev->ppl.sector + ppl_size_new > rdev->data_offset) ||
> +	    (rdev->ppl.sector >= rdev->data_offset &&
> +	     rdev->data_offset + rdev->sectors > rdev->ppl.sector)) {
> +		pr_warn("md/raid:%s: PPL space overlaps with data on %s\n",
> +			mdname(rdev->mddev), bdevname(rdev->bdev, b));
> +		return -EINVAL;
> +	}
> +
> +	if (!rdev->mddev->external &&
> +	    ((rdev->ppl.offset > 0 && rdev->ppl.offset < (rdev->sb_size >> 9)) ||
> +	     (rdev->ppl.offset <= 0 && rdev->ppl.offset + ppl_size_new > 0))) {
> +		pr_warn("md/raid:%s: PPL space overlaps with superblock on %s\n",
> +			mdname(rdev->mddev), bdevname(rdev->bdev, b));
> +		return -EINVAL;
> +	}
> +
> +	rdev->ppl.size = ppl_size_new;
> +
> +	return 0;
> +}
> +
> +int ppl_init_log(struct r5conf *conf)
> +{
> +	struct ppl_conf *ppl_conf;
> +	struct mddev *mddev = conf->mddev;
> +	int ret = 0;
> +	int i;
> +	bool need_cache_flush;
> +
> +	if (PAGE_SIZE != 4096)
> +		return -EINVAL;
> +
> +	if (mddev->bitmap_info.file || mddev->bitmap_info.offset) {
> +		pr_warn("md/raid:%s PPL is not compatible with bitmap\n",
> +			mdname(mddev));
> +		return -EINVAL;
> +	}
> +
> +	if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
> +		pr_warn("md/raid:%s PPL is not compatible with journal\n",
> +			mdname(mddev));
> +		return -EINVAL;
> +	}

shouldn't these two checks be put into md.c, which we load the rdev?

> +
> +	ppl_conf = kzalloc(sizeof(struct ppl_conf), GFP_KERNEL);
> +	if (!ppl_conf)
> +		return -ENOMEM;
> +
> +	ppl_conf->io_kc = KMEM_CACHE(ppl_io_unit, 0);
> +	if (!ppl_conf->io_kc) {
> +		ret = -EINVAL;
> +		goto err;
> +	}
> +
> +	ppl_conf->io_pool = mempool_create_slab_pool(conf->raid_disks, ppl_conf->io_kc);
> +	if (!ppl_conf->io_pool) {
> +		ret = -EINVAL;
> +		goto err;
> +	}
> +
> +	ppl_conf->bs = bioset_create(conf->raid_disks, 0);
> +	if (!ppl_conf->bs) {
> +		ret = -EINVAL;
> +		goto err;
> +	}
> +
> +	ppl_conf->meta_pool = mempool_create_page_pool(conf->raid_disks, 0);
> +	if (!ppl_conf->meta_pool) {
> +		ret = -EINVAL;
> +		goto err;
> +	}
> +
> +	ppl_conf->mddev = mddev;
> +	ppl_conf->count = conf->raid_disks;
> +	ppl_conf->child_logs = kcalloc(ppl_conf->count, sizeof(struct ppl_log),
> +				       GFP_KERNEL);
> +	if (!ppl_conf->child_logs) {
> +		ret = -ENOMEM;
> +		goto err;
> +	}
> +
> +	atomic64_set(&ppl_conf->seq, 0);
> +
> +	if (!mddev->external) {
> +		ppl_conf->signature = ~crc32c_le(~0, mddev->uuid, sizeof(mddev->uuid));
> +		ppl_conf->block_size = 512;
> +	} else {
> +		ppl_conf->block_size = queue_logical_block_size(mddev->queue);
> +	}

Can we always set block_size 512? Don't really understand this part.

> +
> +	for (i = 0; i < ppl_conf->count; i++) {
> +		struct ppl_log *log = &ppl_conf->child_logs[i];
> +		struct md_rdev *rdev = conf->disks[i].rdev;
> +
> +		mutex_init(&log->io_mutex);
> +		spin_lock_init(&log->io_list_lock);
> +		INIT_LIST_HEAD(&log->io_list);
> +		INIT_LIST_HEAD(&log->no_mem_stripes);
> +
> +		log->ppl_conf = ppl_conf;
> +		log->rdev = rdev;
> +
> +		if (rdev) {
> +			struct request_queue *q;
> +
> +			ret = ppl_validate_rdev(rdev);
> +			if (ret)
> +				goto err;
> +
> +			q = bdev_get_queue(rdev->bdev);
> +			if (test_bit(QUEUE_FLAG_WC, &q->queue_flags))
> +				need_cache_flush = true;
> +		}
> +	}
> +
> +	if (need_cache_flush)
> +		pr_warn("md/raid:%s: Volatile write-back cache should be disabled on all member drives when using PPL!\n",
> +			mdname(mddev));
> +
> +	conf->ppl = ppl_conf;
> +
> +	return 0;
> +err:
> +	ppl_exit_log(ppl_conf);
> +	return ret;
> +}
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 02e02fe5b04e..21440b594878 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -739,7 +739,7 @@ static bool stripe_can_batch(struct stripe_head *sh)
>  {
>  	struct r5conf *conf = sh->raid_conf;
>  
> -	if (conf->log)
> +	if (conf->log || conf->ppl)
>  		return false;

Maybe add a new inline function into raid5-log.h

>  	return test_bit(STRIPE_BATCH_READY, &sh->state) &&
>  		!test_bit(STRIPE_BITMAP_PENDING, &sh->state) &&
> @@ -936,6 +936,9 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
>  		}
>  	}
>  
> +	if (conf->ppl && ppl_write_stripe(conf->ppl, sh) == 0)
> +		return;

please put the one and the above raid5-cache part into a separate function into raid5-log.h

> +
>  	for (i = disks; i--; ) {
>  		int op, op_flags = 0;
>  		int replace_only = 0;
> @@ -3308,6 +3311,16 @@ static void stripe_set_idx(sector_t stripe, struct r5conf *conf, int previous,
>  			     &dd_idx, sh);
>  }
>  
> +static void log_stripe_write_finished(struct stripe_head *sh)
> +{
> +	struct r5conf *conf = sh->raid_conf;
> +
> +	if (conf->log)
> +		r5l_stripe_write_finished(sh);
> +	else if (conf->ppl)
> +		ppl_stripe_write_finished(sh);
> +}

please put this into raid5-log.h

> +
>  static void
>  handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
>  				struct stripe_head_state *s, int disks,
> @@ -3347,7 +3360,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
>  		if (bi)
>  			bitmap_end = 1;
>  
> -		r5l_stripe_write_finished(sh);
> +		log_stripe_write_finished(sh);
>  
>  		if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
>  			wake_up(&conf->wait_for_overlap);
> @@ -3766,7 +3779,7 @@ static void handle_stripe_clean_event(struct r5conf *conf,
>  				discard_pending = 1;
>  		}
>  
> -	r5l_stripe_write_finished(sh);
> +	log_stripe_write_finished(sh);
>  
>  	if (!discard_pending &&
>  	    test_bit(R5_Discard, &sh->dev[sh->pd_idx].flags)) {
> @@ -4756,7 +4769,7 @@ static void handle_stripe(struct stripe_head *sh)
>  
>  	if (s.just_cached)
>  		r5c_handle_cached_data_endio(conf, sh, disks, &s.return_bi);
> -	r5l_stripe_write_finished(sh);
> +	log_stripe_write_finished(sh);
>  
>  	/* Now we might consider reading some blocks, either to check/generate
>  	 * parity, or to satisfy requests
> @@ -6120,6 +6133,14 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
>  	return handled;
>  }
>  
> +static void log_write_stripe_run(struct r5conf *conf)
> +{
> +	if (conf->log)
> +		r5l_write_stripe_run(conf->log);
> +	else if (conf->ppl)
> +		ppl_write_stripe_run(conf->ppl);
> +}

ditto

> +
>  static int handle_active_stripes(struct r5conf *conf, int group,
>  				 struct r5worker *worker,
>  				 struct list_head *temp_inactive_list)
> @@ -6157,7 +6178,7 @@ static int handle_active_stripes(struct r5conf *conf, int group,
>  
>  	for (i = 0; i < batch_size; i++)
>  		handle_stripe(batch[i]);
> -	r5l_write_stripe_run(conf->log);
> +	log_write_stripe_run(conf);
>  
>  	cond_resched();
>  
> @@ -6735,6 +6756,8 @@ static void free_conf(struct r5conf *conf)
>  
>  	if (conf->log)
>  		r5l_exit_log(conf->log);
> +	if (conf->ppl)
> +		ppl_exit_log(conf->ppl);

Ditto

>  	if (conf->shrinker.nr_deferred)
>  		unregister_shrinker(&conf->shrinker);
>  
> @@ -7196,6 +7219,13 @@ static int raid5_run(struct mddev *mddev)
>  		BUG_ON(mddev->delta_disks != 0);
>  	}
>  
> +	if (test_bit(MD_HAS_JOURNAL, &mddev->flags) &&
> +	    test_bit(MD_HAS_PPL, &mddev->flags)) {
> +		pr_warn("md/raid:%s: using journal device and PPL not allowed - disabling PPL\n",
> +			mdname(mddev));
> +		clear_bit(MD_HAS_PPL, &mddev->flags);
> +	}
> +
>  	if (mddev->private == NULL)
>  		conf = setup_conf(mddev);
>  	else
> @@ -7422,6 +7452,11 @@ static int raid5_run(struct mddev *mddev)
>  			 mdname(mddev), bdevname(journal_dev->bdev, b));
>  		if (r5l_init_log(conf, journal_dev))
>  			goto abort;
> +	} else if (test_bit(MD_HAS_PPL, &mddev->flags)) {
> +		pr_debug("md/raid:%s: enabling distributed Partial Parity Log\n",
> +			 mdname(mddev));
> +		if (ppl_init_log(conf))
> +			goto abort;
>  	}

Ditto, if possible

>  
>  	return 0;
> @@ -7690,7 +7725,7 @@ static int raid5_resize(struct mddev *mddev, sector_t sectors)
>  	sector_t newsize;
>  	struct r5conf *conf = mddev->private;
>  
> -	if (conf->log)
> +	if (conf->log || conf->ppl)
>  		return -EINVAL;
>  	sectors &= ~((sector_t)conf->chunk_sectors - 1);
>  	newsize = raid5_size(mddev, sectors, mddev->raid_disks);
> @@ -7743,7 +7778,7 @@ static int check_reshape(struct mddev *mddev)
>  {
>  	struct r5conf *conf = mddev->private;
>  
> -	if (conf->log)
> +	if (conf->log || conf->ppl)

there are other places we check conf->log. like stripe_can_batch(), I think the
ppl code can't hanel batch right now.

>  		return -EINVAL;
>  	if (mddev->delta_disks == 0 &&
>  	    mddev->new_layout == mddev->layout &&
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index 3cc4cb28f7e6..f915a7a0e752 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -230,6 +230,7 @@ struct stripe_head {
>  	struct list_head	r5c; /* for r5c_cache->stripe_in_journal */
>  
>  	struct page		*ppl_page; /* partial parity of this stripe */
> +	struct ppl_io_unit	*ppl_log_io;

Maybe we should put the ppl fields and raid5 fields to a union

>  	/**
>  	 * struct stripe_operations
>  	 * @target - STRIPE_OP_COMPUTE_BLK target
> @@ -689,6 +690,7 @@ struct r5conf {
>  	int			group_cnt;
>  	int			worker_cnt_per_group;
>  	struct r5l_log		*log;
> +	struct ppl_conf		*ppl;

Maybe use void * log_private, so works for both raid5-cache and ppl

>  
>  	struct bio_list		pending_bios;
>  	spinlock_t		pending_bios_lock;
> @@ -798,4 +800,10 @@ extern void r5c_check_cached_full_stripe(struct r5conf *conf);
>  extern struct md_sysfs_entry r5c_journal_mode;
>  extern void r5c_update_on_rdev_error(struct mddev *mddev);
>  extern bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect);
> +
> +extern int ppl_init_log(struct r5conf *conf);
> +extern void ppl_exit_log(struct ppl_conf *log);
> +extern int ppl_write_stripe(struct ppl_conf *log, struct stripe_head *sh);
> +extern void ppl_write_stripe_run(struct ppl_conf *log);
> +extern void ppl_stripe_write_finished(struct stripe_head *sh);

These part and the raid5 part should be put into raid5-log.h

>  #endif
> diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h
> index fe2112810c43..2c28711cc5f1 100644
> --- a/include/uapi/linux/raid/md_p.h
> +++ b/include/uapi/linux/raid/md_p.h
> @@ -398,4 +398,30 @@ struct r5l_meta_block {
>  
>  #define R5LOG_VERSION 0x1
>  #define R5LOG_MAGIC 0x6433c509
> +
> +struct ppl_header_entry {
> +	__le64 data_sector;	/* Raid sector of the new data */
> +	__le32 pp_size;		/* Length of partial parity */
> +	__le32 data_size;	/* Length of data */
> +	__le32 parity_disk;	/* Member disk containing parity */
> +	__le32 checksum;	/* Checksum of this entry */

So we changed the disk format. Will this new format be put into IMSM standard?
Do we have a version to verify the old/new format for IMSM?

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v4 4/7] md: add sysfs entries for PPL
From: Shaohua Li @ 2017-02-28  0:37 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: shli, linux-raid
In-Reply-To: <20170221194401.18733-5-artur.paszkiewicz@intel.com>

On Tue, Feb 21, 2017 at 08:43:58PM +0100, Artur Paszkiewicz wrote:
> Add 'consistency_policy' attribute for array. It indicates how the array
> maintains consistency in case of unexpected shutdown.
> 
> Add 'ppl_sector' and 'ppl_size' for rdev, which describe the location
> and size of the PPL space on the device. They can't be changed for
> active members if the array is started and PPL is enabled, so in the
> setter functions only basic checks are performed. More checks are done
> in ppl_validate_rdev() when starting the log.
> 
> These attributes are writable to allow enabling PPL for external
> metadata arrays and (later) to enable/disable PPL for a running array.
> 
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  Documentation/admin-guide/md.rst |  32 ++++++++++-
>  drivers/md/md.c                  | 115 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 144 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
> index 1e61bf50595c..84de718f24a4 100644
> --- a/Documentation/admin-guide/md.rst
> +++ b/Documentation/admin-guide/md.rst
> @@ -276,14 +276,14 @@ All md devices contain:
>       array creation it will default to 0, though starting the array as
>       ``clean`` will set it much larger.
>  
> -   new_dev
> +  new_dev
>       This file can be written but not read.  The value written should
>       be a block device number as major:minor.  e.g. 8:0
>       This will cause that device to be attached to the array, if it is
>       available.  It will then appear at md/dev-XXX (depending on the
>       name of the device) and further configuration is then possible.
>  
> -   safe_mode_delay
> +  safe_mode_delay
>       When an md array has seen no write requests for a certain period
>       of time, it will be marked as ``clean``.  When another write
>       request arrives, the array is marked as ``dirty`` before the write
> @@ -292,7 +292,7 @@ All md devices contain:
>       period as a number of seconds.  The default is 200msec (0.200).
>       Writing a value of 0 disables safemode.
>  
> -   array_state
> +  array_state
>       This file contains a single word which describes the current
>       state of the array.  In many cases, the state can be set by
>       writing the word for the desired state, however some states
> @@ -401,7 +401,30 @@ All md devices contain:
>       once the array becomes non-degraded, and this fact has been
>       recorded in the metadata.
>  
> +  consistency_policy
> +     This indicates how the array maintains consistency in case of unexpected
> +     shutdown. It can be:
>  
> +     none
> +       Array has no redundancy information, e.g. raid0, linear.
> +
> +     resync
> +       Full resync is performed and all redundancy is regenerated when the
> +       array is started after unclean shutdown.
> +
> +     bitmap
> +       Resync assisted by a write-intent bitmap.
> +
> +     journal
> +       For raid4/5/6, journal device is used to log transactions and replay
> +       after unclean shutdown.
> +
> +     ppl
> +       For raid5 only, Partial Parity Log is used to close the write hole and
> +       eliminate resync.
> +
> +     The accepted values when writing to this file are ``ppl`` and ``resync``,
> +     used to enable and disable PPL.
>  
>  
>  As component devices are added to an md array, they appear in the ``md``
> @@ -563,6 +586,9 @@ Each directory contains:
>  	adds bad blocks without acknowledging them. This is largely
>  	for testing.
>  
> +      ppl_sector, ppl_size
> +        Location and size (in sectors) of the space used for Partial Parity Log
> +        on this device.
>  
>  
>  An active md device will also contain an entry for each active device
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index c2028007b209..3ff979d538d4 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -3157,6 +3157,78 @@ static ssize_t ubb_store(struct md_rdev *rdev, const char *page, size_t len)
>  static struct rdev_sysfs_entry rdev_unack_bad_blocks =
>  __ATTR(unacknowledged_bad_blocks, S_IRUGO|S_IWUSR, ubb_show, ubb_store);
>  
> +static ssize_t
> +ppl_sector_show(struct md_rdev *rdev, char *page)
> +{
> +	return sprintf(page, "%llu\n", (unsigned long long)rdev->ppl.sector);
> +}
> +
> +static ssize_t
> +ppl_sector_store(struct md_rdev *rdev, const char *buf, size_t len)
> +{
> +	unsigned long long sector;
> +
> +	if (kstrtoull(buf, 10, &sector) < 0)
> +		return -EINVAL;

Maybe use base 0 to be more friendly

> +	if (sector != (sector_t)sector)
> +		return -EINVAL;
> +
> +	if (rdev->mddev->pers && test_bit(MD_HAS_PPL, &rdev->mddev->flags) &&
> +	    rdev->raid_disk >= 0)
> +		return -EBUSY;
> +
> +	if (rdev->mddev->persistent) {
> +		if (rdev->mddev->major_version == 0)
> +			return -EINVAL;
> +		if ((sector > rdev->sb_start &&
> +		     sector - rdev->sb_start > S16_MAX) ||
> +		    (sector < rdev->sb_start &&
> +		     rdev->sb_start - sector > -S16_MIN))
> +			return -EINVAL;

Don't check if the address overlaps with data?

> +		rdev->ppl.offset = sector - rdev->sb_start;
> +	} else if (!rdev->mddev->external) {
> +		return -EBUSY;
> +	}

generally we don't use {} for single line code

Thanks,
Shaohua

^ permalink raw reply

* Re: Process stuck in md_flush_request (state: D)
From: Shaohua Li @ 2017-02-28  0:44 UTC (permalink / raw)
  To: Les Stroud; +Cc: linux-raid
In-Reply-To: <ED1913CC-96EF-4A25-A1F3-E664396A2710@lesstroud.com>

On Mon, Feb 27, 2017 at 01:48:00PM -0500, Les Stroud wrote:
> 
> 
> 
> 
> > On Feb 27, 2017, at 1:28 PM, Shaohua Li <shli@kernel.org> wrote:
> > 
> > On Mon, Feb 27, 2017 at 09:49:59AM -0500, Les Stroud wrote:
> >> After a period of a couple of weeks with one of our test instances having this problem every other day, they were all nice enough to operate without an issue for 9 days.  It finally reoccurred last night on one of the machines.  
> >> 
> >> It exhibits the same symptoms and the call traces look as they did previously.  This particular instance is configured with a deadline scheduler.  I was able to capture the inflight you requested:
> >> 
> >> $ cat /sys/block/xvd[abcde]/inflight
> >>        0        0
> >>        0        0
> >>        0        0
> >>        0        0
> >>        0        0
> >> 
> >> I’ve had this happen on instances with the deadline scheduler and the noop scheduler.  At this point, I have not had this happen on an instance that is noop and the raid filesystem (ext4) is mounted with nobarrier.  The instances with noop/nobarrier have not been running long enough for me to make any sort of conclusion that it works around the problem. Frankly, I’m not sure I understand the interaction between ext4 barriers and raid0 block flushes well enough to theorize whether it should or shouldn’t make a difference.
> > 
> > If nobarrier, ext4 doesn't send flush request.
> 
> So, could ext4’s flush request deadlock with an md_flush_request?  Do they share a mutex of some sort? Could one of them be failing to acquire a mutex and not handling it?

No, it shouldn't deadlock. I don't have other reports for such issue. Yours are the only one.
 
> > 
> >> Does any of this help with identifying the bug?  Is there anymore information I can get that would be useful?  
> > 
> > 
> > Unfortunately I can't find anything fishing. Does the xcdx disk correctly
> > handle flush request? For example, you can do the same test with a single such
> > disk and check if anything wrong.
> 
> Until recently, we had a number of these systems setup without raid0.  This issue never occurred on those systems.  Unfortunately, I can’t find a way to make it happen other than stand a server up and let it run.
> 
> I suppose I could try a different filesystem and see if that makes a difference (maybe ext3, xfs, etc).

You could format a xcdx disk and do a test against it, and check if there is
anything wrong. To be honest, I don't think it's a problme in ext4 side too,
but better try other filesystems. If the xcdx is a proprietory driver, I highly
recommend a check with a single such disk first.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v4 5/7] raid5-ppl: load and recover the log
From: Shaohua Li @ 2017-02-28  1:12 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: shli, linux-raid
In-Reply-To: <20170221194401.18733-6-artur.paszkiewicz@intel.com>

On Tue, Feb 21, 2017 at 08:43:59PM +0100, Artur Paszkiewicz wrote:
> Load the log from each disk when starting the array and recover if the
> array is dirty.
> 
> The initial empty PPL is written by mdadm. When loading the log we
> verify the header checksum and signature. For external metadata arrays
> the signature is verified in userspace, so here we read it from the
> header, verifying only if it matches on all disks, and use it later when
> writing PPL.
> 
> In addition to the header checksum, each header entry also contains a
> checksum of its partial parity data. If the header is valid, recovery is
> performed for each entry until an invalid entry is found. If the array
> is not degraded and recovery using PPL fully succeeds, there is no need
> to resync the array because data and parity will be consistent, so in
> this case resync will be disabled.
> 
> Due to compatibility with IMSM implementations on other systems, we
> can't assume that the recovery data block size is always 4K. Writes
> generated by MD raid5 don't have this issue, but when recovering PPL
> written in other environments it is possible to have entries with
> 512-byte sector granularity. The recovery code takes this into account
> and also the logical sector size of the underlying drives.
> 
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  drivers/md/raid5-ppl.c | 483 +++++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/md/raid5.c     |   5 +-
>  2 files changed, 487 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
> index a00cabf1adf6..a7693353243a 100644
> --- a/drivers/md/raid5-ppl.c
> +++ b/drivers/md/raid5-ppl.c
> @@ -16,6 +16,7 @@
>  #include <linux/blkdev.h>
>  #include <linux/slab.h>
>  #include <linux/crc32c.h>
> +#include <linux/async_tx.h>
>  #include <linux/raid/md_p.h>
>  #include "md.h"
>  #include "raid5.h"
> @@ -84,6 +85,10 @@ struct ppl_conf {
>  	mempool_t *io_pool;
>  	struct bio_set *bs;
>  	mempool_t *meta_pool;
> +
> +	/* used only for recovery */
> +	int recovered_entries;
> +	int mismatch_count;
>  };
>  
>  struct ppl_log {
> @@ -450,6 +455,467 @@ void ppl_stripe_write_finished(struct stripe_head *sh)
>  		ppl_io_unit_finished(io);
>  }
>  
> +static void ppl_xor(int size, struct page *page1, struct page *page2,
> +		    struct page *page_result)
> +{
> +	struct async_submit_ctl submit;
> +	struct dma_async_tx_descriptor *tx;
> +	struct page *xor_srcs[] = { page1, page2 };
> +
> +	init_async_submit(&submit, ASYNC_TX_ACK|ASYNC_TX_XOR_DROP_DST,
> +			  NULL, NULL, NULL, NULL);
> +	tx = async_xor(page_result, xor_srcs, 0, 2, size, &submit);
> +
> +	async_tx_quiesce(&tx);
> +}
> +
> +/*
> + * PPL recovery strategy: xor partial parity and data from all modified data
> + * disks within a stripe and write the result as the new stripe parity. If all
> + * stripe data disks are modified (full stripe write), no partial parity is
> + * available, so just xor the data disks.
> + *
> + * Recovery of a PPL entry shall occur only if all modified data disks are
> + * available and read from all of them succeeds.
> + *
> + * A PPL entry applies to a stripe, partial parity size for an entry is at most
> + * the size of the chunk. Examples of possible cases for a single entry:
> + *
> + * case 0: single data disk write:
> + *   data0    data1    data2     ppl        parity
> + * +--------+--------+--------+           +--------------------+
> + * | ------ | ------ | ------ | +----+    | (no change)        |
> + * | ------ | -data- | ------ | | pp | -> | data1 ^ pp         |
> + * | ------ | -data- | ------ | | pp | -> | data1 ^ pp         |
> + * | ------ | ------ | ------ | +----+    | (no change)        |
> + * +--------+--------+--------+           +--------------------+
> + * pp_size = data_size
> + *
> + * case 1: more than one data disk write:
> + *   data0    data1    data2     ppl        parity
> + * +--------+--------+--------+           +--------------------+
> + * | ------ | ------ | ------ | +----+    | (no change)        |
> + * | -data- | -data- | ------ | | pp | -> | data0 ^ data1 ^ pp |
> + * | -data- | -data- | ------ | | pp | -> | data0 ^ data1 ^ pp |
> + * | ------ | ------ | ------ | +----+    | (no change)        |
> + * +--------+--------+--------+           +--------------------+
> + * pp_size = data_size / modified_data_disks
> + *
> + * case 2: write to all data disks (also full stripe write):
> + *   data0    data1    data2                parity
> + * +--------+--------+--------+           +--------------------+
> + * | ------ | ------ | ------ |           | (no change)        |
> + * | -data- | -data- | -data- | --------> | xor all data       |
> + * | ------ | ------ | ------ | --------> | (no change)        |
> + * | ------ | ------ | ------ |           | (no change)        |
> + * +--------+--------+--------+           +--------------------+
> + * pp_size = 0
> + *
> + * The following cases are possible only in other implementations. The recovery
> + * code can handle them, but they are not generated at runtime because they can
> + * be reduced to cases 0, 1 and 2:
> + *
> + * case 3:
> + *   data0    data1    data2     ppl        parity
> + * +--------+--------+--------+ +----+    +--------------------+
> + * | ------ | -data- | -data- | | pp |    | data1 ^ data2 ^ pp |
> + * | ------ | -data- | -data- | | pp | -> | data1 ^ data2 ^ pp |
> + * | -data- | -data- | -data- | | -- | -> | xor all data       |
> + * | -data- | -data- | ------ | | pp |    | data0 ^ data1 ^ pp |
> + * +--------+--------+--------+ +----+    +--------------------+
> + * pp_size = chunk_size
> + *
> + * case 4:
> + *   data0    data1    data2     ppl        parity
> + * +--------+--------+--------+ +----+    +--------------------+
> + * | ------ | -data- | ------ | | pp |    | data1 ^ pp         |
> + * | ------ | ------ | ------ | | -- | -> | (no change)        |
> + * | ------ | ------ | ------ | | -- | -> | (no change)        |
> + * | -data- | ------ | ------ | | pp |    | data0 ^ pp         |
> + * +--------+--------+--------+ +----+    +--------------------+
> + * pp_size = chunk_size
> + */
> +static int ppl_recover_entry(struct ppl_log *log, struct ppl_header_entry *e,
> +			     sector_t ppl_sector)
> +{
> +	struct ppl_conf *ppl_conf = log->ppl_conf;
> +	struct mddev *mddev = ppl_conf->mddev;
> +	struct r5conf *conf = mddev->private;
> +	int block_size = ppl_conf->block_size;
> +	struct page *pages;
> +	struct page *page1;
> +	struct page *page2;
> +	sector_t r_sector_first;
> +	sector_t r_sector_last;
> +	int strip_sectors;
> +	int data_disks;
> +	int i;
> +	int ret = 0;
> +	char b[BDEVNAME_SIZE];
> +	unsigned int pp_size = le32_to_cpu(e->pp_size);
> +	unsigned int data_size = le32_to_cpu(e->data_size);
> +
> +	r_sector_first = le64_to_cpu(e->data_sector) * (block_size >> 9);
> +
> +	if ((pp_size >> 9) < conf->chunk_sectors) {
> +		if (pp_size > 0) {
> +			data_disks = data_size / pp_size;
> +			strip_sectors = pp_size >> 9;
> +		} else {
> +			data_disks = conf->raid_disks - conf->max_degraded;
> +			strip_sectors = (data_size >> 9) / data_disks;
> +		}
> +		r_sector_last = r_sector_first +
> +				(data_disks - 1) * conf->chunk_sectors +
> +				strip_sectors;
> +	} else {
> +		data_disks = conf->raid_disks - conf->max_degraded;
> +		strip_sectors = conf->chunk_sectors;
> +		r_sector_last = r_sector_first + (data_size >> 9);
> +	}
> +
> +	pages = alloc_pages(GFP_KERNEL, 1);

order != 0 allocation should be avoid if possible. Here sounds not necessary

> +	if (!pages)
> +		return -ENOMEM;
> +	page1 = pages;
> +	page2 = pages + 1;
> +
> +	pr_debug("%s: array sector first: %llu last: %llu\n", __func__,
> +		 (unsigned long long)r_sector_first,
> +		 (unsigned long long)r_sector_last);
> +
> +	/* if start and end is 4k aligned, use a 4k block */
> +	if (block_size == 512 &&
> +	    (r_sector_first & (STRIPE_SECTORS - 1)) == 0 &&
> +	    (r_sector_last & (STRIPE_SECTORS - 1)) == 0)
> +		block_size = STRIPE_SIZE;
> +
> +	/* iterate through blocks in strip */
> +	for (i = 0; i < strip_sectors; i += (block_size >> 9)) {
> +		bool update_parity = false;
> +		sector_t parity_sector;
> +		struct md_rdev *parity_rdev;
> +		struct stripe_head sh;
> +		int disk;
> +		int indent = 0;
> +
> +		pr_debug("%s:%*s iter %d start\n", __func__, indent, "", i);
> +		indent += 2;
> +
> +		memset(page_address(page1), 0, PAGE_SIZE);
> +
> +		/* iterate through data member disks */
> +		for (disk = 0; disk < data_disks; disk++) {
> +			int dd_idx;
> +			struct md_rdev *rdev;
> +			sector_t sector;
> +			sector_t r_sector = r_sector_first + i +
> +					    (disk * conf->chunk_sectors);
> +
> +			pr_debug("%s:%*s data member disk %d start\n",
> +				 __func__, indent, "", disk);
> +			indent += 2;
> +
> +			if (r_sector >= r_sector_last) {
> +				pr_debug("%s:%*s array sector %llu doesn't need parity update\n",
> +					 __func__, indent, "",
> +					 (unsigned long long)r_sector);
> +				indent -= 2;
> +				continue;
> +			}
> +
> +			update_parity = true;
> +
> +			/* map raid sector to member disk */
> +			sector = raid5_compute_sector(conf, r_sector, 0,
> +						      &dd_idx, NULL);
> +			pr_debug("%s:%*s processing array sector %llu => data member disk %d, sector %llu\n",
> +				 __func__, indent, "",
> +				 (unsigned long long)r_sector, dd_idx,
> +				 (unsigned long long)sector);
> +
> +			rdev = conf->disks[dd_idx].rdev;
> +			if (!rdev) {
> +				pr_debug("%s:%*s data member disk %d missing\n",
> +					 __func__, indent, "", dd_idx);
> +				update_parity = false;
> +				break;
> +			}
> +
> +			pr_debug("%s:%*s reading data member disk %s sector %llu\n",
> +				 __func__, indent, "", bdevname(rdev->bdev, b),
> +				 (unsigned long long)sector);
> +			if (!sync_page_io(rdev, sector, block_size, page2,
> +					REQ_OP_READ, 0, false)) {
> +				md_error(mddev, rdev);
> +				pr_debug("%s:%*s read failed!\n", __func__,
> +					 indent, "");
> +				ret = -EIO;
> +				goto out;
> +			}
> +
> +			ppl_xor(block_size, page1, page2, page1);
> +
> +			indent -= 2;
> +		}
> +
> +		if (!update_parity)
> +			continue;
> +
> +		if (pp_size > 0) {
> +			pr_debug("%s:%*s reading pp disk sector %llu\n",
> +				 __func__, indent, "",
> +				 (unsigned long long)(ppl_sector + i));
> +			if (!sync_page_io(log->rdev,
> +					ppl_sector - log->rdev->data_offset + i,
> +					block_size, page2, REQ_OP_READ, 0,
> +					false)) {
> +				pr_debug("%s:%*s read failed!\n", __func__,
> +					 indent, "");
> +				md_error(mddev, log->rdev);
> +				ret = -EIO;
> +				goto out;
> +			}
> +
> +			ppl_xor(block_size, page1, page2, page1);
> +		}
> +
> +		/* map raid sector to parity disk */
> +		parity_sector = raid5_compute_sector(conf, r_sector_first + i,
> +				0, &disk, &sh);
> +		BUG_ON(sh.pd_idx != le32_to_cpu(e->parity_disk));
> +		parity_rdev = conf->disks[sh.pd_idx].rdev;
> +
> +		BUG_ON(parity_rdev->bdev->bd_dev != log->rdev->bdev->bd_dev);
> +		pr_debug("%s:%*s write parity at sector %llu, disk %s\n",
> +			 __func__, indent, "",
> +			 (unsigned long long)parity_sector,
> +			 bdevname(parity_rdev->bdev, b));
> +		if (!sync_page_io(parity_rdev, parity_sector, block_size,
> +				page1, REQ_OP_WRITE, 0, false)) {
> +			pr_debug("%s:%*s parity write error!\n", __func__,
> +				 indent, "");
> +			md_error(mddev, parity_rdev);
> +			ret = -EIO;
> +			goto out;
> +		}
> +	}
> +out:
> +	__free_pages(pages, 1);
> +	return ret;
> +}
> +
> +static int ppl_recover(struct ppl_log *log, struct ppl_header *pplhdr)
> +{
> +	struct ppl_conf *ppl_conf = log->ppl_conf;
> +	struct md_rdev *rdev = log->rdev;
> +	struct mddev *mddev = rdev->mddev;
> +	sector_t ppl_sector = rdev->ppl.sector + (PPL_HEADER_SIZE >> 9);
> +	struct page *page;
> +	int i;
> +	int ret = 0;
> +
> +	page = alloc_page(GFP_KERNEL);
> +	if (!page)
> +		return -ENOMEM;
> +
> +	/* iterate through all PPL entries saved */
> +	for (i = 0; i < le32_to_cpu(pplhdr->entries_count); i++) {
> +		struct ppl_header_entry *e = &pplhdr->entries[i];
> +		u32 pp_size = le32_to_cpu(e->pp_size);
> +		sector_t sector = ppl_sector;
> +		int ppl_entry_sectors = pp_size >> 9;
> +		u32 crc, crc_stored;
> +
> +		pr_debug("%s: disk: %d entry: %d ppl_sector: %llu pp_size: %u\n",
> +			 __func__, rdev->raid_disk, i,
> +			 (unsigned long long)ppl_sector, pp_size);
> +
> +		crc = ~0;
> +		crc_stored = le32_to_cpu(e->checksum);
> +
> +		/* read parial parity for this entry and calculate its checksum */
> +		while (pp_size) {
> +			int s = pp_size > PAGE_SIZE ? PAGE_SIZE : pp_size;
> +
> +			if (!sync_page_io(rdev, sector - rdev->data_offset,
> +					s, page, REQ_OP_READ, 0, false)) {
> +				md_error(mddev, rdev);
> +				ret = -EIO;
> +				goto out;
> +			}
> +
> +			crc = crc32c_le(crc, page_address(page), s);
> +
> +			pp_size -= s;
> +			sector += s >> 9;
> +		}
> +
> +		crc = ~crc;
> +
> +		if (crc != crc_stored) {
> +			/*
> +			 * Don't recover this entry if the checksum does not
> +			 * match, but keep going and try to recover other
> +			 * entries.
> +			 */
> +			pr_debug("%s: ppl entry crc does not match: stored: 0x%x calculated: 0x%x\n",
> +				 __func__, crc_stored, crc);
> +			ppl_conf->mismatch_count++;
> +		} else {
> +			ret = ppl_recover_entry(log, e, ppl_sector);
> +			if (ret)
> +				goto out;
> +			ppl_conf->recovered_entries++;
> +		}
> +
> +		ppl_sector += ppl_entry_sectors;
> +	}
> +out:
> +	__free_page(page);
> +	return ret;
> +}
> +
> +static int ppl_write_empty_header(struct ppl_log *log)
> +{
> +	struct page *page;
> +	struct ppl_header *pplhdr;
> +	struct md_rdev *rdev = log->rdev;
> +	int ret = 0;
> +
> +	pr_debug("%s: disk: %d ppl_sector: %llu\n", __func__,
> +		 rdev->raid_disk, (unsigned long long)rdev->ppl.sector);
> +
> +	page = alloc_page(GFP_NOIO | __GFP_ZERO);
> +	if (!page)
> +		return -ENOMEM;
> +
> +	pplhdr = page_address(page);
> +	memset(pplhdr->reserved, 0xff, PPL_HDR_RESERVED);
> +	pplhdr->signature = cpu_to_le32(log->ppl_conf->signature);
> +	pplhdr->checksum = cpu_to_le32(~crc32c_le(~0, pplhdr, PAGE_SIZE));
> +
> +	if (!sync_page_io(rdev, rdev->ppl.sector - rdev->data_offset,
> +			  PPL_HEADER_SIZE, page, REQ_OP_WRITE, 0, false)) {

write with FUA? otherwise, the empty header might not set down in the disks.

> +		md_error(rdev->mddev, rdev);
> +		ret = -EIO;
> +	}
> +
> +	__free_page(page);
> +	return ret;
> +}
> +
> +static int ppl_load_distributed(struct ppl_log *log)
> +{
> +	struct ppl_conf *ppl_conf = log->ppl_conf;
> +	struct md_rdev *rdev = log->rdev;
> +	struct mddev *mddev = rdev->mddev;
> +	struct page *page;
> +	struct ppl_header *pplhdr;
> +	u32 crc, crc_stored;
> +	u32 signature;
> +	int ret = 0;
> +
> +	pr_debug("%s: disk: %d\n", __func__, rdev->raid_disk);
> +
> +	/* read PPL header */
> +	page = alloc_page(GFP_KERNEL);
> +	if (!page)
> +		return -ENOMEM;
> +
> +	if (!sync_page_io(rdev, rdev->ppl.sector - rdev->data_offset,
> +			  PAGE_SIZE, page, REQ_OP_READ, 0, false)) {
> +		md_error(mddev, rdev);
> +		ret = -EIO;
> +		goto out;
> +	}
> +	pplhdr = page_address(page);
> +
> +	/* check header validity */
> +	crc_stored = le32_to_cpu(pplhdr->checksum);
> +	pplhdr->checksum = 0;
> +	crc = ~crc32c_le(~0, pplhdr, PAGE_SIZE);
> +
> +	if (crc_stored != crc) {
> +		pr_debug("%s: ppl header crc does not match: stored: 0x%x calculated: 0x%x\n",
> +			 __func__, crc_stored, crc);
> +		ppl_conf->mismatch_count++;
> +		goto out;
> +	}
> +
> +	signature = le32_to_cpu(pplhdr->signature);
> +
> +	if (mddev->external) {
> +		/*
> +		 * For external metadata the header signature is set and
> +		 * validated in userspace.
> +		 */
> +		ppl_conf->signature = signature;
> +	} else if (ppl_conf->signature != signature) {
> +		pr_debug("%s: ppl header signature does not match: stored: 0x%x configured: 0x%x\n",
> +			 __func__, signature, ppl_conf->signature);
> +		ppl_conf->mismatch_count++;
> +		goto out;
> +	}
> +
> +	/* attempt to recover from log if we are starting a dirty array */
> +	if (!mddev->pers && mddev->recovery_cp != MaxSector)
> +		ret = ppl_recover(log, pplhdr);

when could mddev->pers be null?

we need flush disk cache here. I know you assume disks don't enable cache yet,
but it's simple to do a flush here.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v4 6/7] raid5-ppl: support disk hot add/remove with PPL
From: Shaohua Li @ 2017-02-28  1:22 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: shli, linux-raid
In-Reply-To: <20170221194401.18733-7-artur.paszkiewicz@intel.com>

On Tue, Feb 21, 2017 at 08:44:00PM +0100, Artur Paszkiewicz wrote:
> Add a function to modify the log by removing an rdev when a drive fails
> or adding when a spare/replacement is activated as a raid member.
> 
> Removing a disk just clears the child log rdev pointer. No new stripes
> will be accepted for this child log in ppl_write_stripe() and running io
> units will be processed without writing PPL to the device.
> 
> Adding a disk sets the child log rdev pointer and writes an empty PPL
> header.
> 
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  drivers/md/raid5-ppl.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/md/raid5.c     | 15 +++++++++++++++
>  drivers/md/raid5.h     |  8 ++++++++
>  3 files changed, 70 insertions(+)
> 
> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
> index a7693353243a..c2070d124849 100644
> --- a/drivers/md/raid5-ppl.c
> +++ b/drivers/md/raid5-ppl.c
> @@ -319,6 +319,12 @@ static void ppl_submit_iounit(struct ppl_io_unit *io)
>  
>  	bio = bio_alloc_bioset(GFP_NOIO, BIO_MAX_PAGES, ppl_conf->bs);
>  	bio->bi_private = io;
> +
> +	if (!log->rdev || test_bit(Faulty, &log->rdev->flags)) {
> +		ppl_log_endio(bio);
> +		return;
> +	}
> +
>  	bio->bi_end_io = ppl_log_endio;
>  	bio->bi_opf = REQ_OP_WRITE | REQ_FUA;
>  	bio->bi_bdev = log->rdev->bdev;
> @@ -1098,3 +1104,44 @@ int ppl_init_log(struct r5conf *conf)
>  	ppl_exit_log(ppl_conf);
>  	return ret;
>  }
> +
> +int ppl_modify_log(struct ppl_conf *ppl_conf, struct md_rdev *rdev,
> +		   enum ppl_modify_log_operation operation)
> +{
> +	struct ppl_log *log;
> +	int ret = 0;
> +	char b[BDEVNAME_SIZE];
> +
> +	if (!rdev)
> +		return -EINVAL;
> +
> +	pr_debug("%s: disk: %d operation: %s dev: %s\n",
> +		 __func__, rdev->raid_disk,
> +		 operation == PPL_MODIFY_LOG_DISK_REMOVE ? "remove" :
> +		 (operation == PPL_MODIFY_LOG_DISK_ADD ? "add" : "?"),
> +		 bdevname(rdev->bdev, b));
> +
> +	if (rdev->raid_disk < 0)
> +		return 0;
> +
> +	if (rdev->raid_disk >= ppl_conf->count)
> +		return -ENODEV;
> +
> +	log = &ppl_conf->child_logs[rdev->raid_disk];
> +
> +	mutex_lock(&log->io_mutex);
> +	if (operation == PPL_MODIFY_LOG_DISK_REMOVE) {
> +		log->rdev = NULL;
> +	} else if (operation == PPL_MODIFY_LOG_DISK_ADD) {
> +		ret = ppl_validate_rdev(rdev);
> +		if (!ret) {
> +			log->rdev = rdev;
> +			ret = ppl_write_empty_header(log);
> +		}
> +	} else {
> +		ret = -EINVAL;
> +	}
> +	mutex_unlock(&log->io_mutex);
> +
> +	return ret;
> +}
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 8b52392457d8..b17e90f06f19 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -7623,6 +7623,12 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
>  			*rdevp = rdev;
>  		}
>  	}
> +	if (conf->ppl) {
> +		err = ppl_modify_log(conf->ppl, rdev,
> +				     PPL_MODIFY_LOG_DISK_REMOVE);
> +		if (err)
> +			goto abort;
> +	}
>  	if (p->replacement) {
>  		/* We must have just cleared 'rdev' */
>  		p->rdev = p->replacement;
> @@ -7632,6 +7638,10 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
>  			   */
>  		p->replacement = NULL;
>  		clear_bit(WantReplacement, &rdev->flags);
> +
> +		if (conf->ppl)
> +			err = ppl_modify_log(conf->ppl, p->rdev,
> +					     PPL_MODIFY_LOG_DISK_ADD);
>  	} else
>  		/* We might have just removed the Replacement as faulty-
>  		 * clear the bit just in case
> @@ -7695,6 +7705,11 @@ static int raid5_add_disk(struct mddev *mddev, struct md_rdev *rdev)
>  			if (rdev->saved_raid_disk != disk)
>  				conf->fullsync = 1;
>  			rcu_assign_pointer(p->rdev, rdev);
> +
> +			if (conf->ppl)
> +				err = ppl_modify_log(conf->ppl, rdev,
> +						     PPL_MODIFY_LOG_DISK_ADD);
> +
>  			goto out;
>  		}
>  	}
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index f915a7a0e752..43cfa5fa71b3 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -806,4 +806,12 @@ extern void ppl_exit_log(struct ppl_conf *log);
>  extern int ppl_write_stripe(struct ppl_conf *log, struct stripe_head *sh);
>  extern void ppl_write_stripe_run(struct ppl_conf *log);
>  extern void ppl_stripe_write_finished(struct stripe_head *sh);
> +
> +enum ppl_modify_log_operation {
> +	PPL_MODIFY_LOG_DISK_REMOVE,
> +	PPL_MODIFY_LOG_DISK_ADD,
> +};

a bool for this parameter is better

> +extern int ppl_modify_log(struct ppl_conf *log, struct md_rdev *rdev,
> +			  enum ppl_modify_log_operation operation);

So this should be put into raid5-log.h and rename to log_modify_log(conf, xxx)
or other sensible name.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v4 7/7] raid5-ppl: runtime PPL enabling or disabling
From: Shaohua Li @ 2017-02-28  1:31 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: shli, linux-raid
In-Reply-To: <20170221194401.18733-8-artur.paszkiewicz@intel.com>

On Tue, Feb 21, 2017 at 08:44:01PM +0100, Artur Paszkiewicz wrote:
> Allow writing to 'consistency_policy' attribute when the array is
> active. Add a new function 'change_consistency_policy' to the
> md_personality operations structure to handle the change in the
> personality code. Values "ppl" and "resync" are accepted and
> turn PPL on and off respectively.
> 
> When enabling PPL its location and size should first be set using
> 'ppl_sector' and 'ppl_size' attributes and a valid PPL header should be
> written at this location on each member device.
> 
> Enabling or disabling PPL is performed under a suspended array.  The
> raid5_reset_stripe_cache function frees the stripe cache and allocates
> it again in order to allocate or free the ppl_pages for the stripes in
> the stripe cache.
> 
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  drivers/md/md.c        | 12 ++++++++---
>  drivers/md/md.h        |  2 ++
>  drivers/md/raid5-ppl.c |  4 ++++
>  drivers/md/raid5.c     | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 73 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 3ff979d538d4..b70e19513588 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -5003,14 +5003,20 @@ consistency_policy_show(struct mddev *mddev, char *page)
>  static ssize_t
>  consistency_policy_store(struct mddev *mddev, const char *buf, size_t len)
>  {
> +	int err = 0;
> +
>  	if (mddev->pers) {
> -		return -EBUSY;
> +		if (mddev->pers->change_consistency_policy)
> +			err = mddev->pers->change_consistency_policy(mddev, buf);
> +		else
> +			err = -EBUSY;
>  	} else if (mddev->external && strncmp(buf, "ppl", 3) == 0) {
>  		set_bit(MD_HAS_PPL, &mddev->flags);
> -		return len;
>  	} else {
> -		return -EINVAL;
> +		err = -EINVAL;
>  	}
> +
> +	return err ? err : len;
>  }
>  
>  static struct md_sysfs_entry md_consistency_policy =
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 66a5a16e79f7..88a5155c152e 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -548,6 +548,8 @@ struct md_personality
>  	/* congested implements bdi.congested_fn().
>  	 * Will not be called while array is 'suspended' */
>  	int (*congested)(struct mddev *mddev, int bits);
> +	/* Changes the consistency policy of an active array. */
> +	int (*change_consistency_policy)(struct mddev *mddev, const char *buf);
>  };
>  
>  struct md_sysfs_entry {
> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
> index c2070d124849..a5c4d3333bce 100644
> --- a/drivers/md/raid5-ppl.c
> +++ b/drivers/md/raid5-ppl.c
> @@ -1095,6 +1095,10 @@ int ppl_init_log(struct r5conf *conf)
>  		 */
>  		mddev->recovery_cp = MaxSector;
>  		set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
> +	} else if (mddev->pers && ppl_conf->mismatch_count > 0) {
> +		/* no mismatch allowed when enabling PPL for a running array */
> +		ret = -EINVAL;
> +		goto err;
>  	}
>  
>  	conf->ppl = ppl_conf;
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index b17e90f06f19..754fb8e1c76f 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -8319,6 +8319,63 @@ static void *raid6_takeover(struct mddev *mddev)
>  	return setup_conf(mddev);
>  }
>  
> +static void raid5_reset_stripe_cache(struct mddev *mddev)
> +{
> +	struct r5conf *conf = mddev->private;
> +
> +	mutex_lock(&conf->cache_size_mutex);
> +	while (conf->max_nr_stripes &&
> +	       drop_one_stripe(conf))
> +		;
> +	while (conf->min_nr_stripes > conf->max_nr_stripes &&
> +	       grow_one_stripe(conf, GFP_KERNEL))
> +		;
> +	mutex_unlock(&conf->cache_size_mutex);
> +}
> +
> +static int raid5_change_consistency_policy(struct mddev *mddev, const char *buf)
> +{
> +	struct r5conf *conf;
> +	int err;
> +
> +	err = mddev_lock(mddev);
> +	if (err)
> +		return err;
> +	conf = mddev->private;
> +	if (!conf) {
> +		mddev_unlock(mddev);
> +		return -ENODEV;
> +	}
> +
> +	if (strncmp(buf, "ppl", 3) == 0 &&
> +	    !test_bit(MD_HAS_PPL, &mddev->flags)) {
> +		mddev_suspend(mddev);
> +		err = ppl_init_log(conf);

I didn't see you check if all rdev have valid ppl setting in ppl_load. Am I
missing anything? Is it possible some rdevs have ppl set, but some not. In the
case, we should just bail out.

Also check if this is raid5

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v4 0/7] Partial Parity Log for MD RAID 5
From: Shaohua Li @ 2017-02-28  1:35 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: shli, linux-raid
In-Reply-To: <20170221194401.18733-1-artur.paszkiewicz@intel.com>

On Tue, Feb 21, 2017 at 08:43:54PM +0100, Artur Paszkiewicz wrote:
> This series of patches implements the Partial Parity Log for RAID5 arrays. The
> purpose of this feature is closing the RAID 5 Write Hole. It is a solution
> alternative to the existing raid5-cache, but the logging workflow and much of
> the implementation is based on it.
> 
> The main differences compared to raid5-cache is that PPL is a distributed log -
> it is stored on array member drives in the metadata area and does not require a
> dedicated journaling drive. Write performance is reduced by up to 30%-40% but
> it scales with the number of drives in the array and the journaling drive does
> not become a bottleneck or a single point of failure. PPL does not protect from
> losing in-flight data, only from silent data corruption. More details about how
> the log works can be found in patches 3 and 5.
> 
> This feature originated from Intel RSTe, which uses IMSM metadata. This
> patchset implements PPL for external metadata (specifically IMSM) as well as
> native MD v1.x metadata.
> 
> Changes in mdadm are also required to make this fully usable. Patches for mdadm
> will be sent later.

Generally looks ok, I replied separatly. I'd like to make sure the new format
will be supported by Intel RSTe, and intel will continue working on this.

Thanks,
Shaohua

^ permalink raw reply

* [PATCH RFC] test: revise 'test' and make it easier to understand
From: Zhilong Liu @ 2017-02-28  2:47 UTC (permalink / raw)
  To: neilb, Jes.Sorensen; +Cc: linux-raid, gqjiang, Zhilong Liu
In-Reply-To: <a81137d5-3042-e983-fe0e-1b9a0de836ff@suse.com>

1. use 'Tab' as the code style.
2. arrange the testing steps and provide the 'main' entrance.
3. draft the log_save feature, it captures the /proc/mdstat,
   md superblock info, bitmap info and the detail dmesg.
4. modified the mdadm() func, adding the operation that clear
   the superblock when create or build one new array, and it
   would exit testing when mdadm command returned non-0 value.
5. delete no_errors() func, it only used in tests/04update-uuid,
   I recommend the new mdadm() using method.
6. delete fast_sync() func.
7. testdev(), add the object file checking, otherwise this command
   would create one regular file, it's one trouble thing.
8. add dmesg checking in do_test() func, it's necessary to check
   dmesg whether or not printed abnormal message.
9. add checking conditions in main(), such as $pwd/raid6check need
   exists, here is a prompt to remind users to 'make everything'
   before testing; the $targetdir should mount under ext[2-4] FS,
   because the external bitmap only supports ext, the bmap() API
   of bitmap.c doesn't exist in all filesystem, such as btrfs.

Signed-off-by: Zhilong Liu <zlliu@suse.com>

diff --git a/test b/test
index 13f1bda..e23addb 100755
--- a/test
+++ b/test
@@ -1,36 +1,21 @@
 #!/bin/bash
 #
 # run test suite for mdadm
-user=`id -un`
-if [ " $user" != " root" ]
-then echo >&2 "test: testing can only be done as 'root'."
-     exit 1;
-fi
-
-prefix='[0-9][0-9]'
-
-dir=`pwd`
+dir=$(pwd)
+DEVTYPE=loop
 mdadm=$dir/mdadm
-if [ \! -x $mdadm ]
-then
-   echo >&2 "test: $mdadm isn't usable."
-fi
-
 testdir="tests"
-logdir="$testdir/logs"
 logsave=0
 exitonerror=1
-
-echo "Testing on linux-$(uname -r) kernel"
-
-# Check whether to run multipath tests
-modprobe multipath 2> /dev/null
-if grep -s 'Personalities : .*multipath' > /dev/null /proc/mdstat ; then
-    MULTIPATH="yes"
-fi
+prefix='[0-9][0-9]'
 INTEGRITY=yes
-DEVTYPE=loop
 LVM_VOLGROUP=mdtest
+targetdir="/var/tmp/mdtest"
+[ -d "$targetdir" ] &&
+	rm -fr $targetdir
+logdir="$dir/$testdir/log"
+[ -d "$logdir" ] &&
+	rm -fr $logdir
 
 # make sure to test local mdmon, not system one
 export MDADM_NO_SYSTEMCTL=1
@@ -42,7 +27,7 @@ mdp1=/dev/md_d1
 
 # We test mdadm on loop-back block devices.
 # dir for storing files should be settable by command line maybe
-targetdir=/var/tmp
+#targetdir=/var/tmp/mdtest
 size=20000
 # super0, round down to multiple of 64 and substract 64
 mdsize0=19904
@@ -68,22 +53,64 @@ config=/tmp/mdadm.conf
 
 cleanup() {
 	udevadm settle
-	$mdadm -Ssq 2> /dev/null
-        case $DEVTYPE in
-        loop)
-	  for d in 0 1 2 3 4 5 6 7  8 9 10 11 12 13
-	  do
-	    losetup -d /dev/loop$d ; # rm -f $targetdir/mdtest$d
-	    rm -f /dev/disk/by-path/loop*
-	  done
-          ;;
-        lvm)
-	  for d in 0 1 2 3 4 5 6 7  8 9 10 11 12 13
-	  do
-	    eval "lvremove --quiet -f \$dev$d"
-	  done
-          ;;
-        esac
+	$mdadm -Ssq
+	case $DEVTYPE in
+		loop )
+			for d in 0 1 2 3 4 5 6 7  8 9 10 11 12 13
+			do
+				losetup -d /dev/loop$d ; # rm -f $targetdir/mdtest$d
+				rm -f /dev/disk/by-path/loop*
+			done
+		;;
+		lvm )
+			for d in 0 1 2 3 4 5 6 7  8 9 10 11 12 13
+			do
+				eval "lvremove --quiet -f \$dev$d"
+			done
+		;;
+	esac
+	dmesg -c > /dev/null
+}
+
+die()
+{
+	echo -e "\n\tERROR: $* \n"
+	log_save fail
+	exit 2
+}
+
+# add save_log func, $1 is the flag why save log.
+log_save() {
+	status=$1
+	save_log="$status""$_basename".log
+
+	echo "## $HOSTNAME: dmesg saved." >> $logdir/$save_log
+	dmesg -c >> $logdir/$save_log
+	$mdadm -As 2> /dev/null
+	echo "## $HOSTNAME: md status message saved." >> $logdir/$save_log
+	cat /proc/mdstat >> $logdir/$save_log
+
+	if [ $DEVTYPE == 'lvm' ]
+	then
+		# waiting for supporting.
+		echo
+	elif [ $DEVTYPE == 'loop' ]
+	then
+		array=($($mdadm -Ds | cut -d' ' -f2))
+		if [ ${#array[@]} -ge 1 ]; then
+			md_disks=($($mdadm -D -Y ${array[@]} | grep "/dev/$DEVTYPE" | cut -d'=' -f2))
+			echo "## $HOSTNAME: mdadm -D ${array[@]}" >> $logdir/$save_log
+			$mdadm -D ${array[@]} >> $logdir/$save_log
+			$mdadm -X $md_disks &> /dev/null
+			if [ $? -eq 0 ]
+			then
+				echo "## $HOSTNAME: mdadm -X ${md_disks[@]}" >> $logdir/$save_log
+				$mdadm -X ${md_disks[@]} >> $logdir/$save_log
+			fi
+		elif [ ${#array[@]} -lt 1 ]; then
+			echo "## $HOSTNAME: no array assembled!" >> $logdir/$save_log
+		fi
+	fi
 }
 
 ctrl_c() {
@@ -91,350 +118,374 @@ ctrl_c() {
 }
 
 do_setup() {
-  trap cleanup 0 1 3 15
-  trap ctrl_c 2
+	trap cleanup 0 1 3 15
+	trap ctrl_c 2
 
-  # make sure there are no loop devices remaining.
-  # udev started things can sometimes prevent them being stopped
-  # immediately
-  while grep loop /proc/partitions > /dev/null 2>&1
-  do
-    mdadm -Ss
-    losetup -d /dev/loop[0-9]* 2> /dev/null
-    sleep 1
-  done
-  devlist=
-  for d in 0 1 2 3 4 5 6 7 8 9 10 11 12 13
-  do
-    sz=$size
-    if [ $d -gt 7 ]; then sz=$ddfsize ; fi
-    case $DEVTYPE in
-    loop)
-      [ -f $targetdir/mdtest$d ] || dd if=/dev/zero of=$targetdir/mdtest$d count=$sz bs=1K > /dev/null 2>&1
-      # make sure udev doesn't touch
-      mdadm --zero $targetdir/mdtest$d 2> /dev/null
-      [ -b /dev/loop$d ] || mknod /dev/loop$d b 7 $d
-      if [ $d -eq 7 ]
-      then
-        losetup /dev/loop$d $targetdir/mdtest6 # for multipath use
-      else
-        losetup /dev/loop$d $targetdir/mdtest$d
-      fi
-      eval dev$d=/dev/loop$d
-      eval file$d=$targetdir/mdtest$d
-      ;;
-    lvm)
-      unset MULTIPATH
-      eval dev$d=/dev/mapper/${LVM_VOLGROUP}-mdtest$d
-      if ! lvcreate --quiet -L ${sz}K -n mdtest$d $LVM_VOLGROUP; then
-	  trap '' 0 # make sure lvremove is not called
-	  eval echo error creating \$dev$d
-	  exit 129
-      fi
-      ;;
-    ram)
-      unset MULTIPATH
-      eval dev$d=/dev/ram$d
-      ;;
-    esac
-    eval devlist=\"\$devlist \$dev$d\"
-    eval devlist$d=\"\$devlist\"
-   #" <-- add this quote to un-confuse vim syntax highlighting
-  done
-  path0=$dev6
-  path1=$dev7
+	# make sure there are no loop devices remaining.
+	# udev started things can sometimes prevent them being stopped
+	# immediately
+	while grep loop /proc/partitions > /dev/null 2>&1
+	do
+		mdadm -Ss
+		losetup -d /dev/loop[0-9]* 2> /dev/null
+		sleep 0.2
+	done
+	devlist=
+	for d in 0 1 2 3 4 5 6 7 8 9 10 11 12 13
+	do
+		sz=$size
+		[ $d -gt 7 ] && sz=$ddfsize
+		case $DEVTYPE in
+		loop )
+			[ -f $targetdir/mdtest$d ] ||
+				dd if=/dev/zero of=$targetdir/mdtest$d count=$sz bs=1K > /dev/null 2>&1
+			# make sure udev doesn't touch
+			mdadm --zero $targetdir/mdtest$d 2> /dev/null
+			[ -b /dev/loop$d ] || mknod /dev/loop$d b 7 $d
+			if [ $d -eq 7 ]
+			then
+				losetup /dev/loop$d $targetdir/mdtest6 # for multipath use
+			else
+				losetup /dev/loop$d $targetdir/mdtest$d
+			fi
+			eval dev$d=/dev/loop$d
+			eval file$d=$targetdir/mdtest$d
+		;;
+		lvm )
+			unset MULTIPATH
+			eval dev$d=/dev/mapper/${LVM_VOLGROUP}-mdtest$d
+			if ! lvcreate --quiet -L ${sz}K -n mdtest$d $LVM_VOLGROUP; then
+				trap '' 0 # make sure lvremove is not called
+				eval echo error creating \$dev$d
+				exit 129
+			fi
+		;;
+		ram )
+			unset MULTIPATH
+			eval dev$d=/dev/ram$d
+		;;
+		esac
+		eval devlist=\"\$devlist \$dev$d\"
+		eval devlist$d=\"\$devlist\"
+		#" <-- add this quote to un-confuse vim syntax highlighting
+	done
+	path0=$dev6
+	path1=$dev7
 
-  ulimit -c unlimited
-  [ -f /proc/mdstat ] || modprobe md_mod
-  echo 2000 > /proc/sys/dev/raid/speed_limit_max
-  echo 0 > /sys/module/md_mod/parameters/start_ro
+	ulimit -c unlimited
+	[ -f /proc/mdstat ] || modprobe md_mod
+	echo 2000 > /proc/sys/dev/raid/speed_limit_max
+	echo 0 > /sys/module/md_mod/parameters/start_ro
 }
 
 # mdadm always adds --quiet, and we want to see any unexpected messages
 mdadm() {
-    rm -f $targetdir/stderr
-    case $* in
-	*-S* ) udevadm settle
-	       p=`cat /proc/sys/dev/raid/speed_limit_max`
-	       echo 20000 > /proc/sys/dev/raid/speed_limit_max
-    esac
-    case $* in
-	*-C* ) $mdadm 2> $targetdir/stderr --quiet "$@" --auto=yes;;
-	* )    $mdadm 2> $targetdir/stderr --quiet "$@"
-    esac
-    rv=$?
-    case $* in
-	*-S* ) udevadm settle
-	       echo $p > /proc/sys/dev/raid/speed_limit_max
-    esac
-    cat >&2 $targetdir/stderr
-    return $rv
+	rm -f $targetdir/stderr
+	case $* in
+		*-S* )
+			udevadm settle
+			p=`cat /proc/sys/dev/raid/speed_limit_max`
+			echo 20000 > /proc/sys/dev/raid/speed_limit_max
+	esac
+	case $* in
+		*-C* | *--create* | *-B* | *--build* )
+			for args in $*
+			do
+				[[ $args =~ "/dev/" ]] && {
+					[[ $args =~ "md" ]] ||
+						$mdadm --zero $args > /dev/null
+					}
+			done
+			$mdadm 2> $targetdir/stderr --quiet "$@" --auto=yes
+		;;
+		* )
+			$mdadm 2> $targetdir/stderr --quiet "$@"
+	esac
+	rv=$?
+	case $* in
+		*-S* )
+			udevadm settle
+			echo $p > /proc/sys/dev/raid/speed_limit_max
+	esac
+	cat >&2 $targetdir/stderr > $targetdir/log
+	[  $rv -ne 0 ] && exit 1
+	return $rv
 }
 
 # check various things
 check() {
-   case $1 in
-    spares )
-       spares=`tr '] ' '\012\012' < /proc/mdstat | grep -c '(S)' || exit 0`
-       if [ $spares -ne $2 ]
-       then
-          echo >&2 "ERROR expected $2 spares, found $spares"; exit 1;
-       fi
-      ;;
-    raid* | linear )
-      grep -s "active $1 " /proc/mdstat > /dev/null || {
-		echo >&2 "ERROR active $1 not found" ; cat /proc/mdstat ; exit 1;}
-     ;;
-    algorithm )
-      grep -s " algorithm $2 " /proc/mdstat > /dev/null || {
-	  echo >&2 "ERROR algorithm $2 not found"; cat /proc/mdstat; exit 1;}
-     ;;
-    resync | recovery | reshape)
-	cnt=5
-	while ! grep -s $1 /proc/mdstat > /dev/null
-	do
-	    if [ $cnt -gt 0 ] && grep -v idle /sys/block/md*/md/sync_action > /dev/null
-	    then # Something isn't idle - wait a bit
-		sleep 0.5
-		cnt=$[cnt-1]
-	    else
-		echo >&2 ERROR no $1 happening; cat /proc/mdstat; exit 1
-	    fi
-	done
-	;;
-
-     nosync )
-       sleep 0.5
-       # Since 4.2 we delay the close of recovery until there has been a chance for
-       # spares to be activated.  That means that a recovery that finds nothing
-       # to do can still take a little longer than expected.
-       # add an extra check: is sync_completed shows the end is reached, assume
-       # there is no recovery.
-       if grep -s -E '(resync|recovery|reshape) *=' > /dev/null /proc/mdstat ; then
-	   incomplete=`grep / /sys/block/md*/md/sync_completed 2> /dev/null | sed '/^ *\([0-9]*\) \/ \1/d'`
-	   if [ -n "$incomplete" ]; then
-		echo >&2 "ERROR resync or recovery is happening!"; cat /proc/mdstat ; exit 1;
-	   fi
-       fi
-     ;;
-
-    wait )
-      p=`cat /proc/sys/dev/raid/speed_limit_max`
-      echo 2000000 > /proc/sys/dev/raid/speed_limit_max
-      sleep 0.1
-      while grep -E '(resync|recovery|reshape|check|repair) *=' > /dev/null /proc/mdstat ||
-	      grep -v idle > /dev/null /sys/block/md*/md/sync_action
-      do sleep 0.5;
-      done
-      echo $p > /proc/sys/dev/raid/speed_limit_max
-      ;;
-
-    state )
-       grep -s "blocks.*\[$2\]\$" /proc/mdstat > /dev/null || {
-		echo >&2 "ERROR state $2 not found!"; cat /proc/mdstat ; exit 1; }
-       sleep 0.5
-      ;;
-
-    bitmap )
-       grep -s bitmap > /dev/null /proc/mdstat || {
-		echo >&2 ERROR no bitmap ; cat /proc/mdstat ; exit 1; }
-      ;;
-    nobitmap )
-       if grep -s "bitmap" > /dev/null /proc/mdstat
-       then
-		echo >&2 ERROR bitmap present ; cat /proc/mdstat ; exit 1;
-       fi
-      ;;
-
-    readonly )
-       grep -s "read-only" > /dev/null /proc/mdstat || {
-                echo >&2 "ERROR array is not read-only!"; cat /proc/mdstat ; exit 1; }
-      ;;
-
-    inactive )
-       grep -s "inactive" > /dev/null /proc/mdstat || {
-                echo >&2 "ERROR array is not inactive!"; cat /proc/mdstat ; exit 1; }
-      ;;
-    * ) echo >&2 ERROR unknown check $1 ; exit 1;
-   esac
+	case $1 in
+		spares )
+			spares=$(tr '] ' '\012\012' < /proc/mdstat | grep -c '(S)')
+			[ $spares -ne $2 ] &&
+				die "expected $2 spares, found $spares"
+		;;
+		raid* | linear )
+			grep -s -q "active $1 " /proc/mdstat ||
+				die "active $1 not found"
+		;;
+		algorithm )
+			grep -s -q " algorithm $2 " /proc/mdstat ||
+				die "algorithm $2 not found"
+		;;
+		resync | recovery | reshape )
+			cnt=5
+			while ! grep -s $1 /proc/mdstat > /dev/null
+			do
+				if [ $cnt -gt 0 ]
+				then # Something isn't idle - wait a bit
+					sleep 0.5
+					cnt=$[cnt-1]
+				else
+					die "no $1 happening"
+				fi
+			done
+		;;
+		nosync )
+			sleep 0.5
+			# Since 4.2 we delay the close of recovery until there has been a chance for
+			# spares to be activated.  That means that a recovery that finds nothing
+			# to do can still take a little longer than expected.
+			# add an extra check: is sync_completed shows the end is reached, assume
+			# there is no recovery.
+			if grep -s -E -q '(resync|recovery|reshape) *=' /proc/mdstat
+			then
+				incomplete=`grep / /sys/block/md*/md/sync_completed 2> /dev/null | sed '/^ *\([0-9]*\) \/ \1/d'`
+				[ -n "$incomplete" ] &&
+					die "resync or recovery is happening!"
+			fi
+		;;
+		wait )
+			p=$(cat /proc/sys/dev/raid/speed_limit_max)
+			echo 2000000 > /proc/sys/dev/raid/speed_limit_max
+			sleep 0.1
+			while grep -E -q '(resync|recovery|reshape|check|repair) *=' /proc/mdstat ||
+				grep -v idle > /dev/null /sys/block/md*/md/sync_action
+			do sleep 0.5;
+			done
+			echo $p > /proc/sys/dev/raid/speed_limit_max
+		;;
+		state )
+			grep -s -q "blocks.*\[$2\]\$" /proc/mdstat ||
+				die "state $2 not found!"
+			sleep 0.5
+		;;
+		bitmap )
+			grep -s -q bitmap /proc/mdstat ||
+				die "no bitmap found in /proc/mdstat"
+		;;
+		nobitmap )
+			grep -s -q "bitmap" /proc/mdstat &&
+				die "bitmap present in /proc/mdstat"
+		;;
+		readonly )
+			grep -s -q "read-only" /proc/mdstat ||
+				die "array is not read-only!"
+		;;
+		inactive )
+			grep -s -q "inactive" /proc/mdstat ||
+				die "array is not inactive!"
+		;;
+		* )
+			die "check $1 is unknown!"
+	esac
 }
 
-no_errors() {
-  if [ -s $targetdir/stderr ]
-  then echo Bad errors from mdadm: ; cat $targetdir/stderr; exit 2;
-  fi
-}
 # basic device test
-
 testdev() {
-   udevadm settle
-   dev=$1
-   cnt=$2
-   dvsize=$3
-   chunk=$4
-   if [ -z "$5" ]; then
-      mkfs.ext3 -F -j $dev > /dev/null 2>&1 && fsck -fn $dev >&2
-   fi
-   dsize=$[dvsize/chunk]
-   dsize=$[dsize*chunk]
-   rasize=$[dsize*2*cnt]
-   # rasize is in sectors
-   if [ -n "$DEV_ROUND_K" ]; then
-      rasize=$[rasize/DEV_ROUND_K/2]
-      rasize=$[rasize*DEV_ROUND_K*2]
-   fi
-   if [ `/sbin/blockdev --getsize $dev` -eq 0 ]; then sleep 2 ; fi
-   _sz=`/sbin/blockdev --getsize $dev`
-   if [ $rasize -lt $_sz -o $[rasize*4/5] -gt $_sz ]
-   then
-     echo "ERROR: size is wrong for $dev: $cnt * $dvsize (chunk=$chunk) = $rasize, not $_sz"
-     exit 1
-   fi
-}
-
-fast_sync() {
-  echo 200000 > /proc/sys/dev/raid/speed_limit_max
+# add the necessary checking when received object file, such as testdev /dev/md0
+# it would be created one regular file if /dev/md0 doesn't pull up, the rest testing
+# scripts would be affected.
+	[ -f $1 ] && rm -f $1
+	[ -b $1 ] || die "$1 doesn't exist!"
+	udevadm settle
+	dev=$1
+	cnt=$2
+	dvsize=$3
+	chunk=$4
+	if [ -z "$5" ]; then
+		mkfs.ext3 -F -j $dev > /dev/null 2>&1 && fsck -fn $dev >&2
+	fi
+	dsize=$[dvsize/chunk]
+	dsize=$[dsize*chunk]
+	rasize=$[dsize*2*cnt]
+	# rasize is in sectors
+	if [ -n "$DEV_ROUND_K" ]; then
+		rasize=$[rasize/DEV_ROUND_K/2]
+		rasize=$[rasize*DEV_ROUND_K*2]
+	fi
+	[ `/sbin/blockdev --getsize $dev` -eq 0 ] && sleep 2
+	_sz=`/sbin/blockdev --getsize $dev`
+	[ $rasize -lt $_sz -o $[rasize*4/5] -gt $_sz ] &&
+		die "size is wrong for $dev: $cnt * $dvsize (chunk=$chunk) = $rasize, not $_sz"
+
+# sometimes the above command would return non-0
+	return 0
 }
 
 rotest() {
-  dev=$1
-  fsck -fn $dev >&2
+	dev=$1
+	fsck -fn $dev >&2
 }
 
 do_test() {
-  _script=$1
-  _basename=`basename $_script`
-  if [ -f "$_script" ]
-  then
-    rm -f $targetdir/stderr
-    # stop all arrays, just incase some script left an array active.
-    $mdadm -Ssq 2> /dev/null
-    mdadm --zero $devlist 2> /dev/null
-    mdadm --zero $devlist 2> /dev/null
-    # this might have been reset: restore the default.
-    echo 2000 > /proc/sys/dev/raid/speed_limit_max
-    # source script in a subshell, so it has access to our
-    # namespace, but cannot change it.
-    echo -ne "$_script... "
-    if ( set -ex ; . $_script ) &> $targetdir/log
-    then
-      echo "succeeded"
-      _fail=0
-    else
-      log=log
-      cat $targetdir/stderr >> $targetdir/log
-      echo "=======================dmesg=================" >> $targetdir/log
-      dmesg | tail -n 200 >> $targetdir/log
-      if [ $exitonerror == 0 ]; then
-	  log=log-`basename $_script`
-	  mv $targetdir/log $logdir/$log
-      fi
-      echo "FAILED - see $logdir/$log for details"
-      _fail=1
-    fi
-    if [ "$savelogs" == "1" ]; then
-      cp $targetdir/log $logdir/$_basename.log
-    fi
-    if [ "$_fail" == "1" -a "$exitonerror" == "1" ]; then
-      exit 1
-    fi
-  fi
+	_script=$1
+	_basename=`basename $_script`
+	if [ -f "$_script" ]
+	then
+		rm -f $targetdir/stderr
+		# stop all arrays, just incase some script left an array active.
+		$mdadm -Ssq 2> /dev/null
+		mdadm --zero $devlist 2> /dev/null
+		# this might have been reset: restore the default.
+		echo 2000 > /proc/sys/dev/raid/speed_limit_max
+		# source script in a subshell, so it has access to our
+		# namespace, but cannot change it.
+		echo -ne "$_script... "
+		if ( set -ex ; . $_script ) &> $targetdir/log
+		then
+# put the dmesg checking here, the following key-words shouldn't appeared during testing.
+			dmesg | grep -i "error\|call trace\|segfault" &&
+				die "dmesg printed error when testing $_basename!"
+			echo "succeeded"
+			_fail=0
+		else
+			log=log-$_basename
+			cat $targetdir/stderr >> $targetdir/log
+			log_save fail
+			mv $targetdir/log $logdir/$log
+			echo "FAILED - see $logdir/$log for details"
+			_fail=1
+		fi
+		[ "$savelogs" == "1" ] &&
+			cp $targetdir/log $logdir/$_basename.log
+		[ "$_fail" == "1" -a "$exitonerror" == "1" ] && exit 1
+	fi
+
+	return 0
 }
 
+# just a recommend.
 do_help() {
-  echo "Usage: $0 [options]"
-  echo " Options:"
-  echo "    --tests=<test1,test2,..>    Comma separated list of tests to run"
-  echo "    --disable-multipath         Disable any tests involving multipath"
-  echo "    --disable-integrity         Disable slow tests of RAID[56] consistency"
-  echo "    --logdir=<directory>        Directory to save logfiles in"
-  echo "    --save-logs                 Save all logs in <logdir>"
-  echo "    --keep-going                Don't stop on error, ie. run all tests"
-  echo "    --dev=[loop|lvm|ram]        Use loop devices (default), LVM, or RAM disk"
-  echo "    --volgroup=<name>           LVM volume group for LVM test"
-  echo "    setup                       Setup test environment and exit"
-  echo "    cleanup                     Cleanup test environment"
-  echo "    <prefix>                    Run tests with <prefix>"
+	cat <<-EOF
+	Usage: $0 [options]
+	Options:
+		--tests=<test1,test2,..>    Comma separated list of tests to run
+		--disable-multipath         Disable any tests involving multipath
+		--disable-integrity         Disable slow tests of RAID[56] consistency
+		--logdir=<directory>        Directory to save logfiles in
+		--save-logs                 Save all logs in <logdir>
+		--keep-going                Don't stop on error, ie. run all tests
+		--dev=[loop|lvm|ram]        Use loop devices (default), LVM, or RAM disk
+		--volgroup=<name>           LVM volume group for LVM test
+		setup                       Setup test environment and exit
+		cleanup                     Cleanup test environment
+		<prefix>                    Run tests with <prefix>
+	EOF
+	exit 0
 }
 
 parse_args() {
-  for i in $*
-  do
-    case $i in
-    [0-9]*)
-      prefix=$i
-      ;;
-    setup)
-      echo "mdadm test environment setup"
-      do_setup
-      trap 0; exit 0
-      ;;
-    cleanup)
-      cleanup
-      exit 0
-      ;;
-    --tests=*)
-      TESTLIST=`expr "x$i" : 'x[^=]*=\(.*\)' | sed -e 's/,/ /g'`
-      ;;
-    --logdir=*)
-      logdir=`expr "x$i" : 'x[^=]*=\(.*\)'`
-      ;;
-    --save-logs)
-      savelogs=1
-      ;;
-    --keep-going | --no-error)
-      exitonerror=0
-      ;;
-    --disable-multipath)
-      unset MULTIPATH
-      ;;
-    --disable-integrity)
-      unset INTEGRITY
-      ;;
-    --dev=loop)
-      DEVTYPE=loop
-      ;;
-    --dev=lvm)
-      DEVTYPE=lvm
-      ;;
-    --dev=ram)
-      DEVTYPE=ram
-      ;;
-    --volgroup=*)
-      LVM_VOLGROUP=`expr "x$i" : 'x[^=]*=\(.*\)'`
-      ;;
-    --help)
-      do_help
-      exit 0;
-      ;;
-    -*)
-      echo " $0: Unknown argument: $i"
-      do_help
-      exit 0;
-      ;;
-    esac
-done
+	for i in $*
+	do
+		case $i in
+			[0-9]* )
+				prefix=$i
+			;;
+			setup )
+				echo "mdadm test environment setup"
+				do_setup
+				trap 0; exit 0
+			;;
+			cleanup )
+				cleanup
+				exit 0
+			;;
+			--tests=* )
+				TESTLIST=`expr "x$i" : 'x[^=]*=\(.*\)' | sed -e 's/,/ /g'`
+			;;
+			--logdir=* )
+				logdir=`expr "x$i" : 'x[^=]*=\(.*\)'`
+			;;
+			--save-logs )
+				savelogs=1
+			;;
+			--keep-going | --no-error )
+				exitonerror=0
+			;;
+			--disable-multipath )
+				unset MULTIPATH
+			;;
+			--disable-integrity )
+				unset INTEGRITY
+			;;
+			--dev=loop )
+				DEVTYPE=loop
+			;;
+			--dev=lvm )
+				DEVTYPE=lvm
+			;;
+			--dev=ram )
+				DEVTYPE=ram
+			;;
+			--volgroup=* )
+				LVM_VOLGROUP=`expr "x$i" : 'x[^=]*=\(.*\)'`
+			;;
+			--help )
+				do_help
+			;;
+			-* )
+				echo " $0: Unknown argument: $i"
+				do_help
+		;;
+		esac
+	done
 }
 
-logdir=$targetdir
-parse_args $@
-
-do_setup
-mkdir -p $logdir
-
-if [ "$savelogs" == "1" ]; then
-  echo "Saving logs to $logdir"
-fi
+# draft the main func
+main() {
+	[ "X$(id -un)" != "Xroot" ] && {
+		echo "test: testing can only be done as 'root'."
+		exit 1
+	}
+	[ -x $mdadm -a -x "test" ] || {
+		echo "test: $mdadm or '$dir/test' isn't usable."
+		exit 1
+	}
+	[ -x raid6check ] || {
+		echo "test: please run 'make everything' before testing."
+		exit 1
+	}
+	mkdir -p $targetdir
+	mkdir -p $logdir
+	# such as the external bitmap only support the ext file system.
+	# users can modify the $targetdir path under ext3 mount point.
+	[[ $(df $targetdir -T) =~ ext ]] || {
+		echo "ensure that $targetdir mounted under ext[2,3,4] filesystem!"
+		exit 1
+	}
+	echo "Testing on linux-$(uname -r) kernel"
+	[ "$savelogs" == "1" ] &&
+		echo "Saving logs to $logdir"
+	# Check whether to run multipath tests
+	modprobe multipath 2> /dev/null
+	grep -s -q 'Personalities : .*multipath' /proc/mdstat &&
+		MULTIPATH="yes"
+	do_setup
+	if [ "x$TESTLIST" != "x" ]; then
+		for script in $TESTLIST
+		do
+			do_test $testdir/$script
+		done
+	else
+		for script in $testdir/$prefix $testdir/$prefix*[^~]
+		do
+			do_test $script
+		done
+	fi
+
+	exit 0
+}
 
-if [ "x$TESTLIST" != "x" ]; then
-  for script in $TESTLIST
-  do
-    do_test $testdir/$script
-  done
-else
-  for script in $testdir/$prefix $testdir/$prefix*[^~]
-  do
-    do_test $script
-  done
-fi
-exit 0
+parse_args $@
+main
-- 
2.6.6


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox