Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* [PATCH v6 1/4] lib/raid6: Add log-of-2 table for RAID6 HW requiring disk position
From: Anup Patel @ 2017-03-06  9:43 UTC (permalink / raw)
  To: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar
  Cc: Dan Williams, Ray Jui, Scott Branden, Jon Mason, Rob Rice,
	bcm-kernel-feedback-list, dmaengine, devicetree, linux-arm-kernel,
	linux-kernel, linux-crypto, linux-raid, Anup Patel
In-Reply-To: <1488793408-25592-1-git-send-email-anup.patel@broadcom.com>

The raid6_gfexp table represents {2}^n values for 0 <= n < 256. The
Linux async_tx framework pass values from raid6_gfexp as coefficients
for each source to prep_dma_pq() callback of DMA channel with PQ
capability. This creates problem for RAID6 offload engines (such as
Broadcom SBA) which take disk position (i.e. log of {2}) instead of
multiplicative cofficients from raid6_gfexp table.

This patch adds raid6_gflog table having log-of-2 value for any given
x such that 0 <= x < 256. For any given disk coefficient x, the
corresponding disk position is given by raid6_gflog[x]. The RAID6
offload engine driver can use this newly added raid6_gflog table to
get disk position from multiplicative coefficient.

Signed-off-by: Anup Patel <anup.patel@broadcom.com>
Reviewed-by: Scott Branden <scott.branden@broadcom.com>
Reviewed-by: Ray Jui <ray.jui@broadcom.com>
---
 include/linux/raid/pq.h |  1 +
 lib/raid6/mktables.c    | 20 ++++++++++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
index 4d57bba..30f9453 100644
--- a/include/linux/raid/pq.h
+++ b/include/linux/raid/pq.h
@@ -142,6 +142,7 @@ int raid6_select_algo(void);
 extern const u8 raid6_gfmul[256][256] __attribute__((aligned(256)));
 extern const u8 raid6_vgfmul[256][32] __attribute__((aligned(256)));
 extern const u8 raid6_gfexp[256]      __attribute__((aligned(256)));
+extern const u8 raid6_gflog[256]      __attribute__((aligned(256)));
 extern const u8 raid6_gfinv[256]      __attribute__((aligned(256)));
 extern const u8 raid6_gfexi[256]      __attribute__((aligned(256)));
 
diff --git a/lib/raid6/mktables.c b/lib/raid6/mktables.c
index 39787db..e824d08 100644
--- a/lib/raid6/mktables.c
+++ b/lib/raid6/mktables.c
@@ -125,6 +125,26 @@ int main(int argc, char *argv[])
 	printf("EXPORT_SYMBOL(raid6_gfexp);\n");
 	printf("#endif\n");
 
+	/* Compute log-of-2 table */
+	printf("\nconst u8 __attribute__((aligned(256)))\n"
+	       "raid6_gflog[256] =\n" "{\n");
+	for (i = 0; i < 256; i += 8) {
+		printf("\t");
+		for (j = 0; j < 8; j++) {
+			v = 255;
+			for (k = 0; k < 256; k++)
+				if (exptbl[k] == (i + j)) {
+					v = k;
+					break;
+				}
+			printf("0x%02x,%c", v, (j == 7) ? '\n' : ' ');
+		}
+	}
+	printf("};\n");
+	printf("#ifdef __KERNEL__\n");
+	printf("EXPORT_SYMBOL(raid6_gflog);\n");
+	printf("#endif\n");
+
 	/* Compute inverse table x^-1 == x^254 */
 	printf("\nconst u8 __attribute__((aligned(256)))\n"
 	       "raid6_gfinv[256] =\n" "{\n");
-- 
2.7.4

^ permalink raw reply related

* [PATCH v6 0/4] Broadcom SBA RAID support
From: Anup Patel @ 2017-03-06  9:43 UTC (permalink / raw)
  To: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar
  Cc: devicetree, Anup Patel, Scott Branden, Jon Mason, Ray Jui,
	linux-kernel, linux-raid, bcm-kernel-feedback-list, linux-crypto,
	Rob Rice, dmaengine, Dan Williams, linux-arm-kernel

The Broadcom SBA RAID is a stream-based device which provides
RAID5/6 offload.

It requires a SoC specific ring manager (such as Broadcom FlexRM
ring manager) to provide ring-based programming interface. Due to
this, the Broadcom SBA RAID driver (mailbox client) implements
DMA device having one DMA channel using a set of mailbox channels
provided by Broadcom SoC specific ring manager driver (mailbox
controller).

The Broadcom SBA RAID hardware requires PQ disk position instead
of PQ disk coefficient. To address this, we have added raid_gflog
table which will help driver to convert PQ disk coefficient to PQ
disk position.

This patchset is based on Linux-4.11-rc1 and depends on patchset
"[PATCH v5 0/2] Broadcom FlexRM ring manager support"

It is also available at sba-raid-v6 branch of
https://github.com/Broadcom/arm64-linux.git

Changes since v5:
 - Rebased patches for Linux-4.11-rc1

Changes since v4:
 - Removed dependency of bcm-sba-raid driver on kconfig opton
   ASYNC_TX_ENABLE_CHANNEL_SWITCH
 - Select kconfig options ASYNC_TX_DISABLE_XOR_VAL_DMA and
   ASYNC_TX_DISABLE_PQ_VAL_DMA for bcm-sba-raid driver
 - Implemented device_prep_dma_interrupt() using dummy 8-byte
   copy operation so that the dma_async_device_register() can
   set DMA_ASYNC_TX capability for the DMA device provided
   by bcm-sba-raid driver

Changes since v3:
 - Replaced SBA_ENC() with sba_cmd_enc() inline function
 - Use list_first_entry_or_null() wherever possible
 - Remove unwanted brances around loops wherever possible
 - Use lockdep_assert_held() where required

Changes since v2:
 - Droped patch to handle DMA devices having support for fewer
   PQ coefficients in Linux Async Tx
 - Added work-around in bcm-sba-raid driver to handle unsupported
   PQ coefficients using multiple SBA requests

Changes since v1:
 - Droped patch to add mbox_channel_device() API
 - Used GENMASK and BIT macros wherever possible in bcm-sba-raid driver
 - Replaced C_MDATA macros with static inline functions in
   bcm-sba-raid driver
 - Removed sba_alloc_chan_resources() callback in bcm-sba-raid driver
 - Used dev_err() instead of dev_info() wherever applicable
 - Removed call to sba_issue_pending() from sba_tx_submit() in
   bcm-sba-raid driver
 - Implemented SBA request chaning for handling (len > sba->req_size)
   in bcm-sba-raid driver
 - Implemented device_terminate_all() callback in bcm-sba-raid driver

Anup Patel (4):
  lib/raid6: Add log-of-2 table for RAID6 HW requiring disk position
  async_tx: Fix DMA_PREP_FENCE usage in do_async_gen_syndrome()
  dmaengine: Add Broadcom SBA RAID driver
  dt-bindings: Add DT bindings document for Broadcom SBA RAID driver

 .../devicetree/bindings/dma/brcm,iproc-sba.txt     |   29 +
 crypto/async_tx/async_pq.c                         |    5 +-
 drivers/dma/Kconfig                                |   14 +
 drivers/dma/Makefile                               |    1 +
 drivers/dma/bcm-sba-raid.c                         | 1785 ++++++++++++++++++++
 include/linux/raid/pq.h                            |    1 +
 lib/raid6/mktables.c                               |   20 +
 7 files changed, 1852 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/dma/brcm,iproc-sba.txt
 create mode 100644 drivers/dma/bcm-sba-raid.c

-- 
2.7.4

^ permalink raw reply

* Re: [RAID recovery] Unable to recover RAID5 array after disk failure
From: Olivier Swinkels @ 2017-03-06  9:17 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid
In-Reply-To: <CAJ0QwkKAVLh7+kV_GsEL4O5hA+Pj=Y9sE_CBbWJJqDyA+gVqpA@mail.gmail.com>

On Mon, Mar 6, 2017 at 9:26 AM, Olivier Swinkels
<olivier.swinkels@gmail.com> wrote:
> On Sun, Mar 5, 2017 at 7:55 PM, Phil Turmel <philip@turmel.org> wrote:
>>
>> On 03/03/2017 04:35 PM, Olivier Swinkels wrote:
>> > Hi,
>> >
>> > I'm in quite a pickle here. I can't recover from a disk failure on my
>> > 6 disk raid 5 array.
>> > Any help would really be appreciated!
>> >
>> > Please bear with me as I lay out the the steps that got me here:
>>
>> [trim /]
>>
>> Well, you've learned that mdadm --create is not a good idea. /-:
>>
>> However, you did save your pre-re-create --examine reports, and it
>> looks like you've reconstructed correctly.  (Very brief look.)
>>
>> However, you discovered that mdadm's defaults have long since changed
>> to v1.2 superblock, 512k chunks, bitmaps, and a substantially different
>> metadata layout.  In fact, I'm certain your LVM metadata has been
>> damaged by the brief existence of mdadm's v1.2 metadata on your member
>> devices.  Including removal of the LVM magic signature.
>>
>> What you need is a backup of your lvm configuration, which is commonly
>> available in /etc/ of an install, but naturally not available if /etc/
>> was inside this array.  In addition, though, LVM generally writes
>> multiple copies of this backup in its metadata.  And that is likely
>> still there, near the beginning of your array.
>>
>> You should hexdump the first several megabytes of your array looking for
>> LVM's XML formatted configuration.  If you can locate some of those
>> copies, you can probably use dd to extract a copy to a file, then use
>> that with LVM's recovery tools to re-establish all of your LVs.
>>
>> There is a possibilility that some of your actual LV content was damaged
>> by the mdadm v1.2 metadata, too, but first recover the LVM setup.
>>
>> Phil
>
>
> That sounds promising, as /etc was not on the array.
> I found a backup in /etc/lvm/backup/lvm-raid (contents shown below).
>
> Unfortunatelly when I try to use it to restore the LVM I get the
> following error:
> vgcfgrestore -f /etc/lvm/backup/lvm-raid lvm-raid
> Aborting vg_write: No metadata areas to write to!
> Restore failed.
>
> So I guess I also need to recreate the physical volume using:

Correction: (Put the wrong ID in the pvcreate example):
pvcreate --uuid "DWv51O-lg9s-Dl4w-EBp9-QeIF-Vv60-8wt2uS" --restorefile
/etc/lvm/backup/lvm-raid

> Is this correct? (I'm a bit hesitant with another 'create' command as
> you might understand.)
>
> Regards,
>
> Olivier
>
>
> ===============================================================================
> /etc/lvm/backup/lvm-raid
> ===============================================================================
> # Generated by LVM2 version 2.02.133(2) (2015-10-30): Fri Oct 14 15:55:36 2016
>
> contents = "Text Format Volume Group"
> version = 1
>
> description = "Created *after* executing 'vgcfgbackup'"
>
> creation_host = "horus-server"  # Linux horus-server 3.13.0-98-generic
> #145-Ubuntu SMP Sat Oct 8 20:13:07 UTC 2016 x86_64
> creation_time = 1476453336      # Fri Oct 14 15:55:36 2016
>
> lvm-raid {
>         id = "0Esja8-U0EZ-fndQ-vjUq-oIuX-3KgA-uTL6rP"
>         seqno = 8
>         format = "lvm2"                 # informational
>         status = ["RESIZEABLE", "READ", "WRITE"]
>         flags = []
>         extent_size = 524288            # 256 Megabytes
>         max_lv = 0
>         max_pv = 0
>         metadata_copies = 0
>
>         physical_volumes {
>
>                 pv0 {
>                         id = "DWv51O-lg9s-Dl4w-EBp9-QeIF-Vv60-8wt2uS"
>                         device = "/dev/md0"     # Hint only
>
>                         status = ["ALLOCATABLE"]
>                         flags = []
>                         dev_size = 19535144448  # 9.09676 Terabytes
>                         pe_start = 512
>                         pe_count = 37260        # 9.09668 Terabytes
>                 }
>         }
>
>         logical_volumes {
>
>                 lvm0 {
>                         id = "OpWRpy-O4JT-Ua3t-E1A4-2SuN-GLLR-5CFMLh"
>                         status = ["READ", "WRITE", "VISIBLE"]
>                         flags = []
>                         segment_count = 1
>
>                         segment1 {
>                                 start_extent = 0
>                                 extent_count = 37260    # 9.09668 Terabytes
>
>                                 type = "striped"
>                                 stripe_count = 1        # linear
>
>                                 stripes = [
>                                         "pv0", 0
>                                 ]
>                         }
>                 }
>         }
> }

^ permalink raw reply

* Re: [RAID recovery] Unable to recover RAID5 array after disk failure
From: Olivier Swinkels @ 2017-03-06  8:26 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid
In-Reply-To: <0a6e4a82-665c-6e62-293f-151e41c67842@turmel.org>

On Sun, Mar 5, 2017 at 7:55 PM, Phil Turmel <philip@turmel.org> wrote:
>
> On 03/03/2017 04:35 PM, Olivier Swinkels wrote:
> > Hi,
> >
> > I'm in quite a pickle here. I can't recover from a disk failure on my
> > 6 disk raid 5 array.
> > Any help would really be appreciated!
> >
> > Please bear with me as I lay out the the steps that got me here:
>
> [trim /]
>
> Well, you've learned that mdadm --create is not a good idea. /-:
>
> However, you did save your pre-re-create --examine reports, and it
> looks like you've reconstructed correctly.  (Very brief look.)
>
> However, you discovered that mdadm's defaults have long since changed
> to v1.2 superblock, 512k chunks, bitmaps, and a substantially different
> metadata layout.  In fact, I'm certain your LVM metadata has been
> damaged by the brief existence of mdadm's v1.2 metadata on your member
> devices.  Including removal of the LVM magic signature.
>
> What you need is a backup of your lvm configuration, which is commonly
> available in /etc/ of an install, but naturally not available if /etc/
> was inside this array.  In addition, though, LVM generally writes
> multiple copies of this backup in its metadata.  And that is likely
> still there, near the beginning of your array.
>
> You should hexdump the first several megabytes of your array looking for
> LVM's XML formatted configuration.  If you can locate some of those
> copies, you can probably use dd to extract a copy to a file, then use
> that with LVM's recovery tools to re-establish all of your LVs.
>
> There is a possibilility that some of your actual LV content was damaged
> by the mdadm v1.2 metadata, too, but first recover the LVM setup.
>
> Phil


That sounds promising, as /etc was not on the array.
I found a backup in /etc/lvm/backup/lvm-raid (contents shown below).

Unfortunatelly when I try to use it to restore the LVM I get the
following error:
vgcfgrestore -f /etc/lvm/backup/lvm-raid lvm-raid
Aborting vg_write: No metadata areas to write to!
Restore failed.

So I guess I also need to recreate the physical volume using:
pvcreate --uuid "0Esja8-U0EZ-fndQ-vjUq-oIuX-3KgA-uTL6rP" --restorefile
/etc/lvm/backup/lvm-raid
Is this correct? (I'm a bit hesitant with another 'create' command as
you might understand.)

Regards,

Olivier


===============================================================================
/etc/lvm/backup/lvm-raid
===============================================================================
# Generated by LVM2 version 2.02.133(2) (2015-10-30): Fri Oct 14 15:55:36 2016

contents = "Text Format Volume Group"
version = 1

description = "Created *after* executing 'vgcfgbackup'"

creation_host = "horus-server"  # Linux horus-server 3.13.0-98-generic
#145-Ubuntu SMP Sat Oct 8 20:13:07 UTC 2016 x86_64
creation_time = 1476453336      # Fri Oct 14 15:55:36 2016

lvm-raid {
        id = "0Esja8-U0EZ-fndQ-vjUq-oIuX-3KgA-uTL6rP"
        seqno = 8
        format = "lvm2"                 # informational
        status = ["RESIZEABLE", "READ", "WRITE"]
        flags = []
        extent_size = 524288            # 256 Megabytes
        max_lv = 0
        max_pv = 0
        metadata_copies = 0

        physical_volumes {

                pv0 {
                        id = "DWv51O-lg9s-Dl4w-EBp9-QeIF-Vv60-8wt2uS"
                        device = "/dev/md0"     # Hint only

                        status = ["ALLOCATABLE"]
                        flags = []
                        dev_size = 19535144448  # 9.09676 Terabytes
                        pe_start = 512
                        pe_count = 37260        # 9.09668 Terabytes
                }
        }

        logical_volumes {

                lvm0 {
                        id = "OpWRpy-O4JT-Ua3t-E1A4-2SuN-GLLR-5CFMLh"
                        status = ["READ", "WRITE", "VISIBLE"]
                        flags = []
                        segment_count = 1

                        segment1 {
                                start_extent = 0
                                extent_count = 37260    # 9.09668 Terabytes

                                type = "striped"
                                stripe_count = 1        # linear

                                stripes = [
                                        "pv0", 0
                                ]
                        }
                }
        }
}

^ permalink raw reply

* Re: [PATCH 3/3] md/raid5: sort bios
From: NeilBrown @ 2017-03-06  6:40 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Shaohua Li, linux-raid, songliubraving, kernel-team
In-Reply-To: <20170303175909.q2t2hoxsorjhk77k@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 3826 bytes --]

On Fri, Mar 03 2017, Shaohua Li wrote:

> On Fri, Mar 03, 2017 at 02:43:49PM +1100, Neil Brown wrote:
>> On Fri, Feb 17 2017, Shaohua Li wrote:
>> 
>> > Previous patch (raid5: only dispatch IO from raid5d for harddisk raid)
>> > defers IO dispatching. The goal is to create better IO pattern. At that
>> > time, we don't sort the deffered IO and hope the block layer can do IO
>> > merge and sort. Now the raid5-cache writeback could create large amount
>> > of bios. And if we enable muti-thread for stripe handling, we can't
>> > control when to dispatch IO to raid disks. In a lot of time, we are
>> > dispatching IO which block layer can't do merge effectively.
>> >
>> > This patch moves further for the IO dispatching defer. We accumulate
>> > bios, but we don't dispatch all the bios after a threshold is met. This
>> > 'dispatch partial portion of bios' stragety allows bios coming in a
>> > large time window are sent to disks together. At the dispatching time,
>> > there is large chance the block layer can merge the bios. To make this
>> > more effective, we dispatch IO in ascending order. This increases
>> > request merge chance and reduces disk seek.
>> 
>> I can see the benefit of batching and sorting requests.
>> 
>> I wonder if the extra complexity of grouping together 512 requests, then
>> submitting the "first" 128 is really worth it.  Have you measured the
>> value of that?
>
> I'm pretty sure I tried. The whole point of dispatching the first 128 is we
> don't have a better pipeline. Grouping 512 and then dispatching them together
> definitely improve the IO patter, but the request accumulation takes time, we
> will have no IO running in the window.

But we don't wait for the first batch before we start collecting the
next batch - do we?  Why would there be a window with no IO running?


>
>> If you just submitted every time you got 512 requests, you could use
>> list_sort() on the bio list and wouldn't need an array.
>> 
>> If an array really is best, it would be really nice if "sort" could pass
>> a 'void*' down to the cmp function,
>> and it could sort all bios that are
>> *after* last_bio_pos first, and then the others.  That would make the
>> code much simpler.  I guess sort() could be changed (list_sort() already
>> has a 'priv' argument like this).
>
> Ok, I'll change this to a list. And add extra pointer to record the last sorted
> entry. I didn't see the sort uses much time in my profile, but the merge sort
> looks better. Will do the change.

I think both sorts are O(log(N)).
I had thought that list_sort() would work on a bio_list, but it requires
a list_head (even though it doesn't use the prev pointer).
If it worked on a bio_list and if you could just submit the whole batch,
then using list_sort would have meant that you don't need to allocate a
table of r5pending_data.
Now with the struct list_head in there, the data is twice the size.

I guess that doesn't matter too much.

It just feels like there should be a cleaner solution, but I cannot find
it without writing a new sort function (not that it would be so hard do
to that).

Thanks,
NeilBrown


>
>> If we cannot change sort(), then maybe use lib/bsearch.c for the binary
>> search.  Performing two comparisons in the loop of a binary search
>> should get a *fail* in any algorithms class!!
>> The "pending_data" array that you have added to the r5conf structure
>> adds 4096 bytes.  This means it is larger than a page, which is best
>> avoided (though it is unlikely to cause problems).  I would allocate it
>> separately.
>
> Yep, already fixed internally.
>
>> 
>> So there is a lot that I don't really like, but it seems like a good
>> idea in principle.
>
> ok, thanks for your time!
>
> Thanks,
> Shaohua

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH 2/3] md/raid5-cache: bump flush stripe batch size
From: NeilBrown @ 2017-03-06  6:23 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Shaohua Li, linux-raid, songliubraving, kernel-team
In-Reply-To: <20170303174138.iecwisogdkkdh6iy@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 1974 bytes --]

On Fri, Mar 03 2017, Shaohua Li wrote:

> On Fri, Mar 03, 2017 at 02:03:31PM +1100, Neil Brown wrote:
>> On Fri, Feb 17 2017, Shaohua Li wrote:
>> 
>> > Bump the flush stripe batch size to 2048. For my 12 disks raid
>> > array, the stripes takes:
>> > 12 * 4k * 2048 = 96MB
>> >
>> > This is still quite small. A hardware raid card generally has 1GB size,
>> > which we suggest the raid5-cache has similar cache size.
>> >
>> > The advantage of a big batch size is we can dispatch a lot of IO in the
>> > same time, then we can do some scheduling to make better IO pattern.
>> >
>> > Last patch prioritizes stripes, so we don't worry about a big flush
>> > stripe batch will starve normal stripes.
>> >
>> > Signed-off-by: Shaohua Li <shli@fb.com>
>> > ---
>> >  drivers/md/raid5-cache.c | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
>> > index 3f307be..b25512c 100644
>> > --- a/drivers/md/raid5-cache.c
>> > +++ b/drivers/md/raid5-cache.c
>> > @@ -43,7 +43,7 @@
>> >  /* wake up reclaim thread periodically */
>> >  #define R5C_RECLAIM_WAKEUP_INTERVAL (30 * HZ)
>> >  /* start flush with these full stripes */
>> > -#define R5C_FULL_STRIPE_FLUSH_BATCH 256
>> > +#define R5C_FULL_STRIPE_FLUSH_BATCH 2048
>> 
>> Fixed numbers are warning signs... I wonder if there is something better
>> we could do?   "conf->max_nr_stripes / 4" maybe?  We use that sort of
>> number elsewhere.
>> Would that make sense?
>
> The code where we check the batch size (in r5c_do_reclaim) already a check:
> total_cached > conf->min_nr_stripes * 1 / 2
> so I think that's ok, no?

I'm not sure what you are saying.

I'm suggesting that we get rid of R5C_FULL_STRIPE_FLUSH_BATCH and use a
number like "conf->max_nr_stripes / 4"
Are you agreeing, or are you saying that you don't think we need to get
rid of R5C_FULL_STRIPE_FLUSH_BATCH??

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH RFC] test: revise 'test' and make it easier to understand
From: zhilong @ 2017-03-06  3:27 UTC (permalink / raw)
  To: Guoqing Jiang, neilb, Jes.Sorensen; +Cc: linux-raid
In-Reply-To: <58B92400.7010800@suse.de>


On 03/03/2017 04:06 PM, Guoqing Jiang wrote:
>
>
> On 02/28/2017 10:47 AM, Zhilong Liu wrote:
>> 1. use 'Tab' as the code style.
>> 2. arrange the testing steps and provide the 'main' entrance.
>> 3. draft the log_save feature, it captures the /proc/mdstat,
>>     md superblock info, bitmap info and the detail dmesg.
>> 4. modified the mdadm() func, adding the operation that clear
>>     the superblock when create or build one new array, and it
>>     would exit testing when mdadm command returned non-0 value.
>> 5. delete no_errors() func, it only used in tests/04update-uuid,
>>     I recommend the new mdadm() using method.
>> 6. delete fast_sync() func.
>> 7. testdev(), add the object file checking, otherwise this command
>>     would create one regular file, it's one trouble thing.
>> 8. add dmesg checking in do_test() func, it's necessary to check
>>     dmesg whether or not printed abnormal message.
>> 9. add checking conditions in main(), such as $pwd/raid6check need
>>     exists, here is a prompt to remind users to 'make everything'
>>     before testing; the $targetdir should mount under ext[2-4] FS,
>>     because the external bitmap only supports ext, the bmap() API
>>     of bitmap.c doesn't exist in all filesystem, such as btrfs.
>>
>
> I like the improvement for the test, and I would suggest you split
> those changes into smaller patches, make each patch do one thing,
> it would be easier for Jes to review I think, and you still can merge
> them into one finally if Jes prefer one patch with huge changes, :-) .
>

Copy that, really appreciate this nice point.

Thanks,
-Zhilong
> Cheers,
> Guoqing
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply

* [PATCH] mdadm:add man page for --symlinks
From: Zhilong Liu @ 2017-03-06  2:39 UTC (permalink / raw)
  To: Jes.Sorensen; +Cc: linux-raid, Zhilong Liu

In build and create mode:
--symlinks
	Auto creation of symlinks in /dev to /dev/md, option --symlinks
	must be 'no' or 'yes' and work with --create and --build.
In assemble mode:
--symlinks
	See this option under Create and Build options.

Signed-off-by: Zhilong Liu <zlliu@suse.com>

diff --git a/mdadm.8.in b/mdadm.8.in
index 1e4f91d..df1d460 100644
--- a/mdadm.8.in
+++ b/mdadm.8.in
@@ -1015,6 +1015,11 @@ simultaneously. If not specified, this defaults to 4.
 Specify journal device for the RAID-4/5/6 array. The journal device
 should be a SSD with reasonable lifetime.
 
+.TP
+.BR \-\-symlinks
+Auto creation of symlinks in /dev to /dev/md, option --symlinks must
+be 'no' or 'yes' and work with --create and --build.
+
 
 .SH For assemble:
 
@@ -1291,6 +1296,10 @@ Reshape can be continued later using the
 .B \-\-continue
 option for the grow command.
 
+.TP
+.BR \-\-symlinks
+See this option under Create and Build options.
+
 .SH For Manage mode:
 
 .TP
-- 
2.6.6


^ permalink raw reply related

* When will Linux support M2 on RAID ?
From: David F. @ 2017-03-06  2:09 UTC (permalink / raw)
  To: linux-kernel, linux-raid@vger.kernel.org

More and more systems are coming with M2 on RAID and Linux doesn't
work unless you change the system out of RAID mode.  This is becoming
more and more of a problem.   What is the status of Linux support for
the new systems?

TIA!!

^ permalink raw reply

* Re: [RAID recovery] Unable to recover RAID5 array after disk failure
From: Phil Turmel @ 2017-03-05 18:55 UTC (permalink / raw)
  To: Olivier Swinkels, linux-raid
In-Reply-To: <CAJ0QwkJqRbJtM9HiXL+cCj2TkQQRahGgWTwD9QKH_CBfbJeLKA@mail.gmail.com>

On 03/03/2017 04:35 PM, Olivier Swinkels wrote:
> Hi,
> 
> I'm in quite a pickle here. I can't recover from a disk failure on my
> 6 disk raid 5 array.
> Any help would really be appreciated!
> 
> Please bear with me as I lay out the the steps that got me here:

[trim /]

Well, you've learned that mdadm --create is not a good idea. /-:

However, you did save your pre-re-create --examine reports, and it
looks like you've reconstructed correctly.  (Very brief look.)

However, you discovered that mdadm's defaults have long since changed
to v1.2 superblock, 512k chunks, bitmaps, and a substantially different
metadata layout.  In fact, I'm certain your LVM metadata has been
damaged by the brief existence of mdadm's v1.2 metadata on your member
devices.  Including removal of the LVM magic signature.

What you need is a backup of your lvm configuration, which is commonly
available in /etc/ of an install, but naturally not available if /etc/
was inside this array.  In addition, though, LVM generally writes
multiple copies of this backup in its metadata.  And that is likely
still there, near the beginning of your array.

You should hexdump the first several megabytes of your array looking for
LVM's XML formatted configuration.  If you can locate some of those
copies, you can probably use dd to extract a copy to a file, then use
that with LVM's recovery tools to re-establish all of your LVs.

There is a possibilility that some of your actual LV content was damaged
by the mdadm v1.2 metadata, too, but first recover the LVM setup.

Phil

^ permalink raw reply

* Re: Bit-Rot
From: Mikael Abrahamsson @ 2017-03-05  8:15 UTC (permalink / raw)
  To: Gandalf Corvotempesta; +Cc: Anthony Youngman, linux-raid
In-Reply-To: <CAJH6TXimNQ8raxBCT=aLhzLzRnbMh8adcF21BrOqp6G-cajLpw@mail.gmail.com>

On Fri, 3 Mar 2017, Gandalf Corvotempesta wrote:

> This is what ZFS does. This is what Gluster does. This is what BRTFS does.
> Adding this in mdadm could be an interesting feature.

This has been discussed several times. Yes, it would be interesting. It's 
not easy to do because mdadm maps 4k blocks to 4k blocks. Only way to 
"easily" add this I imagine, would be to have an additional "checksum" 
block, so that raid6 would require 3 extra drives instead of 2.

The answer historically has been "patches welcome".

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply

* Re: Bit-Rot
From: Chris Murphy @ 2017-03-05  6:01 UTC (permalink / raw)
  To: Anthony Youngman; +Cc: Gandalf Corvotempesta, Linux-RAID
In-Reply-To: <dce0bbfd-ea6b-4ed8-668a-c883c2380aca@youngman.org.uk>

On Fri, Mar 3, 2017 at 3:16 PM, Anthony Youngman
<antlists@youngman.org.uk> wrote:
>
>
> On 03/03/17 21:54, Gandalf Corvotempesta wrote:
>>
>> 2017-03-03 22:41 GMT+01:00 Anthony Youngman <antlists@youngman.org.uk>:
>>>
>>> Isn't that what raid 5 does?
>>
>>
>> nothing to do with raid-5
>>
>>> Actually, iirc, it doesn't read every stripe and check parity on a read,
>>> because it would clobber performance. But I guess you could have a switch
>>> to
>>> turn it on. It's unlikely to achieve anything.
>>>
>>> Barring bugs in the firmware, it's pretty near 100% that a drive will
>>> either
>>> return what was written, or return a read error. Drives don't return dud
>>> data, they have quite a lot of error correction built in.
>>
>>
>> This is wrong.
>> Sometimes drives return data differently from what was stored, or,
>> store data differently from the original.
>> In this case, if real data is "1" and you store "0", when you read
>> "0", no read error is made, but data is still corrupted.
>
>
> Do you have any figures? I didn't say it can't happen, I just said it was
> very unlikely.

Torn and misdirected writes do happen. There are a bunch of papers on
this problem indicating it's real. This and various other sources of
silent corrupt are why ZFS and Btrfs exist.

>>
>>
>> With a bit-rot prevention this could be fixed, you checksum "1" from
>> the source, write that to disks and if you read back "0", the checksum
>> would be invalid.
>
>
> Or you just read the raid5 parity (which I don't think, by default, is what
> happens). That IS your checksum. So if you think the performance hit is
> worth it, write the code to add it, and turn it on. Not only will it detect
> a bit-flip, but it will tell you which bit flipped, and let you correct it.

Parity isn't a checksum. Using it in this fashion is expensive because
it means computing parity for all reads, and means you can't do
partial stripe reads anymore. Next, even once you get a mismatch it's
ambiguous which strip (mdadm chunk) is corrupt. That'd normally be
exposed by the drive reporting an explicit read error. Since that
doesn't exist you'd have to fake "fail" each strip, rebuild from
parity, and compare.

>>
>>
>> This is what ZFS does. This is what Gluster does. This is what BRTFS does.
>> Adding this in mdadm could be an interesting feature.
>>
> Well, seeing as I understand btrfs doesn't do raid5, only raid1, then of
> course it needs some way of detecting whether a mirror is corrupt. I don't
> know about gluster or ZFS. (I believe raid5/btrfs is currently experimental,
> and dangerous.)

Btrfs supports raid1, 10, 5 and 6. It's reasonable to consider raid56
experimental because it has a number of gotchas, not least of which is
there are certain kinds of writes that are not COW, so the COW
safeguards don't always apply in a power failures. As for dangerous,
the opinions vary but probably something everyone can agree on is any
ambiguity with the stability of a file system is that it looks bad.

> But the question remains - is the effort worth it?

That's the central question. And to answer it, you'd need some sort of
rough design. Where are the csums going to be stored? Do you update
data strips before or after the csums? Either way, if this is now COW,
you have a moment of complete mismatching between data and csums, with
live data. So... that's a big problem actually. And if you have a
crash or power failure during writes, it's an even bigger problem. Do
you csum the party?

-- 
Chris Murphy

^ permalink raw reply

* [PATCH V2] md/raid5: sort bios
From: Shaohua Li @ 2017-03-04  6:06 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, songliubraving

Previous patch (raid5: only dispatch IO from raid5d for harddisk raid)
defers IO dispatching. The goal is to create better IO pattern. At that
time, we don't sort the deffered IO and hope the block layer can do IO
merge and sort. Now the raid5-cache writeback could create large amount
of bios. And if we enable muti-thread for stripe handling, we can't
control when to dispatch IO to raid disks. In a lot of time, we are
dispatching IO which block layer can't do merge effectively.

This patch moves further for the IO dispatching defer. We accumulate
bios, but we don't dispatch all the bios after a threshold is met. This
'dispatch partial portion of bios' stragety allows bios coming in a
large time window are sent to disks together. At the dispatching time,
there is large chance the block layer can merge the bios. To make this
more effective, we dispatch IO in ascending order. This increases
request merge chance and reduces disk seek.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5.c | 138 +++++++++++++++++++++++++++++++++++++++++++----------
 drivers/md/raid5.h |  14 +++++-
 2 files changed, 126 insertions(+), 26 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b58cbb4..3921c77 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -56,6 +56,7 @@
 #include <linux/nodemask.h>
 #include <linux/flex_array.h>
 #include <trace/events/block.h>
+#include <linux/list_sort.h>
 
 #include "md.h"
 #include "raid5.h"
@@ -876,41 +877,107 @@ static int use_new_offset(struct r5conf *conf, struct stripe_head *sh)
 	return 1;
 }
 
-static void flush_deferred_bios(struct r5conf *conf)
+static void dispatch_bio_list(struct bio_list *tmp)
 {
-	struct bio_list tmp;
 	struct bio *bio;
 
-	if (!conf->batch_bio_dispatch || !conf->group_cnt)
+	while ((bio = bio_list_pop(tmp)))
+		generic_make_request(bio);
+}
+
+static int cmp_stripe(void *priv, struct list_head *a, struct list_head *b)
+{
+	const struct r5pending_data *da = list_entry(a,
+				struct r5pending_data, sibling);
+	const struct r5pending_data *db = list_entry(b,
+				struct r5pending_data, sibling);
+	if (da->sector > db->sector)
+		return 1;
+	if (da->sector < db->sector)
+		return -1;
+	return 0;
+}
+
+static void dispatch_defer_bios(struct r5conf *conf, int target,
+				struct bio_list *list)
+{
+	struct r5pending_data *data;
+	struct list_head *first, *next = NULL;
+	int cnt = 0;
+
+	if (conf->pending_data_cnt == 0)
+		return;
+
+	list_sort(NULL, &conf->pending_list, cmp_stripe);
+
+	first = conf->pending_list.next;
+
+	/* temporarily move the head */
+	if (conf->next_pending_data)
+		list_move_tail(&conf->pending_list,
+				&conf->next_pending_data->sibling);
+
+	while (!list_empty(&conf->pending_list)) {
+		data = list_first_entry(&conf->pending_list,
+			struct r5pending_data, sibling);
+		if (&data->sibling == first)
+			first = data->sibling.next;
+		next = data->sibling.next;
+
+		bio_list_merge(list, &data->bios);
+		list_move(&data->sibling, &conf->free_list);
+		cnt++;
+		if (cnt >= target)
+			break;
+	}
+	conf->pending_data_cnt -= cnt;
+	BUG_ON(conf->pending_data_cnt < 0 || cnt < target);
+
+	if (next != &conf->pending_list)
+		conf->next_pending_data = list_entry(next,
+				struct r5pending_data, sibling);
+	else
+		conf->next_pending_data = NULL;
+	/* list isn't empty */
+	if (first != &conf->pending_list)
+		list_move_tail(&conf->pending_list, first);
+}
+
+static void flush_deferred_bios(struct r5conf *conf)
+{
+	struct bio_list tmp = BIO_EMPTY_LIST;
+
+	if (conf->pending_data_cnt == 0)
 		return;
 
-	bio_list_init(&tmp);
 	spin_lock(&conf->pending_bios_lock);
-	bio_list_merge(&tmp, &conf->pending_bios);
-	bio_list_init(&conf->pending_bios);
+	dispatch_defer_bios(conf, conf->pending_data_cnt, &tmp);
+	BUG_ON(conf->pending_data_cnt != 0);
 	spin_unlock(&conf->pending_bios_lock);
 
-	while ((bio = bio_list_pop(&tmp)))
-		generic_make_request(bio);
+	dispatch_bio_list(&tmp);
 }
 
-static void defer_bio_issue(struct r5conf *conf, struct bio *bio)
+static void defer_issue_bios(struct r5conf *conf, sector_t sector,
+				struct bio_list *bios)
 {
-	/*
-	 * change group_cnt will drain all bios, so this is safe
-	 *
-	 * A read generally means a read-modify-write, which usually means a
-	 * randwrite, so we don't delay it
-	 */
-	if (!conf->batch_bio_dispatch || !conf->group_cnt ||
-	    bio_op(bio) == REQ_OP_READ) {
-		generic_make_request(bio);
-		return;
-	}
+	struct bio_list tmp = BIO_EMPTY_LIST;
+	struct r5pending_data *ent;
+
 	spin_lock(&conf->pending_bios_lock);
-	bio_list_add(&conf->pending_bios, bio);
+	ent = list_first_entry(&conf->free_list, struct r5pending_data,
+							sibling);
+	list_move_tail(&ent->sibling, &conf->pending_list);
+	ent->sector = sector;
+	bio_list_init(&ent->bios);
+	bio_list_merge(&ent->bios, bios);
+	conf->pending_data_cnt++;
+	if (conf->pending_data_cnt >= PENDING_IO_MAX)
+		dispatch_defer_bios(conf, PENDING_IO_ONE_FLUSH, &tmp);
+
 	spin_unlock(&conf->pending_bios_lock);
-	md_wakeup_thread(conf->mddev->thread);
+
+	dispatch_bio_list(&tmp);
 }
 
 static void
@@ -923,6 +990,8 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 	struct r5conf *conf = sh->raid_conf;
 	int i, disks = sh->disks;
 	struct stripe_head *head_sh = sh;
+	struct bio_list pending_bios = BIO_EMPTY_LIST;
+	bool should_defer;
 
 	might_sleep();
 
@@ -939,6 +1008,8 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 		}
 	}
 
+	should_defer = conf->batch_bio_dispatch && conf->group_cnt;
+
 	for (i = disks; i--; ) {
 		int op, op_flags = 0;
 		int replace_only = 0;
@@ -1093,7 +1164,10 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 				trace_block_bio_remap(bdev_get_queue(bi->bi_bdev),
 						      bi, disk_devt(conf->mddev->gendisk),
 						      sh->dev[i].sector);
-			defer_bio_issue(conf, bi);
+			if (should_defer && op_is_write(op))
+				bio_list_add(&pending_bios, bi);
+			else
+				generic_make_request(bi);
 		}
 		if (rrdev) {
 			if (s->syncing || s->expanding || s->expanded
@@ -1138,7 +1212,10 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 				trace_block_bio_remap(bdev_get_queue(rbi->bi_bdev),
 						      rbi, disk_devt(conf->mddev->gendisk),
 						      sh->dev[i].sector);
-			defer_bio_issue(conf, rbi);
+			if (should_defer && op_is_write(op))
+				bio_list_add(&pending_bios, rbi);
+			else
+				generic_make_request(rbi);
 		}
 		if (!rdev && !rrdev) {
 			if (op_is_write(op))
@@ -1156,6 +1233,9 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 		if (sh != head_sh)
 			goto again;
 	}
+
+	if (should_defer && !bio_list_empty(&pending_bios))
+		defer_issue_bios(conf, head_sh->sector, &pending_bios);
 }
 
 static struct dma_async_tx_descriptor *
@@ -6675,6 +6755,7 @@ static void free_conf(struct r5conf *conf)
 			put_page(conf->disks[i].extra_page);
 	kfree(conf->disks);
 	kfree(conf->stripe_hashtbl);
+	kfree(conf->pending_data);
 	kfree(conf);
 }
 
@@ -6784,6 +6865,14 @@ static struct r5conf *setup_conf(struct mddev *mddev)
 	conf = kzalloc(sizeof(struct r5conf), GFP_KERNEL);
 	if (conf == NULL)
 		goto abort;
+	INIT_LIST_HEAD(&conf->free_list);
+	INIT_LIST_HEAD(&conf->pending_list);
+	conf->pending_data = kzalloc(sizeof(struct r5pending_data) *
+		PENDING_IO_MAX, GFP_KERNEL);
+	if (!conf->pending_data)
+		goto abort;
+	for (i = 0; i < PENDING_IO_MAX; i++)
+		list_add(&conf->pending_data[i].sibling, &conf->free_list);
 	/* Don't enable multi-threading by default*/
 	if (!alloc_thread_groups(conf, 0, &group_cnt, &worker_cnt_per_group,
 				 &new_group)) {
@@ -6808,7 +6897,6 @@ static struct r5conf *setup_conf(struct mddev *mddev)
 	atomic_set(&conf->active_stripes, 0);
 	atomic_set(&conf->preread_active_stripes, 0);
 	atomic_set(&conf->active_aligned_reads, 0);
-	bio_list_init(&conf->pending_bios);
 	spin_lock_init(&conf->pending_bios_lock);
 	conf->batch_bio_dispatch = true;
 	rdev_for_each(rdev, mddev) {
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 6b9d2e8..985cdc4 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -572,6 +572,14 @@ enum r5_cache_state {
 				 */
 };
 
+#define PENDING_IO_MAX 512
+#define PENDING_IO_ONE_FLUSH 128
+struct r5pending_data {
+	struct list_head sibling;
+	sector_t sector; /* stripe sector */
+	struct bio_list bios;
+};
+
 struct r5conf {
 	struct hlist_head	*stripe_hashtbl;
 	/* only protect corresponding hash list and inactive_list */
@@ -689,9 +697,13 @@ struct r5conf {
 	int			worker_cnt_per_group;
 	struct r5l_log		*log;
 
-	struct bio_list		pending_bios;
 	spinlock_t		pending_bios_lock;
 	bool			batch_bio_dispatch;
+	struct r5pending_data	*pending_data;
+	struct list_head	free_list;
+	struct list_head	pending_list;
+	int			pending_data_cnt;
+	struct r5pending_data	*next_pending_data;
 };
 
 
-- 
2.9.3


^ permalink raw reply related

* Re: Bit-Rot
From: Anthony Youngman @ 2017-03-03 22:16 UTC (permalink / raw)
  To: Gandalf Corvotempesta; +Cc: linux-raid
In-Reply-To: <CAJH6TXimNQ8raxBCT=aLhzLzRnbMh8adcF21BrOqp6G-cajLpw@mail.gmail.com>



On 03/03/17 21:54, Gandalf Corvotempesta wrote:
> 2017-03-03 22:41 GMT+01:00 Anthony Youngman <antlists@youngman.org.uk>:
>> Isn't that what raid 5 does?
>
> nothing to do with raid-5
>
>> Actually, iirc, it doesn't read every stripe and check parity on a read,
>> because it would clobber performance. But I guess you could have a switch to
>> turn it on. It's unlikely to achieve anything.
>>
>> Barring bugs in the firmware, it's pretty near 100% that a drive will either
>> return what was written, or return a read error. Drives don't return dud
>> data, they have quite a lot of error correction built in.
>
> This is wrong.
> Sometimes drives return data differently from what was stored, or,
> store data differently from the original.
> In this case, if real data is "1" and you store "0", when you read
> "0", no read error is made, but data is still corrupted.

Do you have any figures? I didn't say it can't happen, I just said it 
was very unlikely.
>
> With a bit-rot prevention this could be fixed, you checksum "1" from
> the source, write that to disks and if you read back "0", the checksum
> would be invalid.

Or you just read the raid5 parity (which I don't think, by default, is 
what happens). That IS your checksum. So if you think the performance 
hit is worth it, write the code to add it, and turn it on. Not only will 
it detect a bit-flip, but it will tell you which bit flipped, and let 
you correct it.
>
> This is what ZFS does. This is what Gluster does. This is what BRTFS does.
> Adding this in mdadm could be an interesting feature.
>
Well, seeing as I understand btrfs doesn't do raid5, only raid1, then of 
course it needs some way of detecting whether a mirror is corrupt. I 
don't know about gluster or ZFS. (I believe raid5/btrfs is currently 
experimental, and dangerous.)

But the question remains - is the effort worth it?

Can I refer you to a very interesting article on LWN? About git, which 
assumes that "if hash(A) == hash(B) then A == B". And how that was 
actually MORE accurate than "if (memcmp( A, B) == true) then A == B".

Cheers,
Wol

^ permalink raw reply

* Re: Bit-Rot
From: Gandalf Corvotempesta @ 2017-03-03 21:54 UTC (permalink / raw)
  To: Anthony Youngman; +Cc: linux-raid
In-Reply-To: <95588e39-f42e-54d4-e7ff-a70af51a77c1@youngman.org.uk>

2017-03-03 22:41 GMT+01:00 Anthony Youngman <antlists@youngman.org.uk>:
> Isn't that what raid 5 does?

nothing to do with raid-5

> Actually, iirc, it doesn't read every stripe and check parity on a read,
> because it would clobber performance. But I guess you could have a switch to
> turn it on. It's unlikely to achieve anything.
>
> Barring bugs in the firmware, it's pretty near 100% that a drive will either
> return what was written, or return a read error. Drives don't return dud
> data, they have quite a lot of error correction built in.

This is wrong.
Sometimes drives return data differently from what was stored, or,
store data differently from the original.
In this case, if real data is "1" and you store "0", when you read
"0", no read error is made, but data is still corrupted.

With a bit-rot prevention this could be fixed, you checksum "1" from
the source, write that to disks and if you read back "0", the checksum
would be invalid.

This is what ZFS does. This is what Gluster does. This is what BRTFS does.
Adding this in mdadm could be an interesting feature.

^ permalink raw reply

* Re: Bit-Rot
From: Anthony Youngman @ 2017-03-03 21:41 UTC (permalink / raw)
  To: Gandalf Corvotempesta, linux-raid
In-Reply-To: <CAJH6TXgimQ9Ru58GCvaxqTZrYE9yDYh9osq2tEP_sLTfV0gQmg@mail.gmail.com>

On 03/03/17 21:25, Gandalf Corvotempesta wrote:
> Hi to all
> Wouldn't be possible to add a sort of bitrot detection on mdadm?
> I know that MD is working on blocks and not files, but checksumming a
> block should still be possible.
>
> In example, if you read some blocks with dd, you can still hash the
> content and verify that on next read/consistency check

Isn't that what raid 5 does?

Actually, iirc, it doesn't read every stripe and check parity on a read, 
because it would clobber performance. But I guess you could have a 
switch to turn it on. It's unlikely to achieve anything.

Barring bugs in the firmware, it's pretty near 100% that a drive will 
either return what was written, or return a read error. Drives don't 
return dud data, they have quite a lot of error correction built in.

Cheers,
Wol

^ permalink raw reply

* [RAID recovery] Unable to recover RAID5 array after disk failure
From: Olivier Swinkels @ 2017-03-03 21:35 UTC (permalink / raw)
  To: linux-raid

Hi,

I'm in quite a pickle here. I can't recover from a disk failure on my
6 disk raid 5 array.
Any help would really be appreciated!

Please bear with me as I lay out the the steps that got me here:

- I got a message my raid went down as 3 disks seemed to have failed.
I've dealt with this before and usually it meant that one disk failed
and took out the complete SATA controller.

- 1 of the disks was quite old and the 2 others quite new (<1 year).
So i removed the old drive and the controller can up again. I tried to
reassemble the RAID using:   sudo mdadm -v --assemble --force /dev/md0
 /dev/sdb /dev/sdc /dev/sdg /dev/sdf /dev/sde

- However I got the message :
mdadm: /dev/md0 assembled from 4 drives - not enough to start the array.

- This got me worried and this was the place I screwed up:

- Against the recommendations on the wiki I tried to recover the RAID
using a re-create:
sudo mdadm --verbose --create --assume-clean --level=5
--raid-devices=6 /dev/md0 /dev/sdb /dev/sdc missing /dev/sdg /dev/sdf
/dev/sde

- The second error I made was I forgot to add the correct superblock
version and chunksize.

- The resulting RAID did not seem correct as I couldn't find the LVM
which should be there.

- Subsequently the SATA controller went down again, so my assumption
on the failed disk was also incorrect and I disconnected the wrong
disk.

- After some trial and error I found out one of the newer disk was the
culprit and I tried to recover the RAID by re-creating the array with
the healthy disks and the correct superblock configuration using:
sudo mdadm --verbose --create --bitmap=none --chunk=64 --metadata=0.90
--assume-clean --level=5 --raid-devices=6 /dev/md0 /dev/sdb missing
/dev/sdc /dev/sdf /dev/sde /dev/sdd

- This gives me a degraded array, but unfortunately the LVM is still
not available.

- Is this situation still rescue-able?


===============================================================================
===============================================================================
- Below is the output of "mdadm --examine /dev/sd*" BEFORE the first
create action.

/dev/sdb:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 7af7d0ad:b37b1b49:151db09a:a68c27d9 (local to host
horus-server)
  Creation Time : Sun Apr 10 17:59:16 2011
     Raid Level : raid5
  Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
     Array Size : 9767572480 (9315.08 GiB 10001.99 GB)
   Raid Devices : 6
  Total Devices : 3
Preferred Minor : 0

    Update Time : Fri Feb 24 16:31:02 2017
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 3
  Spare Devices : 0
       Checksum : ae0d0dec - correct
         Events : 51108

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     0       8       16        0      active sync   /dev/sdb

   0     0       8       16        0      active sync   /dev/sdb
   1     1       0        0        1      active sync
   2     2       0        0        2      faulty removed
   3     3       8       96        3      active sync   /dev/sdg
   4     4       8       80        4      active sync   /dev/sdf
   5     5       0        0        5      faulty removed
/dev/sdc:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 7af7d0ad:b37b1b49:151db09a:a68c27d9 (local to host
horus-server)
  Creation Time : Sun Apr 10 17:59:16 2011
     Raid Level : raid5
  Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
     Array Size : 9767572480 (9315.08 GiB 10001.99 GB)
   Raid Devices : 6
  Total Devices : 6
Preferred Minor : 0

    Update Time : Fri Feb 24 02:01:04 2017
          State : clean
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0
       Checksum : ae0c42ac - correct
         Events : 51108

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     1       8       32        1      active sync   /dev/sdc

   0     0       8       16        0      active sync   /dev/sdb
   1     1       8       32        1      active sync   /dev/sdc
   2     2       8       48        2      active sync   /dev/sdd
   3     3       8       96        3      active sync   /dev/sdg
   4     4       8       80        4      active sync   /dev/sdf
   5     5       8       64        5      active sync   /dev/sde
/dev/sde:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 7af7d0ad:b37b1b49:151db09a:a68c27d9 (local to host
horus-server)
  Creation Time : Sun Apr 10 17:59:16 2011
     Raid Level : raid5
  Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
     Array Size : 9767572480 (9315.08 GiB 10001.99 GB)
   Raid Devices : 6
  Total Devices : 6
Preferred Minor : 0

    Update Time : Fri Feb 24 02:01:04 2017
          State : clean
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0
       Checksum : ae0c42c0 - correct
         Events : 51088

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     5       8       64        5      active sync   /dev/sde

   0     0       8       16        0      active sync   /dev/sdb
   1     1       8       32        1      active sync   /dev/sdc
   2     2       8       48        2      active sync   /dev/sdd
   3     3       8       96        3      active sync   /dev/sdg
   4     4       8       80        4      active sync   /dev/sdf
   5     5       8       64        5      active sync   /dev/sde
/dev/sdf:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 7af7d0ad:b37b1b49:151db09a:a68c27d9 (local to host
horus-server)
  Creation Time : Sun Apr 10 17:59:16 2011
     Raid Level : raid5
  Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
     Array Size : 9767572480 (9315.08 GiB 10001.99 GB)
   Raid Devices : 6
  Total Devices : 3
Preferred Minor : 0

    Update Time : Fri Feb 24 16:31:02 2017
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 3
  Spare Devices : 0
       Checksum : ae0d0e37 - correct
         Events : 51108

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     4       8       80        4      active sync   /dev/sdf

   0     0       8       16        0      active sync   /dev/sdb
   1     1       0        0        1      faulty removed
   2     2       0        0        2      faulty removed
   3     3       8       96        3      active sync   /dev/sdg
   4     4       8       80        4      active sync   /dev/sdf
   5     5       0        0        5      faulty removed
/dev/sdg:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 7af7d0ad:b37b1b49:151db09a:a68c27d9 (local to host
horus-server)
  Creation Time : Sun Apr 10 17:59:16 2011
     Raid Level : raid5
  Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
     Array Size : 9767572480 (9315.08 GiB 10001.99 GB)
   Raid Devices : 6
  Total Devices : 3
Preferred Minor : 0

    Update Time : Fri Feb 24 16:31:02 2017
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 3
  Spare Devices : 0
       Checksum : ae0d0e45 - correct
         Events : 51108

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     3       8       96        3      active sync   /dev/sdg

   0     0       8       16        0      active sync   /dev/sdb
   1     1       0        0        1      faulty removed
   2     2       0        0        2      faulty removed
   3     3       8       96        3      active sync   /dev/sdg
   4     4       8       80        4      active sync   /dev/sdf
   5     5       0        0        5      faulty removed

===============================================================================
===============================================================================
- below is status of the current situation:
===============================================================================
===============================================================================

- Phil Turmel's lsdrv:

sudo ./lsdrv
[sudo] password for horus:
PCI [ata_piix] 00:1f.2 IDE interface: Intel Corporation NM10/ICH7
Family SATA Controller [IDE mode] (rev 02)
├scsi 0:0:0:0 ATA      Samsung SSD 850  {S21UNX0H601730R}
│└sda 111.79g [8:0] Partitioned (dos)
│ └sda1 97.66g [8:1] Partitioned (dos) {a2d2e5b3-cef5-44f8-83a7-3c25f285c7b4}
│  └Mounted as /dev/sda1 @ /
└scsi 1:0:0:0 ATA      SAMSUNG HD204UI  {S2H7JD2B201244}
 └sdb 1.82t [8:16] MD raid5 (0/6) (w/ sdc,sdd,sde,sdf) in_sync
{4c0518af-d198-d804-151d-b09aa68c27d9}
  └md0 9.10t [9:0] MD v0.90 raid5 (6) clean DEGRADED, 64k Chunk
{4c0518af:d198d804:151db09a:a68c27d9}
                   PV LVM2_member (inactive)
{DWv51O-lg9s-Dl4w-EBp9-QeIF-Vv60-8wt2uS}
PCI [pata_jmicron] 01:00.1 IDE interface: JMicron Technology Corp.
JMB363 SATA/IDE Controller (rev 03)
├scsi 2:x:x:x [Empty]
└scsi 3:x:x:x [Empty]
PCI [ahci] 01:00.0 SATA controller: JMicron Technology Corp. JMB363
SATA/IDE Controller (rev 03)
├scsi 4:0:0:0 ATA      SAMSUNG HD204UI  {S2H7JD2B201246}
│└sdc 1.82t [8:32] MD raid5 (2/6) (w/ sdb,sdd,sde,sdf) in_sync
{4c0518af-d198-d804-151d-b09aa68c27d9}
│ └md0 9.10t [9:0] MD v0.90 raid5 (6) clean DEGRADED, 64k Chunk
{4c0518af:d198d804:151db09a:a68c27d9}
│                  PV LVM2_member (inactive)
{DWv51O-lg9s-Dl4w-EBp9-QeIF-Vv60-8wt2uS}
├scsi 4:1:0:0 ATA      WDC WD40EFRX-68W {WD-WCC4E6JF3EE3}
│└sdd 3.64t [8:48] MD raid5 (5/6) (w/ sdb,sdc,sde,sdf) in_sync
{4c0518af-d198-d804-151d-b09aa68c27d9}
│ ├md0 9.10t [9:0] MD v0.90 raid5 (6) clean DEGRADED, 64k Chunk
{4c0518af:d198d804:151db09a:a68c27d9}
│ │                PV LVM2_member (inactive)
{DWv51O-lg9s-Dl4w-EBp9-QeIF-Vv60-8wt2uS}
└scsi 5:x:x:x [Empty]
PCI [ahci] 05:00.0 SATA controller: Marvell Technology Group Ltd.
88SE9120 SATA 6Gb/s Controller (rev 12)
├scsi 6:0:0:0 ATA      Hitachi HDS72202 {JK11A1YAJN30GV}
│└sde 1.82t [8:64] MD raid5 (4/6) (w/ sdb,sdc,sdd,sdf) in_sync
{4c0518af-d198-d804-151d-b09aa68c27d9}
│ └md0 9.10t [9:0] MD v0.90 raid5 (6) clean DEGRADED, 64k Chunk
{4c0518af:d198d804:151db09a:a68c27d9}
│                  PV LVM2_member (inactive)
{DWv51O-lg9s-Dl4w-EBp9-QeIF-Vv60-8wt2uS}
└scsi 7:0:0:0 ATA      Hitachi HDS72202 {JK1174YAH779AW}
 └sdf 1.82t [8:80] MD raid5 (3/6) (w/ sdb,sdc,sdd,sde) in_sync
{4c0518af-d198-d804-151d-b09aa68c27d9}
  └md0 9.10t [9:0] MD v0.90 raid5 (6) clean DEGRADED, 64k Chunk
{4c0518af:d198d804:151db09a:a68c27d9}
                   PV LVM2_member (inactive)
{DWv51O-lg9s-Dl4w-EBp9-QeIF-Vv60-8wt2uS}

===============================================================================
===============================================================================
cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
[raid4] [raid10]
md0 : active raid5 sdd[5] sde[4] sdf[3] sdc[2] sdb[0]
      9767572480 blocks level 5, 64k chunk, algorithm 2 [6/5] [U_UUUU]

unused devices: <none>

===============================================================================
===============================================================================
This is the current none functional RAID.
sudo mdadm --detail /dev/md0
/dev/md0:
        Version : 0.90
  Creation Time : Fri Mar  3 21:09:22 2017
     Raid Level : raid5
     Array Size : 9767572480 (9315.08 GiB 10001.99 GB)
  Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
   Raid Devices : 6
  Total Devices : 5
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Fri Mar  3 21:09:22 2017
          State : clean, degraded
 Active Devices : 5
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : 4c0518af:d198d804:151db09a:a68c27d9 (local to host
horus-server)
         Events : 0.1

    Number   Major   Minor   RaidDevice State
       0       8       16        0      active sync   /dev/sdb
       2       0        0        2      removed
       2       8       32        2      active sync   /dev/sdc
       3       8       80        3      active sync   /dev/sdf
       4       8       64        4      active sync   /dev/sde
       5       8       48        5      active sync   /dev/sdd


===============================================================================
===============================================================================

sudo mdadm --examine /dev/sd*
/dev/sda:
   MBR Magic : aa55
Partition[0] :    204800000 sectors at         2048 (type 83)
/dev/sda1:
   MBR Magic : aa55
/dev/sdb:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 4c0518af:d198d804:151db09a:a68c27d9 (local to host
horus-server)
  Creation Time : Fri Mar  3 21:09:22 2017
     Raid Level : raid5
  Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
     Array Size : 9767572480 (9315.08 GiB 10001.99 GB)
   Raid Devices : 6
  Total Devices : 6
Preferred Minor : 0

    Update Time : Fri Mar  3 21:09:22 2017
          State : clean
 Active Devices : 5
Working Devices : 5
 Failed Devices : 1
  Spare Devices : 0
       Checksum : a857f8f3 - correct
         Events : 1

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     0       8       16        0      active sync   /dev/sdb

   0     0       8       16        0      active sync   /dev/sdb
   1     0       0        0        0      spare
   2     2       8       32        2      active sync   /dev/sdc
   3     3       8       80        3      active sync   /dev/sdf
   4     4       8       64        4      active sync   /dev/sde
   5     5       8       48        5      active sync   /dev/sdd
/dev/sdc:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 4c0518af:d198d804:151db09a:a68c27d9 (local to host
horus-server)
  Creation Time : Fri Mar  3 21:09:22 2017
     Raid Level : raid5
  Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
     Array Size : 9767572480 (9315.08 GiB 10001.99 GB)
   Raid Devices : 6
  Total Devices : 6
Preferred Minor : 0

    Update Time : Fri Mar  3 21:09:22 2017
          State : clean
 Active Devices : 5
Working Devices : 5
 Failed Devices : 1
  Spare Devices : 0
       Checksum : a857f907 - correct
         Events : 1

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     2       8       32        2      active sync   /dev/sdc

   0     0       8       16        0      active sync   /dev/sdb
   1     0       0        0        0      spare
   2     2       8       32        2      active sync   /dev/sdc
   3     3       8       80        3      active sync   /dev/sdf
   4     4       8       64        4      active sync   /dev/sde
   5     5       8       48        5      active sync   /dev/sdd
/dev/sdd:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 4c0518af:d198d804:151db09a:a68c27d9 (local to host
horus-server)
  Creation Time : Fri Mar  3 21:09:22 2017
     Raid Level : raid5
  Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
     Array Size : 9767572480 (9315.08 GiB 10001.99 GB)
   Raid Devices : 6
  Total Devices : 6
Preferred Minor : 0

    Update Time : Fri Mar  3 21:09:22 2017
          State : clean
 Active Devices : 5
Working Devices : 5
 Failed Devices : 1
  Spare Devices : 0
       Checksum : a857f91d - correct
         Events : 1

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     5       8       48        5      active sync   /dev/sdd

   0     0       8       16        0      active sync   /dev/sdb
   1     0       0        0        0      spare
   2     2       8       32        2      active sync   /dev/sdc
   3     3       8       80        3      active sync   /dev/sdf
   4     4       8       64        4      active sync   /dev/sde
   5     5       8       48        5      active sync   /dev/sdd
mdadm: No md superblock detected on /dev/sdd1.
/dev/sde:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 4c0518af:d198d804:151db09a:a68c27d9 (local to host
horus-server)
  Creation Time : Fri Mar  3 21:09:22 2017
     Raid Level : raid5
  Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
     Array Size : 9767572480 (9315.08 GiB 10001.99 GB)
   Raid Devices : 6
  Total Devices : 6
Preferred Minor : 0

    Update Time : Fri Mar  3 21:09:22 2017
          State : clean
 Active Devices : 5
Working Devices : 5
 Failed Devices : 1
  Spare Devices : 0
       Checksum : a857f92b - correct
         Events : 1

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     4       8       64        4      active sync   /dev/sde

   0     0       8       16        0      active sync   /dev/sdb
   1     0       0        0        0      spare
   2     2       8       32        2      active sync   /dev/sdc
   3     3       8       80        3      active sync   /dev/sdf
   4     4       8       64        4      active sync   /dev/sde
   5     5       8       48        5      active sync   /dev/sdd
/dev/sdf:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 4c0518af:d198d804:151db09a:a68c27d9 (local to host
horus-server)
  Creation Time : Fri Mar  3 21:09:22 2017
     Raid Level : raid5
  Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
     Array Size : 9767572480 (9315.08 GiB 10001.99 GB)
   Raid Devices : 6
  Total Devices : 6
Preferred Minor : 0

    Update Time : Fri Mar  3 21:09:22 2017
          State : clean
 Active Devices : 5
Working Devices : 5
 Failed Devices : 1
  Spare Devices : 0
       Checksum : a857f939 - correct
         Events : 1

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     3       8       80        3      active sync   /dev/sdf

   0     0       8       16        0      active sync   /dev/sdb
   1     0       0        0        0      spare
   2     2       8       32        2      active sync   /dev/sdc
   3     3       8       80        3      active sync   /dev/sdf
   4     4       8       64        4      active sync   /dev/sde
   5     5       8       48        5      active sync   /dev/sdd

^ permalink raw reply

* Bit-Rot
From: Gandalf Corvotempesta @ 2017-03-03 21:25 UTC (permalink / raw)
  To: linux-raid

Hi to all
Wouldn't be possible to add a sort of bitrot detection on mdadm?
I know that MD is working on blocks and not files, but checksumming a
block should still be possible.

In example, if you read some blocks with dd, you can still hash the
content and verify that on next read/consistency check

^ permalink raw reply

* [PATCH] md/r5cache: improve recovery with read ahead page pool
From: Song Liu @ 2017-03-03 21:06 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, kernel-team, dan.j.williams, hch, Song Liu

In r5cache recovery, the journal device is scanned page by page.
Currently, we use sync_page_io() to read journal device. This is
not efficient when we have to recovery many stripes from the journal.

To improve the speed of recovery, this patch introduces a read ahead
page pool (ra_pool) to recovery_ctx. With ra_pool, multiple consecutive
pages are read in one IO. Then the recovery code read the journal from
ra_pool.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 151 +++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 134 insertions(+), 17 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 3f307be..46afea8 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -1552,6 +1552,8 @@ bool r5l_log_disk_error(struct r5conf *conf)
 	return ret;
 }
 
+#define R5L_RECOVERY_PAGE_POOL_SIZE 64
+
 struct r5l_recovery_ctx {
 	struct page *meta_page;		/* current meta */
 	sector_t meta_total_blocks;	/* total size of current meta and data */
@@ -1560,18 +1562,130 @@ struct r5l_recovery_ctx {
 	int data_parity_stripes;	/* number of data_parity stripes */
 	int data_only_stripes;		/* number of data_only stripes */
 	struct list_head cached_list;
+
+	/*
+	 * read ahead page pool (ra_pool)
+	 * in recovery, log is read sequentially. It is not efficient to
+	 * read every page with sync_page_io(). The read ahead page pool
+	 * reads multiple pages with one IO, so further log read can
+	 * just copy data from the pool.
+	 */
+	struct page *ra_pool[R5L_RECOVERY_PAGE_POOL_SIZE];
+	sector_t pool_offset;	/* offset of first page in the pool */
+	int total_pages;	/* total allocated pages */
+	int valid_pages;	/* pages with valid data */
+	struct bio *ra_bio;	/* bio to do the read ahead*/
 };
 
+static int r5l_recovery_allocate_ra_pool(struct r5l_log *log,
+					    struct r5l_recovery_ctx *ctx)
+{
+	struct page *page;
+
+	ctx->ra_bio = bio_alloc_bioset(GFP_KERNEL, BIO_MAX_PAGES, log->bs);
+	if (!ctx->ra_bio)
+		return -ENOMEM;
+
+	ctx->valid_pages = 0;
+	ctx->total_pages = 0;
+	while (ctx->total_pages < R5L_RECOVERY_PAGE_POOL_SIZE) {
+		page = alloc_page(GFP_KERNEL);
+
+		if (!page)
+			break;
+		ctx->ra_pool[ctx->total_pages] = page;
+		ctx->total_pages += 1;
+	}
+
+	if (ctx->total_pages == 0) {
+		bio_put(ctx->ra_bio);
+		return -ENOMEM;
+	}
+
+	ctx->pool_offset = 0;
+	return 0;
+}
+
+static void r5l_recovery_free_ra_pool(struct r5l_log *log,
+					struct r5l_recovery_ctx *ctx)
+{
+	int i;
+
+	for (i = 0; i < ctx->total_pages; ++i)
+		put_page(ctx->ra_pool[i]);
+	bio_put(ctx->ra_bio);
+}
+
+/*
+ * fetch ctx->valid_pages pages from offset
+ * In normal cases, ctx->valid_pages == ctx->total_pages after the call.
+ * However, if the offset is close to the end of the journal device,
+ * ctx->valid_pages could be smaller than ctx->total_pages
+ */
+static int r5l_recovery_fetch_ra_pool(struct r5l_log *log,
+				      struct r5l_recovery_ctx *ctx,
+				      sector_t offset)
+{
+	bio_reset(ctx->ra_bio);
+	ctx->ra_bio->bi_bdev = log->rdev->bdev;
+	bio_set_op_attrs(ctx->ra_bio, REQ_OP_READ, 0);
+	ctx->ra_bio->bi_iter.bi_sector = log->rdev->data_offset + offset;
+
+	ctx->valid_pages = 0;
+	ctx->pool_offset = offset;
+
+	while (ctx->valid_pages < ctx->total_pages) {
+		bio_add_page(ctx->ra_bio,
+			     ctx->ra_pool[ctx->valid_pages], PAGE_SIZE, 0);
+		ctx->valid_pages += 1;
+
+		offset = r5l_ring_add(log, offset, BLOCK_SECTORS);
+
+		if (offset == 0)  /* reached end of the device */
+			break;
+	}
+
+	return submit_bio_wait(ctx->ra_bio);
+}
+
+/*
+ * try read a page from the read ahead page pool, if the page is not in the
+ * pool, call r5l_recovery_fetch_ra_pool
+ */
+static int r5l_recovery_read_page(struct r5l_log *log,
+				  struct r5l_recovery_ctx *ctx,
+				  struct page *page,
+				  sector_t offset)
+{
+	int ret;
+
+	if (offset < ctx->pool_offset ||
+	    offset >= ctx->pool_offset + ctx->valid_pages * BLOCK_SECTORS) {
+		ret = r5l_recovery_fetch_ra_pool(log, ctx, offset);
+		if (ret)
+			return ret;
+	}
+
+	BUG_ON(offset < ctx->pool_offset ||
+	       offset >= ctx->pool_offset + ctx->valid_pages * BLOCK_SECTORS);
+
+	memcpy(page_address(page),
+	       page_address(ctx->ra_pool[(offset - ctx->pool_offset) / BLOCK_SECTORS]),
+	       PAGE_SIZE);
+	return 0;
+}
+
 static int r5l_recovery_read_meta_block(struct r5l_log *log,
 					struct r5l_recovery_ctx *ctx)
 {
 	struct page *page = ctx->meta_page;
 	struct r5l_meta_block *mb;
 	u32 crc, stored_crc;
+	int ret;
 
-	if (!sync_page_io(log->rdev, ctx->pos, PAGE_SIZE, page, REQ_OP_READ, 0,
-			  false))
-		return -EIO;
+	ret = r5l_recovery_read_page(log, ctx, page, ctx->pos);
+	if (ret != 0)
+		return ret;
 
 	mb = page_address(page);
 	stored_crc = le32_to_cpu(mb->checksum);
@@ -1653,8 +1767,7 @@ static void r5l_recovery_load_data(struct r5l_log *log,
 	raid5_compute_sector(conf,
 			     le64_to_cpu(payload->location), 0,
 			     &dd_idx, sh);
-	sync_page_io(log->rdev, log_offset, PAGE_SIZE,
-		     sh->dev[dd_idx].page, REQ_OP_READ, 0, false);
+	r5l_recovery_read_page(log, ctx, sh->dev[dd_idx].page, log_offset);
 	sh->dev[dd_idx].log_checksum =
 		le32_to_cpu(payload->checksum[0]);
 	ctx->meta_total_blocks += BLOCK_SECTORS;
@@ -1673,17 +1786,13 @@ static void r5l_recovery_load_parity(struct r5l_log *log,
 	struct r5conf *conf = mddev->private;
 
 	ctx->meta_total_blocks += BLOCK_SECTORS * conf->max_degraded;
-	sync_page_io(log->rdev, log_offset, PAGE_SIZE,
-		     sh->dev[sh->pd_idx].page, REQ_OP_READ, 0, false);
+	r5l_recovery_read_page(log, ctx, sh->dev[sh->pd_idx].page, log_offset);
 	sh->dev[sh->pd_idx].log_checksum =
 		le32_to_cpu(payload->checksum[0]);
 	set_bit(R5_Wantwrite, &sh->dev[sh->pd_idx].flags);
 
 	if (sh->qd_idx >= 0) {
-		sync_page_io(log->rdev,
-			     r5l_ring_add(log, log_offset, BLOCK_SECTORS),
-			     PAGE_SIZE, sh->dev[sh->qd_idx].page,
-			     REQ_OP_READ, 0, false);
+		r5l_recovery_read_page(log, ctx, sh->dev[sh->qd_idx].page, log_offset);
 		sh->dev[sh->qd_idx].log_checksum =
 			le32_to_cpu(payload->checksum[1]);
 		set_bit(R5_Wantwrite, &sh->dev[sh->qd_idx].flags);
@@ -1814,14 +1923,15 @@ r5c_recovery_replay_stripes(struct list_head *cached_stripe_list,
 
 /* if matches return 0; otherwise return -EINVAL */
 static int
-r5l_recovery_verify_data_checksum(struct r5l_log *log, struct page *page,
+r5l_recovery_verify_data_checksum(struct r5l_log *log,
+				  struct r5l_recovery_ctx *ctx,
+				  struct page *page,
 				  sector_t log_offset, __le32 log_checksum)
 {
 	void *addr;
 	u32 checksum;
 
-	sync_page_io(log->rdev, log_offset, PAGE_SIZE,
-		     page, REQ_OP_READ, 0, false);
+	r5l_recovery_read_page(log, ctx, page, log_offset);
 	addr = kmap_atomic(page);
 	checksum = crc32c_le(log->uuid_checksum, addr, PAGE_SIZE);
 	kunmap_atomic(addr);
@@ -1853,17 +1963,17 @@ r5l_recovery_verify_data_checksum_for_mb(struct r5l_log *log,
 
 		if (payload->header.type == R5LOG_PAYLOAD_DATA) {
 			if (r5l_recovery_verify_data_checksum(
-				    log, page, log_offset,
+				    log, ctx, page, log_offset,
 				    payload->checksum[0]) < 0)
 				goto mismatch;
 		} else if (payload->header.type == R5LOG_PAYLOAD_PARITY) {
 			if (r5l_recovery_verify_data_checksum(
-				    log, page, log_offset,
+				    log, ctx, page, log_offset,
 				    payload->checksum[0]) < 0)
 				goto mismatch;
 			if (conf->max_degraded == 2 && /* q for RAID 6 */
 			    r5l_recovery_verify_data_checksum(
-				    log, page,
+				    log, ctx, page,
 				    r5l_ring_add(log, log_offset,
 						 BLOCK_SECTORS),
 				    payload->checksum[1]) < 0)
@@ -2255,9 +2365,16 @@ static int r5l_recovery_log(struct r5l_log *log)
 	if (!ctx.meta_page)
 		return -ENOMEM;
 
+	if (r5l_recovery_allocate_ra_pool(log, &ctx) != 0) {
+		__free_page(ctx.meta_page);
+		return -ENOMEM;
+	}
+
 	ret = r5c_recovery_flush_log(log, &ctx);
+	r5l_recovery_free_ra_pool(log, &ctx);
 	__free_page(ctx.meta_page);
 
+
 	if (ret)
 		return ret;
 
-- 
2.9.3


^ permalink raw reply related

* Re: [PATCH] Fix oddity where mdadm did not recognise a relative path
From: Wols Lists @ 2017-03-03 18:36 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid
In-Reply-To: <wrfjzih4wt7l.fsf@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 397 bytes --]

> 
> I thought this was still pending but going back, it looks like I did
> apply it. The problem is I noticed a major issue in the patch, which
> doesn't belong in code, it is doing 'if () break' on the same line.

Sorry about that - corporate style and all that :-) but it is important
to not break the standard style.
> 
> Mind sending me a patch to fix that.
> 
> Thanks,
> Jes
> 
Cheers,
Wol

[-- Attachment #2: 0001-Fix-formatting-error.patch --]
[-- Type: text/x-patch, Size: 624 bytes --]

From 945b835cc89fd16279e01f743f8507b80219c8a1 Mon Sep 17 00:00:00 2001
From: Wol <anthony@youngman.org.uk>
Date: Fri, 3 Mar 2017 18:31:52 +0000
Subject: [PATCH] Fix formatting error

---
 mdadm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mdadm.c b/mdadm.c
index b5d89e4..d973ae4 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -1902,7 +1902,8 @@ static int misc_list(struct mddev_dev *devlist,
 		switch(dv->devname[0] == '/') {
 			case 0:
 				mdfd = open_dev(dv->devname);
-				if (mdfd >= 0) break;
+				if (mdfd >= 0)
+					break;
 			case 1:
 				mdfd = open_mddev(dv->devname, 1);  
 		}
-- 
2.7.3


^ permalink raw reply related

* Re: [PATCH 3/3] md/raid5: sort bios
From: Shaohua Li @ 2017-03-03 17:59 UTC (permalink / raw)
  To: NeilBrown; +Cc: Shaohua Li, linux-raid, songliubraving, kernel-team
In-Reply-To: <87k287m3d6.fsf@notabene.neil.brown.name>

On Fri, Mar 03, 2017 at 02:43:49PM +1100, Neil Brown wrote:
> On Fri, Feb 17 2017, Shaohua Li wrote:
> 
> > Previous patch (raid5: only dispatch IO from raid5d for harddisk raid)
> > defers IO dispatching. The goal is to create better IO pattern. At that
> > time, we don't sort the deffered IO and hope the block layer can do IO
> > merge and sort. Now the raid5-cache writeback could create large amount
> > of bios. And if we enable muti-thread for stripe handling, we can't
> > control when to dispatch IO to raid disks. In a lot of time, we are
> > dispatching IO which block layer can't do merge effectively.
> >
> > This patch moves further for the IO dispatching defer. We accumulate
> > bios, but we don't dispatch all the bios after a threshold is met. This
> > 'dispatch partial portion of bios' stragety allows bios coming in a
> > large time window are sent to disks together. At the dispatching time,
> > there is large chance the block layer can merge the bios. To make this
> > more effective, we dispatch IO in ascending order. This increases
> > request merge chance and reduces disk seek.
> 
> I can see the benefit of batching and sorting requests.
> 
> I wonder if the extra complexity of grouping together 512 requests, then
> submitting the "first" 128 is really worth it.  Have you measured the
> value of that?

I'm pretty sure I tried. The whole point of dispatching the first 128 is we
don't have a better pipeline. Grouping 512 and then dispatching them together
definitely improve the IO patter, but the request accumulation takes time, we
will have no IO running in the window.

> If you just submitted every time you got 512 requests, you could use
> list_sort() on the bio list and wouldn't need an array.
> 
> If an array really is best, it would be really nice if "sort" could pass
> a 'void*' down to the cmp function,
> and it could sort all bios that are
> *after* last_bio_pos first, and then the others.  That would make the
> code much simpler.  I guess sort() could be changed (list_sort() already
> has a 'priv' argument like this).

Ok, I'll change this to a list. And add extra pointer to record the last sorted
entry. I didn't see the sort uses much time in my profile, but the merge sort
looks better. Will do the change.

> If we cannot change sort(), then maybe use lib/bsearch.c for the binary
> search.  Performing two comparisons in the loop of a binary search
> should get a *fail* in any algorithms class!!
> The "pending_data" array that you have added to the r5conf structure
> adds 4096 bytes.  This means it is larger than a page, which is best
> avoided (though it is unlikely to cause problems).  I would allocate it
> separately.

Yep, already fixed internally.

> 
> So there is a lot that I don't really like, but it seems like a good
> idea in principle.

ok, thanks for your time!

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH 2/3] md/raid5-cache: bump flush stripe batch size
From: Shaohua Li @ 2017-03-03 17:41 UTC (permalink / raw)
  To: NeilBrown; +Cc: Shaohua Li, linux-raid, songliubraving, kernel-team
In-Reply-To: <87mvd3m58c.fsf@notabene.neil.brown.name>

On Fri, Mar 03, 2017 at 02:03:31PM +1100, Neil Brown wrote:
> On Fri, Feb 17 2017, Shaohua Li wrote:
> 
> > Bump the flush stripe batch size to 2048. For my 12 disks raid
> > array, the stripes takes:
> > 12 * 4k * 2048 = 96MB
> >
> > This is still quite small. A hardware raid card generally has 1GB size,
> > which we suggest the raid5-cache has similar cache size.
> >
> > The advantage of a big batch size is we can dispatch a lot of IO in the
> > same time, then we can do some scheduling to make better IO pattern.
> >
> > Last patch prioritizes stripes, so we don't worry about a big flush
> > stripe batch will starve normal stripes.
> >
> > Signed-off-by: Shaohua Li <shli@fb.com>
> > ---
> >  drivers/md/raid5-cache.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> > index 3f307be..b25512c 100644
> > --- a/drivers/md/raid5-cache.c
> > +++ b/drivers/md/raid5-cache.c
> > @@ -43,7 +43,7 @@
> >  /* wake up reclaim thread periodically */
> >  #define R5C_RECLAIM_WAKEUP_INTERVAL (30 * HZ)
> >  /* start flush with these full stripes */
> > -#define R5C_FULL_STRIPE_FLUSH_BATCH 256
> > +#define R5C_FULL_STRIPE_FLUSH_BATCH 2048
> 
> Fixed numbers are warning signs... I wonder if there is something better
> we could do?   "conf->max_nr_stripes / 4" maybe?  We use that sort of
> number elsewhere.
> Would that make sense?

The code where we check the batch size (in r5c_do_reclaim) already a check:
total_cached > conf->min_nr_stripes * 1 / 2
so I think that's ok, no?

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v2 06/13] md: raid1: don't use bio's vec table to manage resync pages
From: Shaohua Li @ 2017-03-03 17:38 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <CACVXFVPQx0rQqQS0PG2MoYh+pGvVQb_H_+fMCGbWeVDM3JNcnQ@mail.gmail.com>

On Fri, Mar 03, 2017 at 10:11:31AM +0800, Ming Lei wrote:
> On Fri, Mar 3, 2017 at 1:48 AM, Shaohua Li <shli@kernel.org> wrote:
> > On Thu, Mar 02, 2017 at 10:25:10AM +0800, Ming Lei wrote:
> >> Hi Shaohua,
> >>
> >> On Wed, Mar 1, 2017 at 7:37 AM, Shaohua Li <shli@kernel.org> wrote:
> >> > On Tue, Feb 28, 2017 at 11:41:36PM +0800, Ming Lei wrote:
> >> >> Now we allocate one page array for managing resync pages, instead
> >> >> of using bio's vec table to do that, and the old way is very hacky
> >> >> and won't work any more if multipage bvec is enabled.
> >> >>
> >> >> The introduced cost is that we need to allocate (128 + 16) * raid_disks
> >> >> bytes per r1_bio, and it is fine because the inflight r1_bio for
> >> >> resync shouldn't be much, as pointed by Shaohua.
> >> >>
> >> >> Also the bio_reset() in raid1_sync_request() is removed because
> >> >> all bios are freshly new now and not necessary to reset any more.
> >> >>
> >> >> This patch can be thought as a cleanup too
> >> >>
> >> >> Suggested-by: Shaohua Li <shli@kernel.org>
> >> >> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> >> >> ---
> >> >>  drivers/md/raid1.c | 83 ++++++++++++++++++++++++++++++++++--------------------
> >> >>  1 file changed, 53 insertions(+), 30 deletions(-)
> >> >>
> >> >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> >> >> index c442b4657e2f..900144f39630 100644
> >> >> --- a/drivers/md/raid1.c
> >> >> +++ b/drivers/md/raid1.c
> >> >> @@ -77,6 +77,16 @@ static void lower_barrier(struct r1conf *conf, sector_t sector_nr);
> >> >>  #define raid1_log(md, fmt, args...)                          \
> >> >>       do { if ((md)->queue) blk_add_trace_msg((md)->queue, "raid1 " fmt, ##args); } while (0)
> >> >>
> >> >> +static inline struct resync_pages *get_resync_pages(struct bio *bio)
> >> >> +{
> >> >> +     return bio->bi_private;
> >> >> +}
> >> >> +
> >> >> +static inline struct r1bio *get_resync_r1bio(struct bio *bio)
> >> >> +{
> >> >> +     return get_resync_pages(bio)->raid_bio;
> >> >> +}
> >> >
> >> > This is a weird between bio, r1bio and the resync_pages. I'd like the pages are
> >>
> >> It is only a bit weird inside allocating and freeing r1bio, once all
> >> are allocated, you
> >> can see everthing is clean and simple:
> >>
> >>     - r1bio includes lots of bioes,
> >>     - and one bio is attached by one resync_pages via .bi_private
> >
> > I don't how complex to let r1bio pointer to the pages, but that's the nartual
> > way. r1bio owns the pages, not the pages own r1bio, so we should let r1bio
> > points to the pages. The bio.bi_private still points to r1bio.
> 
> Actually it is bio which owns the pages for doing its own I/O, and the only
> thing related with r10bio is that bios may share these pages, but using
> page refcount trick will make the relation quite implicit.
>
> The only reason to allocate all resync_pages together is for sake of efficiency,
> and just for avoiding to allocate one resync_pages one time for each bio.
> 
> We have to make .bi_private point to resync_pages(per bio), otherwise we
> can't fetch pages into one bio at all, thinking about where to store the index
> for each bio's pre-allocated pages, and it has to be per bio.

So the reason is we can't find the corresponding pages of the bio if bi_private
points to r1bio, right? Got it. We don't have many choices in this way. Ok, I
don't insist. Please add some comments in the get_resync_r1bio to describe how
the data structure is organized.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v2 09/13] md: raid1: use bio_segments_all()
From: Shaohua Li @ 2017-03-03 17:12 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <CACVXFVNRhekBMAnVN9co-jdkP=+5S+28CyFr9Z_nsJVmdxPxSg@mail.gmail.com>

On Fri, Mar 03, 2017 at 02:22:30PM +0800, Ming Lei wrote:
> On Fri, Mar 3, 2017 at 10:20 AM, Ming Lei <tom.leiming@gmail.com> wrote:
> > On Thu, Mar 2, 2017 at 3:52 PM, Shaohua Li <shli@kernel.org> wrote:
> >> On Thu, Mar 02, 2017 at 10:34:25AM +0800, Ming Lei wrote:
> >>> Hi Shaohua,
> >>>
> >>> On Wed, Mar 1, 2017 at 7:42 AM, Shaohua Li <shli@kernel.org> wrote:
> >>> > On Tue, Feb 28, 2017 at 11:41:39PM +0800, Ming Lei wrote:
> >>> >> Use this helper, instead of direct access to .bi_vcnt.
> >>> >
> >>> > what We really need to do for the behind IO is:
> >>> > - allocate memory and copy bio data to the memory
> >>> > - let behind bio do IO against the memory
> >>> >
> >>> > The behind bio doesn't need to have the exactly same bio_vec setting. If we
> >>> > just track the new memory, we don't need use the bio_segments_all and access
> >>> > bio_vec too.
> >>>
> >>> But we need to figure out how many vecs(each vec store one page) to be
> >>> allocated for the cloned/behind bio, and that is the only value of
> >>> bio_segments_all() here. Or you have idea to avoid that?
> >>
> >> As I said, the behind bio doesn't need to have the exactly same bio_vec
> >> setting. We just allocate memory and copy original bio data to the memory,
> >> then do IO against the new memory. The behind bio
> >> segments == (bio->bi_iter.bi_size + PAGE_SIZE - 1) >> PAGE_SHIFT
> >
> > The equation isn't always correct, especially when bvec includes just
> > part of page, and it is quite often in case of mkfs, in which one bvec often
> > includes 512byte buffer.
> 
> Think it further, your idea could be workable and more clean, but the change
> can be a bit big, looks we need to switch handling write behind into
> the following way:
> 
> 1) replace bio_clone_bioset_partial() with bio_allocate(nr_vecs), and 'nr_vecs'
> is computed with your equation;
> 
> 2) allocate 'nr_vecs' pages once and share them among all created bio in 1)
> 
> 3) for each created bio, add each page into the bio via bio_add_page()
> 
> 4) only for the 1st created bio, call bio_copy_data() to copy data from
> master bio.
> 
> Let me know if you are OK with the above implementaion.

Right, this is exactly what I'd like to do. This way we don't need touch
bvec and should be much cleaner.

Thanks,
Shaohua

^ permalink raw reply

* Re: [mdadm] compiler warning
From: Michael Shigorin @ 2017-03-03  8:14 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <87bmtjnt5v.fsf@notabene.neil.brown.name>

On Fri, Mar 03, 2017 at 10:41:16AM +1100, NeilBrown wrote:
> > Thought I'd better write this. :)
> Thanks for the report, though we prefer such things to be sent to
>   linux-raid@vger.kernel.org

Ack!

> The warning is just telling us that the compiler is smart
> enough to deduce that whenever st is uninitialised, err = 0.
> So the code is safe, but maybe more complex than needed.
> I'll submit a patch to improve it a little.

Thanks for explanation, I'll file a bug against the compiler then.
Your commit has cleared the warning in the mean time.

-- 
 ---- WBR, Michael Shigorin / http://altlinux.org
  ------ http://opennet.ru / http://anna-news.info

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox