Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: [PATCH v2 00/12] Partial Parity Log for MD RAID 5
From: NeilBrown @ 2016-12-07  0:32 UTC (permalink / raw)
  To: Artur Paszkiewicz, shli; +Cc: linux-raid
In-Reply-To: <20161205153113.7268-1-artur.paszkiewicz@intel.com>

[-- Attachment #1: Type: text/plain, Size: 1768 bytes --]

On Tue, Dec 06 2016, Artur Paszkiewicz wrote:

> This series of patches implements the Partial Parity Log for RAID5 arrays. The
> purpose of this feature is closing the RAID Write Hole. It is a solution
> alternative to the existing raid5-cache, but the implementation is based on it
> and reuses some of the code by introducing support for interchangeable
> policies. This allows decoupling policy from mechanism and not adding more
> boilerplate code in raid5.c.
>
> The issue addressed by PPL is, that on a dirty shutdown, parity for a
> particular stripe may be inconsistent with data on other member disks. In
> degraded state, there is no way to recalculate parity, because one of the disks
> is missing. PPL addresses this issue and allows recalculating the parity. It
> stores only enough data needed for recovering from RWH and is not a true
> journal, like the raid5-cache implementation. It does not protect from losing
> in-flight data.
>
> PPL is a distributed log - data is stored on all RAID member drives in the
> metadata area. It does not need a dedicated journaling drive. Performance is
> reduced by up to 30%-40% but it scales with the number of drives in the array
> and the journaling drive does not become a bottleneck.

I would expect to see as description of what a PPL actually is and how
it works here... but there is none.

The change-log for patch 06 has a tiny bit more information which is
just enough to be able to start trying to understand the code, but it
isn't much.
And none of this description gets into the code, or into the
Documentation/.  This makes it hard to review and hard to maintain.

Remember: if you want people to review you code, it is in your interest
to make it easy.  That means give lots of details.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH v2 03/12] raid5-cache: add a new policy
From: NeilBrown @ 2016-12-07  0:46 UTC (permalink / raw)
  To: Artur Paszkiewicz, shli; +Cc: linux-raid
In-Reply-To: <20161205153113.7268-4-artur.paszkiewicz@intel.com>

[-- Attachment #1: Type: text/plain, Size: 1279 bytes --]

On Tue, Dec 06 2016, Artur Paszkiewicz wrote:

> Add a source file for the new policy implementation and allow selecting
> the policy based on the policy_type parameter in r5l_init_log().
>
> Introduce a new flag for rdev state flags to allow enabling the new
> policy from userspace.

This seems odd.  Why is this a per-device flag?
It makes sense for "journal" to be a per-device flag, because only one
device is the journal device and it is obviously different from the
others.

But with the ppl, all devices serve as journal devices.  So we would
need to set journal_ppl on all devices?  What happens if you set it on
some, but not others?  I see you get an error.

I think some sort of array-wide setting would make more sense, would it
not?

And what is an RWH???  A Really Weird Handle ??

I guess it is probably a Raid5 Write Hole ?  At the very least there
should be a comment explaining this when you define the enum. (remember,
you are trying to make it easier for reviewers).

It might almost make sense for bitmap/metadata to be used here.
It can currently be "external" "internal" "clustered".
Allow also "journalled" or "partial-partiy-log" ???

Maybe not ... but I'd definitely prefer a global setting, and one that
didn't use an obscure abbreviation.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH v2 05/12] raid5-ppl: Partial Parity Log implementation
From: NeilBrown @ 2016-12-07  1:17 UTC (permalink / raw)
  To: Artur Paszkiewicz, shli; +Cc: linux-raid
In-Reply-To: <20161205153113.7268-6-artur.paszkiewicz@intel.com>

[-- Attachment #1: Type: text/plain, Size: 6455 bytes --]

On Tue, Dec 06 2016, Artur Paszkiewicz wrote:

> This implements the write logging functionality, using the policy logic
> introduced in previous patches.
>
> PPL is a distributed log - data is stored on all RAID member drives in
> the metadata area. PPL is written to the parity disk of a particular
> stripe. Distributed log is implemented by using one r5l_log instance per
> each array member. They are grouped in child_logs array in struct
> ppl_conf, which is assigned to a common parent log. This parent log
> serves as a proxy and is used in raid5 personality code - it is assigned
> as _the_ log in r5conf->log. The child logs are where all the real work
> is done.
>
> The PPL consists of a 4KB header (struct ppl_header), and at least 128KB
> for partial parity data. It is stored right after the array data (for
> IMSM) or in the bitmap area (super 1.1 and 1.2) and can be overwritten
> even at each array write request.
>
> Attach a page for holding the partial parity data to each stripe_head.
> Allocate it only if mddev has the MD_HAS_PPL flag set.
>
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  drivers/md/raid5-cache.c |  12 +-
>  drivers/md/raid5-cache.h |   6 +
>  drivers/md/raid5-ppl.c   | 594 ++++++++++++++++++++++++++++++++++++++++++++++-
>  drivers/md/raid5.c       |  15 +-
>  drivers/md/raid5.h       |   1 +
>  5 files changed, 620 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index fa82b9a..be534d8 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -119,8 +119,8 @@ static bool r5l_has_free_space(struct r5l_log *log, sector_t size)
>  	return log->device_size > used_size + size;
>  }
>  
> -static void __r5l_set_io_unit_state(struct r5l_io_unit *io,
> -				    enum r5l_io_unit_state state)
> +void __r5l_set_io_unit_state(struct r5l_io_unit *io,
> +			     enum r5l_io_unit_state state)
>  {
>  	if (WARN_ON(io->state >= state))
>  		return;
> @@ -340,7 +340,7 @@ static void r5c_finish_cache_stripe(struct stripe_head *sh)
>  	}
>  }
>  
> -static void r5l_io_run_stripes(struct r5l_io_unit *io)
> +void r5l_io_run_stripes(struct r5l_io_unit *io)
>  {
>  	struct stripe_head *sh, *next;
>  
> @@ -935,7 +935,7 @@ static sector_t r5l_reclaimable_space(struct r5l_log *log)
>  				 r5c_calculate_new_cp(conf));
>  }
>  
> -static void r5l_run_no_mem_stripe(struct r5l_log *log)
> +void r5l_run_no_mem_stripe(struct r5l_log *log)
>  {
>  	struct stripe_head *sh;
>  
> @@ -1039,7 +1039,7 @@ static void r5l_log_flush_endio(struct bio *bio)
>   * only write stripes of an io_unit to raid disks till the io_unit is the first
>   * one whose data/parity is in log.
>   */
> -static void __r5l_flush_stripe_to_raid(struct r5l_log *log)
> +void __r5l_flush_stripe_to_raid(struct r5l_log *log)
>  {
>  	bool do_flush;
>  
> @@ -1359,7 +1359,7 @@ bool r5l_log_disk_error(struct r5conf *conf)
>  	if (!log)
>  		ret = test_bit(MD_HAS_JOURNAL, &conf->mddev->flags);
>  	else
> -		ret = test_bit(Faulty, &log->rdev->flags);
> +		ret = log->rdev && test_bit(Faulty, &log->rdev->flags);
>  	rcu_read_unlock();
>  	return ret;
>  }
> diff --git a/drivers/md/raid5-cache.h b/drivers/md/raid5-cache.h
> index 4ba11d3..0446100 100644
> --- a/drivers/md/raid5-cache.h
> +++ b/drivers/md/raid5-cache.h
> @@ -157,4 +157,10 @@ extern int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio);
>  extern void r5l_quiesce(struct r5l_log *log, int state);
>  extern bool r5l_log_disk_error(struct r5conf *conf);
>  
> +extern void __r5l_set_io_unit_state(struct r5l_io_unit *io,
> +				    enum r5l_io_unit_state state);
> +extern void r5l_io_run_stripes(struct r5l_io_unit *io);
> +extern void r5l_run_no_mem_stripe(struct r5l_log *log);
> +extern void __r5l_flush_stripe_to_raid(struct r5l_log *log);
> +
>  #endif /* _RAID5_CACHE_H */
> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
> index 263fad7..2d4c90f 100644
> --- a/drivers/md/raid5-ppl.c
> +++ b/drivers/md/raid5-ppl.c
> @@ -14,7 +14,599 @@
>  
>  #include <linux/kernel.h>
>  #include <linux/blkdev.h>
> +#include <linux/slab.h>
> +#include <linux/crc32c.h>
> +#include <linux/module.h>
> +#include <linux/raid/md_p.h>
> +#include "md.h"
>  #include "raid5.h"
>  #include "raid5-cache.h"
>  
> -struct r5l_policy r5l_ppl;
> +static bool ppl_debug;
> +module_param(ppl_debug, bool, 0644);
> +MODULE_PARM_DESC(ppl_debug, "Debug mode for md raid5 PPL");
> +
> +#define dbg(format, args...)						\
> +do {									\
> +	if (ppl_debug)							\
> +		printk(KERN_DEBUG"[%d] %s() "format,			\
> +			current->pid, __func__, ##args);		\
> +} while (0)

Please don't do this.  Just use pr_debug(), and use
 /sys/kernel/debug/dynamic_debug/control
to turn them on and off.

> +
> +struct ppl_conf {
> +	int count;
> +	struct r5l_log **child_logs;
> +};
> +
> +struct ppl_header_entry {
> +	__le64 data_sector;	/* Raid sector of the new data */
> +	__le32 pp_size;		/* Length of partial parity */
> +	__le32 data_size;	/* Length of data */
> +	__u8 parity_disk;	/* Member disk containing parity */
> +	__le32 checksum;	/* Checksum of this entry */
> +} __packed;

Really?  "checksum" is 32bits but not aligned?
I *think* you should be using get_unaligned_le32() to access this
and put_unaligned_le32() to set it.

> +
> +#define PPL_HEADER_SIZE PAGE_SIZE
> +#define PPL_HDR_RESERVED 512
> +#define PPL_HDR_ENTRY_SPACE \
> +	(PPL_HEADER_SIZE - PPL_HDR_RESERVED - 3 * sizeof(u32) - sizeof(u64))
> +#define PPL_HDR_MAX_ENTRIES \
> +	(PPL_HDR_ENTRY_SPACE / sizeof(struct ppl_header_entry))
> +#define PPL_ENTRY_SPACE_IMSM (128 * 1024)
> +
> +struct ppl_header {
> +	__u8 reserved[PPL_HDR_RESERVED];/* Reserved space */
> +	__le32 signature;		/* Signature (family number of volume) */
> +	__le64 generation;		/* Generation number of PP Header */

This probably needs to use the 'unaligned' macros too.

> +	__le32 entries_count;		/* Number of entries in entry array */
> +	__le32 checksum;		/* Checksum of PP Header */
> +	struct ppl_header_entry entries[PPL_HDR_MAX_ENTRIES];
> +} __packed;

ppl_header_entry doesn't seem to be a multiple of 4 bytes long.
This means all the fields in it could be unaligned...

Maybe we should make this code refuse to compile except on x86 ???


NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH v2 07/12] md: mddev_find_container helper function
From: NeilBrown @ 2016-12-07  1:23 UTC (permalink / raw)
  To: Artur Paszkiewicz, shli; +Cc: linux-raid
In-Reply-To: <20161205153113.7268-8-artur.paszkiewicz@intel.com>

[-- Attachment #1: Type: text/plain, Size: 279 bytes --]

On Tue, Dec 06 2016, Artur Paszkiewicz wrote:

> This allows finding the parent container of a subarray for external
> metadata arrays.

This doesn't belong in the kernel.
If it did, there would be a detailed description here of why it is
needed.
But there isn't one.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH v2 08/12] md: expose rdev->sb_start as sysfs attribute
From: NeilBrown @ 2016-12-07  1:25 UTC (permalink / raw)
  To: Artur Paszkiewicz, shli; +Cc: linux-raid
In-Reply-To: <20161205153113.7268-9-artur.paszkiewicz@intel.com>

[-- Attachment #1: Type: text/plain, Size: 424 bytes --]

On Tue, Dec 06 2016, Artur Paszkiewicz wrote:

> This is meant for external metadata arrays, to pass the location of the
> superblock to the kernel.

The whole point of external metadata arrays is that the metadata is
handled externally.  The kernel doesn't need to know where it is.

Certainly the kernel needs to know where the PPL is, but that is quite a
different question to knowing where the superblock is.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH v2 09/12] raid5-ppl: read PPL signature from IMSM metadata
From: NeilBrown @ 2016-12-07  1:25 UTC (permalink / raw)
  To: Artur Paszkiewicz, shli; +Cc: linux-raid
In-Reply-To: <20161205153113.7268-10-artur.paszkiewicz@intel.com>

[-- Attachment #1: Type: text/plain, Size: 464 bytes --]

On Tue, Dec 06 2016, Artur Paszkiewicz wrote:

> The PPL signature is used to determine if the stored PPL is valid for a
> given array. With IMSM, the PPL signature should match the
> orig_family_num field of the superblock. To avoid passing this value
> from userspace, it can be read from the IMSM MPB when initializing the
> log.

It is up to mdadm to determine if the PPL is valid.  It would only tell
the kernel that a PPL exists if it is valid...

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH v2 11/12] raid5-ppl: support disk add/remove with distributed PPL
From: NeilBrown @ 2016-12-07  1:29 UTC (permalink / raw)
  To: Artur Paszkiewicz, shli; +Cc: linux-raid
In-Reply-To: <20161205153113.7268-12-artur.paszkiewicz@intel.com>

[-- Attachment #1: Type: text/plain, Size: 834 bytes --]

On Tue, Dec 06 2016, Artur Paszkiewicz wrote:

> Add a function to modify the log by adding or removing an rdev when a
> drive fails or is added as a spare.
>
> Adding a drive to the log is as simple as initializing and adding a new
> child log, removing a drive is more complicated because it requires
> stopping the child log and freeing all of its resources. In order to do
> that, we busy wait for any submitted log bios to complete and then
> manually finish and free the io_units. No new log requests will happen
> at this point. A new list is added to struct r5l_io_unit to have access
> to stripes that have been written to the log but are not completely
> processed yet.

Busy-wait for IO to finish?  Seriously?

At the very very least you need to cond_resched() in there, but an
actual wait would be much better.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: raid0 vs. mkfs
From: Mike Snitzer @ 2016-12-07  5:08 UTC (permalink / raw)
  To: Avi Kivity; +Cc: NeilBrown, linux-raid@vger.kernel.org
In-Reply-To: <33bb250a-4dfd-0acc-9958-30fdac10918c@scylladb.com>

On Tue, Nov 29, 2016 at 5:45 PM, Avi Kivity <avi@scylladb.com> wrote:
> On 11/29/2016 11:14 PM, NeilBrown wrote:
>>
>> On Mon, Nov 28 2016, Avi Kivity wrote:
>>>>
>>>> If it is easy for the upper layer to break a very large request into a
>>>> few very large requests, then I wouldn't necessarily object.
>>>
>>> I can't see why it would be hard.  It's simple arithmetic.
>>
>> That is easy to say before writing the code :-)
>> It probably is easy for RAID0.  Less so for RAID10.  Even less for
>> RAID6.
>>
>
> pick the largest subrange wihin the inpu range whose bounds are 0 (mod
> stripe-size); TRIM it (for all members); apply the regular algorithm to the
> head and tail subranges.  Works for all RAID types.  If the RAID is
> undergoing reshaping, exclude the range undergoing reshaping, and treat the
> two halves separately.
>
>>>> But unless it is very hard for the lower layer to merge requests, it
>>>> should be doing that too.
>>>
>>> Merging has tradeoffs.  When you merge requests R1, R2, ... Rn you make
>>> the latency request R1 sum of the latencies of R1..Rn.  You may gain
>>> some efficiency in the process, but that's not going to make up for a
>>> factor of n.  The queuing layer has no way to tell whether the caller is
>>> interested in the latency of individual requests.  By sending large
>>> requests, the caller indicates it's not interested in the latency of
>>> individual subranges.  The queuing layer is still free to internally
>>> split the request to smaller ranges, to satisfy hardware constraints, or
>>> to reduce worst-case latencies for competing request streams.
>>
>> I would have thought that using plug/unplug to group requests is a
>> fairly strong statement that they can be handled as a unit if that is
>> convenient.
>
>
> It is not.  As an example, consider a read and a few associated read-ahead
> requests submitted in a batch.  The last thing you want is for them to be
> treated as a unit.
>
> Plug/unplug means: I have a bunch of requests here.  Whether they should be
> merged or reordered is orthogonal to whether they are members of a batch or
> not.
>
>>
>>
>>> So I disagree that all the work should be pushed to the merging layer.
>>> It has less information to work with, so the fewer decisions it has to
>>> make, the better.
>>
>> I think that the merging layer should be as efficient as it reasonably
>> can be, and particularly should take into account plugging.  This
>> benefits all callers.
>
>
> Yes, but plugging does not mean "please merge anything you can until the
> unplug".
>
>> If it can be demonstrated that changes to some of the upper layers bring
>> further improvements with acceptable costs, then certainly it is good to
>> have those too.
>
>
> Generating millions of requests only to merge them again is inefficient.  It
> happens in an edge case (TRIM of the entirety of a very large RAID), but it
> already caused on user to believe the system failed.  I think the system
> should be more robust than that.

DM's stripe target (raid0) has had Avi's proposed implementation
(breaks large discard into N discards, one per component device) for a
long while.
See commit 7b76ec11fec4 ("dm stripe: support discards")

This improvement applies to other raid levels too.  And I completely
agree that it would be a mistake to to continue to have the block
layer issue millions of chunk-sized bios only to _maybe_ have them be
merged by the IO scheduler and/or plugging.  Needless make-work that
just eats cpu and memory.

^ permalink raw reply

* Recovering a RAID6 after all disks were disconnected
From: Giuseppe Bilotta @ 2016-12-07  8:59 UTC (permalink / raw)
  To: linux-raid

Hello all,

my situation is the following: I have a small 4-disk JBOD that I use
to hold a RAID6 software raid setup controlled by mdraid (currently
Debian version 3.4-4 on Linux kernel 4.7.8-1

I've had sporadic resets of the JBOD due to a variety of reasons
(power failures or disk failures —the JBOD has the bad habit of
resetting when one disk has an I/O error, which causes all of the
disks to go offline temporarily).

 When this happens, all the disks get kicked from the RAID, as md
fails to find them until the reset of the JBOD is complete. When the
disks come back online, even if it's just a few seconds later, the
RAID remains in the failed configuration with all 4 disks missing, of
course.

Normally, the way I would proceed in this case is to unmount the
filesystem sitting on top of the RAID, stop the RAID, and then try to
start it again, which works reasonably well (aside from the obvious
filesystem check that is often needed).

The thing happened again a couple of days ago, but this time I tried
re-adding the disks directly when they came back online, using mdadm
-a and confident that since they _had_ been recently part of the
array, the array would actually go back to work fine —except that this
is not the case when ALL disks were kicked out of the array! Instead,
what happened was that all the disks were marked as 'spare' and the
RAID would not assemble anymore.

At this point I stopped everything and made a full copy of the RAID
disks (lucky me, I had just bought a new JBOD for an upgrade, and a
bunch of new disks, even if one of them is apparently defective so I
have only been able to backup 3 of the 4 disks) and I have been toying
around with ways to recover the array by playing on the copies I've
made (I've set the original disks to readonly at the kernel level just
to be sure).

So now my situation is this, and I would like to know if there is
something I can try to recover the RAID (I've made a few tests that I
will describe momentarily). (I would like to know if there is any
possibility for md to handle these kind of issue —all disks in a RAID
going temporarily offline— more gracefully, which is likely needed for
a lot of home setup where SATA is used instead of SAS).

So one thing that I've done is to hack around the superblock in the
disks (copies) to put back the device roles as they were (getting the
information from the pre-failure dmesg output). (By the way, I've been
using Andy's Binary Editor for the superblock editing, so if anyone is
interested in a be.ini for mdraid v1 superblocks, including checksum
verification, I'd be happy to share). Specifically, I've left the
device number untouched, but I have edited the dev_roles array so that
the slots corresponding to the dev_number from all the disks map to
appropriate device roles.

I can then assemble the array with only 3 of 4 disks (because I do not
have a copy of the fourth, essentially) and force-run it. However,
when I do this, I get two things:

(1) a complaint about the bitmap being out of date (number of events
too low by 3) and
(2) I/O errors on logical block 0 (and the RAID data thus completely
inaccessible)

I'm now wondering about what I should try next. Prevent a resync by
matching the event count with that of the bitmap (or conversely)? Try
a different permutation of the roles? (I have triple-checked but who
knows)? Try a different subset of disks? Try and recreate the array?

Thanks in advance for any suggestion you may have,

-- 
Giuseppe "Oblomov" Bilotta

^ permalink raw reply

* Re: raid0 vs. mkfs
From: Coly Li @ 2016-12-07 11:50 UTC (permalink / raw)
  To: Avi Kivity, NeilBrown; +Cc: linux-raid, linux-block
In-Reply-To: <33bb250a-4dfd-0acc-9958-30fdac10918c@scylladb.com>

[-- Attachment #1: Type: text/plain, Size: 3401 bytes --]

On 2016/11/30 上午6:45, Avi Kivity wrote:
> On 11/29/2016 11:14 PM, NeilBrown wrote:
[snip]

>>> So I disagree that all the work should be pushed to the merging layer.
>>> It has less information to work with, so the fewer decisions it has to
>>> make, the better.
>> I think that the merging layer should be as efficient as it reasonably
>> can be, and particularly should take into account plugging.  This
>> benefits all callers.
> 
> Yes, but plugging does not mean "please merge anything you can until the
> unplug".
> 
>> If it can be demonstrated that changes to some of the upper layers bring
>> further improvements with acceptable costs, then certainly it is good to
>> have those too.
> 
> Generating millions of requests only to merge them again is
> inefficient.  It happens in an edge case (TRIM of the entirety of a very
> large RAID), but it already caused on user to believe the system
> failed.  I think the system should be more robust than that.

Neil,

As my understand, if a large discard bio received by
raid0_make_request(), for example it requests to discard chunk 1 to 24
on a raid0 device built by 4 SSDs. This large discard bio will be split
and written to each SSD as the following layout,

SSD1: C1,C5,C9,C13,C17,C21
SSD2: C2,C6,C10,C14,C18,C22
SSD3: C3,C7,C11,C15,C19,C23
SSD4: C4,C8,C12,C16,C20,C24

Current raid0 code will call generic_make_request() for 24 times for
each split bio. But it is possible to calculate the final layout of each
split bio, so we can combine all the bios into four per-SSD large bio,
like this,

bio1 (on SSD1): C{1,5,9,13,17,21}
bio2 (on SSD2): C{2,6,10,14,18,22}
bio3 (on SSD3): C{3,7,11,15,19,23}
bio4 (on SSD4): C{4,8,12,16,20,24}

Now we only need to call generic_make_request() for 4 times. Rebuild the
per-device discard bios is more efficient in raid0 code then in block
layer. There are some reasons that I know,
- there are splice timeout, block layer cannot merge all split bio into
one large bio before time out.
- rebuilt per-device bio in raid0 is just by a few calculation, block
layer does merge on queue with list operations, it is slower.
- raid0 code knows its on disk layout, so rebuild per-device bio is
possible here. block layer has no idea on raid0 layout, it can only do
request merge.

Avi,

I compose a prototype patch, the code is not simple, indeed it is quite
complicated IMHO.

I do a little research, some NVMe SSDs support whole device size
DISCARD, also I observe mkfs.xfs sends out a raid0 device size DISCARD
bio to block layer. But raid0_make_request() only receives 512KB size
DISCARD bio, block/blk-lib.c:__blkdev_issue_discard() splits the
original large bio into 512KB small bios, the limitation is from
q->limits.discard_granularity.

At this moment, I don't know why a q->limits.discard_granularity is
512KB even the underlying SSD supports whole device size discard. We
also need to fix q->limits.discard_granularity, otherwise
block/blk-lib.c:__blkdev_issue_discard() still does an inefficient loop
to split the original large discard bio into smaller ones and sends them
to raid0 code by next_bio().

I also CC this email to linux-block@vger.kernel.org to ask for help. My
question is, if a NVMe SSD supports whole-device-size DISCARD, is
q->limits.discard_granularity still necessary ?

Here I also attach my prototype patch as a proof of concept, it is
runnable with Linux 4.9-rc7.

Coly



[-- Attachment #2: raid0_handle_large_discard_bio.patch --]
[-- Type: text/plain, Size: 9329 bytes --]

Subject: optimization for large size DISCARD bio by per-device bios 

This is a very early prototype, still needs more block layer code
modification to make it work.

Current upstream raid0_make_request() only handles TRIM/DISCARD bio by
chunk size, it meams for large raid0 device built by SSDs will call
million times generic_make_request() for the split bio. This patch
tries to combine small bios into large one if they are on same real
device and continuous on this real device, then send the combined large
bio to underlying device by single call to generic_make_request().

For example, use mkfs.xfs to trim a raid0 device built with 4 x 3TB
NVMeSSD, current upstream raid0_make_request() will call
generic_make_request() 5.7 million times, with this patch only 4 calls
to generic_make_request() is required.

This patch won't work in real world, because in block/blk-lib.c:
__blkdev_issue_discard() the original large bio will be split into
smaller ones by restriction of discard_granularity.

If some day SSD supports whole device sized discard_granularity, it
will be very interesting then...

The basic idea is, if a large discard bio received
by raid0_make_request(), for example it requests to discard chunk 1
to 24 on a raid0 device built by 4 SSDs. This large discard bio will
be split and written to each SSD as the following layout,
	SSD1: C1,C5,C9,C13,C17,C21
	SSD2: C2,C6,C10,C14,C18,C22
	SSD3: C3,C7,C11,C15,C19,C23
	SSD4: C4,C8,C12,C16,C20,C24
Current raid0 code will call generic_make_request() for 24 times for
each split bio. But it is possible to calculate the final layout of
each split bio, so we can combine all the bios into four per-SSD large
bio, like this,
	bio1 (on SSD1): C{1,5,9,13,17,21}
	bio2 (on SSD2): C{2,6,10,14,18,22}
	bio3 (on SSD3): C{3,7,11,15,19,23}
	bio4 (on SSD4): C{4,8,12,16,20,24}
Now we only need to call generic_make_request() for 4 times.

The code is not simple, I need more time to write text to complain how
it works. Currently you can treat it as a proof of concept.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/raid0.c | 224 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 224 insertions(+)

diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 258986a..73c5fac 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -452,6 +452,225 @@ static inline int is_io_in_chunk_boundary(struct mddev *mddev,
 	}
 }
 
+
+struct bio_record {
+	sector_t	bi_sector;
+	unsigned long	sectors;
+	struct md_rdev	*rdev;
+};
+
+static void handle_discard_request(struct mddev *mddev, struct bio *bio)
+{
+	struct bio_record *recs = NULL;
+	struct bio *split;
+	struct r0conf *conf = mddev->private;
+	sector_t sectors, sector;
+	struct strip_zone *first_zone;
+	int zone_idx;
+	sector_t zone_start, zone_end;
+	int nr_strip_zones = conf->nr_strip_zones;
+	int disks;
+	int first_rdev_idx = -1, rdev_idx;
+	struct md_rdev *first_rdev;
+	unsigned int chunk_sects = mddev->chunk_sectors;
+
+	sector = bio->bi_iter.bi_sector;
+	first_zone = find_zone(conf, &sector);
+	first_rdev = map_sector(mddev, first_zone, sector, &sector);
+
+	sectors = chunk_sects -
+		(likely(is_power_of_2(chunk_sects))
+		 ? (sector & (chunk_sects - 1))
+		 : sector_div(sector, chunk_sects));
+
+	/* if bio size is not exceed a chunk boundary,
+	 * simply handle it here.
+	 */
+	if (sectors >= bio_sectors(bio)) {
+		bio->bi_bdev = first_rdev->bdev;
+		bio->bi_iter.bi_sector = sector + first_zone->dev_start +
+					 first_rdev->data_offset;
+		if (unlikely(!blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
+			/* Just ignore it */
+			bio_endio(bio);
+		else
+			generic_make_request(bio);
+		return;
+	}
+
+	/* bio is large enough to be split, allocate recs firstly */
+	disks = mddev->raid_disks;
+	recs = kcalloc(disks, sizeof(struct bio_record), GFP_NOIO);
+	if (recs == NULL) {
+		printk(KERN_ERR "md/raid0:%s: failed to allocate memory " \
+				"for bio_record", mdname(mddev));
+		bio->bi_error = -ENOMEM;
+		bio_endio(bio);
+		return;
+	}
+
+	zone_idx = first_zone - conf->strip_zone;
+	for (rdev_idx = 0; rdev_idx < first_zone->nb_dev; rdev_idx++) {
+		struct md_rdev *rdev;
+
+		rdev = conf->devlist[zone_idx * disks + rdev_idx];
+		recs[rdev_idx].rdev = rdev;
+		if (rdev == first_rdev)
+			first_rdev_idx = rdev_idx;
+	}
+
+	/* Restore due to sector_div */
+	sector = bio->bi_iter.bi_sector;
+	recs[first_rdev_idx].bi_sector = sector + first_zone->dev_start;
+	recs[first_rdev_idx].sectors = sectors;
+	BUG_ON(recs[first_rdev_idx].rdev != first_rdev);
+
+	/* recs[first_rdev_idx] is initialized with 'sectors', we need to
+	 * handle the rested sectors, which is sotred in 'sectors' too.
+	 */
+	sectors = bio_sectors(bio) - sectors;
+
+	/* bio may not be chunk size aligned, the split bio on first rdev
+	 * may not be chunk size aligned too. But the rested split bios
+	 * on rested rdevs must be chunk size aligned, and aligned to
+	 * round down chunk number.
+	 */
+	zone_end = first_zone->zone_end;
+	rdev_idx = first_rdev_idx + 1;
+	sector = likely(is_power_of_2(chunk_sects))
+		 ? sector & (~(chunk_sects - 1))
+		 : chunk_sects * (sector/chunk_sects);
+
+	while (rdev_idx < first_zone->nb_dev) {
+		recs[rdev_idx].bi_sector = sector + first_zone->dev_start;
+		if (sectors <= chunk_sects) {
+			recs[rdev_idx].sectors = sectors;
+			goto issue;
+		}
+		recs[rdev_idx].sectors = chunk_sects;
+		sectors -= chunk_sects;
+		rdev_idx++;
+	}
+
+	sector += chunk_sects;
+	zone_start = sector + first_zone->dev_start;
+	if (zone_start == zone_end) {
+		zone_idx++;
+		zone_start = conf->strip_zone[zone_idx].dev_start;
+	}
+
+	while (zone_idx < nr_strip_zones) {
+		int rdevs_in_zone = conf->strip_zone[zone_idx].nb_dev;
+		int chunks_per_rdev, rested_chunks, rested_sectors;
+		sector_t zone_sectors, grow_sectors;
+		int add_rested_sectors = 0;
+
+		zone_end = conf->strip_zone[zone_idx].zone_end;
+		zone_sectors = zone_end - zone_start;
+		chunks_per_rdev = sectors;
+		rested_sectors =
+			sector_div(chunks_per_rdev, chunk_sects * rdevs_in_zone);
+		rested_chunks = rested_sectors;
+		rested_sectors = sector_div(rested_chunks, chunk_sects);
+
+		if ((chunks_per_rdev * chunk_sects) > zone_sectors)
+			chunks_per_rdev = zone_sectors/chunk_sects;
+
+		/* rested_chunks and rested_sectors go into next zone, we won't
+		 * handle them in this zone. Set them to 0.
+		 */
+		if ((chunks_per_rdev * chunk_sects) == zone_sectors &&
+		    (rested_chunks != 0 || rested_sectors != 0)) {
+			if (rested_chunks != 0)
+				rested_chunks = 0;
+			if (rested_sectors != 0)
+				rested_sectors = 0;
+		}
+
+		if (rested_chunks == 0 && rested_sectors != 0)
+			add_rested_sectors = 1;
+
+		for (rdev_idx = 0; rdev_idx < rdevs_in_zone; rdev_idx++) {
+			/* if .sectors is not initailized (== 0), it indicates
+			 * .bi_sector is not initialized neither. We initiate
+			 * .bi_sector firstly, then set .sectors by
+			 * grow_sectors.
+			 */
+			if (recs[rdev_idx].sectors == 0)
+				recs[rdev_idx].bi_sector = zone_start;
+			grow_sectors = chunks_per_rdev * chunk_sects;
+			if (rested_chunks) {
+				grow_sectors += chunk_sects;
+				rested_chunks--;
+				if (rested_chunks == 0 &&
+				    rested_sectors != 0) {
+					recs[rdev_idx].sectors += grow_sectors;
+					sectors -= grow_sectors;
+					add_rested_sectors = 1;
+					continue;
+				}
+			}
+
+			/* if add_rested_sectors != 0, it indicates
+			 * rested_sectors != 0
+			 */
+			if (add_rested_sectors)
+				grow_sectors += rested_sectors;
+			recs[rdev_idx].sectors += grow_sectors;
+			sectors -= grow_sectors;
+			if (add_rested_sectors)
+				break;
+		}
+
+		if (sectors == 0)
+			break;
+		zone_start = zone_end;
+		zone_idx++;
+		BUG_ON(zone_start != conf->strip_zone[zone_idx].dev_start);
+	}
+
+
+issue:
+	/* recs contains the re-ordered requests layout, now we can
+	 * chain split bios from recs
+	 */
+	for (rdev_idx = 0; rdev_idx < disks; rdev_idx++) {
+		if (rdev_idx == first_rdev_idx ||
+		    recs[rdev_idx].sectors == 0)
+			continue;
+		split = bio_split(bio,
+				  recs[rdev_idx].sectors,
+				  GFP_NOIO,
+				  fs_bio_set);
+		if (split == NULL)
+			break;
+		bio_chain(split, bio);
+		BUG_ON(split->bi_iter.bi_size != recs[rdev_idx].sectors << 9);
+		split->bi_bdev = recs[rdev_idx].rdev->bdev;
+		split->bi_iter.bi_sector = recs[rdev_idx].bi_sector +
+				recs[rdev_idx].rdev->data_offset;
+
+		if (unlikely(!blk_queue_discard(
+				bdev_get_queue(split->bi_bdev))))
+			/* Just ignore it */
+			bio_endio(split);
+		else
+			generic_make_request(split);
+	}
+	BUG_ON(bio->bi_iter.bi_size != recs[first_rdev_idx].sectors << 9);
+	bio->bi_iter.bi_sector = recs[first_rdev_idx].bi_sector +
+					recs[first_rdev_idx].rdev->data_offset;
+	bio->bi_bdev = recs[first_rdev_idx].rdev->bdev;
+
+	if (unlikely(!blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
+		/* Just ignore it */
+		bio_endio(bio);
+	else
+		generic_make_request(bio);
+
+	kfree(recs);
+}
+
 static void raid0_make_request(struct mddev *mddev, struct bio *bio)
 {
 	struct strip_zone *zone;
@@ -463,6 +682,11 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
 		return;
 	}
 
+	if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) {
+		handle_discard_request(mddev, bio);
+		return;
+	}
+
 	do {
 		sector_t sector = bio->bi_iter.bi_sector;
 		unsigned chunk_sects = mddev->chunk_sectors;

^ permalink raw reply related

* Re: raid0 vs. mkfs
From: Coly Li @ 2016-12-07 12:03 UTC (permalink / raw)
  To: Avi Kivity, NeilBrown; +Cc: linux-raid, linux-block
In-Reply-To: <ccf447df-e7b5-0a08-7753-67ee4ad574ef@suse.de>

On 2016/12/7 下午7:50, Coly Li wrote:
> On 2016/11/30 上午6:45, Avi Kivity wrote:
>> On 11/29/2016 11:14 PM, NeilBrown wrote:
> [snip]
> 
>>>> So I disagree that all the work should be pushed to the merging layer.
>>>> It has less information to work with, so the fewer decisions it has to
>>>> make, the better.
>>> I think that the merging layer should be as efficient as it reasonably
>>> can be, and particularly should take into account plugging.  This
>>> benefits all callers.
>>
>> Yes, but plugging does not mean "please merge anything you can until the
>> unplug".
>>
>>> If it can be demonstrated that changes to some of the upper layers bring
>>> further improvements with acceptable costs, then certainly it is good to
>>> have those too.
>>
>> Generating millions of requests only to merge them again is
>> inefficient.  It happens in an edge case (TRIM of the entirety of a very
>> large RAID), but it already caused on user to believe the system
>> failed.  I think the system should be more robust than that.
> 
> Neil,
> 
> As my understand, if a large discard bio received by
> raid0_make_request(), for example it requests to discard chunk 1 to 24
> on a raid0 device built by 4 SSDs. This large discard bio will be split
> and written to each SSD as the following layout,
> 
> SSD1: C1,C5,C9,C13,C17,C21
> SSD2: C2,C6,C10,C14,C18,C22
> SSD3: C3,C7,C11,C15,C19,C23
> SSD4: C4,C8,C12,C16,C20,C24
> 
> Current raid0 code will call generic_make_request() for 24 times for
> each split bio. But it is possible to calculate the final layout of each
> split bio, so we can combine all the bios into four per-SSD large bio,
> like this,
> 
> bio1 (on SSD1): C{1,5,9,13,17,21}
> bio2 (on SSD2): C{2,6,10,14,18,22}
> bio3 (on SSD3): C{3,7,11,15,19,23}
> bio4 (on SSD4): C{4,8,12,16,20,24}
> 
> Now we only need to call generic_make_request() for 4 times. Rebuild the
> per-device discard bios is more efficient in raid0 code then in block
> layer. There are some reasons that I know,
> - there are splice timeout, block layer cannot merge all split bio into
> one large bio before time out.
> - rebuilt per-device bio in raid0 is just by a few calculation, block
> layer does merge on queue with list operations, it is slower.
> - raid0 code knows its on disk layout, so rebuild per-device bio is
> possible here. block layer has no idea on raid0 layout, it can only do
> request merge.
> 
> Avi,
> 
> I compose a prototype patch, the code is not simple, indeed it is quite
> complicated IMHO.
> 
> I do a little research, some NVMe SSDs support whole device size
> DISCARD, also I observe mkfs.xfs sends out a raid0 device size DISCARD
> bio to block layer. But raid0_make_request() only receives 512KB size
> DISCARD bio, block/blk-lib.c:__blkdev_issue_discard() splits the
> original large bio into 512KB small bios, the limitation is from
> q->limits.discard_granularity.
> 
> At this moment, I don't know why a q->limits.discard_granularity is
> 512KB even the underlying SSD supports whole device size discard. We
> also need to fix q->limits.discard_granularity, otherwise
> block/blk-lib.c:__blkdev_issue_discard() still does an inefficient loop
> to split the original large discard bio into smaller ones and sends them
> to raid0 code by next_bio().
> 
> I also CC this email to linux-block@vger.kernel.org to ask for help. My
> question is, if a NVMe SSD supports whole-device-size DISCARD, is
> q->limits.discard_granularity still necessary ?
> 
> Here I also attach my prototype patch as a proof of concept, it is
> runnable with Linux 4.9-rc7.

Aha! It is not limits.discard_granularity, it is
limits.max_discard_sectors. Which is set in raid0.c:raid0_run() by
blk_queue_max_discard_sectors(). Here limits.maxZ_discard_sectors is set
to raid0 chunk size. Interesting ....

And, linux-block@vger.kernel.org, please ignore my noise.

Coly

^ permalink raw reply

* Re: [BUG] MD/RAID1 hung forever on freeze_array
From: Jinpu Wang @ 2016-12-07 14:17 UTC (permalink / raw)
  To: Coly Li; +Cc: linux-raid, Shaohua Li, Neil Brown, Nate Dailey
In-Reply-To: <CAMGffEkbddHsJ4LSWcd1DmueM8XTm1pUiowa=_MUHmysVEFN_w@mail.gmail.com>

On Tue, Nov 29, 2016 at 12:15 PM, Jinpu Wang
<jinpu.wang@profitbricks.com> wrote:
> On Mon, Nov 28, 2016 at 10:10 AM, Coly Li <colyli@suse.de> wrote:
>> On 2016/11/28 下午5:02, Jinpu Wang wrote:
>>> On Mon, Nov 28, 2016 at 9:54 AM, Coly Li <colyli@suse.de> wrote:
>>>> On 2016/11/28 下午4:24, Jinpu Wang wrote:
>>>>> snip
>>>>>>>>
>>>>>>>> every time nr_pending is 1 bigger then (nr_queued + 1), so seems we
>>>>>>>> forgot to increase nr_queued somewhere?
>>>>>>>>
>>>>>>>> I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in
>>>>>>>> nr_queued to prevent freeze_array hang. Seems it fixed similar bug.
>>>>>>>>
>>>>>>>> Could you give your suggestion?
>>>>>>>>
>>>>>>> Sorry, forgot to mention kernel version is 4.4.28
>>>>>>
>>>>>> This commit is Cced to stable@vger.kernel.org for v4.3+, do you use a
>>>>>> stable kernel or a distribution with 4.4.28 kernel ?
>>>>>>
>>>>>> Coly
>>>>>>
>>>>>>
>>>>> Hi Coly,
>>>>>
>>>>> I'm using Debian8 with 4.4.28 kernel.
>>>>
>>>> Hi Jinpu,
>>>>
>>>> Is it possible for your to run a upstream kernel or vanilla kernel to
>>>> test whether the issue still can be reproduced ? Then we can know
>>>> whether it is an upstream bug or a distro issue.
>>>>
>>>> Thanks.
>>>>
>>>> Coly
>>>
>>> Hi Coly,
>>>
>>> I did run kernel 4.4.34 (I download from kernel.org), I can reproduce
>>> the same bug.
>>>
>>> I can also try latest 4.8 or 4.9 rc kernel, if you think it's necessary?
>>>
>> Yes, please. If it can be reproduced on upstream kernel by a set of
>> scripts, it will be very helpful to debug and fix this issue.
>>
>> Thanks in advance.
>>
>> Coly
>
> Hi Coly,
>
> I tried with kernel 4.9-cr7, I can't reproduce it with my testcase anymore.
>
> It's hard to say the bug is fixed or harder to reproduce because code
> changed a lot.
>
> --
> Jinpu Wang
> Linux Kernel Developer
>
> ProfitBricks GmbH
> Greifswalder Str. 207
> D - 10405 Berlin
>
> Tel:       +49 30 577 008  042
> Fax:      +49 30 577 008 299
> Email:    jinpu.wang@profitbricks.com
> URL:      https://www.profitbricks.de
>
> Sitz der Gesellschaft: Berlin
> Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
> Geschäftsführer: Achim Weiss
Hi,

I continue debug the bug:

20161207
crash> struct r1conf 0xffff8800b9792000
struct r1conf {
  mddev = 0xffff88022db03800,
  mirrors = 0xffff880227729200,
  raid_disks = 2,
  next_resync = 18446744073709527039,
  start_next_window = 18446744073709551615,
  current_window_requests = 0,
  next_window_requests = 0,
  device_lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0
          }
        }
      }
    }
  },
  retry_list = {
    next = 0xffff8800afe690c0,
    prev = 0xffff8800b96acac0
  },
  bio_end_io_list = {
    next = 0xffff8800b96ac2c0,
    prev = 0xffff88003735f140
  },
  pending_bio_list = {
    head = 0x0,
    tail = 0x0
  },
  pending_count = 0,
  wait_barrier = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff880221bc7770,
      prev = 0xffff8800ad44bc88
    }
  },
  resync_lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0
          }
        }
      }
    }
  },
  nr_pending = 948,
  nr_waiting = 9,
  nr_queued = 946, // again we need one more to finished wait_event.
  barrier = 0,
  array_frozen = 1,
  fullsync = 0,
  recovery_disabled = 2,
  poolinfo = 0xffff88022c567580,
  r1bio_pool = 0xffff88022fdccea0,
  r1buf_pool = 0x0,
  tmppage = 0xffffea0002bf1600,
  thread = 0x0,
  cluster_sync_low = 0,
  cluster_sync_high = 0
}
crash> exit

on conf->bio_end_io_list we have 91 entries.
crash> list -H 0xffff88003735f140
ffff8800b9792048
ffff8800b96ac2c0
ffff8800b96ac1c0

snip
ffff88022243dfc0
ffff88022243dc40
crash>

on conf->retry_list we have 855
crash> list -H 0xffff8800b96acac0
ffff8800b9792038
ffff8800afe690c0
snip

list -H 0xffff8800b96acac0 r1bio.retry_list -s r1bio

ffff8800b9791ff8
struct r1bio {
  remaining = {
    counter = 0
  },
  behind_remaining = {
    counter = 0
  },
  sector = 18446612141670676480, // corrupted?
  start_next_window = 18446612141565972992, //ditto
  sectors = 2,
  state = 18446744073709527039, // ditto
  mddev = 0xffffffffffffffff,
  master_bio = 0x0,
  read_disk = 0,
  retry_list = {
    next = 0xffff8800afe690c0,
    prev = 0xffff8800b96acac0
  },
  behind_bvecs = 0xffff8800b96ac2c0,
  behind_page_count = 926282048,
  bios = 0xffff8800b9792058
}
ffff8800afe69080
struct r1bio {
  remaining = {
    counter = 0
  },
  behind_remaining = {
    counter = 0
  },
  sector = 1566,
  start_next_window = 0,
  sectors = 128,
  state = 257,
  mddev = 0xffff88022db03800,
  master_bio = 0xffff8800371c1f00,
  read_disk = 0,
  retry_list = {
    next = 0xffff8800ad41a540,
    prev = 0xffff8800b9792038
  },
  behind_bvecs = 0x0,
  behind_page_count = 0,
  bios = 0xffff8800afe690e0
}

check conf->bio_end_io_list
list -H 0xffff88003735f140 r1bio.retry_list -s r1bio

ffff8800b9792008
struct r1bio {
  remaining = {
    counter = 661819904
  },
  behind_remaining = {
    counter = -30718
  },
  sector = 2,
  start_next_window = 18446744073709527039, // corrupted?
  sectors = -1,                                                 // corrupted?
  state = 0,
  mddev = 0x0,                                               // corrupted?
  master_bio = 0xffff8800afe690c0,
  read_disk = -1184183616,                             // ?
  retry_list = {
    next = 0xffff8800b96ac2c0,
    prev = 0xffff88003735f140
  },
  behind_bvecs = 0x0,
  behind_page_count = 0,
  bios = 0xffff8800b9792068
}
ffff8800b96ac280
struct r1bio {
  remaining = {
    counter = 0
  },
  behind_remaining = {
    counter = 0
  },
  sector = 980009,
  start_next_window = 0,
  sectors = 16,
  state = 257,
  mddev = 0xffff88022db03800,
  master_bio = 0xffff8800370b0600,
  read_disk = 0,
  retry_list = {
    next = 0xffff8800b96ac1c0,
    prev = 0xffff8800b9792048
  },
  behind_bvecs = 0x0,
  behind_page_count = 0,
  bios = 0xffff8800b96ac2e0
}

I still have no clue what it could be, any one has idea?

-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply

* Re: Recovering a RAID6 after all disks were disconnected
From: John Stoffel @ 2016-12-07 14:31 UTC (permalink / raw)
  To: Giuseppe Bilotta; +Cc: linux-raid
In-Reply-To: <CAOxFTczhq3Gr4WW=-sMamaYDmAoMFGh7kkanwJzhXe5NBdMHHQ@mail.gmail.com>

Giuseppe> my situation is the following: I have a small 4-disk JBOD that I use
Giuseppe> to hold a RAID6 software raid setup controlled by mdraid (currently
Giuseppe> Debian version 3.4-4 on Linux kernel 4.7.8-1

Giuseppe> I've had sporadic resets of the JBOD due to a variety of reasons
Giuseppe> (power failures or disk failures —the JBOD has the bad habit of
Giuseppe> resetting when one disk has an I/O error, which causes all of the
Giuseppe> disks to go offline temporarily).

Please toss that JBOD out the window!  *grin*

Giuseppe>  When this happens, all the disks get kicked from the RAID,
Giuseppe> as md fails to find them until the reset of the JBOD is
Giuseppe> complete. When the disks come back online, even if it's just
Giuseppe> a few seconds later, the RAID remains in the failed
Giuseppe> configuration with all 4 disks missing, of course.

Giuseppe> Normally, the way I would proceed in this case is to unmount
Giuseppe> the filesystem sitting on top of the RAID, stop the RAID,
Giuseppe> and then try to start it again, which works reasonably well
Giuseppe> (aside from the obvious filesystem check that is often
Giuseppe> needed).

Giuseppe> The thing happened again a couple of days ago, but this time
Giuseppe> I tried re-adding the disks directly when they came back
Giuseppe> online, using mdadm -a and confident that since they _had_
Giuseppe> been recently part of the array, the array would actually go
Giuseppe> back to work fine —except that this is not the case when ALL
Giuseppe> disks were kicked out of the array! Instead, what happened
Giuseppe> was that all the disks were marked as 'spare' and the RAID
Giuseppe> would not assemble anymore.

Can you please send us the full details of each disk using the
command:

  mdadm -E /dev/sda1

Where of course 'a' and '1' depend on whether or not you are using
whole disk or partitioned disks for your arrays.

You might be able to just for the three spare disks (assumed in this
case to be sda1, sdb1, sdc1; but you need to be sure first!) to
assemble into a full array with:

 mdadm -A /dev/md50 /dev/sda1 /dev/sdb1 /dev/sdc1

And if that works, great.   If not, post the error message(s) you get
back.

Basically provide more details on your setup so we can help you.

John

Giuseppe> At this point I stopped everything and made a full copy of
Giuseppe> the RAID disks (lucky me, I had just bought a new JBOD for
Giuseppe> an upgrade, and a bunch of new disks, even if one of them is
Giuseppe> apparently defective so I have only been able to backup 3 of
Giuseppe> the 4 disks) and I have been toying around with ways to
Giuseppe> recover the array by playing on the copies I've made (I've
Giuseppe> set the original disks to readonly at the kernel level just
Giuseppe> to be sure).

Giuseppe> So now my situation is this, and I would like to know if there is
Giuseppe> something I can try to recover the RAID (I've made a few tests that I
Giuseppe> will describe momentarily). (I would like to know if there is any
Giuseppe> possibility for md to handle these kind of issue —all disks in a RAID
Giuseppe> going temporarily offline— more gracefully, which is likely needed for
Giuseppe> a lot of home setup where SATA is used instead of SAS).

Giuseppe> So one thing that I've done is to hack around the superblock in the
Giuseppe> disks (copies) to put back the device roles as they were (getting the
Giuseppe> information from the pre-failure dmesg output). (By the way, I've been
Giuseppe> using Andy's Binary Editor for the superblock editing, so if anyone is
Giuseppe> interested in a be.ini for mdraid v1 superblocks, including checksum
Giuseppe> verification, I'd be happy to share). Specifically, I've left the
Giuseppe> device number untouched, but I have edited the dev_roles array so that
Giuseppe> the slots corresponding to the dev_number from all the disks map to
Giuseppe> appropriate device roles.

Giuseppe> I can then assemble the array with only 3 of 4 disks (because I do not
Giuseppe> have a copy of the fourth, essentially) and force-run it. However,
Giuseppe> when I do this, I get two things:

Giuseppe> (1) a complaint about the bitmap being out of date (number of events
Giuseppe> too low by 3) and
Giuseppe> (2) I/O errors on logical block 0 (and the RAID data thus completely
Giuseppe> inaccessible)

Giuseppe> I'm now wondering about what I should try next. Prevent a resync by
Giuseppe> matching the event count with that of the bitmap (or conversely)? Try
Giuseppe> a different permutation of the roles? (I have triple-checked but who
Giuseppe> knows)? Try a different subset of disks? Try and recreate the array?

Giuseppe> Thanks in advance for any suggestion you may have,

Giuseppe> -- 
Giuseppe> Giuseppe "Oblomov" Bilotta
Giuseppe> --
Giuseppe> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
Giuseppe> the body of a message to majordomo@vger.kernel.org
Giuseppe> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v2 00/12] Partial Parity Log for MD RAID 5
From: Artur Paszkiewicz @ 2016-12-07 14:36 UTC (permalink / raw)
  To: NeilBrown, shli; +Cc: linux-raid
In-Reply-To: <87fum04m0m.fsf@notabene.neil.brown.name>

On 12/07/2016 01:32 AM, NeilBrown wrote:
> 
> I would expect to see as description of what a PPL actually is and how
> it works here... but there is none.
> 
> The change-log for patch 06 has a tiny bit more information which is
> just enough to be able to start trying to understand the code, but it
> isn't much.
> And none of this description gets into the code, or into the
> Documentation/.  This makes it hard to review and hard to maintain.
> 
> Remember: if you want people to review you code, it is in your interest
> to make it easy.  That means give lots of details.

Hi Neil,

Thank you for taking the time to look at this and for your feedback. I
didn't try to make it hard to review... Sometimes it's easy to forget
how non-obvious things are after looking at them for too long :) I will
improve the descriptions and address the issues that you found in the
next version of the patches.

Thanks,
Artur

^ permalink raw reply

* Re: [PATCH v2 03/12] raid5-cache: add a new policy
From: Artur Paszkiewicz @ 2016-12-07 14:36 UTC (permalink / raw)
  To: NeilBrown, shli; +Cc: linux-raid
In-Reply-To: <87d1h44lcj.fsf@notabene.neil.brown.name>

On 12/07/2016 01:46 AM, NeilBrown wrote:
> On Tue, Dec 06 2016, Artur Paszkiewicz wrote:
> 
>> Add a source file for the new policy implementation and allow selecting
>> the policy based on the policy_type parameter in r5l_init_log().
>>
>> Introduce a new flag for rdev state flags to allow enabling the new
>> policy from userspace.
> 
> This seems odd.  Why is this a per-device flag?
> It makes sense for "journal" to be a per-device flag, because only one
> device is the journal device and it is obviously different from the
> others.
> 
> But with the ppl, all devices serve as journal devices.  So we would
> need to set journal_ppl on all devices?  What happens if you set it on
> some, but not others?  I see you get an error.
> 
> I think some sort of array-wide setting would make more sense, would it
> not?

Yes, it would. The problem exists only for external metadata, because
for native there is a feature flag in the superblock and a corresponding
flag in mddev->flags. Patch 12 adds a sysfs attribute to control the
policy at runtime but it would have to be moved out of raid5 personality
into the main md code. I didn't like the idea of adding something
specific to raid5 to generic code and visible in sysfs for unrelated
raid levels.

> And what is an RWH???  A Really Weird Handle ??
> 
> I guess it is probably a Raid5 Write Hole ?  At the very least there
> should be a comment explaining this when you define the enum. (remember,
> you are trying to make it easier for reviewers).

That's right, RWH stands for RAID Write Hole. I think I introduced it in
the cover letter, but I'll explain it also in the code.

> It might almost make sense for bitmap/metadata to be used here.
> It can currently be "external" "internal" "clustered".
> Allow also "journalled" or "partial-partiy-log" ???
> 
> Maybe not ... but I'd definitely prefer a global setting, and one that
> didn't use an obscure abbreviation.

So do you think the sysfs attribute from patch 12 ("rwh_policy") could
be made global? This would simplify things but it doesn't seem right.
And about the abbreviation, should it be called "write_hole_policy" or
"raid5_write_hole_policy"? Maybe using bitmap/metadata is not a bad
idea...

Artur

^ permalink raw reply

* Re: [PATCH v2 05/12] raid5-ppl: Partial Parity Log implementation
From: Artur Paszkiewicz @ 2016-12-07 14:37 UTC (permalink / raw)
  To: NeilBrown, shli; +Cc: linux-raid
In-Reply-To: <87a8c84jwo.fsf@notabene.neil.brown.name>

On 12/07/2016 02:17 AM, NeilBrown wrote:
> On Tue, Dec 06 2016, Artur Paszkiewicz wrote:
> 
>> This implements the write logging functionality, using the policy logic
>> introduced in previous patches.
>>
>> PPL is a distributed log - data is stored on all RAID member drives in
>> the metadata area. PPL is written to the parity disk of a particular
>> stripe. Distributed log is implemented by using one r5l_log instance per
>> each array member. They are grouped in child_logs array in struct
>> ppl_conf, which is assigned to a common parent log. This parent log
>> serves as a proxy and is used in raid5 personality code - it is assigned
>> as _the_ log in r5conf->log. The child logs are where all the real work
>> is done.
>>
>> The PPL consists of a 4KB header (struct ppl_header), and at least 128KB
>> for partial parity data. It is stored right after the array data (for
>> IMSM) or in the bitmap area (super 1.1 and 1.2) and can be overwritten
>> even at each array write request.
>>
>> Attach a page for holding the partial parity data to each stripe_head.
>> Allocate it only if mddev has the MD_HAS_PPL flag set.
>>
>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>> ---
>>  drivers/md/raid5-cache.c |  12 +-
>>  drivers/md/raid5-cache.h |   6 +
>>  drivers/md/raid5-ppl.c   | 594 ++++++++++++++++++++++++++++++++++++++++++++++-
>>  drivers/md/raid5.c       |  15 +-
>>  drivers/md/raid5.h       |   1 +
>>  5 files changed, 620 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
>> index fa82b9a..be534d8 100644
>> --- a/drivers/md/raid5-cache.c
>> +++ b/drivers/md/raid5-cache.c
>> @@ -119,8 +119,8 @@ static bool r5l_has_free_space(struct r5l_log *log, sector_t size)
>>  	return log->device_size > used_size + size;
>>  }
>>  
>> -static void __r5l_set_io_unit_state(struct r5l_io_unit *io,
>> -				    enum r5l_io_unit_state state)
>> +void __r5l_set_io_unit_state(struct r5l_io_unit *io,
>> +			     enum r5l_io_unit_state state)
>>  {
>>  	if (WARN_ON(io->state >= state))
>>  		return;
>> @@ -340,7 +340,7 @@ static void r5c_finish_cache_stripe(struct stripe_head *sh)
>>  	}
>>  }
>>  
>> -static void r5l_io_run_stripes(struct r5l_io_unit *io)
>> +void r5l_io_run_stripes(struct r5l_io_unit *io)
>>  {
>>  	struct stripe_head *sh, *next;
>>  
>> @@ -935,7 +935,7 @@ static sector_t r5l_reclaimable_space(struct r5l_log *log)
>>  				 r5c_calculate_new_cp(conf));
>>  }
>>  
>> -static void r5l_run_no_mem_stripe(struct r5l_log *log)
>> +void r5l_run_no_mem_stripe(struct r5l_log *log)
>>  {
>>  	struct stripe_head *sh;
>>  
>> @@ -1039,7 +1039,7 @@ static void r5l_log_flush_endio(struct bio *bio)
>>   * only write stripes of an io_unit to raid disks till the io_unit is the first
>>   * one whose data/parity is in log.
>>   */
>> -static void __r5l_flush_stripe_to_raid(struct r5l_log *log)
>> +void __r5l_flush_stripe_to_raid(struct r5l_log *log)
>>  {
>>  	bool do_flush;
>>  
>> @@ -1359,7 +1359,7 @@ bool r5l_log_disk_error(struct r5conf *conf)
>>  	if (!log)
>>  		ret = test_bit(MD_HAS_JOURNAL, &conf->mddev->flags);
>>  	else
>> -		ret = test_bit(Faulty, &log->rdev->flags);
>> +		ret = log->rdev && test_bit(Faulty, &log->rdev->flags);
>>  	rcu_read_unlock();
>>  	return ret;
>>  }
>> diff --git a/drivers/md/raid5-cache.h b/drivers/md/raid5-cache.h
>> index 4ba11d3..0446100 100644
>> --- a/drivers/md/raid5-cache.h
>> +++ b/drivers/md/raid5-cache.h
>> @@ -157,4 +157,10 @@ extern int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio);
>>  extern void r5l_quiesce(struct r5l_log *log, int state);
>>  extern bool r5l_log_disk_error(struct r5conf *conf);
>>  
>> +extern void __r5l_set_io_unit_state(struct r5l_io_unit *io,
>> +				    enum r5l_io_unit_state state);
>> +extern void r5l_io_run_stripes(struct r5l_io_unit *io);
>> +extern void r5l_run_no_mem_stripe(struct r5l_log *log);
>> +extern void __r5l_flush_stripe_to_raid(struct r5l_log *log);
>> +
>>  #endif /* _RAID5_CACHE_H */
>> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
>> index 263fad7..2d4c90f 100644
>> --- a/drivers/md/raid5-ppl.c
>> +++ b/drivers/md/raid5-ppl.c
>> @@ -14,7 +14,599 @@
>>  
>>  #include <linux/kernel.h>
>>  #include <linux/blkdev.h>
>> +#include <linux/slab.h>
>> +#include <linux/crc32c.h>
>> +#include <linux/module.h>
>> +#include <linux/raid/md_p.h>
>> +#include "md.h"
>>  #include "raid5.h"
>>  #include "raid5-cache.h"
>>  
>> -struct r5l_policy r5l_ppl;
>> +static bool ppl_debug;
>> +module_param(ppl_debug, bool, 0644);
>> +MODULE_PARM_DESC(ppl_debug, "Debug mode for md raid5 PPL");
>> +
>> +#define dbg(format, args...)						\
>> +do {									\
>> +	if (ppl_debug)							\
>> +		printk(KERN_DEBUG"[%d] %s() "format,			\
>> +			current->pid, __func__, ##args);		\
>> +} while (0)
> 
> Please don't do this.  Just use pr_debug(), and use
>  /sys/kernel/debug/dynamic_debug/control
> to turn them on and off.
> 
>> +
>> +struct ppl_conf {
>> +	int count;
>> +	struct r5l_log **child_logs;
>> +};
>> +
>> +struct ppl_header_entry {
>> +	__le64 data_sector;	/* Raid sector of the new data */
>> +	__le32 pp_size;		/* Length of partial parity */
>> +	__le32 data_size;	/* Length of data */
>> +	__u8 parity_disk;	/* Member disk containing parity */
>> +	__le32 checksum;	/* Checksum of this entry */
>> +} __packed;
> 
> Really?  "checksum" is 32bits but not aligned?
> I *think* you should be using get_unaligned_le32() to access this
> and put_unaligned_le32() to set it.
> 
>> +
>> +#define PPL_HEADER_SIZE PAGE_SIZE
>> +#define PPL_HDR_RESERVED 512
>> +#define PPL_HDR_ENTRY_SPACE \
>> +	(PPL_HEADER_SIZE - PPL_HDR_RESERVED - 3 * sizeof(u32) - sizeof(u64))
>> +#define PPL_HDR_MAX_ENTRIES \
>> +	(PPL_HDR_ENTRY_SPACE / sizeof(struct ppl_header_entry))
>> +#define PPL_ENTRY_SPACE_IMSM (128 * 1024)
>> +
>> +struct ppl_header {
>> +	__u8 reserved[PPL_HDR_RESERVED];/* Reserved space */
>> +	__le32 signature;		/* Signature (family number of volume) */
>> +	__le64 generation;		/* Generation number of PP Header */
> 
> This probably needs to use the 'unaligned' macros too.
> 
>> +	__le32 entries_count;		/* Number of entries in entry array */
>> +	__le32 checksum;		/* Checksum of PP Header */
>> +	struct ppl_header_entry entries[PPL_HDR_MAX_ENTRIES];
>> +} __packed;
> 
> ppl_header_entry doesn't seem to be a multiple of 4 bytes long.
> This means all the fields in it could be unaligned...
> 
> Maybe we should make this code refuse to compile except on x86 ???

Oh... you are right, this is bad. I hope I can still persuade some
people to fix those structures in the platform firmware, etc. Like
changing "parity_disk" to 4 bytes or inserting a padding there.

Artur

^ permalink raw reply

* Re: [PATCH v2 09/12] raid5-ppl: read PPL signature from IMSM metadata
From: Artur Paszkiewicz @ 2016-12-07 14:38 UTC (permalink / raw)
  To: NeilBrown, shli; +Cc: linux-raid
In-Reply-To: <871sxk4jj0.fsf@notabene.neil.brown.name>

On 12/07/2016 02:25 AM, NeilBrown wrote:
> On Tue, Dec 06 2016, Artur Paszkiewicz wrote:
> 
>> The PPL signature is used to determine if the stored PPL is valid for a
>> given array. With IMSM, the PPL signature should match the
>> orig_family_num field of the superblock. To avoid passing this value
>> from userspace, it can be read from the IMSM MPB when initializing the
>> log.
> 
> It is up to mdadm to determine if the PPL is valid.  It would only tell
> the kernel that a PPL exists if it is valid...

The kernel also has to know this value because it writes it to the PPL
header. So yet another sysfs attribute just for this value? How about
adding a directory similar to "bitmap" to hold all the PPL related
settings?

Artur

^ permalink raw reply

* Re: raid0 vs. mkfs
From: Shaohua Li @ 2016-12-07 16:59 UTC (permalink / raw)
  To: Coly Li; +Cc: Avi Kivity, NeilBrown, linux-raid, linux-block
In-Reply-To: <ccf447df-e7b5-0a08-7753-67ee4ad574ef@suse.de>

On Wed, Dec 07, 2016 at 07:50:33PM +0800, Coly Li wrote:
> On 2016/11/30 上午6:45, Avi Kivity wrote:
> > On 11/29/2016 11:14 PM, NeilBrown wrote:
> [snip]
> 
> >>> So I disagree that all the work should be pushed to the merging layer.
> >>> It has less information to work with, so the fewer decisions it has to
> >>> make, the better.
> >> I think that the merging layer should be as efficient as it reasonably
> >> can be, and particularly should take into account plugging.  This
> >> benefits all callers.
> > 
> > Yes, but plugging does not mean "please merge anything you can until the
> > unplug".
> > 
> >> If it can be demonstrated that changes to some of the upper layers bring
> >> further improvements with acceptable costs, then certainly it is good to
> >> have those too.
> > 
> > Generating millions of requests only to merge them again is
> > inefficient.  It happens in an edge case (TRIM of the entirety of a very
> > large RAID), but it already caused on user to believe the system
> > failed.  I think the system should be more robust than that.
> 
> Neil,
> 
> As my understand, if a large discard bio received by
> raid0_make_request(), for example it requests to discard chunk 1 to 24
> on a raid0 device built by 4 SSDs. This large discard bio will be split
> and written to each SSD as the following layout,
> 
> SSD1: C1,C5,C9,C13,C17,C21
> SSD2: C2,C6,C10,C14,C18,C22
> SSD3: C3,C7,C11,C15,C19,C23
> SSD4: C4,C8,C12,C16,C20,C24
> 
> Current raid0 code will call generic_make_request() for 24 times for
> each split bio. But it is possible to calculate the final layout of each
> split bio, so we can combine all the bios into four per-SSD large bio,
> like this,
> 
> bio1 (on SSD1): C{1,5,9,13,17,21}
> bio2 (on SSD2): C{2,6,10,14,18,22}
> bio3 (on SSD3): C{3,7,11,15,19,23}
> bio4 (on SSD4): C{4,8,12,16,20,24}
> 
> Now we only need to call generic_make_request() for 4 times. Rebuild the
> per-device discard bios is more efficient in raid0 code then in block
> layer. There are some reasons that I know,
> - there are splice timeout, block layer cannot merge all split bio into
> one large bio before time out.
> - rebuilt per-device bio in raid0 is just by a few calculation, block
> layer does merge on queue with list operations, it is slower.
> - raid0 code knows its on disk layout, so rebuild per-device bio is
> possible here. block layer has no idea on raid0 layout, it can only do
> request merge.

Thanks for doing this, Coly! For raid0, this totally makes sense. The raid0
zones make things a little complicated though. I just had a brief look of your
proposed patch, which looks really complicated. I'd suggest something like
this:
1. split the bio according to zone boundary.
2. handle the splitted bio. since the bio is within zone range, calculating
the start and end sector for each rdev should be easy.

This will create slightly more bio to each rdev (not too many, since there
aren't too many zones in practice) and block layer should easily merge these
bios without much overhead. The benefit is a much simpler implementation.

> I compose a prototype patch, the code is not simple, indeed it is quite
> complicated IMHO.
> 
> I do a little research, some NVMe SSDs support whole device size
> DISCARD, also I observe mkfs.xfs sends out a raid0 device size DISCARD
> bio to block layer. But raid0_make_request() only receives 512KB size
> DISCARD bio, block/blk-lib.c:__blkdev_issue_discard() splits the
> original large bio into 512KB small bios, the limitation is from
> q->limits.discard_granularity.

please adjust the max discard sectors for the queue. The original setting is
chunk size.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v2 00/12] Partial Parity Log for MD RAID 5
From: Shaohua Li @ 2016-12-07 17:09 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: NeilBrown, linux-raid
In-Reply-To: <84c59771-48b7-da47-3497-c82c6c6939c9@intel.com>

On Wed, Dec 07, 2016 at 03:36:01PM +0100, Artur Paszkiewicz wrote:
> On 12/07/2016 01:32 AM, NeilBrown wrote:
> > 
> > I would expect to see as description of what a PPL actually is and how
> > it works here... but there is none.
> > 
> > The change-log for patch 06 has a tiny bit more information which is
> > just enough to be able to start trying to understand the code, but it
> > isn't much.
> > And none of this description gets into the code, or into the
> > Documentation/.  This makes it hard to review and hard to maintain.
> > 
> > Remember: if you want people to review you code, it is in your interest
> > to make it easy.  That means give lots of details.
> 
> Hi Neil,
> 
> Thank you for taking the time to look at this and for your feedback. I
> didn't try to make it hard to review... Sometimes it's easy to forget
> how non-obvious things are after looking at them for too long :) I will
> improve the descriptions and address the issues that you found in the
> next version of the patches.

Havn't looked at the patches yet, being busy recently, sorry! When you repost
these, I'd like to know why we need another log for raid5 considering we
already had one to fix similar issue. What's the good/bad side of this new log?
There is such feature in Intel RSTe doesn't sound like a technical reason we
should support this.

Thanks,
Shaohua

^ permalink raw reply

* Re: Recovering a RAID6 after all disks were disconnected
From: Giuseppe Bilotta @ 2016-12-07 17:21 UTC (permalink / raw)
  To: John Stoffel; +Cc: linux-raid
In-Reply-To: <22600.7486.444800.536687@quad.stoffel.home>

Hello John, and thanks for your time

Giuseppe> I've had sporadic resets of the JBOD due to a variety of reasons
Giuseppe> (power failures or disk failures —the JBOD has the bad habit of
Giuseppe> resetting when one disk has an I/O error, which causes all of the
Giuseppe> disks to go offline temporarily).

John> Please toss that JBOD out the window!  *grin*

Well, that's exactly why I bought the new one which is the one I'm
currently using to host the backup disks I'm experimenting on! 8-)
However I suspect this is a misfeature common to many if not all
'home' JBODS which are all SATA based and only provide eSATA and/or
USB3 connection to the machine.

Giuseppe> The thing happened again a couple of days ago, but this time
Giuseppe> I tried re-adding the disks directly when they came back
Giuseppe> online, using mdadm -a and confident that since they _had_
Giuseppe> been recently part of the array, the array would actually go
Giuseppe> back to work fine —except that this is not the case when ALL
Giuseppe> disks were kicked out of the array! Instead, what happened
Giuseppe> was that all the disks were marked as 'spare' and the RAID
Giuseppe> would not assemble anymore.

John> Can you please send us the full details of each disk using the
John> command:
John>
John>   mdadm -E /dev/sda1
John>

Here it is. Notice that this is the result of -E _after_ the attempted
re-add while the RAID was running, which marked all the disks as
spares:

==8<=======

/dev/sdc:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x9
     Array UUID : 943d287e:af28b455:88a047f2:d714b8c6
           Name : labrador:oneforall  (local to host labrador)
  Creation Time : Fri Nov 30 19:57:45 2012
     Raid Level : raid6
   Raid Devices : 4

 Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
     Array Size : 5860270080 (5588.79 GiB 6000.92 GB)
  Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262048 sectors, after=944 sectors
          State : clean
    Device UUID : 543f75ac:a1f3cf99:1c6b71d9:52e358b9

Internal Bitmap : 8 sectors from superblock
    Update Time : Sun Dec  4 17:11:19 2016
  Bad Block Log : 512 entries available at offset 80 sectors - bad
blocks present.
       Checksum : 1e2f00fc - correct
         Events : 31196

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : spare
   Array State : .... ('A' == active, '.' == missing, 'R' == replacing)

/dev/sdd:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x9
     Array UUID : 943d287e:af28b455:88a047f2:d714b8c6
           Name : labrador:oneforall  (local to host labrador)
  Creation Time : Fri Nov 30 19:57:45 2012
     Raid Level : raid6
   Raid Devices : 4

 Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
     Array Size : 5860270080 (5588.79 GiB 6000.92 GB)
  Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262048 sectors, after=944 sectors
          State : clean
    Device UUID : 649d53ad:f909b7a9:cd0f57f2:08a55e3b

Internal Bitmap : 8 sectors from superblock
    Update Time : Sun Dec  4 17:11:19 2016
  Bad Block Log : 512 entries available at offset 80 sectors - bad
blocks present.
       Checksum : c9dfe033 - correct
         Events : 31196

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : spare
   Array State : .... ('A' == active, '.' == missing, 'R' == replacing)

/dev/sde:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x9
     Array UUID : 943d287e:af28b455:88a047f2:d714b8c6
           Name : labrador:oneforall  (local to host labrador)
  Creation Time : Fri Nov 30 19:57:45 2012
     Raid Level : raid6
   Raid Devices : 4

 Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
     Array Size : 5860270080 (5588.79 GiB 6000.92 GB)
  Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262048 sectors, after=944 sectors
          State : clean
    Device UUID : dd3f90ab:619684c0:942a7d88:f116f2db

Internal Bitmap : 8 sectors from superblock
    Update Time : Sun Dec  4 17:11:19 2016
  Bad Block Log : 512 entries available at offset 80 sectors - bad
blocks present.
       Checksum : 15a3975a - correct
         Events : 31196

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : spare
   Array State : .... ('A' == active, '.' == missing, 'R' == replacing)

/dev/sdf:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x9
     Array UUID : 943d287e:af28b455:88a047f2:d714b8c6
           Name : labrador:oneforall  (local to host labrador)
  Creation Time : Fri Nov 30 19:57:45 2012
     Raid Level : raid6
   Raid Devices : 4

 Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
     Array Size : 5860270080 (5588.79 GiB 6000.92 GB)
  Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262048 sectors, after=944 sectors
          State : clean
    Device UUID : f7359c4e:c1f04b22:ce7aa32f:ed5bb054

Internal Bitmap : 8 sectors from superblock
    Update Time : Sun Dec  4 17:11:19 2016
  Bad Block Log : 512 entries available at offset 80 sectors - bad
blocks present.
       Checksum : 3a5b94a7 - correct
         Events : 31196

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : spare
   Array State : .... ('A' == active, '.' == missing, 'R' == replacing)

==8<=======

I do however know the _original_ positions of the respective disks
from the kernel messages

At assembly time:

[  +0.000638] RAID conf printout:
[  +0.000001]  --- level:6 rd:4 wd:4
[  +0.000001]  disk 0, o:1, dev:sdf
[  +0.000001]  disk 1, o:1, dev:sde
[  +0.000000]  disk 2, o:1, dev:sdd
[  +0.000001]  disk 3, o:1, dev:sdc

After the JBOD disappeared and right before they all get kicked out:

[  +0.000438] RAID conf printout:
[  +0.000001]  --- level:6 rd:4 wd:0
[  +0.000001]  disk 0, o:0, dev:sdf
[  +0.000001]  disk 1, o:0, dev:sde
[  +0.000000]  disk 2, o:0, dev:sdd
[  +0.000001]  disk 3, o:0, dev:sdc


John> You might be able to just for the three spare disks (assumed in this
John> case to be sda1, sdb1, sdc1; but you need to be sure first!) to
John> assemble into a full array with:
John>
John>  mdadm -A /dev/md50 /dev/sda1 /dev/sdb1 /dev/sdc1
John>
John> And if that works, great.   If not, post the error message(s) you get
John> back.

Note that the RAID has no active disks anymore, since when I tried
re-adding the formerly active disks that
where kicked from the array they got marked as spares, and mdraid
simply refuses to start a RAID6 setup with only spares. The message I
get is indeed

mdadm: /dev/md126 assembled from 0 drives and 3 spares - not enough to
start the array.

This is the point at which I made a copy of 3 of the 4 disks and
started playing around. Specifically, I dd'ed sdc into sdh, sdd into
sdi and sde into sdj and started playing around with sd[hij] rather
than the original disks, as I mentioned:

Giuseppe> So one thing that I've done is to hack around the superblock in the
Giuseppe> disks (copies) to put back the device roles as they were (getting the
Giuseppe> information from the pre-failure dmesg output). (By the way, I've been
Giuseppe> using Andy's Binary Editor for the superblock editing, so if anyone is
Giuseppe> interested in a be.ini for mdraid v1 superblocks, including checksum
Giuseppe> verification, I'd be happy to share). Specifically, I've left the
Giuseppe> device number untouched, but I have edited the dev_roles array so that
Giuseppe> the slots corresponding to the dev_number from all the disks map to
Giuseppe> appropriate device roles.

Specifically, I hand-edited the superblocks to achieve this:


==8<===============

/dev/sdh:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x9
     Array UUID : 943d287e:af28b455:88a047f2:d714b8c6
           Name : labrador:oneforall  (local to host labrador)
  Creation Time : Fri Nov 30 19:57:45 2012
     Raid Level : raid6
   Raid Devices : 4

 Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
     Array Size : 5860270080 (5588.79 GiB 6000.92 GB)
  Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262048 sectors, after=944 sectors
          State : clean
    Device UUID : 543f75ac:a1f3cf99:1c6b71d9:52e358b9

Internal Bitmap : 8 sectors from superblock
    Update Time : Sun Dec  4 17:11:19 2016
  Bad Block Log : 512 entries available at offset 80 sectors - bad
blocks present.
       Checksum : 1e3300fe - correct
         Events : 31196

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 3
   Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)

/dev/sdi:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x9
     Array UUID : 943d287e:af28b455:88a047f2:d714b8c6
           Name : labrador:oneforall  (local to host labrador)
  Creation Time : Fri Nov 30 19:57:45 2012
     Raid Level : raid6
   Raid Devices : 4

 Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
     Array Size : 5860270080 (5588.79 GiB 6000.92 GB)
  Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262048 sectors, after=944 sectors
          State : clean
    Device UUID : 649d53ad:f909b7a9:cd0f57f2:08a55e3b

Internal Bitmap : 8 sectors from superblock
    Update Time : Sun Dec  4 17:11:19 2016
  Bad Block Log : 512 entries available at offset 80 sectors - bad
blocks present.
       Checksum : c9e3e035 - correct
         Events : 31196

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 2
   Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)

/dev/sdj:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x9
     Array UUID : 943d287e:af28b455:88a047f2:d714b8c6
           Name : labrador:oneforall  (local to host labrador)
  Creation Time : Fri Nov 30 19:57:45 2012
     Raid Level : raid6
   Raid Devices : 4

 Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
     Array Size : 5860270080 (5588.79 GiB 6000.92 GB)
  Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262048 sectors, after=944 sectors
          State : clean
    Device UUID : dd3f90ab:619684c0:942a7d88:f116f2db

Internal Bitmap : 8 sectors from superblock
    Update Time : Sun Dec  4 17:11:19 2016
  Bad Block Log : 512 entries available at offset 80 sectors - bad
blocks present.
       Checksum : 15a7975c - correct
         Events : 31196

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 1
   Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)

==8<===============

And I _can_ assemble the array, but what I get is this:

[  +0.003574] md: bind<sdi>
[  +0.001823] md: bind<sdh>
[  +0.000978] md: bind<sdj>
[  +0.003971] md/raid:md127: device sdj operational as raid disk 1
[  +0.000125] md/raid:md127: device sdh operational as raid disk 3
[  +0.000105] md/raid:md127: device sdi operational as raid disk 2
[  +0.015017] md/raid:md127: allocated 4374kB
[  +0.000139] md/raid:md127: raid level 6 active with 3 out of 4
devices, algorithm 2
[  +0.000063] RAID conf printout:
[  +0.000002]  --- level:6 rd:4 wd:3
[  +0.000003]  disk 1, o:1, dev:sdj
[  +0.000002]  disk 2, o:1, dev:sdi
[  +0.000001]  disk 3, o:1, dev:sdh
[  +0.004187] md127: bitmap file is out of date (31193 < 31196) --
forcing full recovery
[  +0.000065] created bitmap (22 pages) for device md127
[  +0.000072] md127: bitmap file is out of date, doing full recovery
[  +0.100300] md127: bitmap initialized from disk: read 2 pages, set
44711 of 44711 bits
[  +0.039741] md127: detected capacity change from 0 to 6000916561920
[  +0.000085] Buffer I/O error on dev md127, logical block 0, async page read
[  +0.000064] Buffer I/O error on dev md127, logical block 0, async page read
[  +0.000022] Buffer I/O error on dev md127, logical block 0, async page read
[  +0.000022] Buffer I/O error on dev md127, logical block 0, async page read
[  +0.000019] ldm_validate_partition_table(): Disk read failed.
[  +0.000021] Buffer I/O error on dev md127, logical block 0, async page read
[  +0.000026] Buffer I/O error on dev md127, logical block 0, async page read
[  +0.000022] Buffer I/O error on dev md127, logical block 0, async page read
[  +0.000021] Buffer I/O error on dev md127, logical block 0, async page read
[  +0.000019] Dev md127: unable to read RDB block 0
[  +0.000016] Buffer I/O error on dev md127, logical block 0, async page read
[  +0.000022] Buffer I/O error on dev md127, logical block 0, async page read
[  +0.000030]  md127: unable to read partition table

and any attempt to access md127 content gives an I/O error.


-- 
Giuseppe "Oblomov" Bilotta

^ permalink raw reply

* Re: [PATCH] md/raid5-cache: no recovery is required when create super-block
From: Shaohua Li @ 2016-12-07 17:26 UTC (permalink / raw)
  To: JackieLiu; +Cc: songliubraving, liuzhengyuan, shli, linux-raid
In-Reply-To: <20161206061317.4388-1-liuyun01@kylinos.cn>

On Tue, Dec 06, 2016 at 02:13:17PM +0800, JackieLiu wrote:
> When create the super-block information, We do not need to do this
> recovery stage, only need to initialize some variables.

This makes sense.

please do look at Documentation/SubmittingPatches. I can't apply patch without
'signed-off-by'.

> ---
>  drivers/md/raid5-cache.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index c3b3124..96f25df 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -2545,7 +2545,13 @@ static int r5l_load_log(struct r5l_log *log)
>  
>  	__free_page(page);
>  
> -	ret = r5l_recovery_log(log);
> +	if (create_super) {
> +		log->log_start = r5l_ring_add(log, cp, BLOCK_SECTORS);
> +		log->seq = log->last_cp_seq + 1;
> +		log->next_checkpoint = cp;

must set ret = 0 here. Next time please make sure there isn't compiling
warnning.

Thanks,
Shaohua

^ permalink raw reply

* [PATCH 1/3] md/raid5-cache: fix crc in rewrite_data_only_stripes()
From: Song Liu @ 2016-12-07 17:42 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuan,
	liuyun01, Song Liu

r5l_recovery_create_empty_meta_block() creates crc for the empty
metablock. After the metablock is updated, we need clear the
checksum before recalculate it.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index c3b3124..875f963 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2117,7 +2117,9 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
 			}
 		}
 		mb->meta_size = cpu_to_le32(offset);
-		mb->checksum = crc32c_le(log->uuid_checksum, mb, PAGE_SIZE);
+		mb->checksum = 0;
+		mb->checksum = cpu_to_le32(crc32c_le(log->uuid_checksum,
+						     mb, PAGE_SIZE));
 		sync_page_io(log->rdev, ctx->pos, PAGE_SIZE, page,
 			     REQ_OP_WRITE, WRITE_FUA, false);
 		sh->log_start = ctx->pos;
-- 
2.9.3


^ permalink raw reply related

* [PATCH 2/3] md/r5cache: after recovery, increase journal seq by 1000
From: Song Liu @ 2016-12-07 17:42 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuan,
	liuyun01, Song Liu
In-Reply-To: <20161207174207.3685260-1-songliubraving@fb.com>

Currently, we increase journal entry seq by 10 after recovery.
However, this is not sufficient in the following case.

After crash the journal looks like

| seq+0 | +1 | +2 | +3 | +4 | +5 | +6 | +7 | ... | +11 | +12 |

If +1 is not valid, we dropped all entries from +1 to +12; and
write seq+10:

| seq+0 | +10 | +2 | +3 | +4 | +5 | +6 | +7 | ... | +11 | +12 |

However, if we write a big journal entry with seq+11, it will
connect with some stale journal entry:

| seq+0 | +10 |                     +11                 | +12 |

To reduce the risk of this issue, we increase seq by 1000 instead.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 875f963..5301081 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2003,8 +2003,8 @@ static int r5c_recovery_flush_log(struct r5l_log *log,
  * happens again, new recovery will start from meta 1. Since meta 2n is
  * valid now, recovery will think meta 3 is valid, which is wrong.
  * The solution is we create a new meta in meta2 with its seq == meta
- * 1's seq + 10 and let superblock points to meta2. The same recovery will
- * not think meta 3 is a valid meta, because its seq doesn't match
+ * 1's seq + 1000 and let superblock points to meta2. The same recovery
+ * will not think meta 3 is a valid meta, because its seq doesn't match
  */
 
 /*
@@ -2034,7 +2034,7 @@ static int r5c_recovery_flush_log(struct r5l_log *log,
  *   ---------------------------------------------
  *   ^                              ^
  *   |- log->last_checkpoint        |- ctx->pos+1
- *   |- log->last_cp_seq            |- ctx->seq+11
+ *   |- log->last_cp_seq            |- ctx->seq+1001
  *
  * However, it is not safe to start the state machine yet, because data only
  * parities are not yet secured in RAID. To save these data only parities, we
@@ -2045,7 +2045,7 @@ static int r5c_recovery_flush_log(struct r5l_log *log,
  *   -----------------------------------------------------------------
  *   ^                                                ^
  *   |- log->last_checkpoint                          |- ctx->pos+n
- *   |- log->last_cp_seq                              |- ctx->seq+10+n
+ *   |- log->last_cp_seq                              |- ctx->seq+1000+n
  *
  * If failure happens again during this process, the recovery can safe start
  * again from log->last_checkpoint.
@@ -2057,7 +2057,7 @@ static int r5c_recovery_flush_log(struct r5l_log *log,
  *   -----------------------------------------------------------------
  *                        ^                         ^
  *                        |- log->last_checkpoint   |- ctx->pos+n
- *                        |- log->last_cp_seq       |- ctx->seq+10+n
+ *                        |- log->last_cp_seq       |- ctx->seq+1000+n
  *
  * Then we can safely start the state machine. If failure happens from this
  * point on, the recovery will start from new log->last_checkpoint.
@@ -2157,8 +2157,8 @@ static int r5l_recovery_log(struct r5l_log *log)
 	if (ret)
 		return ret;
 
-       pos = ctx.pos;
-       ctx.seq += 10;
+	pos = ctx.pos;
+	ctx.seq += 1000;
 
 	if (ctx.data_only_stripes == 0) {
 		log->next_checkpoint = ctx.pos;
-- 
2.9.3


^ permalink raw reply related

* [PATCH 3/3] md/r5cache: sh->log_start in recovery
From: Song Liu @ 2016-12-07 17:42 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuan,
	liuyun01, Song Liu
In-Reply-To: <20161207174207.3685260-1-songliubraving@fb.com>

We only need to update sh->log_start at the end of recovery,
which is r5c_recovery_rewrite_data_only_stripes().

In when there is data-only stripes, log->next_checkpoint is
also set in r5c_recovery_rewrite_data_only_stripes().

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 23 ++++++++---------------
 1 file changed, 8 insertions(+), 15 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 5301081..ae2684a 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -1681,8 +1681,7 @@ r5l_recovery_replay_one_stripe(struct r5conf *conf,
 
 static struct stripe_head *
 r5c_recovery_alloc_stripe(struct r5conf *conf,
-			  sector_t stripe_sect,
-			  sector_t log_start)
+			  sector_t stripe_sect)
 {
 	struct stripe_head *sh;
 
@@ -1691,7 +1690,6 @@ r5c_recovery_alloc_stripe(struct r5conf *conf,
 		return NULL;  /* no more stripe available */
 
 	r5l_recovery_reset_stripe(sh);
-	sh->log_start = log_start;
 
 	return sh;
 }
@@ -1861,7 +1859,7 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
 						stripe_sect);
 
 		if (!sh) {
-			sh = r5c_recovery_alloc_stripe(conf, stripe_sect, ctx->pos);
+			sh = r5c_recovery_alloc_stripe(conf, stripe_sect);
 			/*
 			 * cannot get stripe from raid5_get_active_stripe
 			 * try replay some stripes
@@ -1870,7 +1868,7 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
 				r5c_recovery_replay_stripes(
 					cached_stripe_list, ctx);
 				sh = r5c_recovery_alloc_stripe(
-					conf, stripe_sect, ctx->pos);
+					conf, stripe_sect);
 			}
 			if (!sh) {
 				pr_debug("md/raid:%s: Increasing stripe cache size to %d to recovery data on journal.\n",
@@ -1878,8 +1876,8 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
 					conf->min_nr_stripes * 2);
 				raid5_set_cache_size(mddev,
 						     conf->min_nr_stripes * 2);
-				sh = r5c_recovery_alloc_stripe(
-					conf, stripe_sect, ctx->pos);
+				sh = r5c_recovery_alloc_stripe(conf,
+							       stripe_sect);
 			}
 			if (!sh) {
 				pr_err("md/raid:%s: Cannot get enough stripes due to memory pressure. Recovery failed.\n",
@@ -1893,7 +1891,6 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
 			if (!test_bit(STRIPE_R5C_CACHING, &sh->state) &&
 			    test_bit(R5_Wantwrite, &sh->dev[sh->pd_idx].flags)) {
 				r5l_recovery_replay_one_stripe(conf, sh, ctx);
-				sh->log_start = ctx->pos;
 				list_move_tail(&sh->lru, cached_stripe_list);
 			}
 			r5l_recovery_load_data(log, sh, ctx, payload,
@@ -1932,8 +1929,6 @@ static void r5c_recovery_load_one_stripe(struct r5l_log *log,
 			set_bit(R5_UPTODATE, &dev->flags);
 		}
 	}
-	list_add_tail(&sh->r5c, &log->stripe_in_journal_list);
-	atomic_inc(&log->stripe_in_journal_count);
 }
 
 /*
@@ -2123,9 +2118,11 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
 		sync_page_io(log->rdev, ctx->pos, PAGE_SIZE, page,
 			     REQ_OP_WRITE, WRITE_FUA, false);
 		sh->log_start = ctx->pos;
+		list_add_tail(&sh->r5c, &log->stripe_in_journal_list);
+		atomic_inc(&log->stripe_in_journal_count);
 		ctx->pos = write_pos;
 		ctx->seq += 1;
-
+		log->next_checkpoint = sh->log_start;
 		list_del_init(&sh->lru);
 		raid5_release_stripe(sh);
 	}
@@ -2139,7 +2136,6 @@ static int r5l_recovery_log(struct r5l_log *log)
 	struct r5l_recovery_ctx ctx;
 	int ret;
 	sector_t pos;
-	struct stripe_head *sh;
 
 	ctx.pos = log->last_checkpoint;
 	ctx.seq = log->last_cp_seq;
@@ -2164,9 +2160,6 @@ static int r5l_recovery_log(struct r5l_log *log)
 		log->next_checkpoint = ctx.pos;
 		r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq++);
 		ctx.pos = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
-	} else {
-		sh = list_last_entry(&ctx.cached_list, struct stripe_head, lru);
-		log->next_checkpoint = sh->log_start;
 	}
 
 	if ((ctx.data_only_stripes == 0) && (ctx.data_parity_stripes == 0))
-- 
2.9.3


^ permalink raw reply related

* [PATCH 1/2] md/r5cache: flush data only stripes in r5l_recovery_log()
From: Song Liu @ 2016-12-07 19:36 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuan,
	liuyun01, Song Liu

When there is data only stripes in the journal, we flush them out in
r5l_recovery_log(). Ths logic is implemented in a new function:
r5c_recovery_flush_data_only_stripes();

We need conf->log in r5l_load_log(), so we need to set it before calling
r5l_load_log(). If r5l_load_log() fails, we set conf->log back to NULL.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 60 +++++++++++++++++++++++++++++++++++-------------
 drivers/md/raid5.c       |  9 +++++++-
 drivers/md/raid5.h       |  4 ++++
 3 files changed, 56 insertions(+), 17 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index ae2684a..830bb7f 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2061,7 +2061,7 @@ static int
 r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
 				       struct r5l_recovery_ctx *ctx)
 {
-	struct stripe_head *sh, *next;
+	struct stripe_head *sh;
 	struct mddev *mddev = log->rdev->mddev;
 	struct page *page;
 
@@ -2072,7 +2072,7 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
 		return -ENOMEM;
 	}
 
-	list_for_each_entry_safe(sh, next, &ctx->cached_list, lru) {
+	list_for_each_entry(sh, &ctx->cached_list, lru) {
 		struct r5l_meta_block *mb;
 		int i;
 		int offset;
@@ -2123,13 +2123,39 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
 		ctx->pos = write_pos;
 		ctx->seq += 1;
 		log->next_checkpoint = sh->log_start;
-		list_del_init(&sh->lru);
-		raid5_release_stripe(sh);
 	}
 	__free_page(page);
 	return 0;
 }
 
+static void r5c_recovery_flush_data_only_stripes(struct r5l_log *log,
+						 struct r5l_recovery_ctx *ctx)
+{
+	struct mddev *mddev = log->rdev->mddev;
+	struct r5conf *conf = mddev->private;
+	struct stripe_head *sh, *next;
+
+	if (ctx->data_only_stripes == 0)
+		return;
+
+	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_BACK;
+	set_bit(R5C_PRE_INIT_FLUSH, &conf->cache_state);
+
+	list_for_each_entry_safe(sh, next, &ctx->cached_list, lru) {
+		list_del_init(&sh->lru);
+		raid5_release_stripe(sh);
+	}
+
+	md_wakeup_thread(conf->mddev->thread);
+	wait_event(conf->wait_for_r5c_pre_init_flush,
+		   atomic_read(&conf->active_stripes) == 0 &&
+		   atomic_read(&conf->r5c_cached_full_stripes) == 0 &&
+		   atomic_read(&conf->r5c_cached_partial_stripes) == 0);
+
+	clear_bit(R5C_PRE_INIT_FLUSH, &conf->cache_state);
+	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
+}
+
 static int r5l_recovery_log(struct r5l_log *log)
 {
 	struct mddev *mddev = log->rdev->mddev;
@@ -2156,11 +2182,6 @@ static int r5l_recovery_log(struct r5l_log *log)
 	pos = ctx.pos;
 	ctx.seq += 1000;
 
-	if (ctx.data_only_stripes == 0) {
-		log->next_checkpoint = ctx.pos;
-		r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq++);
-		ctx.pos = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
-	}
 
 	if ((ctx.data_only_stripes == 0) && (ctx.data_parity_stripes == 0))
 		pr_debug("md/raid:%s: starting from clean shutdown\n",
@@ -2169,19 +2190,24 @@ static int r5l_recovery_log(struct r5l_log *log)
 		pr_debug("md/raid:%s: recoverying %d data-only stripes and %d data-parity stripes\n",
 			 mdname(mddev), ctx.data_only_stripes,
 			 ctx.data_parity_stripes);
+	}
 
-		if (ctx.data_only_stripes > 0)
-			if (r5c_recovery_rewrite_data_only_stripes(log, &ctx)) {
-				pr_err("md/raid:%s: failed to rewrite stripes to journal\n",
-				       mdname(mddev));
-				return -EIO;
-			}
+	if (ctx.data_only_stripes == 0) {
+		log->next_checkpoint = ctx.pos;
+		r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq++);
+		ctx.pos = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
+	} else if (r5c_recovery_rewrite_data_only_stripes(log, &ctx)) {
+		pr_err("md/raid:%s: failed to rewrite stripes to journal\n",
+		       mdname(mddev));
+		return -EIO;
 	}
 
 	log->log_start = ctx.pos;
 	log->seq = ctx.seq;
 	log->last_checkpoint = pos;
 	r5l_write_super(log, pos);
+
+	r5c_recovery_flush_data_only_stripes(log, &ctx);
 	return 0;
 }
 
@@ -2626,14 +2652,16 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 	spin_lock_init(&log->stripe_in_journal_lock);
 	atomic_set(&log->stripe_in_journal_count, 0);
 
+	rcu_assign_pointer(conf->log, log);
+
 	if (r5l_load_log(log))
 		goto error;
 
-	rcu_assign_pointer(conf->log, log);
 	set_bit(MD_HAS_JOURNAL, &conf->mddev->flags);
 	return 0;
 
 error:
+	rcu_assign_pointer(conf->log, NULL);
 	md_unregister_thread(&log->reclaim_thread);
 reclaim_thread:
 	mempool_destroy(log->meta_pool);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 6bf3c26..524b041 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -232,7 +232,9 @@ static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
 	 * When quiesce in r5c write back, set STRIPE_HANDLE for stripes with
 	 * data in journal, so they are not released to cached lists
 	 */
-	if (conf->quiesce && r5c_is_writeback(conf->log) &&
+	if ((conf->quiesce ||
+	     test_bit(R5C_PRE_INIT_FLUSH, &conf->cache_state)) &&
+	    r5c_is_writeback(conf->log) &&
 	    !test_bit(STRIPE_HANDLE, &sh->state) && injournal != 0) {
 		if (test_bit(STRIPE_R5C_CACHING, &sh->state))
 			r5c_make_stripe_write_out(sh);
@@ -264,6 +266,10 @@ static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
 			    < IO_THRESHOLD)
 				md_wakeup_thread(conf->mddev->thread);
 		atomic_dec(&conf->active_stripes);
+		if (test_bit(R5C_PRE_INIT_FLUSH, &conf->cache_state) &&
+		    atomic_read(&conf->active_stripes) == 0)
+			wake_up(&sh->raid_conf->wait_for_r5c_pre_init_flush);
+
 		if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
 			if (!r5c_is_writeback(conf->log))
 				list_add_tail(&sh->lru, temp_inactive_list);
@@ -6638,6 +6644,7 @@ static struct r5conf *setup_conf(struct mddev *mddev)
 	init_waitqueue_head(&conf->wait_for_quiescent);
 	init_waitqueue_head(&conf->wait_for_stripe);
 	init_waitqueue_head(&conf->wait_for_overlap);
+	init_waitqueue_head(&conf->wait_for_r5c_pre_init_flush);
 	INIT_LIST_HEAD(&conf->handle_list);
 	INIT_LIST_HEAD(&conf->hold_list);
 	INIT_LIST_HEAD(&conf->delayed_list);
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index ed8e136..b39fe46 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -564,6 +564,9 @@ enum r5_cache_state {
 	R5C_EXTRA_PAGE_IN_USE,	/* a stripe is using disk_info.extra_page
 				 * for prexor
 				 */
+	R5C_PRE_INIT_FLUSH,	/* flushing data only stripes recovered from
+				 * the journal
+				 */
 };
 
 struct r5conf {
@@ -679,6 +682,7 @@ struct r5conf {
 	int			group_cnt;
 	int			worker_cnt_per_group;
 	struct r5l_log		*log;
+	wait_queue_head_t	wait_for_r5c_pre_init_flush;
 };
 
 
-- 
2.9.3


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox