* [PATCH 2/3] md: MD_RECOVERY_NEEDED is set for mddev->recovery
From: Shaohua Li @ 2016-12-08 23:48 UTC (permalink / raw)
To: linux-raid; +Cc: neilb
In-Reply-To: <cover.1481240632.git.shli@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
drivers/md/md.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 84dc891..5e66648 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6856,7 +6856,7 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
/* need to ensure recovery thread has run */
wait_event_interruptible_timeout(mddev->sb_wait,
!test_bit(MD_RECOVERY_NEEDED,
- &mddev->flags),
+ &mddev->recovery),
msecs_to_jiffies(5000));
if (cmd == STOP_ARRAY || cmd == STOP_ARRAY_RO) {
/* Need to flush page cache, and ensure no-one else opens
--
2.9.3
^ permalink raw reply related
* [PATCH 1/3] md: takeover should clear unrelated bits
From: Shaohua Li @ 2016-12-08 23:48 UTC (permalink / raw)
To: linux-raid; +Cc: neilb
In-Reply-To: <cover.1481240632.git.shli@fb.com>
When we change level from raid1 to raid5, the MD_FAILFAST_SUPPORTED bit
will be accidentally set, but raid5 doesn't support it. The same is true
for the MD_HAS_JOURNAL bit.
Fix: 46533ff (md: Use REQ_FAILFAST_* on metadata writes where appropriate)
Cc: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
drivers/md/raid0.c | 5 +++++
drivers/md/raid1.c | 5 ++++-
drivers/md/raid5.c | 6 +++++-
3 files changed, 14 insertions(+), 2 deletions(-)
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index e628f18..a162fed 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -539,8 +539,11 @@ static void *raid0_takeover_raid45(struct mddev *mddev)
mddev->delta_disks = -1;
/* make sure it will be not marked as dirty */
mddev->recovery_cp = MaxSector;
+ clear_bit(MD_HAS_JOURNAL, &mddev->flags);
+ clear_bit(MD_JOURNAL_CLEAN, &mddev->flags);
create_strip_zones(mddev, &priv_conf);
+
return priv_conf;
}
@@ -580,6 +583,7 @@ static void *raid0_takeover_raid10(struct mddev *mddev)
mddev->degraded = 0;
/* make sure it will be not marked as dirty */
mddev->recovery_cp = MaxSector;
+ clear_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
create_strip_zones(mddev, &priv_conf);
return priv_conf;
@@ -622,6 +626,7 @@ static void *raid0_takeover_raid1(struct mddev *mddev)
mddev->raid_disks = 1;
/* make sure it will be not marked as dirty */
mddev->recovery_cp = MaxSector;
+ clear_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
create_strip_zones(mddev, &priv_conf);
return priv_conf;
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 94e0afc..efc2e74 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -3243,9 +3243,12 @@ static void *raid1_takeover(struct mddev *mddev)
mddev->new_layout = 0;
mddev->new_chunk_sectors = 0;
conf = setup_conf(mddev);
- if (!IS_ERR(conf))
+ if (!IS_ERR(conf)) {
/* Array must appear to be quiesced */
conf->array_frozen = 1;
+ clear_bit(MD_HAS_JOURNAL, &mddev->flags);
+ clear_bit(MD_JOURNAL_CLEAN, &mddev->flags);
+ }
return conf;
}
return ERR_PTR(-EINVAL);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 6bf3c26..3e6a2a0 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7811,6 +7811,7 @@ static void *raid45_takeover_raid0(struct mddev *mddev, int level)
static void *raid5_takeover_raid1(struct mddev *mddev)
{
int chunksect;
+ void *ret;
if (mddev->raid_disks != 2 ||
mddev->degraded > 1)
@@ -7832,7 +7833,10 @@ static void *raid5_takeover_raid1(struct mddev *mddev)
mddev->new_layout = ALGORITHM_LEFT_SYMMETRIC;
mddev->new_chunk_sectors = chunksect;
- return setup_conf(mddev);
+ ret = setup_conf(mddev);
+ if (!IS_ERR_VALUE(ret))
+ clear_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
+ return ret;
}
static void *raid5_takeover_raid6(struct mddev *mddev)
--
2.9.3
^ permalink raw reply related
* [PATCH 0/3] md: fix mddev->flags issues
From: Shaohua Li @ 2016-12-08 23:48 UTC (permalink / raw)
To: linux-raid; +Cc: neilb
We had some places the mddev->flags are abused. Today I hit a reshape hang,
which is related to this issue.
Neil,
I will appreciate if you could review the patches before the merge window
start.
Thanks,
Shaohua
Shaohua Li (3):
md: takeover should clear unrelated bits
md: MD_RECOVERY_NEEDED is set for mddev->recovery
md: separate flags for superblock changes
drivers/md/bitmap.c | 4 +-
drivers/md/dm-raid.c | 4 +-
drivers/md/md.c | 117 ++++++++++++++++++++++++-----------------------
drivers/md/md.h | 16 ++++---
drivers/md/multipath.c | 2 +-
drivers/md/raid0.c | 5 ++
drivers/md/raid1.c | 17 ++++---
drivers/md/raid10.c | 22 ++++-----
drivers/md/raid5-cache.c | 6 +--
drivers/md/raid5.c | 32 +++++++------
10 files changed, 121 insertions(+), 104 deletions(-)
--
2.9.3
^ permalink raw reply
* Re: [PATCH] dm: Avoid sleeping while holding the dm_bufio lock
From: Mikulas Patocka @ 2016-12-08 23:20 UTC (permalink / raw)
To: Doug Anderson
Cc: Alasdair Kergon, Mike Snitzer, Shaohua Li, Dmitry Torokhov,
linux-kernel@vger.kernel.org, linux-raid, dm-devel,
David Rientjes, Sonny Rao, Guenter Roeck
In-Reply-To: <CAD=FV=V85=ZTXURVZ3Xo1gS80rqA8gOhr+Xm47axrwVUpjQFuw@mail.gmail.com>
On Wed, 7 Dec 2016, Doug Anderson wrote:
> Hi,
>
> On Wed, Nov 23, 2016 at 12:57 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> > Hi
> >
> > The GFP_NOIO allocation frees clean cached pages. The GFP_NOWAIT
> > allocation doesn't. Your patch would incorrectly reuse buffers in a
> > situation when the memory is filled with clean cached pages.
> >
> > Here I'm proposing an alternate patch that first tries GFP_NOWAIT
> > allocation, then drops the lock and tries GFP_NOIO allocation.
> >
> > Note that the root cause why you are seeing this stacktrace is, that your
> > block device is congested - i.e. there are too many requests in the
> > device's queue - and note that fixing this wait won't fix the root cause
> > (congested device).
> >
> > The congestion limits are set in blk_queue_congestion_threshold to 7/8 to
> > 13/16 size of the nr_requests value.
> >
> > If you don't want your device to report the congested status, you can
> > increase /sys/block/<device>/queue/nr_requests - you should test if your
> > chromebook is faster of slower with this setting increased. But note that
> > this setting won't increase the IO-per-second of the device.
>
> Cool, thanks for the insight!
>
> Can you clarify which block device is relevant here? Is this the DM
> block device, the underlying block device, or the swap block device?
> I'm not at all an expert on DM, but I think we have:
>
> 1. /dev/mmcblk0 - the underlying storage device.
> 2. /dev/dm-0 - The verity device that's run atop /dev/mmcblk0p3
> 3. /dev/zram0 - Our swap device
The /dev/mmcblk0 device is congested. You can see the number of requests
in /sys/block/mmcblk0/inflight
> As stated in the original email, I'm running on a downstream kernel
> (kernel-4.4) with bunches of local patches, so it's plausible that
> things have changed in the meantime, but:
>
> * At boot time the "nr_requests" for all block devices was 128
> * I was unable to set the "nr_requests" for dm-0 and zram0 (it just
> gives an error in sysfs).
> * When I set "nr_requests" to 4096 for /dev/mmcblk0 it didn't seem to
> affect the problem.
The eMMC has some IOPS and the IOPS can't be improved. Use faster block
device - but it will cost more money.
If you want to handle such a situation where you run 4 tasks each eating
900MB, just use more memory, don't expect that this will work smoothly on
4GB machine.
If you want to protect the chromebook from runaway memory allocations, you
can detect this situation in some watchdog process and either kill the
process that consumes most memory with the kill syscall or trigger the
kernel OOM killer by writing 'f' to /proc/sysrq-trigger.
The question is what you really want - handle this situation smoothly
(then, you must add more memory) or protect chromeOS from applications
allocating too much memory?
Mikulas
^ permalink raw reply
* Re: [PATCH v2 03/12] raid5-cache: add a new policy
From: NeilBrown @ 2016-12-08 21:22 UTC (permalink / raw)
To: Artur Paszkiewicz, shli; +Cc: linux-raid
In-Reply-To: <a2b7d2a9-4b49-6f11-bbc5-228f28c83389@intel.com>
[-- Attachment #1: Type: text/plain, Size: 1209 bytes --]
On Thu, Dec 08 2016, Artur Paszkiewicz wrote:
>>
>> How about we call it "resync_policy" which describes how to cope with
>> unexpected shutdown. Options include:
>>
>> full regenerate all redundancy info after a crash
>> bitmap only regenerate redundancy info indicated by bitmap
>> (both these suseptible to write-hole on raid456)
>> journal raid456 only, though could theoretically be extended
>> to raid1, raid10 : log transactions and replay after crash
>> ppl raid456 only: log partial-parity before writes.
>>
>> With external metadata, this must be set explicitly. With internal
>> metadata, it is set best on flags etc.
>>
>> Thoughts? I'm not really happy with "full", but I cannot think of a
>> better name.
>
> I'm OK with this approach. A corresponding option will also be needed in
> mdadm. But won't this name be a little misleading because this option is
> not just about unexpected shutdown (where resync applies), but also
> degraded array unexpected shutdown. So maybe something like
> "consistency_policy" and "resync" option instead of "full"?
"consistency_policy" and "resync" sounds good to me. Thanks.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply
* Re: raid0 vs. mkfs
From: Shaohua Li @ 2016-12-08 19:19 UTC (permalink / raw)
To: Coly Li; +Cc: Avi Kivity, NeilBrown, linux-raid, linux-block
In-Reply-To: <c384d070-457c-cfee-e35f-53b9195ace10@suse.de>
On Fri, Dec 09, 2016 at 12:44:57AM +0800, Coly Li wrote:
> On 2016/12/8 上午12:59, Shaohua Li wrote:
> > On Wed, Dec 07, 2016 at 07:50:33PM +0800, Coly Li wrote:
> [snip]
> > Thanks for doing this, Coly! For raid0, this totally makes sense. The raid0
> > zones make things a little complicated though. I just had a brief look of your
> > proposed patch, which looks really complicated. I'd suggest something like
> > this:
> > 1. split the bio according to zone boundary.
> > 2. handle the splitted bio. since the bio is within zone range, calculating
> > the start and end sector for each rdev should be easy.
> >
>
> Hi Shaohua,
>
> Thanks for your suggestion! I try to modify the code by your suggestion,
> it is even more hard to make the code that way ...
>
> Because even split bios for each zone, all the corner cases still exist
> and should be taken care in every zoon. The code will be more complicated.
Not sure why it makes the code more complicated. Probably I'm wrong, but Just
want to make sure we are in the same page: split the bio according to zone
boundary, then handle the splitted bio separately. Calculating end/start point
of each rdev for the new bio within a zone should be simple. we then clone a
bio for each rdev and dispatch. So for example:
Disk 0: D0 D2 D4 D6 D7
Disk 1: D1 D3 D5
zone 0 is from D0 - D5, zone 1 is from D6 - D7
If bio is from D1 to D7, we split it to 2 bios, one is D1 - D5, the other D6 - D7.
For D1 - D5, we dispatch 2 bios. D1 - D5 for disk 1, D2 - D4 for disk 0
For D6 - D7, we just dispatch to disk 0.
What kind of corner case makes this more complicated?
> > This will create slightly more bio to each rdev (not too many, since there
> > aren't too many zones in practice) and block layer should easily merge these
> > bios without much overhead. The benefit is a much simpler implementation.
> >
> >> I compose a prototype patch, the code is not simple, indeed it is quite
> >> complicated IMHO.
> >>
> >> I do a little research, some NVMe SSDs support whole device size
> >> DISCARD, also I observe mkfs.xfs sends out a raid0 device size DISCARD
> >> bio to block layer. But raid0_make_request() only receives 512KB size
> >> DISCARD bio, block/blk-lib.c:__blkdev_issue_discard() splits the
> >> original large bio into 512KB small bios, the limitation is from
> >> q->limits.discard_granularity.
> >
> > please adjust the max discard sectors for the queue. The original setting is
> > chunk size.
>
> This is a powerful suggestion, I change the max_discard_sectors to raid0
> size, and fix some bugs, now the patch looks working well. The
> performance number is not bad.
>
> On 4x3TB NVMe raid0, format it with mkfs.xfs. Current upstream kernel
> spends 306 seconds, the patched kernel spends 15 seconds. I see average
> request size increases from 1 chunk (1024 sectors) to 2048 chunks
> (2097152 sectors).
>
> I don't know why the bios are still be split before raid0_make_request()
> receives them, after I set q->limits.max_discard_sectors to
> mddev->array_sectors. Can anybody give me a hint ?
That probably is from disk_stack_limits. try set the max_discard_sectors after it.
> Here I attach the RFC v2 patch, if anybody wants to try it, please do it
> and response the result :-)
>
> I will take time to write a very detailed commit log and code comments
> to make this patch more easier to be understood. Ugly code, that's what
> I have to pay to gain better performance ....
Can't say I like it :). Hard to read and the memory allocation is ugly. Please
check if there is simpler solution first before writting detailed commit log.
Thanks,
Shaohua
^ permalink raw reply
* Re: Recovering a RAID6 after all disks were disconnected
From: John Stoffel @ 2016-12-08 19:02 UTC (permalink / raw)
To: Giuseppe Bilotta; +Cc: John Stoffel, linux-raid
In-Reply-To: <CAOxFTczn4Su6KwjDGS0S5BrUdirv9Tu_zCeo26iFMvL3378xpw@mail.gmail.com>
Sorry for not getting back to you sooner, I've been under the weather
lately. And I'm NOT an expert on this, but it's good you've made
copies of the disks.
Giuseppe> Hello John, and thanks for your time
Giuseppe> I've had sporadic resets of the JBOD due to a variety of reasons
Giuseppe> (power failures or disk failures —the JBOD has the bad habit of
Giuseppe> resetting when one disk has an I/O error, which causes all of the
Giuseppe> disks to go offline temporarily).
John> Please toss that JBOD out the window! *grin*
Giuseppe> Well, that's exactly why I bought the new one which is the one I'm
Giuseppe> currently using to host the backup disks I'm experimenting on! 8-)
Giuseppe> However I suspect this is a misfeature common to many if not all
Giuseppe> 'home' JBODS which are all SATA based and only provide eSATA and/or
Giuseppe> USB3 connection to the machine.
Giuseppe> The thing happened again a couple of days ago, but this time
Giuseppe> I tried re-adding the disks directly when they came back
Giuseppe> online, using mdadm -a and confident that since they _had_
Giuseppe> been recently part of the array, the array would actually go
Giuseppe> back to work fine —except that this is not the case when ALL
Giuseppe> disks were kicked out of the array! Instead, what happened
Giuseppe> was that all the disks were marked as 'spare' and the RAID
Giuseppe> would not assemble anymore.
John> Can you please send us the full details of each disk using the
John> command:
John>
John> mdadm -E /dev/sda1
John>
Giuseppe> Here it is. Notice that this is the result of -E _after_ the attempted
Giuseppe> re-add while the RAID was running, which marked all the disks as
Giuseppe> spares:
Yeah, this is probably a bad state. I would suggest you try to just
assemble the disks in various orders using your clones:
mdadm -A /dev/md0 /dev/sdc /dev/sdd /dev/sde /dev/sdf
And then mix up the order until you get a working array. You might
also want to try assembling using the 'missing' flag for the original
disk which dropped out of the array, so that just the three good disks
are used. This might take a while to test all the possible
permutations.
You might also want to look back in the archives of this mailing
list. Phil Turmel has some great advice and howto guides for this.
You can do the test assembles using loop back devices so that you
don't write to the originals, or even to the clones.
This should let you do testing more quickly.
Here's some other pointers for drive timeout issues that you should
look at as well:
Readings for timeout mismatch issues: (whole threads if possible)
http://marc.info/?l=linux-raid&m=139050322510249&w=2
http://marc.info/?l=linux-raid&m=135863964624202&w=2
http://marc.info/?l=linux-raid&m=135811522817345&w=1
http://marc.info/?l=linux-raid&m=133761065622164&w=2
http://marc.info/?l=linux-raid&m=132477199207506
http://marc.info/?l=linux-raid&m=133665797115876&w=2
http://marc.info/?l=linux-raid&m=142487508806844&w=3
http://marc.info/?l=linux-raid&m=144535576302583&w=2
Giuseppe> ==8<=======
Giuseppe> /dev/sdc:
Giuseppe> Magic : a92b4efc
Giuseppe> Version : 1.2
Giuseppe> Feature Map : 0x9
Giuseppe> Array UUID : 943d287e:af28b455:88a047f2:d714b8c6
Giuseppe> Name : labrador:oneforall (local to host labrador)
Giuseppe> Creation Time : Fri Nov 30 19:57:45 2012
Giuseppe> Raid Level : raid6
Giuseppe> Raid Devices : 4
Giuseppe> Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
Giuseppe> Array Size : 5860270080 (5588.79 GiB 6000.92 GB)
Giuseppe> Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
Giuseppe> Data Offset : 262144 sectors
Giuseppe> Super Offset : 8 sectors
Giuseppe> Unused Space : before=262048 sectors, after=944 sectors
Giuseppe> State : clean
Giuseppe> Device UUID : 543f75ac:a1f3cf99:1c6b71d9:52e358b9
Giuseppe> Internal Bitmap : 8 sectors from superblock
Giuseppe> Update Time : Sun Dec 4 17:11:19 2016
Giuseppe> Bad Block Log : 512 entries available at offset 80 sectors - bad
Giuseppe> blocks present.
Giuseppe> Checksum : 1e2f00fc - correct
Giuseppe> Events : 31196
Giuseppe> Layout : left-symmetric
Giuseppe> Chunk Size : 512K
Giuseppe> Device Role : spare
Giuseppe> Array State : .... ('A' == active, '.' == missing, 'R' == replacing)
Giuseppe> /dev/sdd:
Giuseppe> Magic : a92b4efc
Giuseppe> Version : 1.2
Giuseppe> Feature Map : 0x9
Giuseppe> Array UUID : 943d287e:af28b455:88a047f2:d714b8c6
Giuseppe> Name : labrador:oneforall (local to host labrador)
Giuseppe> Creation Time : Fri Nov 30 19:57:45 2012
Giuseppe> Raid Level : raid6
Giuseppe> Raid Devices : 4
Giuseppe> Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
Giuseppe> Array Size : 5860270080 (5588.79 GiB 6000.92 GB)
Giuseppe> Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
Giuseppe> Data Offset : 262144 sectors
Giuseppe> Super Offset : 8 sectors
Giuseppe> Unused Space : before=262048 sectors, after=944 sectors
Giuseppe> State : clean
Giuseppe> Device UUID : 649d53ad:f909b7a9:cd0f57f2:08a55e3b
Giuseppe> Internal Bitmap : 8 sectors from superblock
Giuseppe> Update Time : Sun Dec 4 17:11:19 2016
Giuseppe> Bad Block Log : 512 entries available at offset 80 sectors - bad
Giuseppe> blocks present.
Giuseppe> Checksum : c9dfe033 - correct
Giuseppe> Events : 31196
Giuseppe> Layout : left-symmetric
Giuseppe> Chunk Size : 512K
Giuseppe> Device Role : spare
Giuseppe> Array State : .... ('A' == active, '.' == missing, 'R' == replacing)
Giuseppe> /dev/sde:
Giuseppe> Magic : a92b4efc
Giuseppe> Version : 1.2
Giuseppe> Feature Map : 0x9
Giuseppe> Array UUID : 943d287e:af28b455:88a047f2:d714b8c6
Giuseppe> Name : labrador:oneforall (local to host labrador)
Giuseppe> Creation Time : Fri Nov 30 19:57:45 2012
Giuseppe> Raid Level : raid6
Giuseppe> Raid Devices : 4
Giuseppe> Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
Giuseppe> Array Size : 5860270080 (5588.79 GiB 6000.92 GB)
Giuseppe> Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
Giuseppe> Data Offset : 262144 sectors
Giuseppe> Super Offset : 8 sectors
Giuseppe> Unused Space : before=262048 sectors, after=944 sectors
Giuseppe> State : clean
Giuseppe> Device UUID : dd3f90ab:619684c0:942a7d88:f116f2db
Giuseppe> Internal Bitmap : 8 sectors from superblock
Giuseppe> Update Time : Sun Dec 4 17:11:19 2016
Giuseppe> Bad Block Log : 512 entries available at offset 80 sectors - bad
Giuseppe> blocks present.
Giuseppe> Checksum : 15a3975a - correct
Giuseppe> Events : 31196
Giuseppe> Layout : left-symmetric
Giuseppe> Chunk Size : 512K
Giuseppe> Device Role : spare
Giuseppe> Array State : .... ('A' == active, '.' == missing, 'R' == replacing)
Giuseppe> /dev/sdf:
Giuseppe> Magic : a92b4efc
Giuseppe> Version : 1.2
Giuseppe> Feature Map : 0x9
Giuseppe> Array UUID : 943d287e:af28b455:88a047f2:d714b8c6
Giuseppe> Name : labrador:oneforall (local to host labrador)
Giuseppe> Creation Time : Fri Nov 30 19:57:45 2012
Giuseppe> Raid Level : raid6
Giuseppe> Raid Devices : 4
Giuseppe> Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
Giuseppe> Array Size : 5860270080 (5588.79 GiB 6000.92 GB)
Giuseppe> Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
Giuseppe> Data Offset : 262144 sectors
Giuseppe> Super Offset : 8 sectors
Giuseppe> Unused Space : before=262048 sectors, after=944 sectors
Giuseppe> State : clean
Giuseppe> Device UUID : f7359c4e:c1f04b22:ce7aa32f:ed5bb054
Giuseppe> Internal Bitmap : 8 sectors from superblock
Giuseppe> Update Time : Sun Dec 4 17:11:19 2016
Giuseppe> Bad Block Log : 512 entries available at offset 80 sectors - bad
Giuseppe> blocks present.
Giuseppe> Checksum : 3a5b94a7 - correct
Giuseppe> Events : 31196
Giuseppe> Layout : left-symmetric
Giuseppe> Chunk Size : 512K
Giuseppe> Device Role : spare
Giuseppe> Array State : .... ('A' == active, '.' == missing, 'R' == replacing)
Giuseppe> ==8<=======
Giuseppe> I do however know the _original_ positions of the respective disks
Giuseppe> from the kernel messages
Giuseppe> At assembly time:
Giuseppe> [ +0.000638] RAID conf printout:
Giuseppe> [ +0.000001] --- level:6 rd:4 wd:4
Giuseppe> [ +0.000001] disk 0, o:1, dev:sdf
Giuseppe> [ +0.000001] disk 1, o:1, dev:sde
Giuseppe> [ +0.000000] disk 2, o:1, dev:sdd
Giuseppe> [ +0.000001] disk 3, o:1, dev:sdc
Giuseppe> After the JBOD disappeared and right before they all get kicked out:
Giuseppe> [ +0.000438] RAID conf printout:
Giuseppe> [ +0.000001] --- level:6 rd:4 wd:0
Giuseppe> [ +0.000001] disk 0, o:0, dev:sdf
Giuseppe> [ +0.000001] disk 1, o:0, dev:sde
Giuseppe> [ +0.000000] disk 2, o:0, dev:sdd
Giuseppe> [ +0.000001] disk 3, o:0, dev:sdc
John> You might be able to just for the three spare disks (assumed in this
John> case to be sda1, sdb1, sdc1; but you need to be sure first!) to
John> assemble into a full array with:
John>
John> mdadm -A /dev/md50 /dev/sda1 /dev/sdb1 /dev/sdc1
John>
John> And if that works, great. If not, post the error message(s) you get
John> back.
Giuseppe> Note that the RAID has no active disks anymore, since when I tried
Giuseppe> re-adding the formerly active disks that
Giuseppe> where kicked from the array they got marked as spares, and mdraid
Giuseppe> simply refuses to start a RAID6 setup with only spares. The message I
Giuseppe> get is indeed
Giuseppe> mdadm: /dev/md126 assembled from 0 drives and 3 spares - not enough to
Giuseppe> start the array.
Giuseppe> This is the point at which I made a copy of 3 of the 4 disks and
Giuseppe> started playing around. Specifically, I dd'ed sdc into sdh, sdd into
Giuseppe> sdi and sde into sdj and started playing around with sd[hij] rather
Giuseppe> than the original disks, as I mentioned:
Giuseppe> So one thing that I've done is to hack around the superblock in the
Giuseppe> disks (copies) to put back the device roles as they were (getting the
Giuseppe> information from the pre-failure dmesg output). (By the way, I've been
Giuseppe> using Andy's Binary Editor for the superblock editing, so if anyone is
Giuseppe> interested in a be.ini for mdraid v1 superblocks, including checksum
Giuseppe> verification, I'd be happy to share). Specifically, I've left the
Giuseppe> device number untouched, but I have edited the dev_roles array so that
Giuseppe> the slots corresponding to the dev_number from all the disks map to
Giuseppe> appropriate device roles.
Giuseppe> Specifically, I hand-edited the superblocks to achieve this:
Giuseppe> ==8<===============
Giuseppe> /dev/sdh:
Giuseppe> Magic : a92b4efc
Giuseppe> Version : 1.2
Giuseppe> Feature Map : 0x9
Giuseppe> Array UUID : 943d287e:af28b455:88a047f2:d714b8c6
Giuseppe> Name : labrador:oneforall (local to host labrador)
Giuseppe> Creation Time : Fri Nov 30 19:57:45 2012
Giuseppe> Raid Level : raid6
Giuseppe> Raid Devices : 4
Giuseppe> Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
Giuseppe> Array Size : 5860270080 (5588.79 GiB 6000.92 GB)
Giuseppe> Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
Giuseppe> Data Offset : 262144 sectors
Giuseppe> Super Offset : 8 sectors
Giuseppe> Unused Space : before=262048 sectors, after=944 sectors
Giuseppe> State : clean
Giuseppe> Device UUID : 543f75ac:a1f3cf99:1c6b71d9:52e358b9
Giuseppe> Internal Bitmap : 8 sectors from superblock
Giuseppe> Update Time : Sun Dec 4 17:11:19 2016
Giuseppe> Bad Block Log : 512 entries available at offset 80 sectors - bad
Giuseppe> blocks present.
Giuseppe> Checksum : 1e3300fe - correct
Giuseppe> Events : 31196
Giuseppe> Layout : left-symmetric
Giuseppe> Chunk Size : 512K
Giuseppe> Device Role : Active device 3
Giuseppe> Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
Giuseppe> /dev/sdi:
Giuseppe> Magic : a92b4efc
Giuseppe> Version : 1.2
Giuseppe> Feature Map : 0x9
Giuseppe> Array UUID : 943d287e:af28b455:88a047f2:d714b8c6
Giuseppe> Name : labrador:oneforall (local to host labrador)
Giuseppe> Creation Time : Fri Nov 30 19:57:45 2012
Giuseppe> Raid Level : raid6
Giuseppe> Raid Devices : 4
Giuseppe> Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
Giuseppe> Array Size : 5860270080 (5588.79 GiB 6000.92 GB)
Giuseppe> Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
Giuseppe> Data Offset : 262144 sectors
Giuseppe> Super Offset : 8 sectors
Giuseppe> Unused Space : before=262048 sectors, after=944 sectors
Giuseppe> State : clean
Giuseppe> Device UUID : 649d53ad:f909b7a9:cd0f57f2:08a55e3b
Giuseppe> Internal Bitmap : 8 sectors from superblock
Giuseppe> Update Time : Sun Dec 4 17:11:19 2016
Giuseppe> Bad Block Log : 512 entries available at offset 80 sectors - bad
Giuseppe> blocks present.
Giuseppe> Checksum : c9e3e035 - correct
Giuseppe> Events : 31196
Giuseppe> Layout : left-symmetric
Giuseppe> Chunk Size : 512K
Giuseppe> Device Role : Active device 2
Giuseppe> Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
Giuseppe> /dev/sdj:
Giuseppe> Magic : a92b4efc
Giuseppe> Version : 1.2
Giuseppe> Feature Map : 0x9
Giuseppe> Array UUID : 943d287e:af28b455:88a047f2:d714b8c6
Giuseppe> Name : labrador:oneforall (local to host labrador)
Giuseppe> Creation Time : Fri Nov 30 19:57:45 2012
Giuseppe> Raid Level : raid6
Giuseppe> Raid Devices : 4
Giuseppe> Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
Giuseppe> Array Size : 5860270080 (5588.79 GiB 6000.92 GB)
Giuseppe> Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
Giuseppe> Data Offset : 262144 sectors
Giuseppe> Super Offset : 8 sectors
Giuseppe> Unused Space : before=262048 sectors, after=944 sectors
Giuseppe> State : clean
Giuseppe> Device UUID : dd3f90ab:619684c0:942a7d88:f116f2db
Giuseppe> Internal Bitmap : 8 sectors from superblock
Giuseppe> Update Time : Sun Dec 4 17:11:19 2016
Giuseppe> Bad Block Log : 512 entries available at offset 80 sectors - bad
Giuseppe> blocks present.
Giuseppe> Checksum : 15a7975c - correct
Giuseppe> Events : 31196
Giuseppe> Layout : left-symmetric
Giuseppe> Chunk Size : 512K
Giuseppe> Device Role : Active device 1
Giuseppe> Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
Giuseppe> ==8<===============
Giuseppe> And I _can_ assemble the array, but what I get is this:
Giuseppe> [ +0.003574] md: bind<sdi>
Giuseppe> [ +0.001823] md: bind<sdh>
Giuseppe> [ +0.000978] md: bind<sdj>
Giuseppe> [ +0.003971] md/raid:md127: device sdj operational as raid disk 1
Giuseppe> [ +0.000125] md/raid:md127: device sdh operational as raid disk 3
Giuseppe> [ +0.000105] md/raid:md127: device sdi operational as raid disk 2
Giuseppe> [ +0.015017] md/raid:md127: allocated 4374kB
Giuseppe> [ +0.000139] md/raid:md127: raid level 6 active with 3 out of 4
Giuseppe> devices, algorithm 2
Giuseppe> [ +0.000063] RAID conf printout:
Giuseppe> [ +0.000002] --- level:6 rd:4 wd:3
Giuseppe> [ +0.000003] disk 1, o:1, dev:sdj
Giuseppe> [ +0.000002] disk 2, o:1, dev:sdi
Giuseppe> [ +0.000001] disk 3, o:1, dev:sdh
Giuseppe> [ +0.004187] md127: bitmap file is out of date (31193 < 31196) --
Giuseppe> forcing full recovery
Giuseppe> [ +0.000065] created bitmap (22 pages) for device md127
Giuseppe> [ +0.000072] md127: bitmap file is out of date, doing full recovery
Giuseppe> [ +0.100300] md127: bitmap initialized from disk: read 2 pages, set
Giuseppe> 44711 of 44711 bits
Giuseppe> [ +0.039741] md127: detected capacity change from 0 to 6000916561920
Giuseppe> [ +0.000085] Buffer I/O error on dev md127, logical block 0, async page read
Giuseppe> [ +0.000064] Buffer I/O error on dev md127, logical block 0, async page read
Giuseppe> [ +0.000022] Buffer I/O error on dev md127, logical block 0, async page read
Giuseppe> [ +0.000022] Buffer I/O error on dev md127, logical block 0, async page read
Giuseppe> [ +0.000019] ldm_validate_partition_table(): Disk read failed.
Giuseppe> [ +0.000021] Buffer I/O error on dev md127, logical block 0, async page read
Giuseppe> [ +0.000026] Buffer I/O error on dev md127, logical block 0, async page read
Giuseppe> [ +0.000022] Buffer I/O error on dev md127, logical block 0, async page read
Giuseppe> [ +0.000021] Buffer I/O error on dev md127, logical block 0, async page read
Giuseppe> [ +0.000019] Dev md127: unable to read RDB block 0
Giuseppe> [ +0.000016] Buffer I/O error on dev md127, logical block 0, async page read
Giuseppe> [ +0.000022] Buffer I/O error on dev md127, logical block 0, async page read
Giuseppe> [ +0.000030] md127: unable to read partition table
Giuseppe> and any attempt to access md127 content gives an I/O error.
Giuseppe> --
Giuseppe> Giuseppe "Oblomov" Bilotta
Giuseppe> --
Giuseppe> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
Giuseppe> the body of a message to majordomo@vger.kernel.org
Giuseppe> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 3/3] md/r5cache: sh->log_start in recovery
From: Shaohua Li @ 2016-12-08 18:41 UTC (permalink / raw)
To: Song Liu
Cc: linux-raid, neilb, shli, kernel-team, dan.j.williams, hch,
liuzhengyuan, liuyun01
In-Reply-To: <20161207174207.3685260-3-songliubraving@fb.com>
On Wed, Dec 07, 2016 at 09:42:07AM -0800, Song Liu wrote:
> We only need to update sh->log_start at the end of recovery,
> which is r5c_recovery_rewrite_data_only_stripes().
>
> In when there is data-only stripes, log->next_checkpoint is
> also set in r5c_recovery_rewrite_data_only_stripes().
I didn't get this. please describe why this patch is required instead of what
this patch does.
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
> drivers/md/raid5-cache.c | 23 ++++++++---------------
> 1 file changed, 8 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 5301081..ae2684a 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -1681,8 +1681,7 @@ r5l_recovery_replay_one_stripe(struct r5conf *conf,
>
> static struct stripe_head *
> r5c_recovery_alloc_stripe(struct r5conf *conf,
> - sector_t stripe_sect,
> - sector_t log_start)
> + sector_t stripe_sect)
> {
> struct stripe_head *sh;
>
> @@ -1691,7 +1690,6 @@ r5c_recovery_alloc_stripe(struct r5conf *conf,
> return NULL; /* no more stripe available */
>
> r5l_recovery_reset_stripe(sh);
> - sh->log_start = log_start;
>
> return sh;
> }
> @@ -1861,7 +1859,7 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
> stripe_sect);
>
> if (!sh) {
> - sh = r5c_recovery_alloc_stripe(conf, stripe_sect, ctx->pos);
> + sh = r5c_recovery_alloc_stripe(conf, stripe_sect);
> /*
> * cannot get stripe from raid5_get_active_stripe
> * try replay some stripes
> @@ -1870,7 +1868,7 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
> r5c_recovery_replay_stripes(
> cached_stripe_list, ctx);
> sh = r5c_recovery_alloc_stripe(
> - conf, stripe_sect, ctx->pos);
> + conf, stripe_sect);
> }
> if (!sh) {
> pr_debug("md/raid:%s: Increasing stripe cache size to %d to recovery data on journal.\n",
> @@ -1878,8 +1876,8 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
> conf->min_nr_stripes * 2);
> raid5_set_cache_size(mddev,
> conf->min_nr_stripes * 2);
> - sh = r5c_recovery_alloc_stripe(
> - conf, stripe_sect, ctx->pos);
> + sh = r5c_recovery_alloc_stripe(conf,
> + stripe_sect);
> }
> if (!sh) {
> pr_err("md/raid:%s: Cannot get enough stripes due to memory pressure. Recovery failed.\n",
> @@ -1893,7 +1891,6 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
> if (!test_bit(STRIPE_R5C_CACHING, &sh->state) &&
> test_bit(R5_Wantwrite, &sh->dev[sh->pd_idx].flags)) {
> r5l_recovery_replay_one_stripe(conf, sh, ctx);
> - sh->log_start = ctx->pos;
> list_move_tail(&sh->lru, cached_stripe_list);
> }
> r5l_recovery_load_data(log, sh, ctx, payload,
> @@ -1932,8 +1929,6 @@ static void r5c_recovery_load_one_stripe(struct r5l_log *log,
> set_bit(R5_UPTODATE, &dev->flags);
> }
> }
> - list_add_tail(&sh->r5c, &log->stripe_in_journal_list);
> - atomic_inc(&log->stripe_in_journal_count);
> }
>
> /*
> @@ -2123,9 +2118,11 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
> sync_page_io(log->rdev, ctx->pos, PAGE_SIZE, page,
> REQ_OP_WRITE, WRITE_FUA, false);
> sh->log_start = ctx->pos;
> + list_add_tail(&sh->r5c, &log->stripe_in_journal_list);
> + atomic_inc(&log->stripe_in_journal_count);
> ctx->pos = write_pos;
> ctx->seq += 1;
> -
> + log->next_checkpoint = sh->log_start;
> list_del_init(&sh->lru);
> raid5_release_stripe(sh);
> }
> @@ -2139,7 +2136,6 @@ static int r5l_recovery_log(struct r5l_log *log)
> struct r5l_recovery_ctx ctx;
> int ret;
> sector_t pos;
> - struct stripe_head *sh;
>
> ctx.pos = log->last_checkpoint;
> ctx.seq = log->last_cp_seq;
> @@ -2164,9 +2160,6 @@ static int r5l_recovery_log(struct r5l_log *log)
> log->next_checkpoint = ctx.pos;
> r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq++);
> ctx.pos = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
> - } else {
> - sh = list_last_entry(&ctx.cached_list, struct stripe_head, lru);
> - log->next_checkpoint = sh->log_start;
> }
>
> if ((ctx.data_only_stripes == 0) && (ctx.data_parity_stripes == 0))
> --
> 2.9.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 1/3] md/raid5-cache: fix crc in rewrite_data_only_stripes()
From: Shaohua Li @ 2016-12-08 18:28 UTC (permalink / raw)
To: Song Liu
Cc: linux-raid, neilb, shli, kernel-team, dan.j.williams, hch,
liuzhengyuan, liuyun01
In-Reply-To: <20161207174207.3685260-1-songliubraving@fb.com>
On Wed, Dec 07, 2016 at 09:42:05AM -0800, Song Liu wrote:
> r5l_recovery_create_empty_meta_block() creates crc for the empty
> metablock. After the metablock is updated, we need clear the
> checksum before recalculate it.
applied this one. However, I moved out checksum calculation from
r5l_recovery_create_empty_meta_block. We should calculate checksum after all
fields in meta block is updated.
Thanks,
Shaohua
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
> drivers/md/raid5-cache.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index c3b3124..875f963 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -2117,7 +2117,9 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
> }
> }
> mb->meta_size = cpu_to_le32(offset);
> - mb->checksum = crc32c_le(log->uuid_checksum, mb, PAGE_SIZE);
> + mb->checksum = 0;
> + mb->checksum = cpu_to_le32(crc32c_le(log->uuid_checksum,
> + mb, PAGE_SIZE));
> sync_page_io(log->rdev, ctx->pos, PAGE_SIZE, page,
> REQ_OP_WRITE, WRITE_FUA, false);
> sh->log_start = ctx->pos;
> --
> 2.9.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH] md/raid5-cache: no recovery is required when create super-block
From: Shaohua Li @ 2016-12-08 18:00 UTC (permalink / raw)
To: JackieLiu; +Cc: shli, songliubraving, liuzhengyuan, linux-raid
In-Reply-To: <20161208004739.2889-1-liuyun01@kylinos.cn>
On Thu, Dec 08, 2016 at 08:47:39AM +0800, JackieLiu wrote:
> When create the super-block information, We do not need to do this
> recovery stage, only need to initialize some variables.
applied, thanks!
> Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
> ---
> drivers/md/raid5-cache.c | 10 ++++++++--
> 1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index c3b3124..7c732c5 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -2492,7 +2492,7 @@ static int r5l_load_log(struct r5l_log *log)
> sector_t cp = log->rdev->journal_tail;
> u32 stored_crc, expected_crc;
> bool create_super = false;
> - int ret;
> + int ret = 0;
>
> /* Make sure it's valid */
> if (cp >= rdev->sectors || round_down(cp, BLOCK_SECTORS) != cp)
> @@ -2545,7 +2545,13 @@ static int r5l_load_log(struct r5l_log *log)
>
> __free_page(page);
>
> - ret = r5l_recovery_log(log);
> + if (create_super) {
> + log->log_start = r5l_ring_add(log, cp, BLOCK_SECTORS);
> + log->seq = log->last_cp_seq + 1;
> + log->next_checkpoint = cp;
> + } else
> + ret = r5l_recovery_log(log);
> +
> r5c_update_log_state(log);
> return ret;
> ioerr:
> --
> 2.10.2
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: raid0 vs. mkfs
From: Coly Li @ 2016-12-08 16:44 UTC (permalink / raw)
To: Shaohua Li; +Cc: Avi Kivity, NeilBrown, linux-raid, linux-block
In-Reply-To: <20161207165933.isq64dbkxye772nz@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 2545 bytes --]
On 2016/12/8 上午12:59, Shaohua Li wrote:
> On Wed, Dec 07, 2016 at 07:50:33PM +0800, Coly Li wrote:
[snip]
> Thanks for doing this, Coly! For raid0, this totally makes sense. The raid0
> zones make things a little complicated though. I just had a brief look of your
> proposed patch, which looks really complicated. I'd suggest something like
> this:
> 1. split the bio according to zone boundary.
> 2. handle the splitted bio. since the bio is within zone range, calculating
> the start and end sector for each rdev should be easy.
>
Hi Shaohua,
Thanks for your suggestion! I try to modify the code by your suggestion,
it is even more hard to make the code that way ...
Because even split bios for each zone, all the corner cases still exist
and should be taken care in every zoon. The code will be more complicated.
> This will create slightly more bio to each rdev (not too many, since there
> aren't too many zones in practice) and block layer should easily merge these
> bios without much overhead. The benefit is a much simpler implementation.
>
>> I compose a prototype patch, the code is not simple, indeed it is quite
>> complicated IMHO.
>>
>> I do a little research, some NVMe SSDs support whole device size
>> DISCARD, also I observe mkfs.xfs sends out a raid0 device size DISCARD
>> bio to block layer. But raid0_make_request() only receives 512KB size
>> DISCARD bio, block/blk-lib.c:__blkdev_issue_discard() splits the
>> original large bio into 512KB small bios, the limitation is from
>> q->limits.discard_granularity.
>
> please adjust the max discard sectors for the queue. The original setting is
> chunk size.
This is a powerful suggestion, I change the max_discard_sectors to raid0
size, and fix some bugs, now the patch looks working well. The
performance number is not bad.
On 4x3TB NVMe raid0, format it with mkfs.xfs. Current upstream kernel
spends 306 seconds, the patched kernel spends 15 seconds. I see average
request size increases from 1 chunk (1024 sectors) to 2048 chunks
(2097152 sectors).
I don't know why the bios are still be split before raid0_make_request()
receives them, after I set q->limits.max_discard_sectors to
mddev->array_sectors. Can anybody give me a hint ?
Here I attach the RFC v2 patch, if anybody wants to try it, please do it
and response the result :-)
I will take time to write a very detailed commit log and code comments
to make this patch more easier to be understood. Ugly code, that's what
I have to pay to gain better performance ....
Thanks in advance.
Coly
[-- Attachment #2: raid0_handle_large_discard_bio.patch --]
[-- Type: text/plain, Size: 11145 bytes --]
Subject: [RFC v2] optimization for large size DISCARD bio by per-device bios
This is a very early prototype, still needs more block layer code
modification to make it work.
Current upstream raid0_make_request() only handles TRIM/DISCARD bio by
chunk size, it meams for large raid0 device built by SSDs will call
million times generic_make_request() for the split bio. This patch
tries to combine small bios into large one if they are on same real
device and continuous on this real device, then send the combined large
bio to underlying device by single call to generic_make_request().
For example, use mkfs.xfs to trim a raid0 device built with 4 x 3TB
NVMeSSD, current upstream raid0_make_request() will call
generic_make_request() 5.7 million times, with this patch only 4 calls
to generic_make_request() is required.
This patch won't work in real world, because in block/blk-lib.c:
__blkdev_issue_discard() the original large bio will be split into
smaller ones by restriction of discard_granularity.
If some day SSD supports whole device sized discard_granularity, it
will be very interesting then...
The basic idea is, if a large discard bio received
by raid0_make_request(), for example it requests to discard chunk 1
to 24 on a raid0 device built by 4 SSDs. This large discard bio will
be split and written to each SSD as the following layout,
SSD1: C1,C5,C9,C13,C17,C21
SSD2: C2,C6,C10,C14,C18,C22
SSD3: C3,C7,C11,C15,C19,C23
SSD4: C4,C8,C12,C16,C20,C24
Current raid0 code will call generic_make_request() for 24 times for
each split bio. But it is possible to calculate the final layout of
each split bio, so we can combine all the bios into four per-SSD large
bio, like this,
bio1 (on SSD1): C{1,5,9,13,17,21}
bio2 (on SSD2): C{2,6,10,14,18,22}
bio3 (on SSD3): C{3,7,11,15,19,23}
bio4 (on SSD4): C{4,8,12,16,20,24}
Now we only need to call generic_make_request() for 4 times.
The code is not simple, I need more time to write text to complain how
it works. Currently you can treat it as a proof of concept.
Changelogs
v1, Initial prototype.
v2, Major changes inlcude,
- rename function names, now handle_discard_bio() takes care
in chunk size DISCARD bio and single disk sutiation, large DISCARD
bio will be handled in handle_large_discard_bio().
- Set max_discard_sectors to raid0 device size.
- Fix several bugs which I find in basic testing..
Signed-off-by: Coly Li <colyli@suse.de>
---
drivers/md/raid0.c | 267 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 266 insertions(+), 1 deletion(-)
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 258986a..c7afe0c 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -378,7 +378,7 @@ static int raid0_run(struct mddev *mddev)
blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors);
blk_queue_max_write_same_sectors(mddev->queue, mddev->chunk_sectors);
- blk_queue_max_discard_sectors(mddev->queue, mddev->chunk_sectors);
+ blk_queue_max_discard_sectors(mddev->queue, raid0_size(mddev, 0, 0));
blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9);
blk_queue_io_opt(mddev->queue,
@@ -452,6 +452,266 @@ static inline int is_io_in_chunk_boundary(struct mddev *mddev,
}
}
+
+struct bio_record {
+ sector_t bi_sector;
+ unsigned long sectors;
+ struct md_rdev *rdev;
+};
+
+static void handle_large_discard_bio(struct mddev *mddev, struct bio *bio)
+{
+ struct bio_record *recs = NULL;
+ struct bio *split;
+ struct r0conf *conf = mddev->private;
+ sector_t sectors, sector;
+ struct strip_zone *first_zone;
+ int zone_idx;
+ sector_t zone_start, zone_end;
+ int nr_strip_zones = conf->nr_strip_zones;
+ int disks;
+ int first_rdev_idx = -1, rdev_idx;
+ struct md_rdev *first_rdev;
+ unsigned int chunk_sects = mddev->chunk_sectors;
+
+ sector = bio->bi_iter.bi_sector;
+ first_zone = find_zone(conf, §or);
+ first_rdev = map_sector(mddev, first_zone, sector, §or);
+
+ /* bio is large enough to be split, allocate recs firstly */
+ disks = mddev->raid_disks;
+ recs = kcalloc(disks, sizeof(struct bio_record), GFP_NOIO);
+ if (recs == NULL) {
+ printk(KERN_ERR "md/raid0:%s: failed to allocate memory " \
+ "for bio_record", mdname(mddev));
+ bio->bi_error = -ENOMEM;
+ bio_endio(bio);
+ return;
+ }
+
+ zone_idx = first_zone - conf->strip_zone;
+ for (rdev_idx = 0; rdev_idx < first_zone->nb_dev; rdev_idx++) {
+ struct md_rdev *rdev;
+
+ rdev = conf->devlist[zone_idx * disks + rdev_idx];
+ recs[rdev_idx].rdev = rdev;
+ if (rdev == first_rdev)
+ first_rdev_idx = rdev_idx;
+ }
+
+ sectors = chunk_sects -
+ (likely(is_power_of_2(chunk_sects))
+ ? (sector & (chunk_sects - 1))
+ : sector_div(sector, chunk_sects));
+ sector = bio->bi_iter.bi_sector;
+
+ recs[first_rdev_idx].bi_sector = sector + first_zone->dev_start;
+ recs[first_rdev_idx].sectors = sectors;
+
+ /* recs[first_rdev_idx] is initialized with 'sectors', we need to
+ * handle the rested sectors, which is sotred in 'sectors' too.
+ */
+ sectors = bio_sectors(bio) - sectors;
+
+ /* bio may not be chunk size aligned, the split bio on first rdev
+ * may not be chunk size aligned too. But the rested split bios
+ * on rested rdevs must be chunk size aligned, and aligned to
+ * round down chunk number.
+ */
+ zone_end = first_zone->zone_end;
+ rdev_idx = first_rdev_idx + 1;
+ sector = likely(is_power_of_2(chunk_sects))
+ ? sector & (~(chunk_sects - 1))
+ : chunk_sects * (sector/chunk_sects);
+
+ while (rdev_idx < first_zone->nb_dev) {
+ if (recs[rdev_idx].sectors == 0) {
+ recs[rdev_idx].bi_sector = sector + first_zone->dev_start;
+ if (sectors <= chunk_sects) {
+ recs[rdev_idx].sectors = sectors;
+ goto issue;
+ }
+ recs[rdev_idx].sectors = chunk_sects;
+ sectors -= chunk_sects;
+ }
+ rdev_idx++;
+ }
+
+ sector += chunk_sects;
+ zone_start = sector + first_zone->dev_start;
+ if (zone_start == zone_end) {
+ zone_idx++;
+ if (zone_idx == nr_strip_zones) {
+ if (sectors != 0)
+ printk(KERN_INFO "bio size exceeds raid0 " \
+ "capability, ignore extra " \
+ "TRIM/DISCARD range.\n");
+ goto issue;
+ }
+ zone_start = conf->strip_zone[zone_idx].dev_start;
+ }
+
+ while (zone_idx < nr_strip_zones) {
+ int rdevs_in_zone = conf->strip_zone[zone_idx].nb_dev;
+ int chunks_per_rdev, rested_chunks, rested_sectors;
+ sector_t zone_sectors, grow_sectors;
+ int add_rested_sectors = 0;
+
+ zone_end = conf->strip_zone[zone_idx].zone_end;
+ zone_sectors = zone_end - zone_start;
+ chunks_per_rdev = sectors;
+ rested_sectors =
+ sector_div(chunks_per_rdev, chunk_sects * rdevs_in_zone);
+ rested_chunks = rested_sectors;
+ rested_sectors = sector_div(rested_chunks, chunk_sects);
+
+ if ((chunks_per_rdev * chunk_sects) > zone_sectors)
+ chunks_per_rdev = zone_sectors/chunk_sects;
+
+ /* rested_chunks and rested_sectors go into next zone, we won't
+ * handle them in this zone. Set them to 0.
+ */
+ if ((chunks_per_rdev * chunk_sects) == zone_sectors &&
+ (rested_chunks != 0 || rested_sectors != 0)) {
+ if (rested_chunks != 0)
+ rested_chunks = 0;
+ if (rested_sectors != 0)
+ rested_sectors = 0;
+ }
+
+ if (rested_chunks == 0 && rested_sectors != 0)
+ add_rested_sectors ++;
+
+ for (rdev_idx = 0; rdev_idx < rdevs_in_zone; rdev_idx++) {
+ /* if .sectors is not initailized (== 0), it indicates
+ * .bi_sector is not initialized neither. We initiate
+ * .bi_sector firstly, then set .sectors by
+ * grow_sectors.
+ */
+ if (recs[rdev_idx].sectors == 0)
+ recs[rdev_idx].bi_sector = zone_start;
+ grow_sectors = chunks_per_rdev * chunk_sects;
+ if (rested_chunks) {
+ grow_sectors += chunk_sects;
+ rested_chunks--;
+ if (rested_chunks == 0 &&
+ rested_sectors != 0) {
+ recs[rdev_idx].sectors += grow_sectors;
+ sectors -= grow_sectors;
+ add_rested_sectors ++;
+ continue;
+ }
+ }
+
+ /* if add_rested_sectors != 0, it indicates
+ * rested_sectors != 0
+ */
+ if (add_rested_sectors == 1) {
+ grow_sectors += rested_sectors;
+ add_rested_sectors ++;
+ }
+ recs[rdev_idx].sectors += grow_sectors;
+ sectors -= grow_sectors;
+ if (sectors == 0)
+ break;
+ }
+
+ if (sectors == 0)
+ break;
+ zone_start = zone_end;
+ zone_idx++;
+ if (zone_idx < nr_strip_zones)
+ BUG_ON(zone_start != conf->strip_zone[zone_idx].dev_start);
+ }
+
+
+issue:
+ /* recs contains the re-ordered requests layout, now we can
+ * chain split bios from recs
+ */
+ for (rdev_idx = 0; rdev_idx < disks; rdev_idx++) {
+ if (rdev_idx == first_rdev_idx ||
+ recs[rdev_idx].sectors == 0)
+ continue;
+ split = bio_split(bio,
+ recs[rdev_idx].sectors,
+ GFP_NOIO,
+ fs_bio_set);
+ if (split == NULL)
+ break;
+ bio_chain(split, bio);
+ BUG_ON(split->bi_iter.bi_size != recs[rdev_idx].sectors << 9);
+ split->bi_bdev = recs[rdev_idx].rdev->bdev;
+ split->bi_iter.bi_sector = recs[rdev_idx].bi_sector +
+ recs[rdev_idx].rdev->data_offset;
+
+ if (unlikely(!blk_queue_discard(
+ bdev_get_queue(split->bi_bdev))))
+ /* Just ignore it */
+ bio_endio(split);
+ else
+ generic_make_request(split);
+ }
+ BUG_ON(bio->bi_iter.bi_size != recs[first_rdev_idx].sectors << 9);
+ bio->bi_iter.bi_sector = recs[first_rdev_idx].bi_sector +
+ recs[first_rdev_idx].rdev->data_offset;
+ bio->bi_bdev = recs[first_rdev_idx].rdev->bdev;
+
+ if (unlikely(!blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
+ /* Just ignore it */
+ bio_endio(bio);
+ else
+ generic_make_request(bio);
+
+ kfree(recs);
+}
+
+static void handle_discard_bio(struct mddev *mddev, struct bio *bio)
+{
+ struct r0conf *conf = mddev->private;
+ unsigned int chunk_sects = mddev->chunk_sectors;
+ sector_t sector, sectors;
+ struct md_rdev *rdev;
+ struct strip_zone *zone;
+
+ sector = bio->bi_iter.bi_sector;
+ zone = find_zone(conf, §or);
+ rdev = map_sector(mddev, zone, sector, §or);
+ bio->bi_bdev = rdev->bdev;
+ sectors = chunk_sects -
+ (likely(is_power_of_2(chunk_sects))
+ ? (sector & (chunk_sects - 1))
+ : sector_div(sector, chunk_sects));
+
+ if (unlikely(sectors >= bio_sectors(bio))) {
+ bio->bi_iter.bi_sector = sector + zone->dev_start +
+ rdev->data_offset;
+ goto single_bio;
+ }
+
+ if (unlikely(zone->nb_dev == 1)) {
+ sectors = conf->strip_zone[0].zone_end -
+ sector;
+ if (bio_sectors(bio) > sectors)
+ bio->bi_iter.bi_size = sectors << 9;
+ bio->bi_iter.bi_sector = sector + rdev->data_offset;
+ goto single_bio;
+ }
+
+ handle_large_discard_bio(mddev, bio);
+ return;
+
+single_bio:
+ if (unlikely(!blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
+ /* Just ignore it */
+ bio_endio(bio);
+ else
+ generic_make_request(bio);
+
+ return;
+}
+
+
static void raid0_make_request(struct mddev *mddev, struct bio *bio)
{
struct strip_zone *zone;
@@ -463,6 +723,11 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
return;
}
+ if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) {
+ handle_discard_bio(mddev, bio);
+ return;
+ }
+
do {
sector_t sector = bio->bi_iter.bi_sector;
unsigned chunk_sects = mddev->chunk_sectors;
^ permalink raw reply related
* [PATCH] Always return last partition end address in 512B blocks
From: Mariusz Dabrowski @ 2016-12-08 11:13 UTC (permalink / raw)
To: linux-raid; +Cc: jes.sorensen, Mariusz Dabrowski
For 4K disks 'endofpart' is an index of the last 4K sector used by partition.
mdadm is using number of 512-byte sectors, so value returned by
get_last_partition_end must be multiplied by 8 for devices with 4K sectors.
Also, unused 'ret' variable has been removed.
Signed-off-by: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
---
util.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/util.c b/util.c
index 883eaa4..46c1280 100644
--- a/util.c
+++ b/util.c
@@ -1430,6 +1430,7 @@ static int get_last_partition_end(int fd, unsigned long long *endofpart)
struct MBR boot_sect;
unsigned long long curr_part_end;
unsigned part_nr;
+ unsigned int sector_size;
int retval = 0;
*endofpart = 0;
@@ -1469,6 +1470,9 @@ static int get_last_partition_end(int fd, unsigned long long *endofpart)
/* Unknown partition table */
retval = -1;
}
+ /* calculate number of 512-byte blocks */
+ if (get_dev_sector_size(fd, NULL, §or_size))
+ *endofpart *= sector_size/512;
abort:
return retval;
}
@@ -1480,9 +1484,8 @@ int check_partitions(int fd, char *dname, unsigned long long freesize,
* Check where the last partition ends
*/
unsigned long long endofpart;
- int ret;
- if ((ret = get_last_partition_end(fd, &endofpart)) > 0) {
+ if (get_last_partition_end(fd, &endofpart) > 0) {
/* There appears to be a partition table here */
if (freesize == 0) {
/* partitions will not be visible in new device */
--
1.8.3.1
^ permalink raw reply related
* [PATCH] Use disk sector size value to set offset for reading GPT
From: Mariusz Dabrowski @ 2016-12-08 11:13 UTC (permalink / raw)
To: linux-raid; +Cc: jes.sorensen, Mariusz Dabrowski
mdadm is using invalid byte-offset while reading GPT header to get
partition info (size, first sector, last sector etc.). Now this offset
is hardcoded to 512 bytes and it is not valid for disks with sector
size different than 512 bytes because MBR and GPT headers are aligned
to LBA, so valid offset for 4k drives is 4096 bytes.
Signed-off-by: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
---
super-gpt.c | 10 ++++++++++
util.c | 7 ++++++-
2 files changed, 16 insertions(+), 1 deletion(-)
diff --git a/super-gpt.c b/super-gpt.c
index 1a2adce..8b080a0 100644
--- a/super-gpt.c
+++ b/super-gpt.c
@@ -73,6 +73,7 @@ static int load_gpt(struct supertype *st, int fd, char *devname)
struct MBR *super;
struct GPT *gpt_head;
int to_read;
+ unsigned int sector_size;
free_gpt(st);
@@ -81,6 +82,11 @@ static int load_gpt(struct supertype *st, int fd, char *devname)
return 1;
}
+ if (!get_dev_sector_size(fd, devname, §or_size)) {
+ free(super);
+ return 1;
+ }
+
lseek(fd, 0, 0);
if (read(fd, super, sizeof(*super)) != sizeof(*super)) {
no_read:
@@ -100,6 +106,8 @@ static int load_gpt(struct supertype *st, int fd, char *devname)
free(super);
return 1;
}
+ /* Set offset to second block (GPT header) */
+ lseek(fd, sector_size, SEEK_SET);
/* Seem to have GPT, load the header */
gpt_head = (struct GPT*)(super+1);
if (read(fd, gpt_head, sizeof(*gpt_head)) != sizeof(*gpt_head))
@@ -111,6 +119,8 @@ static int load_gpt(struct supertype *st, int fd, char *devname)
to_read = __le32_to_cpu(gpt_head->part_cnt) * sizeof(struct GPT_part_entry);
to_read = ((to_read+511)/512) * 512;
+ /* Set offset to third block (GPT entries) */
+ lseek(fd, sector_size*2, SEEK_SET);
if (read(fd, gpt_head+1, to_read) != to_read)
goto no_read;
diff --git a/util.c b/util.c
index 883eaa4..818f839 100644
--- a/util.c
+++ b/util.c
@@ -1378,12 +1378,15 @@ static int get_gpt_last_partition_end(int fd, unsigned long long *endofpart)
unsigned long long curr_part_end;
unsigned all_partitions, entry_size;
unsigned part_nr;
+ unsigned int sector_size = 0;
*endofpart = 0;
BUILD_BUG_ON(sizeof(gpt) != 512);
/* skip protective MBR */
- lseek(fd, 512, SEEK_SET);
+ if (!get_dev_sector_size(fd, NULL, §or_size))
+ return 0;
+ lseek(fd, sector_size, SEEK_SET);
/* read GPT header */
if (read(fd, &gpt, 512) != 512)
return 0;
@@ -1403,6 +1406,8 @@ static int get_gpt_last_partition_end(int fd, unsigned long long *endofpart)
part = (struct GPT_part_entry *)buf;
+ /* set offset to third block (GPT entries) */
+ lseek(fd, sector_size*2, SEEK_SET);
for (part_nr = 0; part_nr < all_partitions; part_nr++) {
/* read partition entry */
if (read(fd, buf, entry_size) != (ssize_t)entry_size)
--
1.8.3.1
^ permalink raw reply related
* [PATCH] imsm: set generation number when reading superblock
From: Mariusz Dabrowski @ 2016-12-08 11:12 UTC (permalink / raw)
To: linux-raid; +Cc: jes.sorensen, Mariusz Dabrowski
IMSM doesn't set 'events' field with generation number, so sometimes mdadm
tries to re-assembly container using metadata which isn't most recent (e. g.
from spare disk).
Signed-off-by: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
---
super-intel.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/super-intel.c b/super-intel.c
index 0407d43..06b199a 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -3381,6 +3381,7 @@ static void getinfo_super_imsm(struct supertype *st, struct mdinfo *info, char *
/* do we have the all the insync disks that we expect? */
mpb = super->anchor;
+ info->events = __le32_to_cpu(mpb->generation_num);
for (i = 0; i < mpb->num_raid_devs; i++) {
struct imsm_dev *dev = get_imsm_dev(super, i);
--
1.8.3.1
^ permalink raw reply related
* Re: [PATCH v2 09/12] raid5-ppl: read PPL signature from IMSM metadata
From: Artur Paszkiewicz @ 2016-12-08 10:36 UTC (permalink / raw)
To: NeilBrown, shli; +Cc: linux-raid
In-Reply-To: <87inqv2ubl.fsf@notabene.neil.brown.name>
On 12/08/2016 12:27 AM, NeilBrown wrote:
> On Thu, Dec 08 2016, Artur Paszkiewicz wrote:
>
>> On 12/07/2016 02:25 AM, NeilBrown wrote:
>>> On Tue, Dec 06 2016, Artur Paszkiewicz wrote:
>>>
>>>> The PPL signature is used to determine if the stored PPL is valid for a
>>>> given array. With IMSM, the PPL signature should match the
>>>> orig_family_num field of the superblock. To avoid passing this value
>>>> from userspace, it can be read from the IMSM MPB when initializing the
>>>> log.
>>>
>>> It is up to mdadm to determine if the PPL is valid. It would only tell
>>> the kernel that a PPL exists if it is valid...
>>
>> The kernel also has to know this value because it writes it to the PPL
>> header. So yet another sysfs attribute just for this value? How about
>> adding a directory similar to "bitmap" to hold all the PPL related
>> settings?
>
> There is only one PPL header (per device) - correct?
> md can read that header, change the few fields that it knows about, and
> write it back out again. Does it ever need to change the PPL signature?
>
> When a PPL is created, mdadm can write the initial header.
>
> Am I missing something?
Well, you are right. If mdadm takes care of validating PPL and writes
the initial header, md can use the signature read from the header. Much
simpler that way :) Thanks for the suggestion.
Artur
^ permalink raw reply
* Re: [PATCH v2 03/12] raid5-cache: add a new policy
From: Artur Paszkiewicz @ 2016-12-08 10:28 UTC (permalink / raw)
To: NeilBrown, shli; +Cc: linux-raid
In-Reply-To: <87lgvr2uhk.fsf@notabene.neil.brown.name>
On 12/08/2016 12:24 AM, NeilBrown wrote:
> On Thu, Dec 08 2016, Artur Paszkiewicz wrote:
>
>> On 12/07/2016 01:46 AM, NeilBrown wrote:
>>> On Tue, Dec 06 2016, Artur Paszkiewicz wrote:
>>>
>>>> Add a source file for the new policy implementation and allow selecting
>>>> the policy based on the policy_type parameter in r5l_init_log().
>>>>
>>>> Introduce a new flag for rdev state flags to allow enabling the new
>>>> policy from userspace.
>>>
>>> This seems odd. Why is this a per-device flag?
>>> It makes sense for "journal" to be a per-device flag, because only one
>>> device is the journal device and it is obviously different from the
>>> others.
>>>
>>> But with the ppl, all devices serve as journal devices. So we would
>>> need to set journal_ppl on all devices? What happens if you set it on
>>> some, but not others? I see you get an error.
>>>
>>> I think some sort of array-wide setting would make more sense, would it
>>> not?
>>
>> Yes, it would. The problem exists only for external metadata, because
>> for native there is a feature flag in the superblock and a corresponding
>> flag in mddev->flags. Patch 12 adds a sysfs attribute to control the
>> policy at runtime but it would have to be moved out of raid5 personality
>> into the main md code. I didn't like the idea of adding something
>> specific to raid5 to generic code and visible in sysfs for unrelated
>> raid levels.
>>
>>> And what is an RWH??? A Really Weird Handle ??
>>>
>>> I guess it is probably a Raid5 Write Hole ? At the very least there
>>> should be a comment explaining this when you define the enum. (remember,
>>> you are trying to make it easier for reviewers).
>>
>> That's right, RWH stands for RAID Write Hole. I think I introduced it in
>> the cover letter, but I'll explain it also in the code.
>>
>>> It might almost make sense for bitmap/metadata to be used here.
>>> It can currently be "external" "internal" "clustered".
>>> Allow also "journalled" or "partial-partiy-log" ???
>>>
>>> Maybe not ... but I'd definitely prefer a global setting, and one that
>>> didn't use an obscure abbreviation.
>>
>> So do you think the sysfs attribute from patch 12 ("rwh_policy") could
>> be made global? This would simplify things but it doesn't seem right.
>> And about the abbreviation, should it be called "write_hole_policy" or
>> "raid5_write_hole_policy"? Maybe using bitmap/metadata is not a bad
>> idea...
>
> How about we call it "resync_policy" which describes how to cope with
> unexpected shutdown. Options include:
>
> full regenerate all redundancy info after a crash
> bitmap only regenerate redundancy info indicated by bitmap
> (both these suseptible to write-hole on raid456)
> journal raid456 only, though could theoretically be extended
> to raid1, raid10 : log transactions and replay after crash
> ppl raid456 only: log partial-parity before writes.
>
> With external metadata, this must be set explicitly. With internal
> metadata, it is set best on flags etc.
>
> Thoughts? I'm not really happy with "full", but I cannot think of a
> better name.
I'm OK with this approach. A corresponding option will also be needed in
mdadm. But won't this name be a little misleading because this option is
not just about unexpected shutdown (where resync applies), but also
degraded array unexpected shutdown. So maybe something like
"consistency_policy" and "resync" option instead of "full"?
Thanks,
Artur
^ permalink raw reply
* Re: [BUG] MD/RAID1 hung forever on freeze_array
From: Jinpu Wang @ 2016-12-08 9:50 UTC (permalink / raw)
To: NeilBrown; +Cc: Coly Li, linux-raid, Shaohua Li, Nate Dailey
In-Reply-To: <871sxj2jpd.fsf@notabene.neil.brown.name>
Thanks Neil for valuable inputs, please see comments inline.
On Thu, Dec 8, 2016 at 4:17 AM, NeilBrown <neilb@suse.com> wrote:
> On Thu, Dec 08 2016, Jinpu Wang wrote:
>
>> On Tue, Nov 29, 2016 at 12:15 PM, Jinpu Wang
>> <jinpu.wang@profitbricks.com> wrote:
>>> On Mon, Nov 28, 2016 at 10:10 AM, Coly Li <colyli@suse.de> wrote:
>>>> On 2016/11/28 下午5:02, Jinpu Wang wrote:
>>>>> On Mon, Nov 28, 2016 at 9:54 AM, Coly Li <colyli@suse.de> wrote:
>>>>>> On 2016/11/28 下午4:24, Jinpu Wang wrote:
>>>>>>> snip
>>>>>>>>>>
>>>>>>>>>> every time nr_pending is 1 bigger then (nr_queued + 1), so seems we
>>>>>>>>>> forgot to increase nr_queued somewhere?
>>>>>>>>>>
>>>>>>>>>> I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in
>>>>>>>>>> nr_queued to prevent freeze_array hang. Seems it fixed similar bug.
>>>>>>>>>>
>>>>>>>>>> Could you give your suggestion?
>>>>>>>>>>
>>>>>>>>> Sorry, forgot to mention kernel version is 4.4.28
>
>>
>> I continue debug the bug:
>>
>> 20161207
>
>> nr_pending = 948,
>> nr_waiting = 9,
>> nr_queued = 946, // again we need one more to finished wait_event.
>> barrier = 0,
>> array_frozen = 1,
>
>> on conf->bio_end_io_list we have 91 entries.
>
>> on conf->retry_list we have 855
>
> This is useful. It confirms that nr_queued is correct, and that
> nr_pending is consistently 1 higher than expected.
> This suggests that a request has been counted in nr_pending, but hasn't
> yet been submitted, or has been taken off one of the queues but has not
> yet been processed.
>
> I notice that in your first email the Blocked tasks listed included
> raid1d which is blocked in freeze_array() and a few others in
> make_request() blocked on wait_barrier().
> In that case nr_waiting was 100, so there should have been 100 threads
> blocked in wait_barrier(). Is that correct? I assume you thought it
> was pointless to list them all, which seems reasonable.
This is correct. From my test, I initially use numjobs set to 100 in
fio. then later I reduce that to 10,
I can still trigger the bug.
>
> I asked because I wonder if there might have been one thread in
> make_request() which was blocked on something else. There are a couple
> of places when make_request() will wait after having successfully called
> wait_barrier(). If that happened, it would cause exactly the symptoms
> you report. Could you check all blocked threads carefully please?
Every time it's the same hung task, raid1d is blocked in freeze_array
PID: 5002 TASK: ffff8800ad430d00 CPU: 0 COMMAND: "md1_raid1"
#0 [ffff8800ad44bbf8] __schedule at ffffffff81811483
#1 [ffff8800ad44bc50] schedule at ffffffff81811c60
#2 [ffff8800ad44bc68] freeze_array at ffffffffa085a1e1 [raid1]
#3 [ffff8800ad44bcc0] handle_read_error at ffffffffa085bfe9 [raid1]
#4 [ffff8800ad44bd68] raid1d at ffffffffa085d29b [raid1]
#5 [ffff8800ad44be60] md_thread at ffffffffa0549040 [md_mod]
and 9 fio threads are blocked in wait_barrier
PID: 5172 TASK: ffff88022bf4a700 CPU: 3 COMMAND: "fio"
#0 [ffff88022186b6d8] __schedule at ffffffff81811483
#1 [ffff88022186b730] schedule at ffffffff81811c60
#2 [ffff88022186b748] wait_barrier at ffffffffa085a0b6 [raid1]
#3 [ffff88022186b7b0] make_request at ffffffffa085c3cd [raid1]
#4 [ffff88022186b890] md_make_request at ffffffffa0549903 [md_mod]
The other fio thread is sleeping in futex wait, I think it's unrelated.
PID: 5171 TASK: ffff88022bf49a00 CPU: 2 COMMAND: "fio"
#0 [ffff880221f97c08] __schedule at ffffffff81811483
#1 [ffff880221f97c60] schedule at ffffffff81811c60
#2 [ffff880221f97c78] futex_wait_queue_me at ffffffff810c8a42
#3 [ffff880221f97cb0] futex_wait at ffffffff810c9c72
>
>
> There are other ways that nr_pending and nr_queued can get out of sync,
> though I think they would result in nr_pending being less than
> nr_queued, not more.
>
> If the presense of a bad block in the bad block log causes a request to
> be split into two r1bios, and if both of those end up on one of the
> queues, then they would be added to nr_queued twice, but to nr_pending
> only once. We should fix that.
I checked, there're some bad_blocks md detected, not sure if it helps.
root@ibnbd-clt1:~/jack# cat /sys/block/md1/md/dev-ibnbd0/bad_blocks
180630 512
181142 7
982011 8
1013386 255
I also checked each md_rdev:
crash> struct md_rdev 0xffff880230630000
struct md_rdev {
same_set = {
next = 0xffff88022db03818,
prev = 0xffff880230653e00
},
sectors = 2095104,
mddev = 0xffff88022db03800,
last_events = 125803676,
meta_bdev = 0x0,
bdev = 0xffff880234a56080,
sb_page = 0xffffea0008d3edc0,
bb_page = 0xffffea0008b17400,
sb_loaded = 1,
sb_events = 72,
data_offset = 2048,
new_data_offset = 2048,
sb_start = 8,
sb_size = 512,
preferred_minor = 65535,
kobj = {
name = 0xffff88022c567510 "dev-loop1",
entry = {
next = 0xffff880230630080,
prev = 0xffff880230630080
},
parent = 0xffff88022db03850,
kset = 0x0,
ktype = 0xffffffffa055f200 <rdev_ktype>,
sd = 0xffff88022c4ac708,
kref = {
refcount = {
counter = 1
}
},
state_initialized = 1,
state_in_sysfs = 1,
state_add_uevent_sent = 0,
state_remove_uevent_sent = 0,
uevent_suppress = 0
},
flags = 2,
blocked_wait = {
lock = {
{
rlock = {
raw_lock = {
val = {
counter = 0
}
}
}
}
},
task_list = {
next = 0xffff8802306300c8,
prev = 0xffff8802306300c8
}
},
desc_nr = 1,
raid_disk = 1,
new_raid_disk = 0,
saved_raid_disk = -1,
{
recovery_offset = 18446744073709551615,
journal_tail = 18446744073709551615
},
nr_pending = {
counter = 1
},
read_errors = {
counter = 0
},
last_read_error = {
tv_sec = 0,
tv_nsec = 0
},
corrected_errors = {
counter = 0
},
del_work = {
data = {
counter = 0
},
entry = {
next = 0x0,
prev = 0x0
},
func = 0x0
},
sysfs_state = 0xffff88022c4ac780,
badblocks = {
count = 0,
unacked_exist = 0,
shift = 0,
page = 0xffff88022740d000,
changed = 0,
lock = {
seqcount = {
sequence = 60
},
lock = {
{
rlock = {
raw_lock = {
val = {
counter = 0
}
}
}
}
}
},
sector = 0,
size = 0
}
}
crash>
crash> struct md_rdev 0xffff880230653e00
struct md_rdev {
same_set = {
next = 0xffff880230630000,
prev = 0xffff88022db03818
},
sectors = 2095104,
mddev = 0xffff88022db03800,
last_events = 9186098,
meta_bdev = 0x0,
bdev = 0xffff880234a56700,
sb_page = 0xffffea0007758f40,
bb_page = 0xffffea000887b480,
sb_loaded = 1,
sb_events = 42,
data_offset = 2048,
new_data_offset = 2048,
sb_start = 8,
sb_size = 512,
preferred_minor = 65535,
kobj = {
name = 0xffff880233c825b0 "dev-ibnbd0",
entry = {
next = 0xffff880230653e80,
prev = 0xffff880230653e80
},
parent = 0xffff88022db03850,
kset = 0x0,
ktype = 0xffffffffa055f200 <rdev_ktype>,
sd = 0xffff8800b9fcce88,
kref = {
refcount = {
counter = 1
}
},
state_initialized = 1,
state_in_sysfs = 1,
state_add_uevent_sent = 0,
state_remove_uevent_sent = 0,
uevent_suppress = 0
},
flags = 581,
blocked_wait = {
lock = {
{
rlock = {
raw_lock = {
val = {
counter = 0
}
}
}
}
},
task_list = {
next = 0xffff880230653ec8,
prev = 0xffff880230653ec8
}
},
desc_nr = 0,
raid_disk = 0,
new_raid_disk = 0,
saved_raid_disk = -1,
{
recovery_offset = 18446744073709551615,
journal_tail = 18446744073709551615
},
nr_pending = {
counter = 856
},
read_errors = {
counter = 0
},
last_read_error = {
tv_sec = 0,
tv_nsec = 0
},
corrected_errors = {
counter = 0
},
del_work = {
data = {
counter = 0
},
entry = {
next = 0x0,
prev = 0x0
},
func = 0x0
},
sysfs_state = 0xffff8800b9fccf78,
badblocks = {
count = 4,
unacked_exist = 0,
shift = 0,
page = 0xffff880227211000,
changed = 0,
lock = {
seqcount = {
sequence = 1624
},
lock = {
{
rlock = {
raw_lock = {
val = {
counter = 0
}
}
}
}
}
},
sector = 80,
size = 8
}
}
>
>
>>
>> list -H 0xffff8800b96acac0 r1bio.retry_list -s r1bio
>>
>> ffff8800b9791ff8
>> struct r1bio {
>> remaining = {
>> counter = 0
>> },
>> behind_remaining = {
>> counter = 0
>> },
>> sector = 18446612141670676480, // corrupted?
>> start_next_window = 18446612141565972992, //ditto
>
> I don't think this is corruption.
>
>> crash> struct r1conf 0xffff8800b9792000
>> struct r1conf {
> ....
>> retry_list = {
>> next = 0xffff8800afe690c0,
>> prev = 0xffff8800b96acac0
>> },
>
> The pointer you started at was at the end of the list.
> So this r1bio structure you are seeing is not an r1bio at all but the
> memory out of the middle of the r1conf, being interpreted as an r1bio.
> You can confirm this by noticing that retry_list in the r1bio:
>
>> retry_list = {
>> next = 0xffff8800afe690c0,
>> prev = 0xffff8800b96acac0
>> },
>
> is exactly the same as the retry_list in the r1conf.
>
> NeilBrown
Oh, thanks for explanation, I notice this coincidence, was curious why.
I still have my hung task machine alive, I can just what ever you
think necessary.
Thanks again for debuging this with me.
--
Jinpu Wang
Linux Kernel Developer
ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin
Tel: +49 30 577 008 042
Fax: +49 30 577 008 299
Email: jinpu.wang@profitbricks.com
URL: https://www.profitbricks.de
Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss
^ permalink raw reply
* Western Union Spende
From: Western Union @ 2016-12-08 5:15 UTC (permalink / raw)
To: Recipients
Sir/Madam,
Hereby we inform you that WESTERN UNION has international donations project you expressed as a cash grant of awarded [85,000.00 EURO], promoted as a charity donation by Western union international, United Kingdom, in conjunction with the children's Fund of the United Nations [UNICEF].
To request more information about the processing and payment of your grants claims please contact Mr. Mike Morris, the National Secretary of the Foundation with your qualifying number [WNI/101/231/BDB] as soon as possible.
Western Union District Manager (Mr Mike MORIS)
Website: www.westernunion.com
Email: Wu.transfer101@gmail.com
^ permalink raw reply
* Re: [BUG] MD/RAID1 hung forever on freeze_array
From: NeilBrown @ 2016-12-08 3:17 UTC (permalink / raw)
To: Jinpu Wang, Coly Li; +Cc: linux-raid, Shaohua Li, Nate Dailey
In-Reply-To: <CAMGffE=T15eLaROLCDGBA_OxQgUZbo22LQZJuSji=Z=rZRGr6Q@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 3489 bytes --]
On Thu, Dec 08 2016, Jinpu Wang wrote:
> On Tue, Nov 29, 2016 at 12:15 PM, Jinpu Wang
> <jinpu.wang@profitbricks.com> wrote:
>> On Mon, Nov 28, 2016 at 10:10 AM, Coly Li <colyli@suse.de> wrote:
>>> On 2016/11/28 下午5:02, Jinpu Wang wrote:
>>>> On Mon, Nov 28, 2016 at 9:54 AM, Coly Li <colyli@suse.de> wrote:
>>>>> On 2016/11/28 下午4:24, Jinpu Wang wrote:
>>>>>> snip
>>>>>>>>>
>>>>>>>>> every time nr_pending is 1 bigger then (nr_queued + 1), so seems we
>>>>>>>>> forgot to increase nr_queued somewhere?
>>>>>>>>>
>>>>>>>>> I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in
>>>>>>>>> nr_queued to prevent freeze_array hang. Seems it fixed similar bug.
>>>>>>>>>
>>>>>>>>> Could you give your suggestion?
>>>>>>>>>
>>>>>>>> Sorry, forgot to mention kernel version is 4.4.28
>
> I continue debug the bug:
>
> 20161207
> nr_pending = 948,
> nr_waiting = 9,
> nr_queued = 946, // again we need one more to finished wait_event.
> barrier = 0,
> array_frozen = 1,
> on conf->bio_end_io_list we have 91 entries.
> on conf->retry_list we have 855
This is useful. It confirms that nr_queued is correct, and that
nr_pending is consistently 1 higher than expected.
This suggests that a request has been counted in nr_pending, but hasn't
yet been submitted, or has been taken off one of the queues but has not
yet been processed.
I notice that in your first email the Blocked tasks listed included
raid1d which is blocked in freeze_array() and a few others in
make_request() blocked on wait_barrier().
In that case nr_waiting was 100, so there should have been 100 threads
blocked in wait_barrier(). Is that correct? I assume you thought it
was pointless to list them all, which seems reasonable.
I asked because I wonder if there might have been one thread in
make_request() which was blocked on something else. There are a couple
of places when make_request() will wait after having successfully called
wait_barrier(). If that happened, it would cause exactly the symptoms
you report. Could you check all blocked threads carefully please?
There are other ways that nr_pending and nr_queued can get out of sync,
though I think they would result in nr_pending being less than
nr_queued, not more.
If the presense of a bad block in the bad block log causes a request to
be split into two r1bios, and if both of those end up on one of the
queues, then they would be added to nr_queued twice, but to nr_pending
only once. We should fix that.
>
> list -H 0xffff8800b96acac0 r1bio.retry_list -s r1bio
>
> ffff8800b9791ff8
> struct r1bio {
> remaining = {
> counter = 0
> },
> behind_remaining = {
> counter = 0
> },
> sector = 18446612141670676480, // corrupted?
> start_next_window = 18446612141565972992, //ditto
I don't think this is corruption.
> crash> struct r1conf 0xffff8800b9792000
> struct r1conf {
....
> retry_list = {
> next = 0xffff8800afe690c0,
> prev = 0xffff8800b96acac0
> },
The pointer you started at was at the end of the list.
So this r1bio structure you are seeing is not an r1bio at all but the
memory out of the middle of the r1conf, being interpreted as an r1bio.
You can confirm this by noticing that retry_list in the r1bio:
> retry_list = {
> next = 0xffff8800afe690c0,
> prev = 0xffff8800b96acac0
> },
is exactly the same as the retry_list in the r1conf.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply
* Re: [PATCH] dm: Avoid sleeping while holding the dm_bufio lock
From: Doug Anderson @ 2016-12-08 0:54 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Alasdair Kergon, Mike Snitzer, Shaohua Li, Dmitry Torokhov,
linux-kernel@vger.kernel.org, linux-raid, dm-devel,
David Rientjes, Sonny Rao, Guenter Roeck
In-Reply-To: <alpine.LRH.2.02.1611231543020.31481@file01.intranet.prod.int.rdu2.redhat.com>
Hi,
On Wed, Nov 23, 2016 at 12:57 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> Hi
>
> The GFP_NOIO allocation frees clean cached pages. The GFP_NOWAIT
> allocation doesn't. Your patch would incorrectly reuse buffers in a
> situation when the memory is filled with clean cached pages.
>
> Here I'm proposing an alternate patch that first tries GFP_NOWAIT
> allocation, then drops the lock and tries GFP_NOIO allocation.
>
> Note that the root cause why you are seeing this stacktrace is, that your
> block device is congested - i.e. there are too many requests in the
> device's queue - and note that fixing this wait won't fix the root cause
> (congested device).
>
> The congestion limits are set in blk_queue_congestion_threshold to 7/8 to
> 13/16 size of the nr_requests value.
>
> If you don't want your device to report the congested status, you can
> increase /sys/block/<device>/queue/nr_requests - you should test if your
> chromebook is faster of slower with this setting increased. But note that
> this setting won't increase the IO-per-second of the device.
Cool, thanks for the insight!
Can you clarify which block device is relevant here? Is this the DM
block device, the underlying block device, or the swap block device?
I'm not at all an expert on DM, but I think we have:
1. /dev/mmcblk0 - the underlying storage device.
2. /dev/dm-0 - The verity device that's run atop /dev/mmcblk0p3
3. /dev/zram0 - Our swap device
As stated in the original email, I'm running on a downstream kernel
(kernel-4.4) with bunches of local patches, so it's plausible that
things have changed in the meantime, but:
* At boot time the "nr_requests" for all block devices was 128
* I was unable to set the "nr_requests" for dm-0 and zram0 (it just
gives an error in sysfs).
* When I set "nr_requests" to 4096 for /dev/mmcblk0 it didn't seem to
affect the problem.
> Mikulas
>
>
> On Thu, 17 Nov 2016, Douglas Anderson wrote:
>
>> We've seen in-field reports showing _lots_ (18 in one case, 41 in
>> another) of tasks all sitting there blocked on:
>>
>> mutex_lock+0x4c/0x68
>> dm_bufio_shrink_count+0x38/0x78
>> shrink_slab.part.54.constprop.65+0x100/0x464
>> shrink_zone+0xa8/0x198
>>
>> In the two cases analyzed, we see one task that looks like this:
>>
>> Workqueue: kverityd verity_prefetch_io
>>
>> __switch_to+0x9c/0xa8
>> __schedule+0x440/0x6d8
>> schedule+0x94/0xb4
>> schedule_timeout+0x204/0x27c
>> schedule_timeout_uninterruptible+0x44/0x50
>> wait_iff_congested+0x9c/0x1f0
>> shrink_inactive_list+0x3a0/0x4cc
>> shrink_lruvec+0x418/0x5cc
>> shrink_zone+0x88/0x198
>> try_to_free_pages+0x51c/0x588
>> __alloc_pages_nodemask+0x648/0xa88
>> __get_free_pages+0x34/0x7c
>> alloc_buffer+0xa4/0x144
>> __bufio_new+0x84/0x278
>> dm_bufio_prefetch+0x9c/0x154
>> verity_prefetch_io+0xe8/0x10c
>> process_one_work+0x240/0x424
>> worker_thread+0x2fc/0x424
>> kthread+0x10c/0x114
>>
>> ...and that looks to be the one holding the mutex.
>>
>> The problem has been reproduced on fairly easily:
>> 0. Be running Chrome OS w/ verity enabled on the root filesystem
>> 1. Pick test patch: http://crosreview.com/412360
>> 2. Install launchBalloons.sh and balloon.arm from
>> http://crbug.com/468342
>> ...that's just a memory stress test app.
>> 3. On a 4GB rk3399 machine, run
>> nice ./launchBalloons.sh 4 900 100000
>> ...that tries to eat 4 * 900 MB of memory and keep accessing.
>> 4. Login to the Chrome web browser and restore many tabs
>>
>> With that, I've seen printouts like:
>> DOUG: long bufio 90758 ms
>> ...and stack trace always show's we're in dm_bufio_prefetch().
>>
>> The problem is that we try to allocate memory with GFP_NOIO while
>> we're holding the dm_bufio lock. Instead we should be using
>> GFP_NOWAIT. Using GFP_NOIO can cause us to sleep while holding the
>> lock and that causes the above problems.
>>
>> The current behavior explained by David Rientjes:
>>
>> It will still try reclaim initially because __GFP_WAIT (or
>> __GFP_KSWAPD_RECLAIM) is set by GFP_NOIO. This is the cause of
>> contention on dm_bufio_lock() that the thread holds. You want to
>> pass GFP_NOWAIT instead of GFP_NOIO to alloc_buffer() when holding a
>> mutex that can be contended by a concurrent slab shrinker (if
>> count_objects didn't use a trylock, this pattern would trivially
>> deadlock).
>>
>> Suggested-by: David Rientjes <rientjes@google.com>
>> Signed-off-by: Douglas Anderson <dianders@chromium.org>
>> ---
>> Note that this change was developed and tested against the Chrome OS
>> 4.4 kernel tree, not mainline. Due to slight differences in verity
>> between mainline and Chrome OS it became too difficult to reproduce my
>> testing setup on mainline. This patch still seems correct and
>> relevant to upstream, so I'm posting it. If this is not acceptible to
>> you then please ignore this patch.
>>
>> Also note that when I tested the Chrome OS 3.14 kernel tree I couldn't
>> reproduce the long delays described in the patch. Presumably
>> something changed in either the kernel config or the memory management
>> code between the two kernel versions that made this crop up. In a
>> similar vein, it is possible that problems described in this patch are
>> no longer reproducible upstream. However, the arguments made in this
>> patch (that we don't want to block while holding the mutex) still
>> apply so I think the patch may still have merit.
>>
>> drivers/md/dm-bufio.c | 6 ++++--
>> 1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
>> index b3ba142e59a4..3c767399cc59 100644
>> --- a/drivers/md/dm-bufio.c
>> +++ b/drivers/md/dm-bufio.c
>> @@ -827,7 +827,8 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>> * dm-bufio is resistant to allocation failures (it just keeps
>> * one buffer reserved in cases all the allocations fail).
>> * So set flags to not try too hard:
>> - * GFP_NOIO: don't recurse into the I/O layer
>> + * GFP_NOWAIT: don't wait; if we need to sleep we'll release our
>> + * mutex and wait ourselves.
>> * __GFP_NORETRY: don't retry and rather return failure
>> * __GFP_NOMEMALLOC: don't use emergency reserves
>> * __GFP_NOWARN: don't print a warning in case of failure
>> @@ -837,7 +838,8 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>> */
>> while (1) {
>> if (dm_bufio_cache_size_latch != 1) {
>> - b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
>> + b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY |
>> + __GFP_NOMEMALLOC | __GFP_NOWARN);
>> if (b)
>> return b;
>> }
>> --
>> 2.8.0.rc3.226.g39d4020
>>
>> --
>> dm-devel mailing list
>> dm-devel@redhat.com
>> https://www.redhat.com/mailman/listinfo/dm-devel
>>
>
> From: Mikulas Patocka <mpatocka@redhat.com>
>
> Subject: dm-bufio: drop the lock when doing GFP_NOIO alloaction
>
> Drop the lock when doing GFP_NOIO alloaction beacuse the allocation can
> take some time.
>
> Note that we won't do GFP_NOIO allocation when we loop for the second
> time, because the lock shouldn't be dropped between __wait_for_free_buffer
> and __get_unclaimed_buffer.
>
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
>
> ---
> drivers/md/dm-bufio.c | 13 ++++++++++++-
> 1 file changed, 12 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/drivers/md/dm-bufio.c
> ===================================================================
> --- linux-2.6.orig/drivers/md/dm-bufio.c
> +++ linux-2.6/drivers/md/dm-bufio.c
> @@ -822,11 +822,13 @@ enum new_flag {
> static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client *c, enum new_flag nf)
> {
> struct dm_buffer *b;
> + bool tried_noio_alloc = false;
>
> /*
> * dm-bufio is resistant to allocation failures (it just keeps
> * one buffer reserved in cases all the allocations fail).
> * So set flags to not try too hard:
> + * GFP_NOWAIT: don't sleep and don't release cache
> * GFP_NOIO: don't recurse into the I/O layer
> * __GFP_NORETRY: don't retry and rather return failure
> * __GFP_NOMEMALLOC: don't use emergency reserves
> @@ -837,7 +839,7 @@ static struct dm_buffer *__alloc_buffer_
> */
> while (1) {
> if (dm_bufio_cache_size_latch != 1) {
> - b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> + b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> if (b)
> return b;
> }
> @@ -845,6 +847,15 @@ static struct dm_buffer *__alloc_buffer_
> if (nf == NF_PREFETCH)
> return NULL;
>
> + if (dm_bufio_cache_size_latch != 1 && !tried_noio_alloc) {
> + dm_bufio_unlock(c);
> + b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> + dm_bufio_lock(c);
> + if (b)
> + return b;
> + tried_noio_alloc = true;
> + }
> +
> if (!list_empty(&c->reserved_buffers)) {
> b = list_entry(c->reserved_buffers.next,
> struct dm_buffer, lru_list);
I agree that I believe that it is safe to drop and re-grab the
dm_bufio lock in this function. I also believe it to be safe to call
alloc_buffer() without holding the dm_bufio lock.
That means that this looks fine to me. It also fixes the test case
that I have. ...so for what it's worth:
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Tested-by: Douglas Anderson <dianders@chromium.org>
-Doug
^ permalink raw reply
* [PATCH] md/raid5-cache: no recovery is required when create super-block
From: JackieLiu @ 2016-12-08 0:47 UTC (permalink / raw)
To: shli; +Cc: songliubraving, liuzhengyuan, linux-raid, JackieLiu
When create the super-block information, We do not need to do this
recovery stage, only need to initialize some variables.
Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
---
drivers/md/raid5-cache.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index c3b3124..7c732c5 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2492,7 +2492,7 @@ static int r5l_load_log(struct r5l_log *log)
sector_t cp = log->rdev->journal_tail;
u32 stored_crc, expected_crc;
bool create_super = false;
- int ret;
+ int ret = 0;
/* Make sure it's valid */
if (cp >= rdev->sectors || round_down(cp, BLOCK_SECTORS) != cp)
@@ -2545,7 +2545,13 @@ static int r5l_load_log(struct r5l_log *log)
__free_page(page);
- ret = r5l_recovery_log(log);
+ if (create_super) {
+ log->log_start = r5l_ring_add(log, cp, BLOCK_SECTORS);
+ log->seq = log->last_cp_seq + 1;
+ log->next_checkpoint = cp;
+ } else
+ ret = r5l_recovery_log(log);
+
r5c_update_log_state(log);
return ret;
ioerr:
--
2.10.2
^ permalink raw reply related
* Re: [PATCH v2 09/12] raid5-ppl: read PPL signature from IMSM metadata
From: NeilBrown @ 2016-12-07 23:27 UTC (permalink / raw)
To: Artur Paszkiewicz, shli; +Cc: linux-raid
In-Reply-To: <0e661d59-033e-6797-b5d1-d30cdb04ef11@intel.com>
[-- Attachment #1: Type: text/plain, Size: 1128 bytes --]
On Thu, Dec 08 2016, Artur Paszkiewicz wrote:
> On 12/07/2016 02:25 AM, NeilBrown wrote:
>> On Tue, Dec 06 2016, Artur Paszkiewicz wrote:
>>
>>> The PPL signature is used to determine if the stored PPL is valid for a
>>> given array. With IMSM, the PPL signature should match the
>>> orig_family_num field of the superblock. To avoid passing this value
>>> from userspace, it can be read from the IMSM MPB when initializing the
>>> log.
>>
>> It is up to mdadm to determine if the PPL is valid. It would only tell
>> the kernel that a PPL exists if it is valid...
>
> The kernel also has to know this value because it writes it to the PPL
> header. So yet another sysfs attribute just for this value? How about
> adding a directory similar to "bitmap" to hold all the PPL related
> settings?
There is only one PPL header (per device) - correct?
md can read that header, change the few fields that it knows about, and
write it back out again. Does it ever need to change the PPL signature?
When a PPL is created, mdadm can write the initial header.
Am I missing something?
Thanks,
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply
* Re: [PATCH v2 03/12] raid5-cache: add a new policy
From: NeilBrown @ 2016-12-07 23:24 UTC (permalink / raw)
To: Artur Paszkiewicz, shli; +Cc: linux-raid
In-Reply-To: <ba67c43b-067e-c565-8c70-4bb5544392a3@intel.com>
[-- Attachment #1: Type: text/plain, Size: 3087 bytes --]
On Thu, Dec 08 2016, Artur Paszkiewicz wrote:
> On 12/07/2016 01:46 AM, NeilBrown wrote:
>> On Tue, Dec 06 2016, Artur Paszkiewicz wrote:
>>
>>> Add a source file for the new policy implementation and allow selecting
>>> the policy based on the policy_type parameter in r5l_init_log().
>>>
>>> Introduce a new flag for rdev state flags to allow enabling the new
>>> policy from userspace.
>>
>> This seems odd. Why is this a per-device flag?
>> It makes sense for "journal" to be a per-device flag, because only one
>> device is the journal device and it is obviously different from the
>> others.
>>
>> But with the ppl, all devices serve as journal devices. So we would
>> need to set journal_ppl on all devices? What happens if you set it on
>> some, but not others? I see you get an error.
>>
>> I think some sort of array-wide setting would make more sense, would it
>> not?
>
> Yes, it would. The problem exists only for external metadata, because
> for native there is a feature flag in the superblock and a corresponding
> flag in mddev->flags. Patch 12 adds a sysfs attribute to control the
> policy at runtime but it would have to be moved out of raid5 personality
> into the main md code. I didn't like the idea of adding something
> specific to raid5 to generic code and visible in sysfs for unrelated
> raid levels.
>
>> And what is an RWH??? A Really Weird Handle ??
>>
>> I guess it is probably a Raid5 Write Hole ? At the very least there
>> should be a comment explaining this when you define the enum. (remember,
>> you are trying to make it easier for reviewers).
>
> That's right, RWH stands for RAID Write Hole. I think I introduced it in
> the cover letter, but I'll explain it also in the code.
>
>> It might almost make sense for bitmap/metadata to be used here.
>> It can currently be "external" "internal" "clustered".
>> Allow also "journalled" or "partial-partiy-log" ???
>>
>> Maybe not ... but I'd definitely prefer a global setting, and one that
>> didn't use an obscure abbreviation.
>
> So do you think the sysfs attribute from patch 12 ("rwh_policy") could
> be made global? This would simplify things but it doesn't seem right.
> And about the abbreviation, should it be called "write_hole_policy" or
> "raid5_write_hole_policy"? Maybe using bitmap/metadata is not a bad
> idea...
How about we call it "resync_policy" which describes how to cope with
unexpected shutdown. Options include:
full regenerate all redundancy info after a crash
bitmap only regenerate redundancy info indicated by bitmap
(both these suseptible to write-hole on raid456)
journal raid456 only, though could theoretically be extended
to raid1, raid10 : log transactions and replay after crash
ppl raid456 only: log partial-parity before writes.
With external metadata, this must be set explicitly. With internal
metadata, it is set best on flags etc.
Thoughts? I'm not really happy with "full", but I cannot think of a
better name.
Thanks,
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply
* [PATCH 2/2] md/r5cache: read data into orig_page for prexor of cached data
From: Song Liu @ 2016-12-07 19:36 UTC (permalink / raw)
To: linux-raid
Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuan,
liuyun01, Song Liu
In-Reply-To: <20161207193637.3905500-1-songliubraving@fb.com>
With write back cache, we use orig_page to do prexor. This patch
makes sure we read data into orig_page for it.
Flag R5_OrigPageUPTDODATE is added to make show when orig_page
has the latest data from raid disk.
Signed-off-by: Song Liu <songliubraving@fb.com>
---
drivers/md/raid5-cache.c | 2 ++
drivers/md/raid5.c | 37 ++++++++++++++++++++++++++++---------
drivers/md/raid5.h | 5 +++++
3 files changed, 35 insertions(+), 9 deletions(-)
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 830bb7f..03cef8b 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2373,6 +2373,8 @@ void r5c_release_extra_page(struct stripe_head *sh)
struct page *p = sh->dev[i].orig_page;
sh->dev[i].orig_page = sh->dev[i].page;
+ clear_bit(R5_OrigPageUPTDODATE, &sh->dev[i].flags);
+
if (!using_disk_info_extra_page)
put_page(p);
}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 524b041..eb7425a 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1019,7 +1019,13 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
if (test_bit(R5_SkipCopy, &sh->dev[i].flags))
WARN_ON(test_bit(R5_UPTODATE, &sh->dev[i].flags));
- sh->dev[i].vec.bv_page = sh->dev[i].page;
+
+ if (!op_is_write(op) &&
+ test_bit(R5_InJournal, &sh->dev[i].flags) &&
+ sh->dev[i].page != sh->dev[i].orig_page)
+ sh->dev[i].vec.bv_page = sh->dev[i].orig_page;
+ else
+ sh->dev[i].vec.bv_page = sh->dev[i].page;
bi->bi_vcnt = 1;
bi->bi_io_vec[0].bv_len = STRIPE_SIZE;
bi->bi_io_vec[0].bv_offset = 0;
@@ -2389,6 +2395,10 @@ static void raid5_end_read_request(struct bio * bi)
} else if (test_bit(R5_ReadNoMerge, &sh->dev[i].flags))
clear_bit(R5_ReadNoMerge, &sh->dev[i].flags);
+ if (test_bit(R5_InJournal, &sh->dev[i].flags) &&
+ sh->dev[i].page != sh->dev[i].orig_page)
+ set_bit(R5_OrigPageUPTDODATE, &sh->dev[i].flags);
+
if (atomic_read(&rdev->read_errors))
atomic_set(&rdev->read_errors, 0);
} else {
@@ -3603,6 +3613,21 @@ static void handle_stripe_clean_event(struct r5conf *conf,
break_stripe_batch_list(head_sh, STRIPE_EXPAND_SYNC_FLAGS);
}
+/*
+ * For RMW in write back cache, we need extra page in prexor to store the
+ * old data. This page is stored in dev->orig_page.
+ *
+ * This function checks whether we have data for prexor. The exact logic
+ * is:
+ * R5_UPTODATE && (!R5_InJournal || R5_OrigPageUPTDODATE)
+ */
+static inline bool uptodate_for_rmw(struct r5dev *dev)
+{
+ return (test_bit(R5_UPTODATE, &dev->flags)) &&
+ (!test_bit(R5_InJournal, &dev->flags) ||
+ test_bit(R5_OrigPageUPTDODATE, &dev->flags));
+}
+
static int handle_stripe_dirtying(struct r5conf *conf,
struct stripe_head *sh,
struct stripe_head_state *s,
@@ -3634,9 +3659,7 @@ static int handle_stripe_dirtying(struct r5conf *conf,
if ((dev->towrite || i == sh->pd_idx || i == sh->qd_idx ||
test_bit(R5_InJournal, &dev->flags)) &&
!test_bit(R5_LOCKED, &dev->flags) &&
- !((test_bit(R5_UPTODATE, &dev->flags) &&
- (!test_bit(R5_InJournal, &dev->flags) ||
- dev->page != dev->orig_page)) ||
+ !(uptodate_for_rmw(dev) ||
test_bit(R5_Wantcompute, &dev->flags))) {
if (test_bit(R5_Insync, &dev->flags))
rmw++;
@@ -3648,7 +3671,6 @@ static int handle_stripe_dirtying(struct r5conf *conf,
i != sh->pd_idx && i != sh->qd_idx &&
!test_bit(R5_LOCKED, &dev->flags) &&
!(test_bit(R5_UPTODATE, &dev->flags) ||
- test_bit(R5_InJournal, &dev->flags) ||
test_bit(R5_Wantcompute, &dev->flags))) {
if (test_bit(R5_Insync, &dev->flags))
rcw++;
@@ -3702,9 +3724,7 @@ static int handle_stripe_dirtying(struct r5conf *conf,
i == sh->pd_idx || i == sh->qd_idx ||
test_bit(R5_InJournal, &dev->flags)) &&
!test_bit(R5_LOCKED, &dev->flags) &&
- !((test_bit(R5_UPTODATE, &dev->flags) &&
- (!test_bit(R5_InJournal, &dev->flags) ||
- dev->page != dev->orig_page)) ||
+ !(uptodate_for_rmw(dev) ||
test_bit(R5_Wantcompute, &dev->flags)) &&
test_bit(R5_Insync, &dev->flags)) {
if (test_bit(STRIPE_PREREAD_ACTIVE,
@@ -3731,7 +3751,6 @@ static int handle_stripe_dirtying(struct r5conf *conf,
i != sh->pd_idx && i != sh->qd_idx &&
!test_bit(R5_LOCKED, &dev->flags) &&
!(test_bit(R5_UPTODATE, &dev->flags) ||
- test_bit(R5_InJournal, &dev->flags) ||
test_bit(R5_Wantcompute, &dev->flags))) {
rcw++;
if (test_bit(R5_Insync, &dev->flags) &&
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index b39fe46..6cc8d4c 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -322,6 +322,11 @@ enum r5dev_flags {
* data and parity being written are in the journal
* device
*/
+ R5_OrigPageUPTDODATE, /* with write back cache, we read old data into
+ * dev->orig_page for prexor. When this flag is
+ * set, orig_page contains latest data in the
+ * raid disk.
+ */
};
/*
--
2.9.3
^ permalink raw reply related
* [PATCH 1/2] md/r5cache: flush data only stripes in r5l_recovery_log()
From: Song Liu @ 2016-12-07 19:36 UTC (permalink / raw)
To: linux-raid
Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuan,
liuyun01, Song Liu
When there is data only stripes in the journal, we flush them out in
r5l_recovery_log(). Ths logic is implemented in a new function:
r5c_recovery_flush_data_only_stripes();
We need conf->log in r5l_load_log(), so we need to set it before calling
r5l_load_log(). If r5l_load_log() fails, we set conf->log back to NULL.
Signed-off-by: Song Liu <songliubraving@fb.com>
---
drivers/md/raid5-cache.c | 60 +++++++++++++++++++++++++++++++++++-------------
drivers/md/raid5.c | 9 +++++++-
drivers/md/raid5.h | 4 ++++
3 files changed, 56 insertions(+), 17 deletions(-)
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index ae2684a..830bb7f 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2061,7 +2061,7 @@ static int
r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
struct r5l_recovery_ctx *ctx)
{
- struct stripe_head *sh, *next;
+ struct stripe_head *sh;
struct mddev *mddev = log->rdev->mddev;
struct page *page;
@@ -2072,7 +2072,7 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
return -ENOMEM;
}
- list_for_each_entry_safe(sh, next, &ctx->cached_list, lru) {
+ list_for_each_entry(sh, &ctx->cached_list, lru) {
struct r5l_meta_block *mb;
int i;
int offset;
@@ -2123,13 +2123,39 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
ctx->pos = write_pos;
ctx->seq += 1;
log->next_checkpoint = sh->log_start;
- list_del_init(&sh->lru);
- raid5_release_stripe(sh);
}
__free_page(page);
return 0;
}
+static void r5c_recovery_flush_data_only_stripes(struct r5l_log *log,
+ struct r5l_recovery_ctx *ctx)
+{
+ struct mddev *mddev = log->rdev->mddev;
+ struct r5conf *conf = mddev->private;
+ struct stripe_head *sh, *next;
+
+ if (ctx->data_only_stripes == 0)
+ return;
+
+ log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_BACK;
+ set_bit(R5C_PRE_INIT_FLUSH, &conf->cache_state);
+
+ list_for_each_entry_safe(sh, next, &ctx->cached_list, lru) {
+ list_del_init(&sh->lru);
+ raid5_release_stripe(sh);
+ }
+
+ md_wakeup_thread(conf->mddev->thread);
+ wait_event(conf->wait_for_r5c_pre_init_flush,
+ atomic_read(&conf->active_stripes) == 0 &&
+ atomic_read(&conf->r5c_cached_full_stripes) == 0 &&
+ atomic_read(&conf->r5c_cached_partial_stripes) == 0);
+
+ clear_bit(R5C_PRE_INIT_FLUSH, &conf->cache_state);
+ log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
+}
+
static int r5l_recovery_log(struct r5l_log *log)
{
struct mddev *mddev = log->rdev->mddev;
@@ -2156,11 +2182,6 @@ static int r5l_recovery_log(struct r5l_log *log)
pos = ctx.pos;
ctx.seq += 1000;
- if (ctx.data_only_stripes == 0) {
- log->next_checkpoint = ctx.pos;
- r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq++);
- ctx.pos = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
- }
if ((ctx.data_only_stripes == 0) && (ctx.data_parity_stripes == 0))
pr_debug("md/raid:%s: starting from clean shutdown\n",
@@ -2169,19 +2190,24 @@ static int r5l_recovery_log(struct r5l_log *log)
pr_debug("md/raid:%s: recoverying %d data-only stripes and %d data-parity stripes\n",
mdname(mddev), ctx.data_only_stripes,
ctx.data_parity_stripes);
+ }
- if (ctx.data_only_stripes > 0)
- if (r5c_recovery_rewrite_data_only_stripes(log, &ctx)) {
- pr_err("md/raid:%s: failed to rewrite stripes to journal\n",
- mdname(mddev));
- return -EIO;
- }
+ if (ctx.data_only_stripes == 0) {
+ log->next_checkpoint = ctx.pos;
+ r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq++);
+ ctx.pos = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
+ } else if (r5c_recovery_rewrite_data_only_stripes(log, &ctx)) {
+ pr_err("md/raid:%s: failed to rewrite stripes to journal\n",
+ mdname(mddev));
+ return -EIO;
}
log->log_start = ctx.pos;
log->seq = ctx.seq;
log->last_checkpoint = pos;
r5l_write_super(log, pos);
+
+ r5c_recovery_flush_data_only_stripes(log, &ctx);
return 0;
}
@@ -2626,14 +2652,16 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
spin_lock_init(&log->stripe_in_journal_lock);
atomic_set(&log->stripe_in_journal_count, 0);
+ rcu_assign_pointer(conf->log, log);
+
if (r5l_load_log(log))
goto error;
- rcu_assign_pointer(conf->log, log);
set_bit(MD_HAS_JOURNAL, &conf->mddev->flags);
return 0;
error:
+ rcu_assign_pointer(conf->log, NULL);
md_unregister_thread(&log->reclaim_thread);
reclaim_thread:
mempool_destroy(log->meta_pool);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 6bf3c26..524b041 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -232,7 +232,9 @@ static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
* When quiesce in r5c write back, set STRIPE_HANDLE for stripes with
* data in journal, so they are not released to cached lists
*/
- if (conf->quiesce && r5c_is_writeback(conf->log) &&
+ if ((conf->quiesce ||
+ test_bit(R5C_PRE_INIT_FLUSH, &conf->cache_state)) &&
+ r5c_is_writeback(conf->log) &&
!test_bit(STRIPE_HANDLE, &sh->state) && injournal != 0) {
if (test_bit(STRIPE_R5C_CACHING, &sh->state))
r5c_make_stripe_write_out(sh);
@@ -264,6 +266,10 @@ static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
< IO_THRESHOLD)
md_wakeup_thread(conf->mddev->thread);
atomic_dec(&conf->active_stripes);
+ if (test_bit(R5C_PRE_INIT_FLUSH, &conf->cache_state) &&
+ atomic_read(&conf->active_stripes) == 0)
+ wake_up(&sh->raid_conf->wait_for_r5c_pre_init_flush);
+
if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
if (!r5c_is_writeback(conf->log))
list_add_tail(&sh->lru, temp_inactive_list);
@@ -6638,6 +6644,7 @@ static struct r5conf *setup_conf(struct mddev *mddev)
init_waitqueue_head(&conf->wait_for_quiescent);
init_waitqueue_head(&conf->wait_for_stripe);
init_waitqueue_head(&conf->wait_for_overlap);
+ init_waitqueue_head(&conf->wait_for_r5c_pre_init_flush);
INIT_LIST_HEAD(&conf->handle_list);
INIT_LIST_HEAD(&conf->hold_list);
INIT_LIST_HEAD(&conf->delayed_list);
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index ed8e136..b39fe46 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -564,6 +564,9 @@ enum r5_cache_state {
R5C_EXTRA_PAGE_IN_USE, /* a stripe is using disk_info.extra_page
* for prexor
*/
+ R5C_PRE_INIT_FLUSH, /* flushing data only stripes recovered from
+ * the journal
+ */
};
struct r5conf {
@@ -679,6 +682,7 @@ struct r5conf {
int group_cnt;
int worker_cnt_per_group;
struct r5l_log *log;
+ wait_queue_head_t wait_for_r5c_pre_init_flush;
};
--
2.9.3
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox