Linux RAID subsystem development
 help / color / mirror / Atom feed
* RAID 5 3-drive array inactive - no superblock - can anything be saved?
From: Robert Schultz @ 2017-03-10 15:18 UTC (permalink / raw)
  To: linux-raid

Before I mess things up even worse.

I have a PC running BackupPC.

The system contains 4 disks:
boot & system: 1x WD 20GB IDE (9.3 years powered on)
backup data: RAID 5 array md0 containing 3 x Seagate 2TB SATA drives
     ST2000DM001-1ER164   /dev/sdb
     ST2000DM001-1CH164   /dev/sdc
     ST2000DM001-1CH164   /dev/sdd

Back around Christmas I was notified of disk errors on the oldest 2TB 
disk. I purchased a new one, failed out the old one, replaced it and 
everything seemed fine.

At some point I noticed that the system had stopped and I was getting 
XFS errors metadata I/O error: block 0x12910b88 
("xfs_trans_read_buf_map") error 117 ....
<snip>
Corruption of in-memory data detected. Shutting down filesystem.

I did a bit of searching and it appears this may have been a cable 
issue. I shutdown, reseated the SATA cables and restarted. I unmounted 
the filesystem, ran xfs_repair and everything seemed fine.

Then it repeated the issue a week or two later. Unfortunately at this 
time it failed one of the disks.

I shutdown, replaced the SATA cables and restarted. I had some issues on 
restart where it would go into recovery mode. I solved this by removing 
the md0 array from fstab and restarted.

At this point I am unable to restart, or re-assemble the array as mdadm 
can't find the superblocks.

 >cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] 
[raid4] [raid10]
md0 : inactive sdd1[4](S) sdc1[3](S) sdb1[5](S)
       5860147464 blocks super 1.2

unused devices: <none>

 >sudo mdadm --detail /dev/md0
/dev/md0:
         Version : 1.2
      Raid Level : raid0
   Total Devices : 3
     Persistence : Superblock is persistent

           State : inactive

            Name : bkp1:0  (local to host bkp1)
            UUID : 77965a25:38a24b98:9ab5899c:7795ded7
          Events : 396761

     Number   Major   Minor   RaidDevice

        -       8       17        -        /dev/sdb1
        -       8       33        -        /dev/sdc1
        -       8       49        -        /dev/sdd1

Trying to reassemble the array:

 >sudo mdadm --assemble --force /dev/md0 /dev/sdb /dev/sdc /dev/sdd
mdadm: Cannot assemble mbr metadata on /dev/sdb
mdadm: /dev/sdb has no superblock - assembly aborted

Running the command with any of the disks says the same thing - no 
superblock.





Here is the mdadm examine for the three disks:
sudo mdadm --examine /dev/sdb1
/dev/sdb1:
           Magic : a92b4efc
         Version : 1.2
     Feature Map : 0x0
      Array UUID : 77965a25:38a24b98:9ab5899c:7795ded7
            Name : bkp1:0  (local to host bkp1)
   Creation Time : Fri May 31 11:06:39 2013
      Raid Level : raid5
    Raid Devices : 3

  Avail Dev Size : 3906764976 (1862.89 GiB 2000.26 GB)
      Array Size : 3906763776 (3725.78 GiB 4000.53 GB)
   Used Dev Size : 3906763776 (1862.89 GiB 2000.26 GB)
     Data Offset : 262144 sectors
    Super Offset : 8 sectors
    Unused Space : before=262056 sectors, after=1200 sectors
           State : clean
     Device UUID : f8a5b84c:6c63d9bf:1b930a93:55f12cb5

     Update Time : Tue Mar  7 08:01:31 2017
   Bad Block Log : 512 entries available at offset 72 sectors
        Checksum : dd687715 - correct
          Events : 396761

          Layout : left-symmetric
      Chunk Size : 512K

    Device Role : Active device 0
    Array State : A.. ('A' == active, '.' == missing, 'R' == replacing)

---------------------------------------------------------------------
sudo mdadm --examine /dev/sdc1
/dev/sdc1:
           Magic : a92b4efc
         Version : 1.2
     Feature Map : 0x0
      Array UUID : 77965a25:38a24b98:9ab5899c:7795ded7
            Name : bkp1:0  (local to host bkp1)
   Creation Time : Fri May 31 11:06:39 2013
      Raid Level : raid5
    Raid Devices : 3

  Avail Dev Size : 3906764976 (1862.89 GiB 2000.26 GB)
      Array Size : 3906763776 (3725.78 GiB 4000.53 GB)
   Used Dev Size : 3906763776 (1862.89 GiB 2000.26 GB)
     Data Offset : 262144 sectors
    Super Offset : 8 sectors
    Unused Space : before=262064 sectors, after=1200 sectors
           State : clean
     Device UUID : 2d4ade03:d6b7e7ce:3744b40b:21a3d17e

     Update Time : Tue Feb 21 23:26:13 2017
        Checksum : e5d43607 - correct
          Events : 396755

          Layout : left-symmetric
      Chunk Size : 512K

    Device Role : Active device 2
    Array State : AAA ('A' == active, '.' == missing, 'R' == replacing)

----------------------------------------------------------------------
  sudo mdadm --examine /dev/sdd1
/dev/sdd1:
           Magic : a92b4efc
         Version : 1.2
     Feature Map : 0x0
      Array UUID : 77965a25:38a24b98:9ab5899c:7795ded7
            Name : bkp1:0  (local to host bkp1)
   Creation Time : Fri May 31 11:06:39 2013
      Raid Level : raid5
    Raid Devices : 3

  Avail Dev Size : 3906764976 (1862.89 GiB 2000.26 GB)
      Array Size : 3906763776 (3725.78 GiB 4000.53 GB)
   Used Dev Size : 3906763776 (1862.89 GiB 2000.26 GB)
     Data Offset : 262144 sectors
    Super Offset : 8 sectors
    Unused Space : before=262064 sectors, after=1200 sectors
           State : clean
     Device UUID : 97c3bfe9:8ed96f77:ef13ee6b:007874b3

     Update Time : Tue Feb 21 23:26:13 2017
        Checksum : 91a793b - correct
          Events : 396755

          Layout : left-symmetric
      Chunk Size : 512K

    Device Role : Active device 1
    Array State : AAA ('A' == active, '.' == missing, 'R' == replacing)
------------------------------------------------------------------------
Is the fact that the Array State on sdb only shows 'A..' significant?


fdisk -l shows:

sudo fdisk -l /dev/sd[b-d]
Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x5ebd3967

Device     Boot Start        End    Sectors  Size Id Type
/dev/sdb1        2048 3907029167 3907027120  1.8T fd Linux raid autodetect
Disk /dev/sdc: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x5ebd3967

Device     Boot Start        End    Sectors  Size Id Type
/dev/sdc1        2048 3907029167 3907027120  1.8T fd Linux raid autodetect
Disk /dev/sdd: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x00000000

Device     Boot Start        End    Sectors  Size Id Type
/dev/sdd1        2048 3907029167 3907027120  1.8T fd Linux raid autodetect

Any change of putting this thing back together?











^ permalink raw reply

* Re: [PATCH v5 5/7] raid5-ppl: load and recover the log
From: Artur Paszkiewicz @ 2017-03-10 15:23 UTC (permalink / raw)
  To: Shaohua Li; +Cc: shli, linux-raid
In-Reply-To: <20170309233058.5rdwdehn2iwgm5r6@kernel.org>

On 03/10/2017 12:30 AM, Shaohua Li wrote:
> On Thu, Mar 09, 2017 at 10:00:01AM +0100, Artur Paszkiewicz wrote:
>> Load the log from each disk when starting the array and recover if the
>> array is dirty.
>>
>> The initial empty PPL is written by mdadm. When loading the log we
>> verify the header checksum and signature. For external metadata arrays
>> the signature is verified in userspace, so here we read it from the
>> header, verifying only if it matches on all disks, and use it later when
>> writing PPL.
>>
>> In addition to the header checksum, each header entry also contains a
>> checksum of its partial parity data. If the header is valid, recovery is
>> performed for each entry until an invalid entry is found. If the array
>> is not degraded and recovery using PPL fully succeeds, there is no need
>> to resync the array because data and parity will be consistent, so in
>> this case resync will be disabled.
>>
>> Due to compatibility with IMSM implementations on other systems, we
>> can't assume that the recovery data block size is always 4K. Writes
>> generated by MD raid5 don't have this issue, but when recovering PPL
>> written in other environments it is possible to have entries with
>> 512-byte sector granularity. The recovery code takes this into account
>> and also the logical sector size of the underlying drives.
>>
>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>> ---
>>  drivers/md/raid5-ppl.c | 497 +++++++++++++++++++++++++++++++++++++++++++++++++
>>  drivers/md/raid5.c     |   5 +-
>>  2 files changed, 501 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
>> index 92783586743d..548d1028a3ce 100644
>> --- a/drivers/md/raid5-ppl.c
>> +++ b/drivers/md/raid5-ppl.c
>> @@ -103,6 +103,10 @@ struct ppl_conf {
>>  	mempool_t *io_pool;
>>  	struct bio_set *bs;
>>  	mempool_t *meta_pool;
>> +
>> +	/* used only for recovery */
>> +	int recovered_entries;
>> +	int mismatch_count;
>>  };
>>  
>>  struct ppl_log {
>> @@ -514,6 +518,482 @@ void ppl_stripe_write_finished(struct stripe_head *sh)
>>  		ppl_io_unit_finished(io);
>>  }
>>  
>> +static void ppl_xor(int size, struct page *page1, struct page *page2,
>> +		    struct page *page_result)
>> +{
> 
> I'd remove the page_result parameter, it should always be page1. And this will
> make it clear why we need ASYNC_TX_XOR_DROP_DST.
> 
>> +	struct async_submit_ctl submit;
>> +	struct dma_async_tx_descriptor *tx;
>> +	struct page *xor_srcs[] = { page1, page2 };
>> +
>> +	init_async_submit(&submit, ASYNC_TX_ACK|ASYNC_TX_XOR_DROP_DST,
>> +			  NULL, NULL, NULL, NULL);
>> +	tx = async_xor(page_result, xor_srcs, 0, 2, size, &submit);
>> +
>> +	async_tx_quiesce(&tx);
>> +}
> 
> ...
>> +			ret = ppl_recover_entry(log, e, ppl_sector);
>> +			if (ret)
>> +				goto out;
>> +			ppl_conf->recovered_entries++;
>> +		}
>> +
>> +		ppl_sector += ppl_entry_sectors;
>> +	}
>> +
>> +	/* flush the disk cache after recovery if necessary */
>> +	if (test_bit(QUEUE_FLAG_WC, &bdev_get_queue(rdev->bdev)->queue_flags)) {
> 
> The block layer will handle this, so you don't need to check
> 
>> +		struct bio *bio = bio_alloc_bioset(GFP_KERNEL, 0, ppl_conf->bs);
>> +
>> +		bio->bi_bdev = rdev->bdev;
>> +		bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
>> +		ret = submit_bio_wait(bio);
>> +		bio_put(bio);
> 
> please use blkdev_issue_flush() instead.
> 
>> +	}
>> +out:
>> +	__free_page(page);
>> +	return ret;
>> +}
>> +
> ...
>> +static int ppl_load(struct ppl_conf *ppl_conf)
>> +{
>> +	int ret = 0;
>> +	u32 signature = 0;
>> +	bool signature_set = false;
>> +	int i;
>> +
>> +	for (i = 0; i < ppl_conf->count; i++) {
>> +		struct ppl_log *log = &ppl_conf->child_logs[i];
>> +
>> +		/* skip missing drive */
>> +		if (!log->rdev)
>> +			continue;
>> +
>> +		ret = ppl_load_distributed(log);
>> +		if (ret)
>> +			break;
> 
> Not sure about the strategy here. But if one disk fails, why don't we continue
> do the recovery from other disks? This way we can at least recovery more data.

I thought it would be safer to abort early. Then we can for example
remove the failed drive try again or disable ppl. And if the array is
already degraded and another disk fails, the recovery won't be
meaningful anyway.

Thanks,
Artur

^ permalink raw reply

* Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()
From: Mike Snitzer @ 2017-03-10 15:35 UTC (permalink / raw)
  To: Jack Wang
  Cc: Mikulas Patocka, Lars Ellenberg, NeilBrown, Jens Axboe, LKML,
	Kent Overstreet, Pavel Machek, linux-raid,
	device-mapper development, linux-block
In-Reply-To: <153a6cff-c553-0d18-e15b-4f3defc3a42b@profitbricks.com>

On Fri, Mar 10 2017 at 10:07am -0500,
Jack Wang <jinpu.wang@profitbricks.com> wrote:

> 
> 
> On 10.03.2017 15:55, Mikulas Patocka wrote:
> > 
> > 
> > On Fri, 10 Mar 2017, Mike Snitzer wrote:
> > 
> >> On Fri, Mar 10 2017 at  7:34am -0500,
> >> Lars Ellenberg <lars.ellenberg@linbit.com> wrote:
> >>
> >>>> --- a/block/blk-core.c
> >>>> +++ b/block/blk-core.c
> >>>> @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio)
> >>>>   */
> >>>>  blk_qc_t generic_make_request(struct bio *bio)
> >>>>  {
> >>>> -       struct bio_list bio_list_on_stack;
> >>>> +       /*
> >>>> +        * bio_list_on_stack[0] contains bios submitted by the current
> >>>> +        * make_request_fn.
> >>>> +        * bio_list_on_stack[1] contains bios that were submitted before
> >>>> +        * the current make_request_fn, but that haven't been processed
> >>>> +        * yet.
> >>>> +        */
> >>>> +       struct bio_list bio_list_on_stack[2];
> >>>>         blk_qc_t ret = BLK_QC_T_NONE;
> >>>
> >>> May I suggest that, if you intend to assign something that is not a
> >>> plain &(struct bio_list), but a &(struct bio_list[2]),
> >>> you change the task member so it is renamed (current->bio_list vs
> >>> current->bio_lists, plural, is what I did last year).
> >>> Or you will break external modules, silently, and horribly (or,
> >>> rather, they won't notice, but break the kernel).
> >>> Examples of such modules would be DRBD, ZFS, quite possibly others.
> >>
> >> drbd is upstream -- so what is the problem?  (if you are having to
> >> distribute drbd independent of the upstream drbd then why is drbd
> >> upstream?)
> >>
> >> As for ZFS, worrying about ZFS kABI breakage is the last thing we should
> >> be doing.
> > 
> > It's better to make external modules not compile than to silently 
> > introduce bugs in them. So yes, I would rename that.
> > 
> > Mikulas
> 
> Agree, better rename current->bio_list to current->bio_lists

Fine, normally wouldn't do so but I'm not so opposed that we need to get
hung up on this detail.  If Neil and Jens agree then so be it.

^ permalink raw reply

* [PATCH] raid5-ppl: two minor improvements
From: Artur Paszkiewicz @ 2017-03-10 15:40 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, Artur Paszkiewicz
In-Reply-To: <20170309233240.r64itd7oreuxkumz@kernel.org>

- Remove 'page_result' parameter from ppl_xor() and always write result
  to 'page1'.
- Use blkdev_issue_flush() instead of open coding the flush bio
  submission.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 drivers/md/raid5-ppl.c | 18 +++++-------------
 1 file changed, 5 insertions(+), 13 deletions(-)

diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
index ca1ef644351a..2419fbc6f684 100644
--- a/drivers/md/raid5-ppl.c
+++ b/drivers/md/raid5-ppl.c
@@ -524,8 +524,7 @@ void ppl_stripe_write_finished(struct stripe_head *sh)
 		ppl_io_unit_finished(io);
 }
 
-static void ppl_xor(int size, struct page *page1, struct page *page2,
-		    struct page *page_result)
+static void ppl_xor(int size, struct page *page1, struct page *page2)
 {
 	struct async_submit_ctl submit;
 	struct dma_async_tx_descriptor *tx;
@@ -533,7 +532,7 @@ static void ppl_xor(int size, struct page *page1, struct page *page2,
 
 	init_async_submit(&submit, ASYNC_TX_ACK|ASYNC_TX_XOR_DROP_DST,
 			  NULL, NULL, NULL, NULL);
-	tx = async_xor(page_result, xor_srcs, 0, 2, size, &submit);
+	tx = async_xor(page1, xor_srcs, 0, 2, size, &submit);
 
 	async_tx_quiesce(&tx);
 }
@@ -724,7 +723,7 @@ static int ppl_recover_entry(struct ppl_log *log, struct ppl_header_entry *e,
 				goto out;
 			}
 
-			ppl_xor(block_size, page1, page2, page1);
+			ppl_xor(block_size, page1, page2);
 
 			indent -= 2;
 		}
@@ -747,7 +746,7 @@ static int ppl_recover_entry(struct ppl_log *log, struct ppl_header_entry *e,
 				goto out;
 			}
 
-			ppl_xor(block_size, page1, page2, page1);
+			ppl_xor(block_size, page1, page2);
 		}
 
 		/* map raid sector to parity disk */
@@ -846,14 +845,7 @@ static int ppl_recover(struct ppl_log *log, struct ppl_header *pplhdr)
 	}
 
 	/* flush the disk cache after recovery if necessary */
-	if (test_bit(QUEUE_FLAG_WC, &bdev_get_queue(rdev->bdev)->queue_flags)) {
-		struct bio *bio = bio_alloc_bioset(GFP_KERNEL, 0, ppl_conf->bs);
-
-		bio->bi_bdev = rdev->bdev;
-		bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
-		ret = submit_bio_wait(bio);
-		bio_put(bio);
-	}
+	ret = blkdev_issue_flush(rdev->bdev, GFP_KERNEL, NULL);
 out:
 	__free_page(page);
 	return ret;
-- 
2.11.0


^ permalink raw reply related

* Re: RAID 5 3-drive array inactive - no superblock - can anything be saved?
From: Roman Mamedov @ 2017-03-10 16:05 UTC (permalink / raw)
  To: Robert Schultz; +Cc: linux-raid
In-Reply-To: <deec02c3-26fd-3297-ec0c-5497cbb9a9e7@schultzfamily.ca>

On Fri, 10 Mar 2017 10:18:19 -0500
Robert Schultz <rob@schultzfamily.ca> wrote:

>  >sudo mdadm --assemble --force /dev/md0 /dev/sdb /dev/sdc /dev/sdd
> mdadm: Cannot assemble mbr metadata on /dev/sdb
> mdadm: /dev/sdb has no superblock - assembly aborted
> 
> Running the command with any of the disks says the same thing - no 
> superblock.

But it seems that your array was/is on partitions, not disks, so why are you
assembling with disks? Use sdb1 sdc1 sdd1 in the assemble command and try
again.

-- 
With respect,
Roman

^ permalink raw reply

* Re: [PATCH v5 3/7] raid5-ppl: Partial Parity Log write logging implementation
From: Shaohua Li @ 2017-03-10 18:15 UTC (permalink / raw)
  To: Artur Paszkiewicz, dan.j.williams; +Cc: shli, linux-raid
In-Reply-To: <12942165-232b-a078-f434-1087932ac166@intel.com>

On Fri, Mar 10, 2017 at 04:16:58PM +0100, Artur Paszkiewicz wrote:
> On 03/10/2017 12:24 AM, Shaohua Li wrote:
> > On Thu, Mar 09, 2017 at 09:59:59AM +0100, Artur Paszkiewicz wrote:
> >> Implement the calculation of partial parity for a stripe and PPL write
> >> logging functionality. The description of PPL is added to the
> >> documentation. More details can be found in the comments in raid5-ppl.c.
> >>
> >> Attach a page for holding the partial parity data to stripe_head.
> >> Allocate it only if mddev has the MD_HAS_PPL flag set.
> >>
> >> Partial parity is the xor of not modified data chunks of a stripe and is
> >> calculated as follows:
> >>
> >> - reconstruct-write case:
> >>   xor data from all not updated disks in a stripe
> >>
> >> - read-modify-write case:
> >>   xor old data and parity from all updated disks in a stripe
> >>
> >> Implement it using the async_tx API and integrate into raid_run_ops().
> >> It must be called when we still have access to old data, so do it when
> >> STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
> >> stored into sh->ppl_page.
> >>
> >> Partial parity is not meaningful for full stripe write and is not stored
> >> in the log or used for recovery, so don't attempt to calculate it when
> >> stripe has STRIPE_FULL_WRITE.
> >>
> >> Put the PPL metadata structures to md_p.h because userspace tools
> >> (mdadm) will also need to read/write PPL.
> >>
> >> Warn about using PPL with enabled disk volatile write-back cache for
> >> now. It can be removed once disk cache flushing before writing PPL is
> >> implemented.
> >>
> >> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> > 
> > ... snip ...
> > 
> >> +struct dma_async_tx_descriptor *
> >> +ops_run_partial_parity(struct stripe_head *sh, struct raid5_percpu *percpu,
> >> +		       struct dma_async_tx_descriptor *tx)
> >> +{
> >> +	int disks = sh->disks;
> >> +	struct page **xor_srcs = flex_array_get(percpu->scribble, 0);
> >> +	int count = 0, pd_idx = sh->pd_idx, i;
> >> +	struct async_submit_ctl submit;
> >> +
> >> +	pr_debug("%s: stripe %llu\n", __func__, (unsigned long long)sh->sector);
> >> +
> >> +	/*
> >> +	 * Partial parity is the XOR of stripe data chunks that are not changed
> >> +	 * during the write request. Depending on available data
> >> +	 * (read-modify-write vs. reconstruct-write case) we calculate it
> >> +	 * differently.
> >> +	 */
> >> +	if (sh->reconstruct_state == reconstruct_state_prexor_drain_run) {
> >> +		/* rmw: xor old data and parity from updated disks */
> >> +		for (i = disks; i--;) {
> >> +			struct r5dev *dev = &sh->dev[i];
> >> +			if (test_bit(R5_Wantdrain, &dev->flags) || i == pd_idx)
> >> +				xor_srcs[count++] = dev->page;
> >> +		}
> >> +	} else if (sh->reconstruct_state == reconstruct_state_drain_run) {
> >> +		/* rcw: xor data from all not updated disks */
> >> +		for (i = disks; i--;) {
> >> +			struct r5dev *dev = &sh->dev[i];
> >> +			if (test_bit(R5_UPTODATE, &dev->flags))
> >> +				xor_srcs[count++] = dev->page;
> >> +		}
> >> +	} else {
> >> +		return tx;
> >> +	}
> >> +
> >> +	init_async_submit(&submit, ASYNC_TX_XOR_ZERO_DST, tx, NULL, sh,
> >> +			  flex_array_get(percpu->scribble, 0)
> >> +			  + sizeof(struct page *) * (sh->disks + 2));
> > 
> > Since this should be done before biodrain, should this add ASYNC_TX_FENCE flag?
> 
> The result of this calculation isn't used later by other async_tx
> operations, so it's not needed here, if I understand this correctly. But
> maybe later we could optimize and use partial parity to calculate full
> parity, then it will be necessary.

But the source pages will be overwritten soon, if no fence, I think this is a
problem. Anyway, I'll let Dan to clarify.

Dan,
We are using async API for below operations:
1. xor several source pages to a dest page
2. memcpy other data to the source pages

The two operations will be in an async chain. Should the first operation
include ASYNC_TX_FENCE flag? The API comment declares the flag is required if
the destination page is used by subsequent operations, but I suspect it should
be used too if the subsequent operations could change the source pages.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH] raid5-ppl: two minor improvements
From: Shaohua Li @ 2017-03-10 18:16 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: shli, linux-raid
In-Reply-To: <20170310154044.11198-1-artur.paszkiewicz@intel.com>

On Fri, Mar 10, 2017 at 04:40:44PM +0100, Artur Paszkiewicz wrote:
> - Remove 'page_result' parameter from ppl_xor() and always write result
>   to 'page1'.
> - Use blkdev_issue_flush() instead of open coding the flush bio
>   submission.

merged to original patch
 
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  drivers/md/raid5-ppl.c | 18 +++++-------------
>  1 file changed, 5 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
> index ca1ef644351a..2419fbc6f684 100644
> --- a/drivers/md/raid5-ppl.c
> +++ b/drivers/md/raid5-ppl.c
> @@ -524,8 +524,7 @@ void ppl_stripe_write_finished(struct stripe_head *sh)
>  		ppl_io_unit_finished(io);
>  }
>  
> -static void ppl_xor(int size, struct page *page1, struct page *page2,
> -		    struct page *page_result)
> +static void ppl_xor(int size, struct page *page1, struct page *page2)
>  {
>  	struct async_submit_ctl submit;
>  	struct dma_async_tx_descriptor *tx;
> @@ -533,7 +532,7 @@ static void ppl_xor(int size, struct page *page1, struct page *page2,
>  
>  	init_async_submit(&submit, ASYNC_TX_ACK|ASYNC_TX_XOR_DROP_DST,
>  			  NULL, NULL, NULL, NULL);
> -	tx = async_xor(page_result, xor_srcs, 0, 2, size, &submit);
> +	tx = async_xor(page1, xor_srcs, 0, 2, size, &submit);
>  
>  	async_tx_quiesce(&tx);
>  }
> @@ -724,7 +723,7 @@ static int ppl_recover_entry(struct ppl_log *log, struct ppl_header_entry *e,
>  				goto out;
>  			}
>  
> -			ppl_xor(block_size, page1, page2, page1);
> +			ppl_xor(block_size, page1, page2);
>  
>  			indent -= 2;
>  		}
> @@ -747,7 +746,7 @@ static int ppl_recover_entry(struct ppl_log *log, struct ppl_header_entry *e,
>  				goto out;
>  			}
>  
> -			ppl_xor(block_size, page1, page2, page1);
> +			ppl_xor(block_size, page1, page2);
>  		}
>  
>  		/* map raid sector to parity disk */
> @@ -846,14 +845,7 @@ static int ppl_recover(struct ppl_log *log, struct ppl_header *pplhdr)
>  	}
>  
>  	/* flush the disk cache after recovery if necessary */
> -	if (test_bit(QUEUE_FLAG_WC, &bdev_get_queue(rdev->bdev)->queue_flags)) {
> -		struct bio *bio = bio_alloc_bioset(GFP_KERNEL, 0, ppl_conf->bs);
> -
> -		bio->bi_bdev = rdev->bdev;
> -		bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
> -		ret = submit_bio_wait(bio);
> -		bio_put(bio);
> -	}
> +	ret = blkdev_issue_flush(rdev->bdev, GFP_KERNEL, NULL);
>  out:
>  	__free_page(page);
>  	return ret;
> -- 
> 2.11.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v5 3/7] raid5-ppl: Partial Parity Log write logging implementation
From: Dan Williams @ 2017-03-10 18:42 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Artur Paszkiewicz, Shaohua Li, linux-raid
In-Reply-To: <20170310181539.67vxhhjjhpqkcjyf@kernel.org>

On Fri, Mar 10, 2017 at 10:15 AM, Shaohua Li <shli@kernel.org> wrote:
> On Fri, Mar 10, 2017 at 04:16:58PM +0100, Artur Paszkiewicz wrote:
>> On 03/10/2017 12:24 AM, Shaohua Li wrote:
>> > On Thu, Mar 09, 2017 at 09:59:59AM +0100, Artur Paszkiewicz wrote:
>> >> Implement the calculation of partial parity for a stripe and PPL write
>> >> logging functionality. The description of PPL is added to the
>> >> documentation. More details can be found in the comments in raid5-ppl.c.
>> >>
>> >> Attach a page for holding the partial parity data to stripe_head.
>> >> Allocate it only if mddev has the MD_HAS_PPL flag set.
>> >>
>> >> Partial parity is the xor of not modified data chunks of a stripe and is
>> >> calculated as follows:
>> >>
>> >> - reconstruct-write case:
>> >>   xor data from all not updated disks in a stripe
>> >>
>> >> - read-modify-write case:
>> >>   xor old data and parity from all updated disks in a stripe
>> >>
>> >> Implement it using the async_tx API and integrate into raid_run_ops().
>> >> It must be called when we still have access to old data, so do it when
>> >> STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
>> >> stored into sh->ppl_page.
>> >>
>> >> Partial parity is not meaningful for full stripe write and is not stored
>> >> in the log or used for recovery, so don't attempt to calculate it when
>> >> stripe has STRIPE_FULL_WRITE.
>> >>
>> >> Put the PPL metadata structures to md_p.h because userspace tools
>> >> (mdadm) will also need to read/write PPL.
>> >>
>> >> Warn about using PPL with enabled disk volatile write-back cache for
>> >> now. It can be removed once disk cache flushing before writing PPL is
>> >> implemented.
>> >>
>> >> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>> >
>> > ... snip ...
>> >
>> >> +struct dma_async_tx_descriptor *
>> >> +ops_run_partial_parity(struct stripe_head *sh, struct raid5_percpu *percpu,
>> >> +                 struct dma_async_tx_descriptor *tx)
>> >> +{
>> >> +  int disks = sh->disks;
>> >> +  struct page **xor_srcs = flex_array_get(percpu->scribble, 0);
>> >> +  int count = 0, pd_idx = sh->pd_idx, i;
>> >> +  struct async_submit_ctl submit;
>> >> +
>> >> +  pr_debug("%s: stripe %llu\n", __func__, (unsigned long long)sh->sector);
>> >> +
>> >> +  /*
>> >> +   * Partial parity is the XOR of stripe data chunks that are not changed
>> >> +   * during the write request. Depending on available data
>> >> +   * (read-modify-write vs. reconstruct-write case) we calculate it
>> >> +   * differently.
>> >> +   */
>> >> +  if (sh->reconstruct_state == reconstruct_state_prexor_drain_run) {
>> >> +          /* rmw: xor old data and parity from updated disks */
>> >> +          for (i = disks; i--;) {
>> >> +                  struct r5dev *dev = &sh->dev[i];
>> >> +                  if (test_bit(R5_Wantdrain, &dev->flags) || i == pd_idx)
>> >> +                          xor_srcs[count++] = dev->page;
>> >> +          }
>> >> +  } else if (sh->reconstruct_state == reconstruct_state_drain_run) {
>> >> +          /* rcw: xor data from all not updated disks */
>> >> +          for (i = disks; i--;) {
>> >> +                  struct r5dev *dev = &sh->dev[i];
>> >> +                  if (test_bit(R5_UPTODATE, &dev->flags))
>> >> +                          xor_srcs[count++] = dev->page;
>> >> +          }
>> >> +  } else {
>> >> +          return tx;
>> >> +  }
>> >> +
>> >> +  init_async_submit(&submit, ASYNC_TX_XOR_ZERO_DST, tx, NULL, sh,
>> >> +                    flex_array_get(percpu->scribble, 0)
>> >> +                    + sizeof(struct page *) * (sh->disks + 2));
>> >
>> > Since this should be done before biodrain, should this add ASYNC_TX_FENCE flag?
>>
>> The result of this calculation isn't used later by other async_tx
>> operations, so it's not needed here, if I understand this correctly. But
>> maybe later we could optimize and use partial parity to calculate full
>> parity, then it will be necessary.
>
> But the source pages will be overwritten soon, if no fence, I think this is a
> problem. Anyway, I'll let Dan to clarify.
>
> Dan,
> We are using async API for below operations:
> 1. xor several source pages to a dest page
> 2. memcpy other data to the source pages
>
> The two operations will be in an async chain. Should the first operation
> include ASYNC_TX_FENCE flag? The API comment declares the flag is required if
> the destination page is used by subsequent operations, but I suspect it should
> be used too if the subsequent operations could change the source pages.

I think you're right. If operation2 could change operation1 source
data we should include the fence.

^ permalink raw reply

* Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()
From: Lars Ellenberg @ 2017-03-10 18:51 UTC (permalink / raw)
  To: Jack Wang
  Cc: Mikulas Patocka, Mike Snitzer, NeilBrown, Jens Axboe, LKML,
	Kent Overstreet, Pavel Machek, linux-raid,
	device-mapper development, linux-block
In-Reply-To: <153a6cff-c553-0d18-e15b-4f3defc3a42b@profitbricks.com>

On Fri, Mar 10, 2017 at 04:07:58PM +0100, Jack Wang wrote:
> On 10.03.2017 15:55, Mikulas Patocka wrote:
> > On Fri, 10 Mar 2017, Mike Snitzer wrote:
> >> On Fri, Mar 10 2017 at  7:34am -0500,
> >> Lars Ellenberg <lars.ellenberg@linbit.com> wrote:
> >>
> >>>> --- a/block/blk-core.c
> >>>> +++ b/block/blk-core.c
> >>>> @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio)
> >>>>   */
> >>>>  blk_qc_t generic_make_request(struct bio *bio)
> >>>>  {
> >>>> -       struct bio_list bio_list_on_stack;
> >>>> +       /*
> >>>> +        * bio_list_on_stack[0] contains bios submitted by the current
> >>>> +        * make_request_fn.
> >>>> +        * bio_list_on_stack[1] contains bios that were submitted before
> >>>> +        * the current make_request_fn, but that haven't been processed
> >>>> +        * yet.
> >>>> +        */
> >>>> +       struct bio_list bio_list_on_stack[2];
> >>>>         blk_qc_t ret = BLK_QC_T_NONE;
> >>>
> >>> May I suggest that, if you intend to assign something that is not a
> >>> plain &(struct bio_list), but a &(struct bio_list[2]),
> >>> you change the task member so it is renamed (current->bio_list vs
> >>> current->bio_lists, plural, is what I did last year).
> >>> Or you will break external modules, silently, and horribly (or,
> >>> rather, they won't notice, but break the kernel).
> >>> Examples of such modules would be DRBD, ZFS, quite possibly others.

> > It's better to make external modules not compile than to silently 
> > introduce bugs in them. So yes, I would rename that.
> > 
> > Mikulas
> 
> Agree, better rename current->bio_list to current->bio_lists
>
> Regards,
> Jack

Thank you.

(I don't know if some one does, but...)
Thing is: *IF* some external thing messes with
current->bio_list in "interesting" ways, and not just the
"I don't care, one level of real recursion fixes this for me"
pattern of
	struct bio_list *tmp = current->bio_list;
	current->bio_list = NULL;
	submit_bio()
	current->bio_list = tmp;
you get a chance of stack corruption,
without even as much as a compiler warning.

Which is why I wrote https://lkml.org/lkml/2016/7/8/469
...

Instead, I suggest to distinguish between recursive calls to
generic_make_request(), and pushing back the remainder part in
blk_queue_split(), by pointing current->bio_lists to a
	struct recursion_to_iteration_bio_lists {
		struct bio_list recursion;
		struct bio_list queue;
	}

By providing each q->make_request_fn() with an empty "recursion"
bio_list, then merging any recursively submitted bios to the
head of the "queue" list, we can make the recursion-to-iteration
logic in generic_make_request() process deepest level bios first,
and "sibling" bios of the same level in "natural" order.

...

Cheers,

	Lars

^ permalink raw reply

* linux-next: WARNING: CPU: 0 PID: 1 at lib/refcount.c:114 refcount_inc+0x37/0x40
From: Andrei Vagin @ 2017-03-10 20:01 UTC (permalink / raw)
  To: linux-raid, Shaohua Li

Hello,

We run CRIU tests for linux-next kernels and here is a new issue:

All logs are here: https://api.travis-ci.org/jobs/209680974/log.txt?deansi=true
The kernel version is 4.11.0-rc1-next-20170310

[    2.324763] md: Waiting for all devices to be available before autodetect
[    2.331707] md: If you don't use raid, use raid=noautodetect
[    2.338189] ------------[ cut here ]------------
[    2.342965] WARNING: CPU: 0 PID: 1 at lib/refcount.c:114
refcount_inc+0x37/0x40
[    2.350427] refcount_t: increment on 0; use-after-free.
[    2.355794] Modules linked in:
[    2.358979] CPU: 0 PID: 1 Comm: swapper/0 Not tainted
4.11.0-rc1-next-20170310 #1
[    2.362966] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[    2.362966] Call Trace:
[    2.362966]  dump_stack+0x85/0xc9
[    2.362966]  __warn+0xd1/0xf0
[    2.362966]  warn_slowpath_fmt+0x4f/0x60
[    2.362966]  refcount_inc+0x37/0x40
[    2.362966]  mddev_find+0x1f1/0x2b0
[    2.362966]  md_open+0x1a/0xd0
[    2.362966]  __blkdev_get+0x85/0x4c0
[    2.362966]  blkdev_get+0x1d3/0x340
[    2.362966]  ? _raw_spin_unlock+0x27/0x40
[    2.362966]  blkdev_open+0x5b/0x70
[    2.362966]  do_dentry_open+0x213/0x330
[    2.362966]  ? bd_acquire+0xd0/0xd0
[    2.362966]  vfs_open+0x4f/0x80
[    2.362966]  ? may_open+0x9b/0x100
[    2.362966]  path_openat+0x48a/0xd50
[    2.362966]  ? console_unlock+0x2f9/0x560
[    2.362966]  do_filp_open+0x7e/0xd0
[    2.362966]  ? _raw_spin_unlock+0x27/0x40
[    2.362966]  ? __alloc_fd+0xf7/0x210
[    2.362966]  do_sys_open+0x115/0x1f0
[    2.362966]  SyS_open+0x1e/0x20
[    2.362966]  md_run_setup+0x71/0x9a
[    2.362966]  prepare_namespace+0x36/0x1a4
[    2.362966]  kernel_init_freeable+0x254/0x269
[    2.362966]  ? set_debug_rodata+0x12/0x12
[    2.362966]  ? rest_init+0x140/0x140
[    2.362966]  kernel_init+0xe/0x100
[    2.362966]  ret_from_fork+0x31/0x40
[    2.482465] ---[ end trace a822b43a79b1f9f5 ]---
[    2.487353] md: Autodetecting RAID arrays.
[    2.491647] md: autorun ...
[    2.494592] md: ... autorun DONE.
[    2.503263] EXT4-fs (sda1): couldn't mount as ext3 due to feature
incompatibilities
[    2.511467] ------------[ cut here ]------------
[    2.511477] WARNING: CPU: 0 PID: 21 at lib/refcount.c:207
refcount_dec_not_one+0x75/0x80
[    2.511478] refcount_t: underflow; use-after-free.
[    2.511480] Modules linked in:
[    2.511485] CPU: 0 PID: 21 Comm: kworker/0:1 Tainted: G        W
   4.11.0-rc1-next-20170310 #1
[    2.511486] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[    2.511490] Workqueue: events delayed_fput
[    2.511492] Call Trace:
[    2.511496]  dump_stack+0x85/0xc9
[    2.511501]  __warn+0xd1/0xf0
[    2.511505]  warn_slowpath_fmt+0x4f/0x60
[    2.511509]  refcount_dec_not_one+0x75/0x80
[    2.511511]  refcount_dec_and_lock+0x16/0x50
[    2.511515]  mddev_put+0x22/0x150
[    2.511517]  md_release+0x21/0x30
[    2.511521]  __blkdev_put+0x2df/0x340
[    2.511526]  blkdev_put+0x50/0x150
[    2.511529]  blkdev_close+0x25/0x30
[    2.511531]  __fput+0xfa/0x230
[    2.511535]  delayed_fput+0x25/0x30
[    2.511538]  process_one_work+0x1e1/0x670
[    2.511539]  ? process_one_work+0x162/0x670
[    2.511544]  worker_thread+0x137/0x4b0
[    2.511546]  ? trace_hardirqs_on+0xd/0x10
[    2.511551]  kthread+0x10c/0x140
[    2.511552]  ? process_one_work+0x670/0x670
[    2.511554]  ? kthread_create_on_node+0x40/0x40
[    2.511558]  ret_from_fork+0x31/0x40
[    2.511566] ---[ end trace a822b43a79b1f9f6 ]---

^ permalink raw reply

* on assembly and recovery of a hardware RAID
From: Alfred Matthews @ 2017-03-10 20:37 UTC (permalink / raw)
  To: linux-raid

Hello list. I'm facing a non-redundant Western Digital hardware RAID,
for which, hardware seems to cause a kernel panic at about 3 seconds
running time.

I've assembled the customary testing. The drives appear to be striped RAID 0.

Output: http://pastebin.com/c361jGVx

Evidently WD metadata changes over time, since a new console (adding
USB) will not recognize the drives without erasing them. Files are
visible as files for the short period of controller health.

I imagine I'm trying to assemble a 2 x 3TB RAID array from the
original WD disks when mounted as SATA.

Seeking input on proper mdadm configuration for this.

Then I imagine that I may recover files-as-files from this 2 x 3TB to
standalone disks. Ultimately they would need to move to a new RAID.

Failing: WD My Book Thunderbolt Duo, 2x3TB
New, incompatible: WD My Book Pro, 2x3TB.

Thanks for any comment.

Thanks for your time.

Al Matthews

^ permalink raw reply

* Re: linux-next: WARNING: CPU: 0 PID: 1 at lib/refcount.c:114 refcount_inc+0x37/0x40
From: Shaohua Li @ 2017-03-10 20:54 UTC (permalink / raw)
  To: Andrei Vagin; +Cc: linux-raid, elena.reshetova
In-Reply-To: <CANaxB-xDKh+_0eR2p4rODObQPVFwNM1-vK=fmCV-jaBYpGkPAw@mail.gmail.com>

On Fri, Mar 10, 2017 at 12:01:06PM -0800, Andrei Vagin wrote:
> Hello,
> 
> We run CRIU tests for linux-next kernels and here is a new issue:
> 
> All logs are here: https://api.travis-ci.org/jobs/209680974/log.txt?deansi=true
> The kernel version is 4.11.0-rc1-next-20170310

Thanks for the reporting. It caused by 731d126(drivers, md: convert
mddev.active from atomic_t to refcount_t). It turns out the count doesn't match
the refcount usage. I'll drop the patch temporarily.

Thanks,
Shaohua
> 
> [    2.324763] md: Waiting for all devices to be available before autodetect
> [    2.331707] md: If you don't use raid, use raid=noautodetect
> [    2.338189] ------------[ cut here ]------------
> [    2.342965] WARNING: CPU: 0 PID: 1 at lib/refcount.c:114
> refcount_inc+0x37/0x40
> [    2.350427] refcount_t: increment on 0; use-after-free.
> [    2.355794] Modules linked in:
> [    2.358979] CPU: 0 PID: 1 Comm: swapper/0 Not tainted
> 4.11.0-rc1-next-20170310 #1
> [    2.362966] Hardware name: Google Google Compute Engine/Google
> Compute Engine, BIOS Google 01/01/2011
> [    2.362966] Call Trace:
> [    2.362966]  dump_stack+0x85/0xc9
> [    2.362966]  __warn+0xd1/0xf0
> [    2.362966]  warn_slowpath_fmt+0x4f/0x60
> [    2.362966]  refcount_inc+0x37/0x40
> [    2.362966]  mddev_find+0x1f1/0x2b0
> [    2.362966]  md_open+0x1a/0xd0
> [    2.362966]  __blkdev_get+0x85/0x4c0
> [    2.362966]  blkdev_get+0x1d3/0x340
> [    2.362966]  ? _raw_spin_unlock+0x27/0x40
> [    2.362966]  blkdev_open+0x5b/0x70
> [    2.362966]  do_dentry_open+0x213/0x330
> [    2.362966]  ? bd_acquire+0xd0/0xd0
> [    2.362966]  vfs_open+0x4f/0x80
> [    2.362966]  ? may_open+0x9b/0x100
> [    2.362966]  path_openat+0x48a/0xd50
> [    2.362966]  ? console_unlock+0x2f9/0x560
> [    2.362966]  do_filp_open+0x7e/0xd0
> [    2.362966]  ? _raw_spin_unlock+0x27/0x40
> [    2.362966]  ? __alloc_fd+0xf7/0x210
> [    2.362966]  do_sys_open+0x115/0x1f0
> [    2.362966]  SyS_open+0x1e/0x20
> [    2.362966]  md_run_setup+0x71/0x9a
> [    2.362966]  prepare_namespace+0x36/0x1a4
> [    2.362966]  kernel_init_freeable+0x254/0x269
> [    2.362966]  ? set_debug_rodata+0x12/0x12
> [    2.362966]  ? rest_init+0x140/0x140
> [    2.362966]  kernel_init+0xe/0x100
> [    2.362966]  ret_from_fork+0x31/0x40
> [    2.482465] ---[ end trace a822b43a79b1f9f5 ]---
> [    2.487353] md: Autodetecting RAID arrays.
> [    2.491647] md: autorun ...
> [    2.494592] md: ... autorun DONE.
> [    2.503263] EXT4-fs (sda1): couldn't mount as ext3 due to feature
> incompatibilities
> [    2.511467] ------------[ cut here ]------------
> [    2.511477] WARNING: CPU: 0 PID: 21 at lib/refcount.c:207
> refcount_dec_not_one+0x75/0x80
> [    2.511478] refcount_t: underflow; use-after-free.
> [    2.511480] Modules linked in:
> [    2.511485] CPU: 0 PID: 21 Comm: kworker/0:1 Tainted: G        W
>    4.11.0-rc1-next-20170310 #1
> [    2.511486] Hardware name: Google Google Compute Engine/Google
> Compute Engine, BIOS Google 01/01/2011
> [    2.511490] Workqueue: events delayed_fput
> [    2.511492] Call Trace:
> [    2.511496]  dump_stack+0x85/0xc9
> [    2.511501]  __warn+0xd1/0xf0
> [    2.511505]  warn_slowpath_fmt+0x4f/0x60
> [    2.511509]  refcount_dec_not_one+0x75/0x80
> [    2.511511]  refcount_dec_and_lock+0x16/0x50
> [    2.511515]  mddev_put+0x22/0x150
> [    2.511517]  md_release+0x21/0x30
> [    2.511521]  __blkdev_put+0x2df/0x340
> [    2.511526]  blkdev_put+0x50/0x150
> [    2.511529]  blkdev_close+0x25/0x30
> [    2.511531]  __fput+0xfa/0x230
> [    2.511535]  delayed_fput+0x25/0x30
> [    2.511538]  process_one_work+0x1e1/0x670
> [    2.511539]  ? process_one_work+0x162/0x670
> [    2.511544]  worker_thread+0x137/0x4b0
> [    2.511546]  ? trace_hardirqs_on+0xd/0x10
> [    2.511551]  kthread+0x10c/0x140
> [    2.511552]  ? process_one_work+0x670/0x670
> [    2.511554]  ? kthread_create_on_node+0x40/0x40
> [    2.511558]  ret_from_fork+0x31/0x40
> [    2.511566] ---[ end trace a822b43a79b1f9f6 ]---

^ permalink raw reply

* Re: [PATCH v3] md/r5cache: generate R5LOG_PAYLOAD_FLUSH
From: Shaohua Li @ 2017-03-10 21:01 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid, shli, neilb, kernel-team, dan.j.williams, hch
In-Reply-To: <20170310052339.2705936-1-songliubraving@fb.com>

On Thu, Mar 09, 2017 at 09:23:39PM -0800, Song Liu wrote:
> In r5c_finish_stripe_write_out(), R5LOG_PAYLOAD_FLUSH is append to
> log->current_io.
> 
> Appending R5LOG_PAYLOAD_FLUSH in quiesce needs extra writes to
> journal. To simplify the logic, we just skip R5LOG_PAYLOAD_FLUSH in
> quiesce.
> 
> Even R5LOG_PAYLOAD_FLUSH supports multiple stripes per payload.
> However, current implementation is one stripe per R5LOG_PAYLOAD_FLUSH,
> which is simpler.

Thanks, applied the two patches.
 
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
>  drivers/md/raid5-cache.c | 45 ++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 42 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index e69f922..e3c20c7 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -591,7 +591,7 @@ static void r5l_log_endio(struct bio *bio)
>  
>  	spin_lock_irqsave(&log->io_list_lock, flags);
>  	__r5l_set_io_unit_state(io, IO_UNIT_IO_END);
> -	if (log->need_cache_flush)
> +	if (log->need_cache_flush && !list_empty(&io->stripe_list))
>  		r5l_move_to_end_ios(log);
>  	else
>  		r5l_log_run_stripes(log);
> @@ -619,9 +619,11 @@ static void r5l_log_endio(struct bio *bio)
>  			bio_endio(bi);
>  			atomic_dec(&io->pending_stripe);
>  		}
> -		if (atomic_read(&io->pending_stripe) == 0)
> -			__r5l_stripe_write_finished(io);
>  	}
> +
> +	/* finish flush only io_unit and PAYLOAD_FLUSH only io_unit */
> +	if (atomic_read(&io->pending_stripe) == 0)
> +		__r5l_stripe_write_finished(io);
>  }
>  
>  static void r5l_do_submit_io(struct r5l_log *log, struct r5l_io_unit *io)
> @@ -843,6 +845,41 @@ static void r5l_append_payload_page(struct r5l_log *log, struct page *page)
>  	r5_reserve_log_entry(log, io);
>  }
>  
> +static void r5l_append_flush_payload(struct r5l_log *log, sector_t sect)
> +{
> +	struct mddev *mddev = log->rdev->mddev;
> +	struct r5conf *conf = mddev->private;
> +	struct r5l_io_unit *io;
> +	struct r5l_payload_flush *payload;
> +	int meta_size;
> +
> +	/*
> +	 * payload_flush requires extra writes to the journal.
> +	 * To avoid handling the extra IO in quiesce, just skip
> +	 * flush_payload
> +	 */
> +	if (conf->quiesce)
> +		return;
> +
> +	mutex_lock(&log->io_mutex);
> +	meta_size = sizeof(struct r5l_payload_flush) + sizeof(__le64);
> +
> +	if (r5l_get_meta(log, meta_size)) {
> +		mutex_unlock(&log->io_mutex);
> +		return;
> +	}
> +
> +	/* current implementation is one stripe per flush payload */
> +	io = log->current_io;
> +	payload = page_address(io->meta_page) + io->meta_offset;
> +	payload->header.type = cpu_to_le16(R5LOG_PAYLOAD_FLUSH);
> +	payload->header.flags = cpu_to_le16(0);
> +	payload->size = cpu_to_le32(sizeof(__le64));
> +	payload->flush_stripes[0] = cpu_to_le64(sect);
> +	io->meta_offset += meta_size;
> +	mutex_unlock(&log->io_mutex);
> +}
> +
>  static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
>  			   int data_pages, int parity_pages)
>  {
> @@ -2784,6 +2821,8 @@ void r5c_finish_stripe_write_out(struct r5conf *conf,
>  		atomic_dec(&conf->r5c_flushing_full_stripes);
>  		atomic_dec(&conf->r5c_cached_full_stripes);
>  	}
> +
> +	r5l_append_flush_payload(log, sh->sector);
>  }
>  
>  int
> -- 
> 2.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()
From: NeilBrown @ 2017-03-11  0:47 UTC (permalink / raw)
  To: Lars Ellenberg
  Cc: Jens Axboe, linux-block, Mike Snitzer, LKML, linux-raid,
	device-mapper development, Mikulas Patocka, Pavel Machek,
	Jack Wang, Kent Overstreet
In-Reply-To: <CANr6vz8EbRWXq7igGCzRy9JC1Nt=MMma0h8M6nxHQtwiMDa5aQ@mail.gmail.com>


[-- Attachment #1.1: Type: text/plain, Size: 3001 bytes --]

On Fri, Mar 10 2017, Lars Ellenberg wrote:

>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio)
>>   */
>>  blk_qc_t generic_make_request(struct bio *bio)
>>  {
>> -       struct bio_list bio_list_on_stack;
>> +       /*
>> +        * bio_list_on_stack[0] contains bios submitted by the current
>> +        * make_request_fn.
>> +        * bio_list_on_stack[1] contains bios that were submitted before
>> +        * the current make_request_fn, but that haven't been processed
>> +        * yet.
>> +        */
>> +       struct bio_list bio_list_on_stack[2];
>>         blk_qc_t ret = BLK_QC_T_NONE;
>
> May I suggest that, if you intend to assign something that is not a
> plain &(struct bio_list), but a &(struct bio_list[2]),
> you change the task member so it is renamed (current->bio_list vs
> current->bio_lists, plural, is what I did last year).
> Or you will break external modules, silently, and horribly (or,
> rather, they won't notice, but break the kernel).
> Examples of such modules would be DRBD, ZFS, quite possibly others.
>

This is exactly what I didn't in my first draft (bio_list -> bio_lists),
but then I reverted that change because it didn't seem to be worth the
noise.
It isn't much noise, sched.h, bcache/btree.c, md/dm-bufio.c, and
md/raid1.c get minor changes.
But as I'm hoping to get rid of all of those uses, renaming before
removing seemed pointless ... though admittedly that is what I did for
bioset_create().... I wondered about that too.

The example you give later:
	struct bio_list *tmp = current->bio_list;
	current->bio_list = NULL;
	submit_bio()
	current->bio_list = tmp;

won't cause any problem.  Whatever lists the parent generic_make_request
is holding onto will be untouched during the submit_bio() call, and will
be exactly as it expects them when this caller returns.

If some out-of-tree code does anything with ->bio_list that makes sense
with the previous code, then it will still make sense with the new
code. However there will be a few bios that it didn't get too look at.
These will all be bios that were submitted by a device further up the
stack (closer to the filesystem), so they *should* be irrelevant.
I could probably come up with some weird behaviour that might have
worked before but now wouldn't quite work the same way.  But just fixing
bugs can sometimes affect an out-of-tree driver in a strange way because
it was assuming those bugs.

I hope that I'll soon be able to remove punt_bios_to_rescuer and
flush_current_bio_list, after which current->bio_list  can really be
just a list again.  I don't think it is worth changing the name for a
transient situation.

But thanks for the review - it encouraged me to think though the
consequences again and I'm now more confident.
I actually now think that change probably wasn't necessary.  It is
safer though.  It ensures that current functionality isn't removed
without a clear justification.

Thanks,
NeilBrown


[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply

* [PATCH] dm cache: handle kmalloc failure allocating background_tracker struct
From: Colin King @ 2017-03-11 19:09 UTC (permalink / raw)
  To: Alasdair Kergon, Mike Snitzer, dm-devel, Shaohua Li, linux-raid
  Cc: kernel-janitors, linux-kernel

From: Colin Ian King <colin.king@canonical.com>

currently there is no kmalloc failure check on the allocation of
the background_tracker struct variable b, and so a null return
will lead to a null pointer deference error. Add null check and move
the failure debug message and NULL return so that the two allocation
errors can share the same error exit path.

Detected by CoverityScan, CID#1416587 ("Dereference null return value")

Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 drivers/md/dm-cache-background-tracker.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-cache-background-tracker.c b/drivers/md/dm-cache-background-tracker.c
index 9b1afdf..d27edbcc 100644
--- a/drivers/md/dm-cache-background-tracker.c
+++ b/drivers/md/dm-cache-background-tracker.c
@@ -33,6 +33,8 @@ struct background_tracker *btracker_create(unsigned max_work)
 {
 	struct background_tracker *b = kmalloc(sizeof(*b), GFP_KERNEL);
 
+	if (!b)
+		goto err;
 	b->max_work = max_work;
 	atomic_set(&b->pending_promotes, 0);
 	atomic_set(&b->pending_writebacks, 0);
@@ -44,12 +46,15 @@ struct background_tracker *btracker_create(unsigned max_work)
 	b->pending = RB_ROOT;
 	b->work_cache = KMEM_CACHE(bt_work, 0);
 	if (!b->work_cache) {
-		DMERR("couldn't create mempool for background work items");
 		kfree(b);
-		b = NULL;
+		goto err;
 	}
 
 	return b;
+err:
+	DMERR("couldn't create mempool for background work items");
+	return NULL;
+
 }
 EXPORT_SYMBOL_GPL(btracker_create);
 
-- 
2.10.2

^ permalink raw reply related

* Re: [PATCH 0/5] Updates following recent generic_make_request improvement
From: Jens Axboe @ 2017-03-11 22:32 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-block, Mike Snitzer, LKML, linux-raid,
	device-mapper development, Mikulas Patocka, Pavel Machek,
	Jack Wang, Lars Ellenberg, Kent Overstreet
In-Reply-To: <148912539296.4002.219258660543808741.stgit@noble>

On 03/09/2017 11:00 PM, NeilBrown wrote:
> This is a rebase of the series I sent earlier, based on the
> very latest from Linus, which included my first patch.
> 
> The first fixes a problem that patch introduced, and so should go to
> Linux promptly.
> The others are more general improvements and can go in the normal
> course of events.
> 
> It is possible that the changes to btrfs and xfs can just be dropped
> as a subsequent patch will be needed to revert them anyway.  They are
> there only to be able to say that "blk: make the bioset
> rescue_workqueue optional." doesn't change any functionality at all.

I have applied the first patch for this series, and added the Fixes
tag. We'll let the rest simmer a bit, then aim for 4.12 for those.

-- 
Jens Axboe


^ permalink raw reply

* Re: When will Linux support M2 on RAID ?
From: David F. @ 2017-03-12  0:52 UTC (permalink / raw)
  To: Austin S. Hemmelgarn
  Cc: Christoph Hellwig, linux-kernel, linux-raid@vger.kernel.org
In-Reply-To: <f37700c7-139d-2bf4-07da-4f0bf77c743d@gmail.com>

Very possible it affects other devices attached, but all consumer
reports and test systems here all have NVME drives on m2 and when in
RAID mode.  Listing PCI data linux will show Intel SATA controller
detected in RAID mode, but no drives detected, all you get is your
/dev/sda USB boot device.  A lot of places linux uses a table of known
ID's and if not listed doesn't support it, unlike Windows which always
supports the given generic device type when it can (keyboard, etc..)
and special drivers for special features.  RAID of course is different
an typically requires special drivers.   As mentioned, some system
don't let you change the mode, others you can't use linux as a
maintenance platform since it won't see any of the drives.  Just be
nice to have things "just work".

On Tue, Mar 7, 2017 at 7:54 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2017-03-07 10:15, Christoph Hellwig wrote:
>>
>> On Tue, Mar 07, 2017 at 09:50:22AM -0500, Austin S. Hemmelgarn wrote:
>>>
>>> He's referring to the RAID mode most modern Intel chipsets have, which
>>> (last
>>> I checked) Linux does not support completely and many OEM's are setting
>>> by
>>> default on new systems because it apparently provides better performance
>>> than AHCI even for a single device.
>>
>>
>> It actually provides worse performance.  What it does it that it shoves
>> up to three nvme device bars into the bar of an AHCI device, and
>> requires the OS to handle them all using a single driver.  The Money's
>> on crack at Intel decided to do that to provide their "valueable" RSTe
>> IP (which is a windows ATA + RAID driver in a blob, which now has also
>> grown a NVMe driver).  The only remotely sane thing is to disable it
>> in the bios, and burn all people involved with it.  The next best thing
>> is to provide a fake PCIe root port driver untangling this before it
>> hits the driver, but unfortunately Intel is unwilling to either do this
>> on their own or at least provide enough documentation for others to do
>> it.
>>
> For NVMe, yeah, it hurts performance horribly.  For SATA devices though,
> it's hit or miss, some setups perform better, some perform worse.
>
> It does have one advantage though, it lets you put the C drive for a Windows
> install on a soft-RAID array insanely easily compared to trying to do so
> through Windows itself (although still significantly less easily that doing
> the equivalent on Linux...).
>
> The cynic in me is tempted to believe that the OEM's who are turning it on
> by default are trying to either:
> 1. Make their low-end systems look even crappier in terms of performance
> while adding to their marketing checklist (Of the systems I've seen that
> have this on by default, most were cheap ones with really low specs).
> 2. Actively make it harder to run anything but Windows on their hardware.

^ permalink raw reply

* Re: [PATCH 2/4] mdadm:external bitmap only supports ext filesystem
From: Liu Zhilong @ 2017-03-12 14:48 UTC (permalink / raw)
  To: Jes.Sorensen, shli; +Cc: linux-raid
In-Reply-To: <20170308075144.24873-1-zlliu@suse.com>


      as this patch's purpose, just wanna improve the prompt when using the
external bitmap mode, and this patch maybe a little redundant.
      For the errno rule, RUN_ARRAY returned EINVAL indeed and the man page
has indicated that external bitmap only works with ext[2-4] file system.
      I think it would be more user-friendly if prints one prompt and 
returned
EINVAL at the same time when the bmap() got failure.
      Such as:

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 9fb2cca..0bff96b 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -381,6 +381,7 @@ static int read_page(struct file *file, unsigned 
long index,
                         bh->b_blocknr = bmap(inode, block);
                         if (bh->b_blocknr == 0) {
                                 /* Cannot use this file! */
+                               pr_err("The external bitmap only works 
with ext[2-4] filesystem.\n");
                                 ret = -EINVAL;
                                 goto out;
                         }
Thanks,
-Zhilong

On 03/08/2017 02:51 AM, Zhilong Liu wrote:
> mdadm: ensure that the external bitmap_file is
> stored by ext[2-4] file system, because bmap()
> of linux/driver/md/bitmap.c exits directly when
> the bitmap_file isn't suitable. mdadm should make
> users aware of this scenario and give a prompt.
>
> Signed-off-by: Zhilong Liu <zlliu@suse.com>
>
> diff --git a/Create.c b/Create.c
> index 2721884..9a951b0 100644
> --- a/Create.c
> +++ b/Create.c
> @@ -831,11 +831,6 @@ int Create(struct supertype *st, char *mddev,
>   			goto abort_locked;
>   		}
>   		bitmap_fd = open(s->bitmap_file, O_RDWR);
> -		if (bitmap_fd < 0) {
> -			pr_err("weird: %s cannot be openned\n",
> -				s->bitmap_file);
> -			goto abort_locked;
> -		}
>   		if (ioctl(mdfd, SET_BITMAP_FILE, bitmap_fd) < 0) {
>   			pr_err("Cannot set bitmap file for %s: %s\n",
>   				mddev, strerror(errno));
> diff --git a/mdadm.c b/mdadm.c
> index d6ad8dc..19a06db 100644
> --- a/mdadm.c
> +++ b/mdadm.c
> @@ -28,6 +28,7 @@
>   #include "mdadm.h"
>   #include "md_p.h"
>   #include <ctype.h>
> +#include <sys/vfs.h>
>   
>   static int scan_assemble(struct supertype *ss,
>   			 struct context *c,
> @@ -1143,6 +1144,21 @@ int main(int argc, char *argv[])
>   			    strcmp(optarg, "none") == 0 ||
>   			    strchr(optarg, '/') != NULL) {
>   				s.bitmap_file = optarg;
> +				if (strchr(s.bitmap_file, '/') != NULL) {
> +					bitmap_fd = open(s.bitmap_file, O_RDWR);
> +					if (bitmap_fd < 0) {
> +						pr_err("weird: %s cannot be openned\n", s.bitmap_file);
> +						exit(2);
> +					}
> +					close(bitmap_fd);
> +					struct statfs ext_bitmap;
> +					statfs(s.bitmap_file, &ext_bitmap);
> +					if (ext_bitmap.f_type != 0xEF53){
> +						pr_err("external bitmap only supports ext[2-4] filesystem, %s.\n",
> +							s.bitmap_file);
> +						exit(2);
> +					}
> +				}
>   				continue;
>   			}
>   			if (strcmp(optarg, "clustered") == 0) {


^ permalink raw reply related

* Re: [dm-devel] [PATCH 0/5] Updates following recent generic_make_request improvement
From: NeilBrown @ 2017-03-12 21:52 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-raid, Mike Snitzer, LKML, linux-block,
	device-mapper development, Mikulas Patocka, Pavel Machek,
	Jack Wang, Lars Ellenberg, Kent Overstreet
In-Reply-To: <3c5e786a-ba42-ada8-c943-17e8b72c63d1@kernel.dk>

[-- Attachment #1: Type: text/plain, Size: 866 bytes --]

On Sat, Mar 11 2017, Jens Axboe wrote:

> On 03/09/2017 11:00 PM, NeilBrown wrote:
>> This is a rebase of the series I sent earlier, based on the
>> very latest from Linus, which included my first patch.
>> 
>> The first fixes a problem that patch introduced, and so should go to
>> Linux promptly.
>> The others are more general improvements and can go in the normal
>> course of events.
>> 
>> It is possible that the changes to btrfs and xfs can just be dropped
>> as a subsequent patch will be needed to revert them anyway.  They are
>> there only to be able to say that "blk: make the bioset
>> rescue_workqueue optional." doesn't change any functionality at all.
>
> I have applied the first patch for this series, and added the Fixes
> tag. We'll let the rest simmer a bit, then aim for 4.12 for those.
>

Sounds good, thanks.
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH 0/4] mdadm:checking level once mode has been set
From: NeilBrown @ 2017-03-12 22:54 UTC (permalink / raw)
  To: Jes.Sorensen; +Cc: linux-raid, Zhilong Liu
In-Reply-To: <20170308074831.24683-1-zlliu@suse.com>

[-- Attachment #1: Type: text/plain, Size: 1803 bytes --]

On Wed, Mar 08 2017, Zhilong Liu wrote:

> mdadm: it would be better to check --level ealier,
> because it would fall to different prompt if user
> forgets to specify the --level. such as:
> ./mdadm -CR /dev/md0 -b internal -n2 -x1 /dev/loop[0-2]

When I run that command I get:

  mdadm: a RAID level is needed to create an array.


What do you get?

NeilBrown


>
> Signed-off-by: Zhilong Liu <zlliu@suse.com>
>
> diff --git a/Create.c b/Create.c
> index 9a951b0..beec29f 100644
> --- a/Create.c
> +++ b/Create.c
> @@ -125,10 +125,6 @@ int Create(struct supertype *st, char *mddev,
>  	memset(&info, 0, sizeof(info));
>  	if (s->level == UnSet && st && st->ss->default_geometry)
>  		st->ss->default_geometry(st, &s->level, NULL, NULL);
> -	if (s->level == UnSet) {
> -		pr_err("a RAID level is needed to create an array.\n");
> -		return 1;
> -	}
>  	if (s->raiddisks < 4 && s->level == 6) {
>  		pr_err("at least 4 raid-devices needed for level 6\n");
>  		return 1;
> diff --git a/mdadm.c b/mdadm.c
> index 19a06db..ad24bdf 100644
> --- a/mdadm.c
> +++ b/mdadm.c
> @@ -350,6 +350,12 @@ int main(int argc, char *argv[])
>  				pr_err("Must give -a/--add for devices to add: %s\n", optarg);
>  				exit(2);
>  			}
> +			if (devs_found > 0 && s.level == UnSet && !devmode) {
> +				if (mode == CREATE || mode == BUILD) {
> +					pr_err("a RAID level is needed to create or build an array.\n");
> +					exit(2);
> +				}
> +			}
>  			dv = xmalloc(sizeof(*dv));
>  			dv->devname = optarg;
>  			dv->disposition = devmode;
> -- 
> 2.10.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH 3/4] mdadm:triggers core dump when stat2devnm return NULL
From: NeilBrown @ 2017-03-12 23:00 UTC (permalink / raw)
  To: Jes.Sorensen; +Cc: linux-raid, Zhilong Liu
In-Reply-To: <20170308075221.24965-1-zlliu@suse.com>

[-- Attachment #1: Type: text/plain, Size: 1339 bytes --]

On Wed, Mar 08 2017, Zhilong Liu wrote:

> ensure that the device should be a block device when uses
> --wait parameter, such as the 'f' and 'd' type file would
> be triggered core dumped.
> ./mdadm --wait /dev/md/, happened core dump.
>
> Signed-off-by: Zhilong Liu <zlliu@suse.com>
>
> diff --git a/Monitor.c b/Monitor.c
> index 802a9d9..1900db3 100644
> --- a/Monitor.c
> +++ b/Monitor.c
> @@ -1002,6 +1002,10 @@ int Wait(char *dev)
>  			strerror(errno));
>  		return 2;
>  	}
> +	if ((S_IFMT & stb.st_mode) != S_IFBLK) {
> +		pr_err("%s is not a block device.\n", dev);
> +		return 2;
> +	}
>  	strcpy(devnm, stat2devnm(&stb));

Surely it would be cleaner to do something like:

  tmp = stat2devnm(&stb);
  if (!tmp) {
      pr_err("%s is not a block device.\n", dev);
      return 2;
  }
  strcpy(devnm, tmp);

This makes it more obvious how you have fixed the crash.

>  
>  	while(1) {
> diff --git a/lib.c b/lib.c
> index b640634..7116298 100644
> --- a/lib.c
> +++ b/lib.c
> @@ -89,9 +89,6 @@ char *devid2kname(int devid)
>  
>  char *stat2kname(struct stat *st)
>  {
> -	if ((S_IFMT & st->st_mode) != S_IFBLK)
> -		return NULL;
> -
>  	return devid2kname(st->st_rdev);
>  }

Why are you removing this test?  It has nothing to do with the other
part of the patch.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH 0/4] mdadm:checking level once mode has been set
From: zhilong @ 2017-03-13  3:22 UTC (permalink / raw)
  To: NeilBrown, Jes.Sorensen; +Cc: linux-raid
In-Reply-To: <878toaf6me.fsf@notabene.neil.brown.name>


On 03/13/2017 06:54 AM, NeilBrown wrote:
> On Wed, Mar 08 2017, Zhilong Liu wrote:
>
>> mdadm: it would be better to check --level ealier,
>> because it would fall to different prompt if user
>> forgets to specify the --level. such as:
>> ./mdadm -CR /dev/md0 -b internal -n2 -x1 /dev/loop[0-2]
> When I run that command I get:
>
>    mdadm: a RAID level is needed to create an array.
>
>
> What do you get?

I'm sorry I have provided the wrong command, for this scenario,
issues "mdadm --build /dev/md0 -n2 /dev/loop[0-1]" should be
proper. it's the purpose to check --level earlier.

If this patch is useful, I would send the v1 patch for it, correct the
comments.

Thanks,
-Zhilong
> NeilBrown
>
>
>> Signed-off-by: Zhilong Liu <zlliu@suse.com>
>>
>> diff --git a/Create.c b/Create.c
>> index 9a951b0..beec29f 100644
>> --- a/Create.c
>> +++ b/Create.c
>> @@ -125,10 +125,6 @@ int Create(struct supertype *st, char *mddev,
>>   	memset(&info, 0, sizeof(info));
>>   	if (s->level == UnSet && st && st->ss->default_geometry)
>>   		st->ss->default_geometry(st, &s->level, NULL, NULL);
>> -	if (s->level == UnSet) {
>> -		pr_err("a RAID level is needed to create an array.\n");
>> -		return 1;
>> -	}
>>   	if (s->raiddisks < 4 && s->level == 6) {
>>   		pr_err("at least 4 raid-devices needed for level 6\n");
>>   		return 1;
>> diff --git a/mdadm.c b/mdadm.c
>> index 19a06db..ad24bdf 100644
>> --- a/mdadm.c
>> +++ b/mdadm.c
>> @@ -350,6 +350,12 @@ int main(int argc, char *argv[])
>>   				pr_err("Must give -a/--add for devices to add: %s\n", optarg);
>>   				exit(2);
>>   			}
>> +			if (devs_found > 0 && s.level == UnSet && !devmode) {
>> +				if (mode == CREATE || mode == BUILD) {
>> +					pr_err("a RAID level is needed to create or build an array.\n");
>> +					exit(2);
>> +				}
>> +			}
>>   			dv = xmalloc(sizeof(*dv));
>>   			dv->devname = optarg;
>>   			dv->disposition = devmode;
>> -- 
>> 2.10.2
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* Re: [PATCH 0/4] mdadm:checking level once mode has been set
From: NeilBrown @ 2017-03-13  3:48 UTC (permalink / raw)
  To: zhilong, Jes.Sorensen; +Cc: linux-raid
In-Reply-To: <65454fe3-6165-5d02-eecd-02f8aac8f422@suse.com>

[-- Attachment #1: Type: text/plain, Size: 855 bytes --]

On Mon, Mar 13 2017, zhilong wrote:

> On 03/13/2017 06:54 AM, NeilBrown wrote:
>> On Wed, Mar 08 2017, Zhilong Liu wrote:
>>
>>> mdadm: it would be better to check --level ealier,
>>> because it would fall to different prompt if user
>>> forgets to specify the --level. such as:
>>> ./mdadm -CR /dev/md0 -b internal -n2 -x1 /dev/loop[0-2]
>> When I run that command I get:
>>
>>    mdadm: a RAID level is needed to create an array.
>>
>>
>> What do you get?
>
> I'm sorry I have provided the wrong command, for this scenario,
> issues "mdadm --build /dev/md0 -n2 /dev/loop[0-1]" should be
> proper. it's the purpose to check --level earlier.

OK, that makes sense.

>
> If this patch is useful, I would send the v1 patch for it, correct the
> comments.

My preference would be to have the check in Build.c, but Jes might have
different ideas.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: interesting case of a hung 'recovery'
From: Eyal Lebedinsky @ 2017-03-13  4:12 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org
In-Reply-To: <768e3a4c-070a-a47e-e86a-479918d01c5a@eyal.emu.id.au>

Bump again...

All the data I have is at the bottom.

On 09/03/17 22:00, Eyal Lebedinsky wrote:
> On 09/03/17 20:13, Jack Wang wrote:
>> 2017-03-09 8:39 GMT+01:00 Eyal Lebedinsky <eyal@eyal.emu.id.au>:
>>> Bump.
>>>
>>> On 18/02/17 23:14, Eyal Lebedinsky wrote:
>>>>
>>>> I should start by saying that this is an old fedora 19 system
>>>>
>>>> Executive summary: after '--add'ing a new member a 'recovery' starts but
>>>> 'sync_max' is not reset.
>>>>
>>>> $ uname -a
>>>> Linux e7.eyal.emu.id.au 3.14.27-100.fc19.x86_64 #1 SMP Wed Dec 17 19:36:34
>>>> UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
>>>
>>>
>>> $ sudo mdadm --version
>>> mdadm - v4.0 - 2017-01-09
>>>
>>>> so the issue may have been fixed since.
>>>>
>>>> I had a disk fail in a raid6. After some 'pending' sectors were logged I
>>>> decided to do a 'check'
>>>> around that location (set sync_min/max and echo 'check'). Sure enough it
>>>> elicited disk errors,
>>>> but the disk did not recover and it was kicked out of the array. Moreover
>>>> it became unresponsive.
>>>> It needed a power cycle so I shutdown and rebooted the machine.
>>>>
>>>> Not one to give up easily I tried the check again, with the same result.
>>>> It was time to '--remove' this array member. I then '--add'ed a new disk
>>>> which started a recovery.
>>>>
>>>> A few hours later I noticed that it slowed down. A lot. It actually did
>>>> not progress at all for
>>>> a few hours (I was away from the machine).
>>>>
>>>> As I was staring at the screen (for a long while) I realised that it
>>>> stopped at 55.5%, and this
>>>> number is exactly where the original 'check' failed (I still do not
>>>> understand why with my bad
>>>> memory I remembered this number).
>>>>
>>>> I checked 'sync_completed' and it was proper.
>>>> I then examined 'sync_max' and it was wrong - it had the location where
>>>> the very early 'check'
>>>> failed earlier in the day. It was the same sector where it is now paused
>>>> at - looks related.
>>>>
>>>> I decided to take a (small) risk and do
>>>>     # echo 'max' >/sys/block/md127/md/sync_max
>>>> at which point the recovery moved on. It should be finished in about 5
>>>> hours.
>>>>
>>>> I do not think that it is correct for 'sync_max' to not be set to 'max'
>>>> when a new member is
>>>> added - it surely requires a full recovery.
>>>>
>>>> I really hope (and expect) that this was actually fixed, but this note may
>>>> help others facing
>>>> same predicament.
>>>>
>>>> cheers
>>>>
>>>
>>> --
>>> Eyal Lebedinsky (eyal@eyal.emu.id.au)
>>
>> You'd better offer attach much detailed information, then people can help.
>
> I know, but this is a live system and I cannot experiment with it.
>
> There were no unusual messages (see below) at all. The 'recovery' simply stopped. This is
> the usual behaviour when one sets sync_min/max and starts a (for example) 'check'.
> As I described above, I found that after the --add 'sync_max' was NOT set to 'max' and
> not to the full RAID size but was left at the value where the earlier 'check' failed.
> It failed when the disk completely disappeared.
>
> I guess the original 'check' was left pending, and when a replacement disk was added
> to the RAID it was automatically recovered (as it should) but the original 'check'
> somehow left something behind that should have been reset for the recovery.
>
>> eg:
>> https://raid.wiki.kernel.org/index.php/Asking_for_help
>>
>> For the problem you reported, better offer also kernel dmesg, output
>> of blocking tasks via "echo w >  /proc/sysrq-trigger" maybe also
>> "echo t > /proc/sysrq-trigger"
>
> There was no blocked task, just that the recovery stopped progressing when it
> reached sync_max, as it should. The problem is that sync_max had the wrong value,
> the full (newly added) disk was supposed to be synced.
>
>> Cheers,
>
> In case it helps, here is all the info I have of the event:
>
> ### I have a script that checks the RAID in a small region. I had an earlier error in
> ### on the disk and to check it my script effectively did:
>     echo 4336657408 >sys/block/md127/md/sync_min
>     echo 4339803136 >sys/block/md127/md/sync_max
>     echo check      >sys/block/md127/md/sync_action
> ### messages:
> Feb 18 13:46:31 e7 kernel: [  976.688691] md: data-check of RAID array md127
> Feb 18 13:46:31 e7 kernel: [  976.693254] md: minimum _guaranteed_  speed: 150000 KB/sec/disk.
> Feb 18 13:46:31 e7 kernel: [  976.699479] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
> Feb 18 13:46:31 e7 kernel: [  976.709420] md: using 128k window, over a total of 3906885120k.
> Feb 18 13:46:31 e7 kernel: [  976.715457] md: resuming data-check of md127 from checkpoint.
>
> ### the failure of the 'check'
> ... many i/o errors then sdf completely disappeared ... errors at sectors 4337414{000,040,168}
> Feb 18 13:47:08 e7 kernel: [ 1014.334781] md: super_written gets error=-5, uptodate=0
> Feb 18 13:47:08 e7 kernel: [ 1014.340024] md/raid:md127: Disk failure on sdf1, disabling device.
> Feb 18 13:47:08 e7 kernel: [ 1014.340024] md/raid:md127: Operation continuing on 6 devices.
> Feb 18 13:47:08 e7 kernel: [ 1014.417307] md: md127: data-check interrupted.
>
> ### a new disk (sdj) is plugged in and partitioned.
> $ sudo mdadm /dev/md127 --add /dev/sdj1
> $ cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md127 : active raid6 sdj1[11] sdf1[7](F) sdi1[8] sde1[9] sdh1[12] sdc1[0] sdg1[13] sdd1[10]
>       19534425600 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6] [UUU_UUU]
>       [>....................]  recovery =  0.7% (29805572/3906885120) finish=509.2min speed=126880K/sec
>       bitmap: 7/30 pages [28KB], 65536KB chunk
>
> ### messages:
> Feb 18 14:23:10 e7 kernel: [ 3177.183250] md: bind<sdj1>
> Feb 18 14:23:10 e7 kernel: [ 3177.255529] md: recovery of RAID array md127
> Feb 18 14:23:10 e7 kernel: [ 3177.259894] md: minimum _guaranteed_  speed: 150000 KB/sec/disk.
> Feb 18 14:23:10 e7 kernel: [ 3177.265994] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
> Feb 18 14:23:10 e7 kernel: [ 3177.275736] md: using 128k window, over a total of 3906885120k.
>
> ### the point when the recovery paused:
> 2017-02-18 20:02:48        [===========>.........]  recovery = 55.4% (2166229192/3906885120) finish=372.8min speed=77803K/sec
> 2017-02-18 20:02:58        [===========>.........]  recovery = 55.4% (2167083344/3906885120) finish=366.2min speed=79159K/sec
> 2017-02-18 20:03:08        [===========>.........]  recovery = 55.4% (2167819876/3906885120) finish=374.8min speed=77316K/sec
> 2017-02-18 20:03:18        [===========>.........]  recovery = 55.5% (2168520428/3906885120) finish=375.4min speed=77157K/sec
> 2017-02-18 20:03:28        [===========>.........]  recovery = 55.5% (2168590848/3906885120) finish=489.4min speed=59194K/sec
> 2017-02-18 20:03:38        [===========>.........]  recovery = 55.5% (2168590848/3906885120) finish=608.7min speed=47588K/sec
> 2017-02-18 20:03:48        [===========>.........]  recovery = 55.5% (2168590848/3906885120) finish=728.1min speed=39786K/sec
> 2017-02-18 20:03:58        [===========>.........]  recovery = 55.5% (2168590848/3906885120) finish=847.5min speed=34182K/sec
> ...
> 2017-02-18 22:36:44 0       [===========>.........]  recovery = 55.5% (2168590848/3906885120) finish=110261.8min speed=262K/sec
> 2017-02-18 22:36:54 0       [===========>.........]  recovery = 55.5% (2168590848/3906885120) finish=110381.2min speed=262K/sec
> 2017-02-18 22:37:04 0       [===========>.........]  recovery = 55.5% (2168590848/3906885120) finish=110500.6min speed=262K/sec
> 2017-02-18 22:37:14 0       [===========>.........]  recovery = 55.5% (2168590848/3906885120) finish=110619.9min speed=261K/sec
>
> # echo 'max' >/sys/block/md127/md/sync_max
> ### recovery now moves on:
> 2017-02-18 22:37:24 0       [===========>.........]  recovery = 55.5% (2168938500/3906885120) finish=117500.2min speed=246K/sec
> 2017-02-18 22:37:34 0       [===========>.........]  recovery = 55.5% (2169997568/3906885120) finish=105201.7min speed=275K/sec
> 2017-02-18 22:37:44 0       [===========>.........]  recovery = 55.5% (2171066120/3906885120) finish=90962.0min speed=318K/sec
> 2017-02-18 22:37:54 0       [===========>.........]  recovery = 55.5% (2172125192/3906885120) finish=269.9min speed=107101K/sec
> 2017-02-18 22:38:04 0       [===========>.........]  recovery = 55.6% (2173114372/3906885120) finish=272.1min speed=106165K/sec
> 2017-02-18 22:38:14 0       [===========>.........]  recovery = 55.6% (2174004224/3906885120) finish=287.3min speed=100492K/sec
>
> ### and it completed over six hours later:
> Feb 19 04:49:16 e7 kernel: [55167.633100] md: md127: recovery done.
>
> cheers
>

-- 
Eyal Lebedinsky (eyal@eyal.emu.id.au)

^ permalink raw reply

* Re: [PATCH 0/4] mdadm:checking level once mode has been set
From: zhilong @ 2017-03-13  5:16 UTC (permalink / raw)
  To: NeilBrown, Jes.Sorensen; +Cc: linux-raid
In-Reply-To: <8737ehg7l8.fsf@notabene.neil.brown.name>


On 03/13/2017 11:48 AM, NeilBrown wrote:
> On Mon, Mar 13 2017, zhilong wrote:
>
>> On 03/13/2017 06:54 AM, NeilBrown wrote:
>>> On Wed, Mar 08 2017, Zhilong Liu wrote:
>>>
>>>> mdadm: it would be better to check --level ealier,
>>>> because it would fall to different prompt if user
>>>> forgets to specify the --level. such as:
>>>> ./mdadm -CR /dev/md0 -b internal -n2 -x1 /dev/loop[0-2]
>>> When I run that command I get:
>>>
>>>     mdadm: a RAID level is needed to create an array.
>>>
>>>
>>> What do you get?
>> I'm sorry I have provided the wrong command, for this scenario,
>> issues "mdadm --build /dev/md0 -n2 /dev/loop[0-1]" should be
>> proper. it's the purpose to check --level earlier.
> OK, that makes sense.
>
>> If this patch is useful, I would send the v1 patch for it, correct the
>> comments.
> My preference would be to have the check in Build.c, but Jes might have
> different ideas.

copy that, thanks very much.

Best regards,
-Zhilong

> NeilBrown


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox