Linux RAID subsystem development
 help / color / mirror / Atom feed
* Re: [PATCH] md/raid1/10: fix potential deadlock
From: 王金浦 @ 2017-03-02 17:04 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, NeilBrown, Coly Li, v3.14+, only the raid10 part
In-Reply-To: <b5d7363c5bb03ee017d42f09cf16d9e41e0aa8eb.1488315832.git.shli@fb.com>

2017-02-28 22:08 GMT+01:00 Shaohua Li <shli@fb.com>:
> Neil Brown pointed out a potential deadlock in raid 10 code with
> bio_split/chain. The raid1 code could have the same issue, but recent
> barrier rework makes it less likely to happen. The deadlock happens in
> below sequence:
>
> 1. generic_make_request(bio), this will set current->bio_list
> 2. raid10_make_request will split bio to bio1 and bio2
> 3. __make_request(bio1), wait_barrer, add underlayer disk bio to
> current->bio_list
> 4. __make_request(bio2), wait_barrer
>
> If raise_barrier happens between 3 & 4, since wait_barrier runs at 3,
> raise_barrier waits for IO completion from 3. And since raise_barrier
> sets barrier, 4 waits for raise_barrier. But IO from 3 can't be
> dispatched because raid10_make_request() doesn't finished yet.
>
> The solution is to adjust the IO ordering. Quotes from Neil:
> "
> It is much safer to:
>
>     if (need to split) {
>         split = bio_split(bio, ...)
>         bio_chain(...)
>         make_request_fn(split);
>         generic_make_request(bio);
>    } else
>         make_request_fn(mddev, bio);
>
> This way we first process the initial section of the bio (in 'split')
> which will queue some requests to the underlying devices.  These
> requests will be queued in generic_make_request.
> Then we queue the remainder of the bio, which will be added to the end
> of the generic_make_request queue.
> Then we return.
> generic_make_request() will pop the lower-level device requests off the
> queue and handle them first.  Then it will process the remainder
> of the original bio once the first section has been fully processed.
> "
>
> Cc: Coly Li <colyli@suse.de>
> Cc: 王金浦 <jinpuwang@gmail.com>
> Cc: stable@vger.kernel.org (v3.14+, only the raid10 part)
> Suggested-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  drivers/md/raid1.c  | 25 +++++++++++++++++++++++--
>  drivers/md/raid10.c | 18 ++++++++++++++++++
>  2 files changed, 41 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 551d654..3c5933b 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1584,9 +1584,30 @@ static void raid1_make_request(struct mddev *mddev, struct bio *bio)
>                         split = bio;
>                 }
>
> -               if (bio_data_dir(split) == READ)
> +               if (bio_data_dir(split) == READ) {
>                         raid1_read_request(mddev, split);
> -               else
> +
> +                       /*
> +                        * If a bio is splitted, the first part of bio will
> +                        * pass barrier but the bio is queued in
> +                        * current->bio_list (see generic_make_request). If
> +                        * there is a raise_barrier() called here, the second
> +                        * part of bio can't pass barrier. But since the first
> +                        * part bio isn't dispatched to underlaying disks yet,
> +                        * the barrier is never released, hence raise_barrier
> +                        * will alays wait. We have a deadlock.
> +                        * Note, this only happens in read path. For write
> +                        * path, the first part of bio is dispatched in a
> +                        * schedule() call (because of blk plug) or offloaded
> +                        * to raid10d.
> +                        * Quitting from the function immediately can change
> +                        * the bio order queued in bio_list and avoid the deadlock.
> +                        */
> +                       if (split != bio) {
> +                               generic_make_request(bio);
> +                               break;
> +                       }
> +               } else
>                         raid1_write_request(mddev, split);
>         } while (split != bio);
>  }
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index c4db6d1..b1b1f98 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -1584,7 +1584,25 @@ static void raid10_make_request(struct mddev *mddev, struct bio *bio)
>                         split = bio;
>                 }
>
> +               /*
> +                * If a bio is splitted, the first part of bio will pass
> +                * barrier but the bio is queued in current->bio_list (see
> +                * generic_make_request). If there is a raise_barrier() called
> +                * here, the second part of bio can't pass barrier. But since
> +                * the first part bio isn't dispatched to underlaying disks
> +                * yet, the barrier is never released, hence raise_barrier will
> +                * alays wait. We have a deadlock.
> +                * Note, this only happens in read path. For write path, the
> +                * first part of bio is dispatched in a schedule() call
> +                * (because of blk plug) or offloaded to raid10d.
> +                * Quitting from the function immediately can change the bio
> +                * order queued in bio_list and avoid the deadlock.
> +                */
>                 __make_request(mddev, split);
> +               if (split != bio && bio_data_dir(bio) == READ) {
> +                       generic_make_request(bio);
> +                       break;
> +               }
>         } while (split != bio);
>
>         /* In case raid10d snuck in to freeze_array */
> --
> 2.9.3
>
Looks good to me!
Reviewed-by:  Jack Wang <jinpu.wang@profitbricks.com>

Thanks!

^ permalink raw reply

* Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt
From: Gilad Ben-Yossef @ 2017-03-02 14:01 UTC (permalink / raw)
  To: Ondrej Mosnacek
  Cc: Binoy Jayan, Rajendra, Herbert Xu, Oded, Mike Snitzer,
	Linux kernel mailing list, Milan Broz, linux-raid, dm-devel,
	Mark Brown, Arnd Bergmann, linux-crypto, Shaohua Li,
	David S. Miller, Alasdair Kergon, Ofir
In-Reply-To: <CAFqZXNugEK0do+BYC061i4CidMVgg8_BZue8gUVa2g7_RYsAEw@mail.gmail.com>

On Wed, Mar 1, 2017 at 3:21 PM, Ondrej Mosnacek <omosnace@redhat.com> wrote:
> 2017-03-01 13:42 GMT+01:00 Gilad Ben-Yossef <gilad@benyossef.com>:
>
> Wouldn't adopting a bulk request API (something like what I tried to
> do here [1]) that allows users to supply multiple messages, each with
> their own IV, fulfill this purpose? That way, we wouldn't need to
> introduce any new modes into Crypto API and the drivers/accelerators
> would only need to provide bulk implementations of common modes
> (xts(aes), cbc(aes), ...) to provide better performance for dm-crypt
> (and possibly other users, too).
>
> I'm not sure how exactly these crypto accelerators work, but wouldn't
> it help if the drivers simply get more messages (in our case sectors)
> in a single call? I wonder, would (efficiently) supporting such a
> scheme require changes in the HW itself or could it be achieved just
> by modifying driver code (let's say specifically for your CryptoCell
> accelerator)?
>
> [1] https://www.mail-archive.com/linux-crypto@vger.kernel.org/msg23007.html
>


From a general perspective - that is things are expect to be true not
just for CryptoCell but for most HW crypto engines,
you want two things - for the HW engine to be able to burst work for a
long time and than rest for a long time vs. a stop and go scheme
(engine utilization)
and for the average IO transaction to be relatively long (bus utilization)

So, a big cluster size i.e. Milan's proposal) works great - you get both.

Submitting a series of sequential small clusters where the HW can
calculate the IV (e.g. Binoy's proposal) works great if the HW
supports it - you get both.

A batched series of small clusters + IV is less favorable - if your HW
engines has lots of parallel context processing (this is expensive for
HW) you might enjoy engine utilization but the bus utilization will be
low - lots of small transactions.

Gilad


-- 
Gilad Ben-Yossef
Chief Coffee Drinker

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru

^ permalink raw reply

* Re: GRUB warning after replacing disk drive in RAID1
From: Wols Lists @ 2017-03-02 13:17 UTC (permalink / raw)
  To: Phil Turmel, Peter Sangas, 'Reindl Harald', linux-raid
In-Reply-To: <5136f32b-a0bb-dc16-cae9-59946ea21622@turmel.org>

On 01/03/17 18:29, Phil Turmel wrote:
> Since this worked before, I would guess your grub was updated and its md
> support was left out.  Hopefully someone with more grub experience can
> chip in here -- I don't use any bootloader on my servers any more.

Look at the raid wiki

https://raid.wiki.kernel.org/index.php/Converting_an_existing_system

it mentions grub.

Also, look up grub2 and raid on the gentoo wiki - I wrote a lot of that,
and arch also apparently has very good documentation.

Cheers,
Wol

^ permalink raw reply

* Re: GRUB warning after replacing disk drive in RAID1
From: Wols Lists @ 2017-03-02 13:15 UTC (permalink / raw)
  To: Phil Turmel, Reindl Harald, linux-raid
In-Reply-To: <99581e49-2a6b-e8ce-fdb5-44d923b067eb@turmel.org>

On 02/03/17 02:42, Phil Turmel wrote:
> On 03/01/2017 05:13 PM, Reindl Harald wrote:
>> Am 01.03.2017 um 19:29 schrieb Phil Turmel:
>>> Hi Peter, Reindl,
>>>
>>> { Convention on kernel.org is to reply-to-all, trim unneeded quoted
>>> material, and bottom post or interleave.  Please do so. }
>>
>> why should someone reply to the list and everyboy else subscribed to the
>> list to trigger multiple copies?
> 
> Because kernel.org doesn't require subscriptions to post, and does
> expect participants to include non-subscribers.  Since one can't
> normally tell who is subscribed and who is not, reply-to-all is the rule
> here.  Other lists have other policies.

> 
fwiw, some mailing list software detects this, and if you're cc'd on a
mail the list won't send you another copy.

And some client software also de-dupes for you without asking. I run TB
so I need an add-on - especially as something in my mail setup
duplicates emails madly every now and then ... :-)

Cheers,
Wol

^ permalink raw reply

* Re: GRUB warning after replacing disk drive in RAID1
From: Reindl Harald @ 2017-03-02  9:54 UTC (permalink / raw)
  To: Peter Sangas, linux-raid
In-Reply-To: <008101d292e4$b1aeafe0$150c0fa0$@wnsdev.com>



Am 02.03.2017 um 00:36 schrieb Peter Sangas:
> Hi Reindl,
>
> My comments are interleaved. thanks:
> ___________________________________________________
>
>> #!/bin/bash
>
>> GOOD_DISK="/dev/sda"
>> BAD_DISK="/dev/sdc"
>
>> # clone MBR
>> dd if=$GOOD_DISK of=$BAD_DISK bs=512 count=1
>
> Here I run the command sfdisk -d /dev/$GOOD_DISK | sfdisk -f /dev/$BAD_DISK.
>
> I think dd and sfdisk are doing the same thing which is cloning the
> partitions and copy the MBR ?

yes

>> # force OS to read partition tables
>> partprobe $BAD_DISK
>
> Why run partprobe if the partitions have not changed?

because they *have* changed when you replace a disk with a complete 
empty one and clone the MBR and partition table with 3 partitions and 
have bootet with a empty partition table - not every hardware supports 
hotswap proper and even if it don't harm

>> # install bootloader on replacement disk grub2-install "$BAD_DISK"
>
> Here don't you mean grub-install not grub2-install?

no, i mean what i say since that script replaced multiple disks on 
multiple machines - but how does it matter which name a binary has on 
whatever distribution?

[harry@srv-rhsoft:~]$ rpm -q --filesbypkg grub2 | grep install
[harry@srv-rhsoft:~]$ rpm -q --filesbypkg grub2-tools | grep install
grub2-tools               /usr/sbin/grub2-install
grub2-tools               /usr/share/man/man8/grub2-install.8.gz
[harry@srv-rhsoft:~]$ cat /etc/redhat-release
Generic release 24 (Generic)



^ permalink raw reply

* Re: [PATCH v4 5/7] raid5-ppl: load and recover the log
From: Artur Paszkiewicz @ 2017-03-02  8:40 UTC (permalink / raw)
  To: Shaohua Li; +Cc: shli, linux-raid
In-Reply-To: <20170301175920.r3uzl3sikamvv42a@kernel.org>

On 03/01/2017 06:59 PM, Shaohua Li wrote:
> On Wed, Mar 01, 2017 at 05:14:18PM +0100, Artur Paszkiewicz wrote:
>> On 02/28/2017 02:12 AM, Shaohua Li wrote:
>>> On Tue, Feb 21, 2017 at 08:43:59PM +0100, Artur Paszkiewicz wrote:
>>>> Load the log from each disk when starting the array and recover if the
>>>> array is dirty.
>>>>
>>>> The initial empty PPL is written by mdadm. When loading the log we
>>>> verify the header checksum and signature. For external metadata arrays
>>>> the signature is verified in userspace, so here we read it from the
>>>> header, verifying only if it matches on all disks, and use it later when
>>>> writing PPL.
>>>>
>>>> In addition to the header checksum, each header entry also contains a
>>>> checksum of its partial parity data. If the header is valid, recovery is
>>>> performed for each entry until an invalid entry is found. If the array
>>>> is not degraded and recovery using PPL fully succeeds, there is no need
>>>> to resync the array because data and parity will be consistent, so in
>>>> this case resync will be disabled.
>>>>
>>>> Due to compatibility with IMSM implementations on other systems, we
>>>> can't assume that the recovery data block size is always 4K. Writes
>>>> generated by MD raid5 don't have this issue, but when recovering PPL
>>>> written in other environments it is possible to have entries with
>>>> 512-byte sector granularity. The recovery code takes this into account
>>>> and also the logical sector size of the underlying drives.
>>>>
>>>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>>>> ---
>>>>  drivers/md/raid5-ppl.c | 483 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>  drivers/md/raid5.c     |   5 +-
>>>>  2 files changed, 487 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
>>>> index a00cabf1adf6..a7693353243a 100644
>>>> --- a/drivers/md/raid5-ppl.c
>>>> +++ b/drivers/md/raid5-ppl.c
>>>> @@ -16,6 +16,7 @@
>>>>  #include <linux/blkdev.h>
>>>>  #include <linux/slab.h>
>>>>  #include <linux/crc32c.h>
>>>> +#include <linux/async_tx.h>
>>>>  #include <linux/raid/md_p.h>
>>>>  #include "md.h"
>>>>  #include "raid5.h"
>>>> @@ -84,6 +85,10 @@ struct ppl_conf {
>>>>  	mempool_t *io_pool;
>>>>  	struct bio_set *bs;
>>>>  	mempool_t *meta_pool;
>>>> +
>>>> +	/* used only for recovery */
>>>> +	int recovered_entries;
>>>> +	int mismatch_count;
>>>>  };
>>>>  
>>>>  struct ppl_log {
>>>> @@ -450,6 +455,467 @@ void ppl_stripe_write_finished(struct stripe_head *sh)
>>>>  		ppl_io_unit_finished(io);
>>>>  }
>>>>  
>>>> +static void ppl_xor(int size, struct page *page1, struct page *page2,
>>>> +		    struct page *page_result)
>>>> +{
>>>> +	struct async_submit_ctl submit;
>>>> +	struct dma_async_tx_descriptor *tx;
>>>> +	struct page *xor_srcs[] = { page1, page2 };
>>>> +
>>>> +	init_async_submit(&submit, ASYNC_TX_ACK|ASYNC_TX_XOR_DROP_DST,
>>>> +			  NULL, NULL, NULL, NULL);
>>>> +	tx = async_xor(page_result, xor_srcs, 0, 2, size, &submit);
>>>> +
>>>> +	async_tx_quiesce(&tx);
>>>> +}
>>>> +
>>>> +/*
>>>> + * PPL recovery strategy: xor partial parity and data from all modified data
>>>> + * disks within a stripe and write the result as the new stripe parity. If all
>>>> + * stripe data disks are modified (full stripe write), no partial parity is
>>>> + * available, so just xor the data disks.
>>>> + *
>>>> + * Recovery of a PPL entry shall occur only if all modified data disks are
>>>> + * available and read from all of them succeeds.
>>>> + *
>>>> + * A PPL entry applies to a stripe, partial parity size for an entry is at most
>>>> + * the size of the chunk. Examples of possible cases for a single entry:
>>>> + *
>>>> + * case 0: single data disk write:
>>>> + *   data0    data1    data2     ppl        parity
>>>> + * +--------+--------+--------+           +--------------------+
>>>> + * | ------ | ------ | ------ | +----+    | (no change)        |
>>>> + * | ------ | -data- | ------ | | pp | -> | data1 ^ pp         |
>>>> + * | ------ | -data- | ------ | | pp | -> | data1 ^ pp         |
>>>> + * | ------ | ------ | ------ | +----+    | (no change)        |
>>>> + * +--------+--------+--------+           +--------------------+
>>>> + * pp_size = data_size
>>>> + *
>>>> + * case 1: more than one data disk write:
>>>> + *   data0    data1    data2     ppl        parity
>>>> + * +--------+--------+--------+           +--------------------+
>>>> + * | ------ | ------ | ------ | +----+    | (no change)        |
>>>> + * | -data- | -data- | ------ | | pp | -> | data0 ^ data1 ^ pp |
>>>> + * | -data- | -data- | ------ | | pp | -> | data0 ^ data1 ^ pp |
>>>> + * | ------ | ------ | ------ | +----+    | (no change)        |
>>>> + * +--------+--------+--------+           +--------------------+
>>>> + * pp_size = data_size / modified_data_disks
>>>> + *
>>>> + * case 2: write to all data disks (also full stripe write):
>>>> + *   data0    data1    data2                parity
>>>> + * +--------+--------+--------+           +--------------------+
>>>> + * | ------ | ------ | ------ |           | (no change)        |
>>>> + * | -data- | -data- | -data- | --------> | xor all data       |
>>>> + * | ------ | ------ | ------ | --------> | (no change)        |
>>>> + * | ------ | ------ | ------ |           | (no change)        |
>>>> + * +--------+--------+--------+           +--------------------+
>>>> + * pp_size = 0
>>>> + *
>>>> + * The following cases are possible only in other implementations. The recovery
>>>> + * code can handle them, but they are not generated at runtime because they can
>>>> + * be reduced to cases 0, 1 and 2:
>>>> + *
>>>> + * case 3:
>>>> + *   data0    data1    data2     ppl        parity
>>>> + * +--------+--------+--------+ +----+    +--------------------+
>>>> + * | ------ | -data- | -data- | | pp |    | data1 ^ data2 ^ pp |
>>>> + * | ------ | -data- | -data- | | pp | -> | data1 ^ data2 ^ pp |
>>>> + * | -data- | -data- | -data- | | -- | -> | xor all data       |
>>>> + * | -data- | -data- | ------ | | pp |    | data0 ^ data1 ^ pp |
>>>> + * +--------+--------+--------+ +----+    +--------------------+
>>>> + * pp_size = chunk_size
>>>> + *
>>>> + * case 4:
>>>> + *   data0    data1    data2     ppl        parity
>>>> + * +--------+--------+--------+ +----+    +--------------------+
>>>> + * | ------ | -data- | ------ | | pp |    | data1 ^ pp         |
>>>> + * | ------ | ------ | ------ | | -- | -> | (no change)        |
>>>> + * | ------ | ------ | ------ | | -- | -> | (no change)        |
>>>> + * | -data- | ------ | ------ | | pp |    | data0 ^ pp         |
>>>> + * +--------+--------+--------+ +----+    +--------------------+
>>>> + * pp_size = chunk_size
>>>> + */
>>>> +static int ppl_recover_entry(struct ppl_log *log, struct ppl_header_entry *e,
>>>> +			     sector_t ppl_sector)
>>>> +{
>>>> +	struct ppl_conf *ppl_conf = log->ppl_conf;
>>>> +	struct mddev *mddev = ppl_conf->mddev;
>>>> +	struct r5conf *conf = mddev->private;
>>>> +	int block_size = ppl_conf->block_size;
>>>> +	struct page *pages;
>>>> +	struct page *page1;
>>>> +	struct page *page2;
>>>> +	sector_t r_sector_first;
>>>> +	sector_t r_sector_last;
>>>> +	int strip_sectors;
>>>> +	int data_disks;
>>>> +	int i;
>>>> +	int ret = 0;
>>>> +	char b[BDEVNAME_SIZE];
>>>> +	unsigned int pp_size = le32_to_cpu(e->pp_size);
>>>> +	unsigned int data_size = le32_to_cpu(e->data_size);
>>>> +
>>>> +	r_sector_first = le64_to_cpu(e->data_sector) * (block_size >> 9);
>>>> +
>>>> +	if ((pp_size >> 9) < conf->chunk_sectors) {
>>>> +		if (pp_size > 0) {
>>>> +			data_disks = data_size / pp_size;
>>>> +			strip_sectors = pp_size >> 9;
>>>> +		} else {
>>>> +			data_disks = conf->raid_disks - conf->max_degraded;
>>>> +			strip_sectors = (data_size >> 9) / data_disks;
>>>> +		}
>>>> +		r_sector_last = r_sector_first +
>>>> +				(data_disks - 1) * conf->chunk_sectors +
>>>> +				strip_sectors;
>>>> +	} else {
>>>> +		data_disks = conf->raid_disks - conf->max_degraded;
>>>> +		strip_sectors = conf->chunk_sectors;
>>>> +		r_sector_last = r_sector_first + (data_size >> 9);
>>>> +	}
>>>> +
>>>> +	pages = alloc_pages(GFP_KERNEL, 1);
>>>
>>> order != 0 allocation should be avoid if possible. Here sounds not necessary
>>
>> OK, I will allocate and free them separately.
>>
>>>> +	if (!pages)
>>>> +		return -ENOMEM;
>>>> +	page1 = pages;
>>>> +	page2 = pages + 1;
>>>> +
>>>> +	pr_debug("%s: array sector first: %llu last: %llu\n", __func__,
>>>> +		 (unsigned long long)r_sector_first,
>>>> +		 (unsigned long long)r_sector_last);
>>>> +
>>>> +	/* if start and end is 4k aligned, use a 4k block */
>>>> +	if (block_size == 512 &&
>>>> +	    (r_sector_first & (STRIPE_SECTORS - 1)) == 0 &&
>>>> +	    (r_sector_last & (STRIPE_SECTORS - 1)) == 0)
>>>> +		block_size = STRIPE_SIZE;
>>>> +
>>>> +	/* iterate through blocks in strip */
>>>> +	for (i = 0; i < strip_sectors; i += (block_size >> 9)) {
>>>> +		bool update_parity = false;
>>>> +		sector_t parity_sector;
>>>> +		struct md_rdev *parity_rdev;
>>>> +		struct stripe_head sh;
>>>> +		int disk;
>>>> +		int indent = 0;
>>>> +
>>>> +		pr_debug("%s:%*s iter %d start\n", __func__, indent, "", i);
>>>> +		indent += 2;
>>>> +
>>>> +		memset(page_address(page1), 0, PAGE_SIZE);
>>>> +
>>>> +		/* iterate through data member disks */
>>>> +		for (disk = 0; disk < data_disks; disk++) {
>>>> +			int dd_idx;
>>>> +			struct md_rdev *rdev;
>>>> +			sector_t sector;
>>>> +			sector_t r_sector = r_sector_first + i +
>>>> +					    (disk * conf->chunk_sectors);
>>>> +
>>>> +			pr_debug("%s:%*s data member disk %d start\n",
>>>> +				 __func__, indent, "", disk);
>>>> +			indent += 2;
>>>> +
>>>> +			if (r_sector >= r_sector_last) {
>>>> +				pr_debug("%s:%*s array sector %llu doesn't need parity update\n",
>>>> +					 __func__, indent, "",
>>>> +					 (unsigned long long)r_sector);
>>>> +				indent -= 2;
>>>> +				continue;
>>>> +			}
>>>> +
>>>> +			update_parity = true;
>>>> +
>>>> +			/* map raid sector to member disk */
>>>> +			sector = raid5_compute_sector(conf, r_sector, 0,
>>>> +						      &dd_idx, NULL);
>>>> +			pr_debug("%s:%*s processing array sector %llu => data member disk %d, sector %llu\n",
>>>> +				 __func__, indent, "",
>>>> +				 (unsigned long long)r_sector, dd_idx,
>>>> +				 (unsigned long long)sector);
>>>> +
>>>> +			rdev = conf->disks[dd_idx].rdev;
>>>> +			if (!rdev) {
>>>> +				pr_debug("%s:%*s data member disk %d missing\n",
>>>> +					 __func__, indent, "", dd_idx);
>>>> +				update_parity = false;
>>>> +				break;
>>>> +			}
>>>> +
>>>> +			pr_debug("%s:%*s reading data member disk %s sector %llu\n",
>>>> +				 __func__, indent, "", bdevname(rdev->bdev, b),
>>>> +				 (unsigned long long)sector);
>>>> +			if (!sync_page_io(rdev, sector, block_size, page2,
>>>> +					REQ_OP_READ, 0, false)) {
>>>> +				md_error(mddev, rdev);
>>>> +				pr_debug("%s:%*s read failed!\n", __func__,
>>>> +					 indent, "");
>>>> +				ret = -EIO;
>>>> +				goto out;
>>>> +			}
>>>> +
>>>> +			ppl_xor(block_size, page1, page2, page1);
>>>> +
>>>> +			indent -= 2;
>>>> +		}
>>>> +
>>>> +		if (!update_parity)
>>>> +			continue;
>>>> +
>>>> +		if (pp_size > 0) {
>>>> +			pr_debug("%s:%*s reading pp disk sector %llu\n",
>>>> +				 __func__, indent, "",
>>>> +				 (unsigned long long)(ppl_sector + i));
>>>> +			if (!sync_page_io(log->rdev,
>>>> +					ppl_sector - log->rdev->data_offset + i,
>>>> +					block_size, page2, REQ_OP_READ, 0,
>>>> +					false)) {
>>>> +				pr_debug("%s:%*s read failed!\n", __func__,
>>>> +					 indent, "");
>>>> +				md_error(mddev, log->rdev);
>>>> +				ret = -EIO;
>>>> +				goto out;
>>>> +			}
>>>> +
>>>> +			ppl_xor(block_size, page1, page2, page1);
>>>> +		}
>>>> +
>>>> +		/* map raid sector to parity disk */
>>>> +		parity_sector = raid5_compute_sector(conf, r_sector_first + i,
>>>> +				0, &disk, &sh);
>>>> +		BUG_ON(sh.pd_idx != le32_to_cpu(e->parity_disk));
>>>> +		parity_rdev = conf->disks[sh.pd_idx].rdev;
>>>> +
>>>> +		BUG_ON(parity_rdev->bdev->bd_dev != log->rdev->bdev->bd_dev);
>>>> +		pr_debug("%s:%*s write parity at sector %llu, disk %s\n",
>>>> +			 __func__, indent, "",
>>>> +			 (unsigned long long)parity_sector,
>>>> +			 bdevname(parity_rdev->bdev, b));
>>>> +		if (!sync_page_io(parity_rdev, parity_sector, block_size,
>>>> +				page1, REQ_OP_WRITE, 0, false)) {
>>>> +			pr_debug("%s:%*s parity write error!\n", __func__,
>>>> +				 indent, "");
>>>> +			md_error(mddev, parity_rdev);
>>>> +			ret = -EIO;
>>>> +			goto out;
>>>> +		}
>>>> +	}
>>>> +out:
>>>> +	__free_pages(pages, 1);
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +static int ppl_recover(struct ppl_log *log, struct ppl_header *pplhdr)
>>>> +{
>>>> +	struct ppl_conf *ppl_conf = log->ppl_conf;
>>>> +	struct md_rdev *rdev = log->rdev;
>>>> +	struct mddev *mddev = rdev->mddev;
>>>> +	sector_t ppl_sector = rdev->ppl.sector + (PPL_HEADER_SIZE >> 9);
>>>> +	struct page *page;
>>>> +	int i;
>>>> +	int ret = 0;
>>>> +
>>>> +	page = alloc_page(GFP_KERNEL);
>>>> +	if (!page)
>>>> +		return -ENOMEM;
>>>> +
>>>> +	/* iterate through all PPL entries saved */
>>>> +	for (i = 0; i < le32_to_cpu(pplhdr->entries_count); i++) {
>>>> +		struct ppl_header_entry *e = &pplhdr->entries[i];
>>>> +		u32 pp_size = le32_to_cpu(e->pp_size);
>>>> +		sector_t sector = ppl_sector;
>>>> +		int ppl_entry_sectors = pp_size >> 9;
>>>> +		u32 crc, crc_stored;
>>>> +
>>>> +		pr_debug("%s: disk: %d entry: %d ppl_sector: %llu pp_size: %u\n",
>>>> +			 __func__, rdev->raid_disk, i,
>>>> +			 (unsigned long long)ppl_sector, pp_size);
>>>> +
>>>> +		crc = ~0;
>>>> +		crc_stored = le32_to_cpu(e->checksum);
>>>> +
>>>> +		/* read parial parity for this entry and calculate its checksum */
>>>> +		while (pp_size) {
>>>> +			int s = pp_size > PAGE_SIZE ? PAGE_SIZE : pp_size;
>>>> +
>>>> +			if (!sync_page_io(rdev, sector - rdev->data_offset,
>>>> +					s, page, REQ_OP_READ, 0, false)) {
>>>> +				md_error(mddev, rdev);
>>>> +				ret = -EIO;
>>>> +				goto out;
>>>> +			}
>>>> +
>>>> +			crc = crc32c_le(crc, page_address(page), s);
>>>> +
>>>> +			pp_size -= s;
>>>> +			sector += s >> 9;
>>>> +		}
>>>> +
>>>> +		crc = ~crc;
>>>> +
>>>> +		if (crc != crc_stored) {
>>>> +			/*
>>>> +			 * Don't recover this entry if the checksum does not
>>>> +			 * match, but keep going and try to recover other
>>>> +			 * entries.
>>>> +			 */
>>>> +			pr_debug("%s: ppl entry crc does not match: stored: 0x%x calculated: 0x%x\n",
>>>> +				 __func__, crc_stored, crc);
>>>> +			ppl_conf->mismatch_count++;
>>>> +		} else {
>>>> +			ret = ppl_recover_entry(log, e, ppl_sector);
>>>> +			if (ret)
>>>> +				goto out;
>>>> +			ppl_conf->recovered_entries++;
>>>> +		}
>>>> +
>>>> +		ppl_sector += ppl_entry_sectors;
>>>> +	}
>>>> +out:
>>>> +	__free_page(page);
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +static int ppl_write_empty_header(struct ppl_log *log)
>>>> +{
>>>> +	struct page *page;
>>>> +	struct ppl_header *pplhdr;
>>>> +	struct md_rdev *rdev = log->rdev;
>>>> +	int ret = 0;
>>>> +
>>>> +	pr_debug("%s: disk: %d ppl_sector: %llu\n", __func__,
>>>> +		 rdev->raid_disk, (unsigned long long)rdev->ppl.sector);
>>>> +
>>>> +	page = alloc_page(GFP_NOIO | __GFP_ZERO);
>>>> +	if (!page)
>>>> +		return -ENOMEM;
>>>> +
>>>> +	pplhdr = page_address(page);
>>>> +	memset(pplhdr->reserved, 0xff, PPL_HDR_RESERVED);
>>>> +	pplhdr->signature = cpu_to_le32(log->ppl_conf->signature);
>>>> +	pplhdr->checksum = cpu_to_le32(~crc32c_le(~0, pplhdr, PAGE_SIZE));
>>>> +
>>>> +	if (!sync_page_io(rdev, rdev->ppl.sector - rdev->data_offset,
>>>> +			  PPL_HEADER_SIZE, page, REQ_OP_WRITE, 0, false)) {
>>>
>>> write with FUA? otherwise, the empty header might not set down in the disks.
>>
>> OK.
>>
>>>> +		md_error(rdev->mddev, rdev);
>>>> +		ret = -EIO;
>>>> +	}
>>>> +
>>>> +	__free_page(page);
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +static int ppl_load_distributed(struct ppl_log *log)
>>>> +{
>>>> +	struct ppl_conf *ppl_conf = log->ppl_conf;
>>>> +	struct md_rdev *rdev = log->rdev;
>>>> +	struct mddev *mddev = rdev->mddev;
>>>> +	struct page *page;
>>>> +	struct ppl_header *pplhdr;
>>>> +	u32 crc, crc_stored;
>>>> +	u32 signature;
>>>> +	int ret = 0;
>>>> +
>>>> +	pr_debug("%s: disk: %d\n", __func__, rdev->raid_disk);
>>>> +
>>>> +	/* read PPL header */
>>>> +	page = alloc_page(GFP_KERNEL);
>>>> +	if (!page)
>>>> +		return -ENOMEM;
>>>> +
>>>> +	if (!sync_page_io(rdev, rdev->ppl.sector - rdev->data_offset,
>>>> +			  PAGE_SIZE, page, REQ_OP_READ, 0, false)) {
>>>> +		md_error(mddev, rdev);
>>>> +		ret = -EIO;
>>>> +		goto out;
>>>> +	}
>>>> +	pplhdr = page_address(page);
>>>> +
>>>> +	/* check header validity */
>>>> +	crc_stored = le32_to_cpu(pplhdr->checksum);
>>>> +	pplhdr->checksum = 0;
>>>> +	crc = ~crc32c_le(~0, pplhdr, PAGE_SIZE);
>>>> +
>>>> +	if (crc_stored != crc) {
>>>> +		pr_debug("%s: ppl header crc does not match: stored: 0x%x calculated: 0x%x\n",
>>>> +			 __func__, crc_stored, crc);
>>>> +		ppl_conf->mismatch_count++;
>>>> +		goto out;
>>>> +	}
>>>> +
>>>> +	signature = le32_to_cpu(pplhdr->signature);
>>>> +
>>>> +	if (mddev->external) {
>>>> +		/*
>>>> +		 * For external metadata the header signature is set and
>>>> +		 * validated in userspace.
>>>> +		 */
>>>> +		ppl_conf->signature = signature;
>>>> +	} else if (ppl_conf->signature != signature) {
>>>> +		pr_debug("%s: ppl header signature does not match: stored: 0x%x configured: 0x%x\n",
>>>> +			 __func__, signature, ppl_conf->signature);
>>>> +		ppl_conf->mismatch_count++;
>>>> +		goto out;
>>>> +	}
>>>> +
>>>> +	/* attempt to recover from log if we are starting a dirty array */
>>>> +	if (!mddev->pers && mddev->recovery_cp != MaxSector)
>>>> +		ret = ppl_recover(log, pplhdr);
>>>
>>> when could mddev->pers be null?
>>
>> When the array is inactive - we are starting the array with ppl, not
>> enabling ppl at runtime.
> 
> Didn't follow this. Can explain more?

mddev->pers is null if this is executed from md_run()/raid5_run().
That's when the array is just about to be started and we want to perform
the ppl recovery. If this is called from
raid5_change_consistency_policy(), mddev->pers is not null and recovery
should not be attempted, because the array is active.

Thanks,
Artur

^ permalink raw reply

* Re: [PATCH v2 09/13] md: raid1: use bio_segments_all()
From: Shaohua Li @ 2017-03-02  7:52 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <CACVXFVOLLJ1Og7ezOsDTbuBsJg+qxJdqmf0MO+VDwDZiSoFn1g@mail.gmail.com>

On Thu, Mar 02, 2017 at 10:34:25AM +0800, Ming Lei wrote:
> Hi Shaohua,
> 
> On Wed, Mar 1, 2017 at 7:42 AM, Shaohua Li <shli@kernel.org> wrote:
> > On Tue, Feb 28, 2017 at 11:41:39PM +0800, Ming Lei wrote:
> >> Use this helper, instead of direct access to .bi_vcnt.
> >
> > what We really need to do for the behind IO is:
> > - allocate memory and copy bio data to the memory
> > - let behind bio do IO against the memory
> >
> > The behind bio doesn't need to have the exactly same bio_vec setting. If we
> > just track the new memory, we don't need use the bio_segments_all and access
> > bio_vec too.
> 
> But we need to figure out how many vecs(each vec store one page) to be
> allocated for the cloned/behind bio, and that is the only value of
> bio_segments_all() here. Or you have idea to avoid that?

As I said, the behind bio doesn't need to have the exactly same bio_vec
setting. We just allocate memory and copy original bio data to the memory,
then do IO against the new memory. The behind bio
segments == (bio->bi_iter.bi_size + PAGE_SIZE - 1) >> PAGE_SHIFT

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v2 13/13] md: raid10: avoid direct access to bvec table in handle_reshape_read_error
From: Shaohua Li @ 2017-03-02  7:47 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <CACVXFVN8aE_Cd=ssBuhmVRK5YYRe2q4+jXoxsfztFLZiimH_xA@mail.gmail.com>

On Thu, Mar 02, 2017 at 10:37:49AM +0800, Ming Lei wrote:
> Hi Shaohua,
> 
> On Wed, Mar 1, 2017 at 7:46 AM, Shaohua Li <shli@kernel.org> wrote:
> > On Tue, Feb 28, 2017 at 11:41:43PM +0800, Ming Lei wrote:
> >> The cost is 128bytes(8*16) stack space in kernel thread context, and
> >> just use the bio helper to retrieve pages from bio.
> >>
> >> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> >> ---
> >>  drivers/md/raid10.c | 12 ++++++++++--
> >>  1 file changed, 10 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> >> index 0b97631e3905..6ffb64ab45f8 100644
> >> --- a/drivers/md/raid10.c
> >> +++ b/drivers/md/raid10.c
> >> @@ -4670,7 +4670,15 @@ static int handle_reshape_read_error(struct mddev *mddev,
> >>       struct r10bio *r10b = &on_stack.r10_bio;
> >>       int slot = 0;
> >>       int idx = 0;
> >> -     struct bio_vec *bvec = r10_bio->master_bio->bi_io_vec;
> >> +     struct bio_vec *bvl;
> >> +     struct page *pages[RESYNC_PAGES];
> >> +
> >> +     /*
> >> +      * This bio is allocated in reshape_request(), and size
> >> +      * is still RESYNC_PAGES
> >> +      */
> >> +     bio_for_each_segment_all(bvl, r10_bio->master_bio, idx)
> >> +             pages[idx] = bvl->bv_page;
> >
> > The reshape bio is doing IO against the memory we allocated for r10_bio, I'm
> > wondering why we can't get the pages from r10_bio. In this way, we don't need
> > access the bio_vec any more.
> 
> Reshap read is special and the bio(r10_bio->master_bio) isn't allocated from
> .r10buf_pool, please see reshape_request().

Why does this matter? The bio is still doing IO to the memory allocated in
r10_bio. bi_private->r10_bio->the pages.

My guess why you think the reshape read is special is the bio->bi_private
doesn't pointer to resync_pages. And since you let resync_pages pointer to
r10_bio and not vice versa, we can't find r10_bio, correct? I think this is
another reason why r10_bio embeds resync_pages (I mean a pointer to
resync_pages). With this way I suggested, the reshape read isn't special at
all. If we allocate memory for r10_bio/r1_bio, we shouldn't need access any bio
bvec.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v1] mdadm: add checking clustered bitmap in assemble mode
From: Coly Li @ 2017-03-02  4:25 UTC (permalink / raw)
  To: Zhilong Liu, Jes.Sorensen; +Cc: linux-raid
In-Reply-To: <1488425157-24582-1-git-send-email-zlliu@suse.com>

On 2017/3/2 上午11:25, Zhilong Liu wrote:
> Both clustered and internal array don't need to specify
> --bitmap when assembling array.
> 
> Signed-off-by: Zhilong Liu <zlliu@suse.com>

Acked-by: Coly Li <colyli@suse.de>

> 
> diff --git a/mdadm.c b/mdadm.c
> index b5d89e4..4087f77 100644
> --- a/mdadm.c
> +++ b/mdadm.c
> @@ -1095,8 +1095,10 @@ int main(int argc, char *argv[])
>  				pr_err("bitmap file needed with -b in --assemble mode\n");
>  				exit(2);
>  			}
> -			if (strcmp(optarg, "internal") == 0) {
> -				pr_err("there is no need to specify --bitmap when assembling arrays with internal bitmaps\n");
> +			if (strcmp(optarg, "internal") == 0 ||
> +			    strcmp(optarg, "clustered") == 0) {
> +				pr_err("no need to specify --bitmap when assembling
> +					arrays with internal or clustered bitmap\n");
>  				continue;
>  			}
>  			bitmap_fd = open(optarg, O_RDWR);
> 


^ permalink raw reply

* [PATCH v1] mdadm: add checking clustered bitmap in assemble mode
From: Zhilong Liu @ 2017-03-02  3:25 UTC (permalink / raw)
  To: Jes.Sorensen; +Cc: linux-raid, Zhilong Liu

Both clustered and internal array don't need to specify
--bitmap when assembling array.

Signed-off-by: Zhilong Liu <zlliu@suse.com>

diff --git a/mdadm.c b/mdadm.c
index b5d89e4..4087f77 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -1095,8 +1095,10 @@ int main(int argc, char *argv[])
 				pr_err("bitmap file needed with -b in --assemble mode\n");
 				exit(2);
 			}
-			if (strcmp(optarg, "internal") == 0) {
-				pr_err("there is no need to specify --bitmap when assembling arrays with internal bitmaps\n");
+			if (strcmp(optarg, "internal") == 0 ||
+			    strcmp(optarg, "clustered") == 0) {
+				pr_err("no need to specify --bitmap when assembling
+					arrays with internal or clustered bitmap\n");
 				continue;
 			}
 			bitmap_fd = open(optarg, O_RDWR);
-- 
2.6.6


^ permalink raw reply related

* Re: GRUB warning after replacing disk drive in RAID1
From: Phil Turmel @ 2017-03-02  2:42 UTC (permalink / raw)
  To: Reindl Harald, linux-raid
In-Reply-To: <b900aca8-7f61-2516-8c6a-a9dd25ce2f17@thelounge.net>

On 03/01/2017 05:13 PM, Reindl Harald wrote:
> Am 01.03.2017 um 19:29 schrieb Phil Turmel:
>> Hi Peter, Reindl,
>>
>> { Convention on kernel.org is to reply-to-all, trim unneeded quoted
>> material, and bottom post or interleave.  Please do so. }
> 
> why should someone reply to the list and everyboy else subscribed to the
> list to trigger multiple copies?

Because kernel.org doesn't require subscriptions to post, and does
expect participants to include non-subscribers.  Since one can't
normally tell who is subscribed and who is not, reply-to-all is the rule
here.  Other lists have other policies.

^ permalink raw reply

* Re: [PATCH v2 13/13] md: raid10: avoid direct access to bvec table in handle_reshape_read_error
From: Ming Lei @ 2017-03-02  2:37 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <20170228234627.3h6tospuz33yhdig@kernel.org>

Hi Shaohua,

On Wed, Mar 1, 2017 at 7:46 AM, Shaohua Li <shli@kernel.org> wrote:
> On Tue, Feb 28, 2017 at 11:41:43PM +0800, Ming Lei wrote:
>> The cost is 128bytes(8*16) stack space in kernel thread context, and
>> just use the bio helper to retrieve pages from bio.
>>
>> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
>> ---
>>  drivers/md/raid10.c | 12 ++++++++++--
>>  1 file changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index 0b97631e3905..6ffb64ab45f8 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
>> @@ -4670,7 +4670,15 @@ static int handle_reshape_read_error(struct mddev *mddev,
>>       struct r10bio *r10b = &on_stack.r10_bio;
>>       int slot = 0;
>>       int idx = 0;
>> -     struct bio_vec *bvec = r10_bio->master_bio->bi_io_vec;
>> +     struct bio_vec *bvl;
>> +     struct page *pages[RESYNC_PAGES];
>> +
>> +     /*
>> +      * This bio is allocated in reshape_request(), and size
>> +      * is still RESYNC_PAGES
>> +      */
>> +     bio_for_each_segment_all(bvl, r10_bio->master_bio, idx)
>> +             pages[idx] = bvl->bv_page;
>
> The reshape bio is doing IO against the memory we allocated for r10_bio, I'm
> wondering why we can't get the pages from r10_bio. In this way, we don't need
> access the bio_vec any more.

Reshap read is special and the bio(r10_bio->master_bio) isn't allocated from
.r10buf_pool, please see reshape_request().

Thanks,
Ming Lei

^ permalink raw reply

* Re: [PATCH v2 09/13] md: raid1: use bio_segments_all()
From: Ming Lei @ 2017-03-02  2:34 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <20170228234212.3wjwxsegg4v57nxu@kernel.org>

Hi Shaohua,

On Wed, Mar 1, 2017 at 7:42 AM, Shaohua Li <shli@kernel.org> wrote:
> On Tue, Feb 28, 2017 at 11:41:39PM +0800, Ming Lei wrote:
>> Use this helper, instead of direct access to .bi_vcnt.
>
> what We really need to do for the behind IO is:
> - allocate memory and copy bio data to the memory
> - let behind bio do IO against the memory
>
> The behind bio doesn't need to have the exactly same bio_vec setting. If we
> just track the new memory, we don't need use the bio_segments_all and access
> bio_vec too.

But we need to figure out how many vecs(each vec store one page) to be
allocated for the cloned/behind bio, and that is the only value of
bio_segments_all() here. Or you have idea to avoid that?

Thanks,
Ming

^ permalink raw reply

* Re: [PATCH v2 06/13] md: raid1: don't use bio's vec table to manage resync pages
From: Ming Lei @ 2017-03-02  2:25 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <20170228233716.v3izfseg3dpdyexc@kernel.org>

Hi Shaohua,

On Wed, Mar 1, 2017 at 7:37 AM, Shaohua Li <shli@kernel.org> wrote:
> On Tue, Feb 28, 2017 at 11:41:36PM +0800, Ming Lei wrote:
>> Now we allocate one page array for managing resync pages, instead
>> of using bio's vec table to do that, and the old way is very hacky
>> and won't work any more if multipage bvec is enabled.
>>
>> The introduced cost is that we need to allocate (128 + 16) * raid_disks
>> bytes per r1_bio, and it is fine because the inflight r1_bio for
>> resync shouldn't be much, as pointed by Shaohua.
>>
>> Also the bio_reset() in raid1_sync_request() is removed because
>> all bios are freshly new now and not necessary to reset any more.
>>
>> This patch can be thought as a cleanup too
>>
>> Suggested-by: Shaohua Li <shli@kernel.org>
>> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
>> ---
>>  drivers/md/raid1.c | 83 ++++++++++++++++++++++++++++++++++--------------------
>>  1 file changed, 53 insertions(+), 30 deletions(-)
>>
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index c442b4657e2f..900144f39630 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -77,6 +77,16 @@ static void lower_barrier(struct r1conf *conf, sector_t sector_nr);
>>  #define raid1_log(md, fmt, args...)                          \
>>       do { if ((md)->queue) blk_add_trace_msg((md)->queue, "raid1 " fmt, ##args); } while (0)
>>
>> +static inline struct resync_pages *get_resync_pages(struct bio *bio)
>> +{
>> +     return bio->bi_private;
>> +}
>> +
>> +static inline struct r1bio *get_resync_r1bio(struct bio *bio)
>> +{
>> +     return get_resync_pages(bio)->raid_bio;
>> +}
>
> This is a weird between bio, r1bio and the resync_pages. I'd like the pages are

It is only a bit weird inside allocating and freeing r1bio, once all
are allocated, you
can see everthing is clean and simple:

    - r1bio includes lots of bioes,
    - and one bio is attached by one resync_pages via .bi_private

> embedded in r1bio. Maybe a pointer of r1bio to the pages. It's cleaner and more
> straightforward.

In theory, the cleanest way is to allocate one resync_pages for each resync bio,
but that isn't efficient. That is why this patch allocates all
resync_pages together
in r1buf_pool_alloc(), and split them into bio.

BTW, the only trick is just that the whole page array is stored in .bi_private
of the 1st bio, then we can avoid to add one pointer into r1bio.

>
> I think the patch 6, 7 8 should be in a same patch. Otherwise bisect will be broken.
>

No, it won't. Both patch 7 and patch 8 just replacing reading from bvec table
with reading from the pre-allocated page array.  Both should be fine, but the
latter is cleaner and simpler to do.

Thanks,
Ming

^ permalink raw reply

* Re: [PATCH v2 05/13] md: raid1: simplify r1buf_pool_free()
From: Ming Lei @ 2017-03-02  2:11 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <20170228233105.npftquaf3fr5sfpl@kernel.org>

Hi Shaohua,

On Wed, Mar 1, 2017 at 7:31 AM, Shaohua Li <shli@kernel.org> wrote:
> On Tue, Feb 28, 2017 at 11:41:35PM +0800, Ming Lei wrote:
>> This patch gets each page's reference of each bio for resync,
>> then r1buf_pool_free() gets simplified a lot.
>>
>> The same policy has been taken in raid10's buf pool allocation/free
>> too.
>
> We are going to delete the code, this simplify isn't really required.

Could you explain it a bit? Or I can put this patch into this patchset if you
can provide one?

Thanks,
Ming Lei

^ permalink raw reply

* Re: [PATCH v2 04/13] md: prepare for managing resync I/O pages in clean way
From: Ming Lei @ 2017-03-02  2:09 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <20170228233011.rdqtde22zwsimbz7@kernel.org>

Hi Shaohua,

On Wed, Mar 1, 2017 at 7:30 AM, Shaohua Li <shli@kernel.org> wrote:
> On Tue, Feb 28, 2017 at 11:41:34PM +0800, Ming Lei wrote:
>> Now resync I/O use bio's bec table to manage pages,
>> this way is very hacky, and may not work any more
>> once multipage bvec is introduced.
>>
>> So introduce helpers and new data structure for
>> managing resync I/O pages more cleanly.
>>
>> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
>> ---
>>  drivers/md/md.h | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 54 insertions(+)
>>
>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>> index 1d63239a1be4..b5a638d85cb4 100644
>> --- a/drivers/md/md.h
>> +++ b/drivers/md/md.h
>> @@ -720,4 +720,58 @@ static inline void mddev_check_writesame(struct mddev *mddev, struct bio *bio)
>>  #define RESYNC_BLOCK_SIZE (64*1024)
>>  #define RESYNC_PAGES ((RESYNC_BLOCK_SIZE + PAGE_SIZE-1) / PAGE_SIZE)
>>
>> +/* for managing resync I/O pages */
>> +struct resync_pages {
>> +     unsigned        idx;    /* for get/put page from the pool */
>> +     void            *raid_bio;
>> +     struct page     *pages[RESYNC_PAGES];
>> +};
>
> I'd like this to be embedded into r1bio directly. Not sure if we really need a
> structure.

There are two reasons we can't put this into r1bio:
- r1bio is used in both normal and resync I/O, not fair to allocate more
in normal I/O, and that is why this patch wouldn't like to touch r1bio or r10bio

- the count of 'struct resync_pages' instance depends on raid_disks(raid1)
or copies(raid10), both can't be decided during compiling.

>
>> +
>> +static inline int resync_alloc_pages(struct resync_pages *rp,
>> +                                  gfp_t gfp_flags)
>> +{
>> +     int i;
>> +
>> +     for (i = 0; i < RESYNC_PAGES; i++) {
>> +             rp->pages[i] = alloc_page(gfp_flags);
>> +             if (!rp->pages[i])
>> +                     goto out_free;
>> +     }
>> +
>> +     return 0;
>> +
>> + out_free:
>> +     while (--i >= 0)
>> +             __free_page(rp->pages[i]);
>> +     return -ENOMEM;
>> +}
>> +
>> +static inline void resync_free_pages(struct resync_pages *rp)
>> +{
>> +     int i;
>> +
>> +     for (i = 0; i < RESYNC_PAGES; i++)
>> +             __free_page(rp->pages[i]);
>
> Since we will use get_page, shouldn't this be put_page?

You are right, will fix in v3.

>
>> +}
>> +
>> +static inline void resync_get_all_pages(struct resync_pages *rp)
>> +{
>> +     int i;
>> +
>> +     for (i = 0; i < RESYNC_PAGES; i++)
>> +             get_page(rp->pages[i]);
>> +}
>> +
>> +static inline struct page *resync_fetch_page(struct resync_pages *rp)
>> +{
>> +     if (WARN_ON_ONCE(rp->idx >= RESYNC_PAGES))
>> +             return NULL;
>> +     return rp->pages[rp->idx++];
>
> I'd like the caller explicitly specify the index instead of a global variable
> to track it, which will make the code more understandable and less error prone.

That is fine, but the index has to be per bio, and finally the index
has to be stored
in 'struct resync_pages', so every user has to call it in the following way:

          resync_fetch_page(rp, rp->idx);

then looks no benefit to pass index explicitly.

>
>> +}
>> +
>> +static inline bool resync_page_available(struct resync_pages *rp)
>> +{
>> +     return rp->idx < RESYNC_PAGES;
>> +}
>
> Then we don't need this obscure API.

That is fine.


Thanks,
Ming Lei

^ permalink raw reply

* Re: GRUB warning after replacing disk drive in RAID1
From: Phil Turmel @ 2017-03-02  0:05 UTC (permalink / raw)
  To: Peter Sangas, 'Reindl Harald', linux-raid
In-Reply-To: <008401d292e6$b6d743e0$2485cba0$@wnsdev.com>

On 03/01/2017 06:51 PM, Peter Sangas wrote:
> 
>> { Convention on kernel.org is to reply-to-all, trim unneeded quoted material, and bottom post or interleave.  Please do so. }
> 
> Yes.  thank you.
> 
>> Grub1 needs its boot partitions to use v0.90 or v1.0 superblocks.  Grub2 needs the md module in its core to boot from v1.1 or v1.2 superblocks.
>> Anyways, because the content of a v1.2 array does not start at the beginning of the member devices, stupid grub doesn't connect sd[abc]1 with your /boot mount >and therefore delivers 'null'.
>> And then doesn't know how to link the core together.
> 
> can't say I understand all that but does this mean the server can't boot from the replacement drive?  

Correct.  But it shouldn't be able to boot from the others either, with
grub1.  Something has changed in your grub install since you set up the
original drives.  Only grub2 would be able to boot from your v1.2 md0.

{ Sorry.  I don't know how to fix grub2's md module. }

Phil


^ permalink raw reply

* RE: GRUB warning after replacing disk drive in RAID1
From: Peter Sangas @ 2017-03-01 23:51 UTC (permalink / raw)
  To: 'Phil Turmel', 'Reindl Harald', linux-raid
In-Reply-To: <5136f32b-a0bb-dc16-cae9-59946ea21622@turmel.org>


>{ Convention on kernel.org is to reply-to-all, trim unneeded quoted material, and bottom post or interleave.  Please do so. }

Yes.  thank you.

>Grub1 needs its boot partitions to use v0.90 or v1.0 superblocks.  Grub2 needs the md module in its core to boot from v1.1 or v1.2 superblocks.
>Anyways, because the content of a v1.2 array does not start at the beginning of the member devices, stupid grub doesn't connect sd[abc]1 with your /boot mount >and therefore delivers 'null'.
>And then doesn't know how to link the core together.

can't say I understand all that but does this mean the server can't boot from the replacement drive?  




^ permalink raw reply

* RE: GRUB warning after replacing disk drive in RAID1
From: Peter Sangas @ 2017-03-01 23:36 UTC (permalink / raw)
  To: 'Reindl Harald', linux-raid
In-Reply-To: <97135270-088a-69da-8789-c64f1054ee72@thelounge.net>

Hi Reindl,

My comments are interleaved. thanks:
___________________________________________________

>#!/bin/bash

>GOOD_DISK="/dev/sda"
>BAD_DISK="/dev/sdc"

># clone MBR
>dd if=$GOOD_DISK of=$BAD_DISK bs=512 count=1

Here I run the command sfdisk -d /dev/$GOOD_DISK | sfdisk -f /dev/$BAD_DISK.

I think dd and sfdisk are doing the same thing which is cloning the
partitions and copy the MBR ?

># force OS to read partition tables
>partprobe $BAD_DISK

Why run partprobe if the partitions have not changed?


># install bootloader on replacement disk grub2-install "$BAD_DISK"

Here don't you mean grub-install not grub2-install?  
___________________________________________________




^ permalink raw reply

* Re: [BUG] non-metadata arrays cannot use more than 27 component devices
From: Phil Turmel @ 2017-03-01 22:20 UTC (permalink / raw)
  To: Anthony Youngman, ian_bruce, linux-raid
In-Reply-To: <f14b00ef-17bd-d9ee-6153-2e92b0ce11ae@youngman.org.uk>

On 03/01/2017 02:50 PM, Anthony Youngman wrote:

> Because - and this is a point the kernel guys seem to forget - the
> whole point of having a computer system is TO RUN APPLICATIONS, not
> to run an OS.

If I were a kernel guy, I might be offended.  They run applications,
too, ya know.

> It's all very well saying lvm was created with this in mind, but if
> the system wasn't installed with this originally in mind, you're up a
> gum tree. My home system is raid but not lvm, for example - how do I
> back up the system while it's live? (In reality, I don't care :-)

This little tidbit would have helped.  If you can't redesign your system
to make consistent backups while running, you have to shut down to make
consistent backups.  Tautology, there.  I was arguing from the
assumption that you weren't too far gone to help.

It seems to me that if you can shut it down to make a consistent backup,
you can shut it down to re-jigger the device layering.  Possibly faster
to do that than take the backup, if you are willing take a small risk.

If you are using a database technology that supports checkpointed
backups while running, like PostgreSQL[1], you might get the backup you
need without downtime.  But fixing the device layering issue is what you
should do, imnsho.  Not taking a backup isn't really an option.

1) Create another raid, possibly degraded, and set it up as a PV in a
new volume group.

2) Stop your database and unmount home long enough to resize your
filesystem as small as practical, at least so it fits in an LV in the
new volume group with some free space for future snapshots.

3) Create an LV in the new VG big enough to hold that FS and dd it over.

4) If you need to, break redundant devices out of /dev/md/home to add to
the new array.  When you have the new raid to a comfortable level of
redundancy, mount the LV as /home and resume operations.

5) Finish breaking down the previous array and possibly using its
devices to bolster the redundancy on the new array.

If you are feeling brave, you could use --build mode instead of (3) to
resume running on /dev/md/builthome (comprised of the shrunk
/dev/md/home and the LV) instead of using dd.  When the resync is done,
you'd take a second short outage to unmount /dev/md/builthome and mount
/dev/vg0/home.

Phil

[1] https://www.postgresql.org/docs/9.5/static/continuous-archiving.html

(See the pg_basebackup utility and the description of the API it uses.)

^ permalink raw reply

* Re: GRUB warning after replacing disk drive in RAID1
From: Reindl Harald @ 2017-03-01 22:13 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <5136f32b-a0bb-dc16-cae9-59946ea21622@turmel.org>



Am 01.03.2017 um 19:29 schrieb Phil Turmel:
> Hi Peter, Reindl,
>
> { Convention on kernel.org is to reply-to-all, trim unneeded quoted
> material, and bottom post or interleave.  Please do so. }

why should someone reply to the list and everyboy else subscribed to the 
list to trigger multiple copies?

> On 02/28/2017 06:15 PM, Peter Sangas wrote:
>
>> cat /proc/mdstat
>> Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4]
>> [raid10]
>
>> md0 : active raid1 sdc1[3] sdb1[1] sda1[0]
>>       19514368 blocks super 1.2 [3/3] [UUU]
>
> Grub1 needs its boot partitions to use v0.90 or v1.0 superblocks.  Grub2
> needs the md module in its core to boot from v1.1 or v1.2 superblocks.
> Anyways, because the content of a v1.2 array does not
> start at the beginning of the member devices, stupid grub doesn't
> connect sd[abc]1 with your /boot mount and therefore delivers 'null'.
> And then doesn't know how to link the core together.
>
> Since this worked before, I would guess your grub was updated and its md
> support was left out.  Hopefully someone with more grub experience can
> chip in here -- I don't use any bootloader on my servers any more

i am not the OP *but* i can assure you that i get the same warnings when 
the /boot RAID1 is not synced, afetr hat grub2-install works just fine 
without that warning and so you can be sure grub has no problem and is 
missing nothing, otherwise the command won't work without warnings afetr 
all arrays are in ync again

[root@srv-rhsoft:~]$ cat /proc/mdstat
Personalities : [raid1] [raid10]
md2 : active raid10 sdc3[4] sdd3[7] sda3[6] sdb3[5]
       3875222528 blocks super 1.1 512K chunks 2 near-copies [4/4] [UUUU]

md0 : active raid1 sdc1[4] sdd1[7] sdb1[5] sda1[6]
       511988 blocks super 1.0 [4/4] [UUUU]

md1 : active raid10 sdc2[4] sdd2[7] sdb2[5] sda2[6]
       30716928 blocks super 1.1 512K chunks 2 near-copies [4/4] [UUUU]

^ permalink raw reply

* Re: [PATCH] Fix oddity where mdadm did not recognise a relative path
From: jes.sorensen @ 2017-03-01 22:03 UTC (permalink / raw)
  To: Wols Lists; +Cc: linux-raid
In-Reply-To: <587E6B56.8090704@youngman.org.uk>

Wols Lists <antlists@youngman.org.uk> writes:
> From 4ce784307a9004124392ce48432960d7ca94d0bf Mon Sep 17 00:00:00 2001
> From: Wol <anthony@youngman.org.uk>
> Date: Tue, 17 Jan 2017 17:47:05 +0000
> Subject: [PATCH] Fix oddity where mdadm did not recognise a relative path
>
> mdadm assumed that a pathname started with a "/", while an array
> name didn't. This alters the logic so that if the first character
> is not a "/" it tries to open an array, and if that fails it drops
> through to the pathname code rather than terminating immediately
> with an error.
>
> Signed-off-by: Wol <anthony@youngman.org.uk>
> ---
>  mdadm.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/mdadm.c b/mdadm.c
> index c3a265b..b5d89e4 100644
> --- a/mdadm.c
> +++ b/mdadm.c
> @@ -1899,12 +1899,12 @@ static int misc_list(struct mddev_dev *devlist,
>  			rv |= SetAction(dv->devname, c->action);
>  			continue;
>  		}
> -		if (dv->devname[0] == '/')
> -			mdfd = open_mddev(dv->devname, 1);
> -		else {
> -			mdfd = open_dev(dv->devname);
> -			if (mdfd < 0)
> -				pr_err("Cannot open %s\n", dv->devname);
> +		switch(dv->devname[0] == '/') {
> +			case 0:
> +				mdfd = open_dev(dv->devname);
> +				if (mdfd >= 0) break;
> +			case 1:
> +				mdfd = open_mddev(dv->devname, 1);  

I thought this was still pending but going back, it looks like I did
apply it. The problem is I noticed a major issue in the patch, which
doesn't belong in code, it is doing 'if () break' on the same line.

Mind sending me a patch to fix that.

Thanks,
Jes

^ permalink raw reply

* Re: [BUG] non-metadata arrays cannot use more than 27 component devices
From: Anthony Youngman @ 2017-03-01 19:50 UTC (permalink / raw)
  To: Phil Turmel, ian_bruce, linux-raid
In-Reply-To: <2d168cc1-cddc-5e0c-610b-97cbd08de621@turmel.org>



On 01/03/17 18:13, Phil Turmel wrote:
> On 03/01/2017 12:23 PM, Phil Turmel wrote:
>> I strongly disagree.  This procedure, as shown, is an admin cock-up:
>>
>>> mdadm --build /dev/mdbackup --device-count 2 /dev/md/home missing
>>> ... hotplug sd-big ...
>>> madam /dev/mdbackup --add /dev/sd-big
>>> ... wait for sync to finish ...
>>> mdadm --stop mdbackup
>>> ... unplug sd-big ...
>
> One more point.  The above is functionally identical in every respect
> to just:
>
> # dd if=/dev/md/home of=/dev/sd-big bs=1M
>
> Why are you bothering to --build an array?
>
Because - and this is a point the kernel guys seem to forget - the whole 
point of having a computer system is TO RUN APPLICATIONS, not to run an OS.

As it is, you picked up on the fatal flaw I'd spotted, namely that if 
"home" is mounted, "backup" is going to be corrupt :-( Defeating the 
entire purpose of my idea, which was to back up a running system without 
the need to take down the system to ensure integrity.

I work with a database that, not unreasonably, seeks to cache loads of 
stuff in RAM. I've come across far too many horror stories of corrupt 
backups because the database hadn't flushed its buffers to the OS, so 
all the database files on disk were inconsistent, giving a corrupt 
backup. So the idea was set up the mirror, flush/quiesce the database, 
break the mirror, wake up the database. System disabled for a matter of 
seconds.

It's all very well saying lvm was created with this in mind, but if the 
system wasn't installed with this originally in mind, you're up a gum 
tree. My home system is raid but not lvm, for example - how do I back up 
the system while it's live? (In reality, I don't care :-)

IF it didn't have that fatal flaw, my idea would have been able to back 
up any system. Oh well, it's flawed, time to drop it :-(

Cheers,
Wol

^ permalink raw reply

* Re: GRUB warning after replacing disk drive in RAID1
From: Phil Turmel @ 2017-03-01 18:29 UTC (permalink / raw)
  To: Peter Sangas, 'Reindl Harald', linux-raid
In-Reply-To: <009601d29218$9fd3e460$df7bad20$@wnsdev.com>

Hi Peter, Reindl,

{ Convention on kernel.org is to reply-to-all, trim unneeded quoted
material, and bottom post or interleave.  Please do so. }

On 02/28/2017 06:15 PM, Peter Sangas wrote:

> cat /proc/mdstat
> Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4]
> [raid10] 

> md0 : active raid1 sdc1[3] sdb1[1] sda1[0]
>       19514368 blocks super 1.2 [3/3] [UUU]

Grub1 needs its boot partitions to use v0.90 or v1.0 superblocks.  Grub2
needs the md module in its core to boot from v1.1 or v1.2 superblocks.
Anyways, because the content of a v1.2 array does not
start at the beginning of the member devices, stupid grub doesn't
connect sd[abc]1 with your /boot mount and therefore delivers 'null'.
And then doesn't know how to link the core together.

Since this worked before, I would guess your grub was updated and its md
support was left out.  Hopefully someone with more grub experience can
chip in here -- I don't use any bootloader on my servers any more.

Phil

^ permalink raw reply

* Re: [BUG] non-metadata arrays cannot use more than 27 component devices
From: Phil Turmel @ 2017-03-01 18:13 UTC (permalink / raw)
  To: Wols Lists, ian_bruce, linux-raid
In-Reply-To: <27373f29-a8ea-012c-102d-9fadf05cb3c2@turmel.org>

On 03/01/2017 12:23 PM, Phil Turmel wrote:
> I strongly disagree.  This procedure, as shown, is an admin cock-up:
> 
>> mdadm --build /dev/mdbackup --device-count 2 /dev/md/home missing
>> ... hotplug sd-big ...
>> madam /dev/mdbackup --add /dev/sd-big
>> ... wait for sync to finish ...
>> mdadm --stop mdbackup
>> ... unplug sd-big ...

One more point.  The above is functionally identical in every respect
to just:

# dd if=/dev/md/home of=/dev/sd-big bs=1M

Why are you bothering to --build an array?

Phil


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox