Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: [PATCH 1/7] imsm: metadata changes for PPL
From: Artur Paszkiewicz @ 2016-11-29 10:47 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: linux-raid
In-Reply-To: <wrfjpolfjibn.fsf@redhat.com>

On 11/29/2016 12:27 AM, Jes Sorensen wrote:
> Artur Paszkiewicz <artur.paszkiewicz@intel.com> writes:
>> Updates for the IMSM metadata format, including PPL header structures.
>>
>> Extend imsm_vol dirty field adding a third value, which is required to
>> enable PPL recovery in Windows and UEFI drivers.
>>
>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>> ---
>>  super-intel.c | 48 ++++++++++++++++++++++++++++++++++++++++--------
>>  1 file changed, 40 insertions(+), 8 deletions(-)
> 
> A couple of comments on this change:
> 
>> diff --git a/super-intel.c b/super-intel.c
>> index 5740088..69f6201 100644
>> --- a/super-intel.c
>> +++ b/super-intel.c
>> @@ -154,6 +157,9 @@ struct imsm_vol {
>>  #define MIGR_STATE_CHANGE 4
>>  #define MIGR_REPAIR 5
>>  	__u8  migr_type;	/* Initializing, Rebuilding, ... */
>> +#define RAIDVOL_CLEAN          0
>> +#define RAIDVOL_DIRTY          1
>> +#define RAIDVOL_DSRECORD_VALID 2
>>  	__u8  dirty;
>>  	__u8  fs_state;		/* fast-sync state for CnG (0xff == disabled) */
>>  	__u16 verify_errors;	/* number of mismatches */
>> @@ -189,7 +195,30 @@ struct imsm_dev {
>>  	__u16 cache_policy;
>>  	__u8  cng_state;
>>  	__u8  cng_sub_state;
>> -#define IMSM_DEV_FILLERS 10
>> +	__u16 my_vol_raid_dev_num; /* Used in Unique volume Id for this RaidDev */
>> +
>> +	/* NVM_EN */
>> +	__u8 nv_cache_mode;
>> +	union {
>> +		__u8 nv_cache_flags;
>> +		struct {
>> +		    __u8 dirty:1; /* 1 - cache is dirty, 0 - clean */
>> +		    __u8 nvc_health:2;
>> +		    __u8 expansion_bytes:5;
>> +		} nvCache;
>> +	};
> 
> This sets off alarm bells here - you declare a union of a u8 and a
> matching bitfield, but do not handle bit endian bitfields. If someone
> tried to use this on a big endian system this could get messy.

Those fields are not used in the code at all, the only thing added to
this structure that is used is 'rwh_policy'. The rest is to fill the
gaps between IMSM format in UEFI/Windows.

> 
>> @@ -7565,12 +7594,15 @@ mark_checkpoint:
>>  
>>  skip_mark_checkpoint:
>>  	/* mark dirty / clean */
>> -	if (dev->vol.dirty != !consistent) {
>> +	if ((dev->vol.dirty & RAIDVOL_DIRTY) != !consistent) {
> 
> This part makes my head spin (to be honest, the original code did
> too). I had to go write a small snipped of C to figure out what the
> compiler does with !x given x > 0, since consistent can be 0, 1, and 2
> here.
> 
> Basically for RAIDVOL_CLEAN and consistent = 0, and RAIDVOL_DIRTY and
> consistent = 1 and 2, this statement triggers. Maybe I am just terrible
> dense when it comes to reading obfuscated C, but could we change this
> round to a construct that is a little easier to read?

Yes, I admit it is confusing. Does this look better?

-       if (dev->vol.dirty != !consistent) {
+       if (((dev->vol.dirty & RAIDVOL_DIRTY) && consistent) ||
+           (!(dev->vol.dirty & RAIDVOL_DIRTY) && !consistent)) {

Thanks,
Artur


^ permalink raw reply

* Re: [RFC PATCH] crypto: Add IV generation algorithms
From: Herbert Xu @ 2016-11-29  7:50 UTC (permalink / raw)
  To: Binoy Jayan
  Cc: Oded, Ofir, David S. Miller, linux-crypto, Mark Brown,
	Arnd Bergmann, Linux kernel mailing list, Alasdair Kergon,
	Mike Snitzer, dm-devel, Shaohua Li, linux-raid
In-Reply-To: <CAHv-k___PJA9t=V8Kh1_WVq=K_yBfmD2rUzE-xAd_SCrEgOjUA@mail.gmail.com>

On Tue, Nov 29, 2016 at 01:16:46PM +0530, Binoy Jayan wrote:
> 
> No one is using it as of now. It was just a thought to pass context
> information, instead of making it part of the context which is shared
> among dm-crypt and geniv.

OK in that case we should just get rid of it until it's actually
needed.  In any case, if there was a need for such information we
should try to embed it into either the key (per-tfm) or the IV
(per-request) as appropriate.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [RFC PATCH] crypto: Add IV generation algorithms
From: Binoy Jayan @ 2016-11-29  7:46 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Oded, Ofir, David S. Miller, linux-crypto, Mark Brown,
	Arnd Bergmann, Linux kernel mailing list, Alasdair Kergon,
	Mike Snitzer, dm-devel, Shaohua Li, linux-raid
In-Reply-To: <20161129072824.GA5240@gondor.apana.org.au>

Hi Herbert,

On 29 November 2016 at 12:58, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> But that begs the question, who is supposed to use crypto_geniv_set_ctx?
> I thought it was dm-crypt but your patch doesn't contain any uses
> of it at all.

No one is using it as of now. It was just a thought to pass context
information, instead of making it part of the context which is shared
among dm-crypt and geniv.

-Binoy

^ permalink raw reply

* Re: [RFC PATCH] crypto: Add IV generation algorithms
From: Herbert Xu @ 2016-11-29  7:28 UTC (permalink / raw)
  To: Binoy Jayan
  Cc: Oded, Ofir, David S. Miller, linux-crypto, Mark Brown,
	Arnd Bergmann, Linux kernel mailing list, Alasdair Kergon,
	Mike Snitzer, dm-devel, Shaohua Li, linux-raid
In-Reply-To: <CAHv-k_93FRosd4GeW3NB6fPmjYUNZoe+p+i4ZnwBQqB605bxUA@mail.gmail.com>

On Tue, Nov 29, 2016 at 10:15:40AM +0530, Binoy Jayan wrote:
>
> Thank you for the reply. The dm-crypt changes are also included as
> part of this patchset. It has been tested for functionality as well.
> More information can be found in the cover letter including the test
> procedure etc.
> 
> https://lkml.org/lkml/2016/11/21/168

Sorry, I don't know how I missed that :)

But that begs the question, who is supposed to use crypto_geniv_set_ctx?
I thought it was dm-crypt but your patch doesn't contain any uses
of it at all.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [RFC PATCH] crypto: Add IV generation algorithms
From: Binoy Jayan @ 2016-11-29  4:45 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Oded, Ofir, David S. Miller, linux-crypto, Mark Brown,
	Arnd Bergmann, Linux kernel mailing list, Alasdair Kergon,
	Mike Snitzer, dm-devel, Shaohua Li, linux-raid
In-Reply-To: <20161128124722.GB2257@gondor.apana.org.au>

On 28 November 2016 at 18:17, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Mon, Nov 21, 2016 at 03:40:09PM +0530, Binoy Jayan wrote:
>> Currently, the iv generation algorithms are implemented in dm-crypt.c.
>> The goal is to move these algorithms from the dm layer to the kernel
>> crypto layer by implementing them as template ciphers so they can be used
>> in relation with algorithms like aes, and with multiple modes like cbc,
>> ecb etc. As part of this patchset, the iv-generation code is moved from the
>> dm layer to the crypto layer. The dm-layer can later be optimized to
>> encrypt larger block sizes in a single call to the crypto engine. The iv
>> generation algorithms implemented in geniv.c includes plain, plain64,
>> essiv, benbi, null, lmk and tcw. These templates are to be configured
>> and has to be invoked as:
>>
>> crypto_alloc_skcipher("plain(cbc(aes))", 0, 0);
>> crypto_alloc_skcipher("essiv(cbc(aes))", 0, 0);
>> ...
>>
>> from the dm layer.
>>
>> Signed-off-by: Binoy Jayan <binoy.jayan@linaro.org>
>
> Thanks a lot for working on this!
>
>> +static int crypto_geniv_set_ctx(struct crypto_skcipher *cipher,
>> +                             void *newctx, unsigned int len)
>> +{
>> +     struct geniv_ctx *ctx = crypto_skcipher_ctx(cipher);
>> +     /*
>> +      * TODO:
>> +      * Do we really need this API or can we append the context
>> +      * 'struct geniv_ctx' to the cipher from dm-crypt and use
>> +      * the same here.
>> +      */
>> +     memcpy(ctx, (char *) newctx, len);
>> +     return geniv_setkey_init_ctx(&ctx->data);
>> +}
>
> I think we should be able to do without adding another API for this.
> Can we instead put the information into the key and/or the IV?
>
> Also it would really help if you could attach a patch for dm-crypt
> that goes along with this (it doesn't have to be complete or
> functional, just showing how this is meant to be used)


Hi Herbert,

Thank you for the reply. The dm-crypt changes are also included as
part of this patchset. It has been tested for functionality as well.
More information can be found in the cover letter including the test
procedure etc.

https://lkml.org/lkml/2016/11/21/168

Thanks,
Binoy

^ permalink raw reply

* [PATCH] md/raid5-cache: do not need to set STRIPE_PREREAD_ACTIVE repeatedly
From: JackieLiu @ 2016-11-29  3:57 UTC (permalink / raw)
  To: songliubraving; +Cc: liuzhengyuan, linux-raid, JackieLiu

R5c_make_stripe_write_out has set this flag, do not need to set again.

Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
---
 drivers/md/raid5-cache.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 95bd51e..aa89de0 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -1247,8 +1247,6 @@ static void r5c_flush_stripe(struct r5conf *conf, struct stripe_head *sh)
 	atomic_inc(&conf->active_stripes);
 	r5c_make_stripe_write_out(sh);
 
-	if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
-		atomic_inc(&conf->preread_active_stripes);
 	raid5_release_stripe(sh);
 }
 
-- 
2.10.2




^ permalink raw reply related

* [PATCH] md/raid5-cache: remove the unnecessary next_cp_seq field from the r5l_log
From: JackieLiu @ 2016-11-29  3:13 UTC (permalink / raw)
  To: shli; +Cc: songliubraving, liuzhengyuan, linux-raid, JackieLiu

The next_cp_seq field is useless, remove it.

Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
---
 drivers/md/raid5-cache.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 5f817bd..95bd51e 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -113,7 +113,6 @@ struct r5l_log {
 	u64 seq;			/* log head sequence */
 
 	sector_t next_checkpoint;
-	u64 next_cp_seq;
 
 	struct mutex io_mutex;
 	struct r5l_io_unit *current_io;	/* current io_unit accepting new data */
@@ -1075,7 +1074,6 @@ static bool r5l_complete_finished_ios(struct r5l_log *log)
 			break;
 
 		log->next_checkpoint = io->log_start;
-		log->next_cp_seq = io->seq;
 
 		list_del(&io->log_sibling);
 		mempool_free(io, log->io_pool);
-- 
2.10.2




^ permalink raw reply related

* RE: md prefered minor has been renumbered
From: Randall C. Grimshaw @ 2016-11-29  1:12 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Linux-RAID
In-Reply-To: <5b2eb59e-7447-da7f-bb0a-85866f2146ef@turmel.org>

________________________________________
From: Phil Turmel [philip@turmel.org]
Sent: Monday, November 28, 2016 6:10 PM
To: Randall C. Grimshaw
Cc: Linux-RAID
Subject: Re: md prefered minor has been renumbered

Hi Randall,

{ Convention on kernel.org is reply-to-all and to avoid top-posting. I
added the list back. }

On 11/28/2016 05:56 PM, Randall C. Grimshaw wrote:
> Thank you Phil:
>   the linux rescue mode didn't have blkid but I had the uuid's from mdadm --examine,
> so I tried uuid=xxxxx in the fstab file. no joy. it is failing in the load of the ramfs, after assembly but before fstab.
>   it appears that I am caught in the catch22 where I cannot update initramfs until I minimally fix the duplicate md122 problem so that I can assemble /var and mount it.
> Randall

You must use the form UUID="xxxx" with caps and quotes.

you can test with mount UUID="....." /path/to/mount/point to check.

Phil
----------------------------------

Phil:
  Sorry for the typos and the top posts. Even with Caps and Quotes I was still in trouble.
sequentially it complains that md122 is already running - it cannot run /sdb5 (the duplicate)
then it tries to resume from /dev/md1 -- (md1 remains as swap) then a bunch of indicators that the initramfs
is unable to mount everything... but the post I found was basically correct.
I suspect it was the conflicting devices that made my situation different but the solution was to specify the 
device to mdadm using uuid= (thank you for that tip)
so my assembly command looked like: mdadm -A --uuid=xxxx --update=super-minor /dev/mdx
it solved the conflicting devices and returned my system to running state.
[SOLVED]

^ permalink raw reply

* Re: [PATCH] RFC: dm: avoid the mutex lock in dm_bufio_shrink_count()
From: David Rientjes @ 2016-11-29  0:56 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Douglas Anderson, Mike Snitzer, shli, Dmitry Torokhov,
	linux-kernel, linux-raid, dm-devel, linux, Sonny Rao,
	Alasdair Kergon
In-Reply-To: <alpine.LRH.2.02.1611231531340.31481@file01.intranet.prod.int.rdu2.redhat.com>

On Wed, 23 Nov 2016, Mikulas Patocka wrote:

> From: Mikulas Patocka <mpatocka@redhat.com>
> 
> dm-bufio: don't take the lock in dm_bufio_shrink_count
> 
> dm_bufio_shrink_count is called from do_shrink_slab to find out how many
> freeable objects are there. The reported value doesn't have to be precise,
> so we don't need to take the dm-bufio lock.
> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply

* Re: Question about raid1_make_request
From: Robert LeBlanc @ 2016-11-29  0:04 UTC (permalink / raw)
  To: doug; +Cc: linux-raid
In-Reply-To: <CAFx4rwQQrETbpi-kQ56wj3izmvSpDiCQ3iD6dyK2LkkM-JZBTg@mail.gmail.com>

I think I looked at the comments to that function and then didn't read
the code too closely. A few extra tests and jumps, but in the grand
scheme of things when dealing with disks, not noticeable like I
thought before. Thanks for pointing it out.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Nov 28, 2016 at 4:53 PM, Doug Dumitru <doug@easyco.com> wrote:
> Not the best code organization, but no real harm done.  Look at the
> front of md_write_start:
>
> void md_write_start(struct mddev *mddev, struct bio *bi)
> {
>         int did_change = 0;
>         if (bio_data_dir(bi) != WRITE)
>                 return;
>
> So writes fall thru really quickly.
>
> Doug
>
>
> On Mon, Nov 28, 2016 at 3:41 PM, Robert LeBlanc <robert@leblancnet.us> wrote:
>> I'm looking through the code and am wondering why this is at the start
>> of raid1_make_request. It seems that it is only needed for WRITE and
>> would unnecessarily delay READ requests waiting for the superblock.
>> Would it make more sense to put this at the start of the WRITE branch?
>> What am I missing?
>>
>> 1057         /*
>> 1058          * Register the new request and wait if the
>> reconstruction
>> 1059          * thread has put up a bar for new requests.
>> 1060          * Continue immediately if no resync is active currently.
>> 1061          */
>> 1062
>> 1063         md_write_start(mddev, bio); /* wait on superblock update
>> early */
>> 1064
>> 1065         if (bio_data_dir(bio) == WRITE &&
>> 1066             ((bio_end_sector(bio) > mddev->suspend_lo &&
>> 1067             bio->bi_iter.bi_sector < mddev->suspend_hi) ||
>> 1068             (mddev_is_clustered(mddev) &&
>> 1069              md_cluster_ops->area_resyncing(mddev, WRITE,
>> 1070                      bio->bi_iter.bi_sector,
>> bio_end_sector(bio))))) {
>> 1071                 /* As the suspend_* range is controlled by
>> 1072                  * userspace, we want an interruptible
>> 1073                  * wait.
>> 1074                  */
>> 1075                 DEFINE_WAIT(w);
>> 1076                 for (;;) {
>> 1077                         flush_signals(current);
>> 1078                         prepare_to_wait(&conf->wait_barrier,
>> 1079                                         &w, TASK_INTERRUPTIBLE);
>> 1080                         if (bio_end_sector(bio) <=
>> mddev->suspend_lo ||
>> 1081                             bio->bi_iter.bi_sector >=
>> mddev->suspend_hi ||
>> 1082                             (mddev_is_clustered(mddev) &&
>> 1083
>> !md_cluster_ops->area_resyncing(mddev, WRITE,
>> 1084                                      bio->bi_iter.bi_sector,
>> bio_end_sector(bio))))
>> 1085                                 break;
>> 1086                         schedule();
>> 1087                 }
>> 1088                 finish_wait(&conf->wait_barrier, &w);
>> 1089         }
>> 1090
>> 1091         start_next_window = wait_barrier(conf, bio);
>> 1092
>> 1093         bitmap = mddev->bitmap;
>> 1094
>>
>> Thanks
>>
>> ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Doug Dumitru
> EasyCo LLC

^ permalink raw reply

* Re: Question about raid1_make_request
From: Doug Dumitru @ 2016-11-28 23:53 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: linux-raid
In-Reply-To: <CAANLjFpb0ZCEVtvaib8nsoKhKZs55agy35HEByeo1g1OSWBGjg@mail.gmail.com>

Not the best code organization, but no real harm done.  Look at the
front of md_write_start:

void md_write_start(struct mddev *mddev, struct bio *bi)
{
        int did_change = 0;
        if (bio_data_dir(bi) != WRITE)
                return;

So writes fall thru really quickly.

Doug


On Mon, Nov 28, 2016 at 3:41 PM, Robert LeBlanc <robert@leblancnet.us> wrote:
> I'm looking through the code and am wondering why this is at the start
> of raid1_make_request. It seems that it is only needed for WRITE and
> would unnecessarily delay READ requests waiting for the superblock.
> Would it make more sense to put this at the start of the WRITE branch?
> What am I missing?
>
> 1057         /*
> 1058          * Register the new request and wait if the
> reconstruction
> 1059          * thread has put up a bar for new requests.
> 1060          * Continue immediately if no resync is active currently.
> 1061          */
> 1062
> 1063         md_write_start(mddev, bio); /* wait on superblock update
> early */
> 1064
> 1065         if (bio_data_dir(bio) == WRITE &&
> 1066             ((bio_end_sector(bio) > mddev->suspend_lo &&
> 1067             bio->bi_iter.bi_sector < mddev->suspend_hi) ||
> 1068             (mddev_is_clustered(mddev) &&
> 1069              md_cluster_ops->area_resyncing(mddev, WRITE,
> 1070                      bio->bi_iter.bi_sector,
> bio_end_sector(bio))))) {
> 1071                 /* As the suspend_* range is controlled by
> 1072                  * userspace, we want an interruptible
> 1073                  * wait.
> 1074                  */
> 1075                 DEFINE_WAIT(w);
> 1076                 for (;;) {
> 1077                         flush_signals(current);
> 1078                         prepare_to_wait(&conf->wait_barrier,
> 1079                                         &w, TASK_INTERRUPTIBLE);
> 1080                         if (bio_end_sector(bio) <=
> mddev->suspend_lo ||
> 1081                             bio->bi_iter.bi_sector >=
> mddev->suspend_hi ||
> 1082                             (mddev_is_clustered(mddev) &&
> 1083
> !md_cluster_ops->area_resyncing(mddev, WRITE,
> 1084                                      bio->bi_iter.bi_sector,
> bio_end_sector(bio))))
> 1085                                 break;
> 1086                         schedule();
> 1087                 }
> 1088                 finish_wait(&conf->wait_barrier, &w);
> 1089         }
> 1090
> 1091         start_next_window = wait_barrier(conf, bio);
> 1092
> 1093         bitmap = mddev->bitmap;
> 1094
>
> Thanks
>
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Doug Dumitru
EasyCo LLC

^ permalink raw reply

* Re: [PATCH 3/7] imsm: PPL support
From: Jes Sorensen @ 2016-11-28 23:51 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: linux-raid
In-Reply-To: <20161124122952.16529-4-artur.paszkiewicz@intel.com>

Artur Paszkiewicz <artur.paszkiewicz@intel.com> writes:
> Enable creating and assembling IMSM raid5 arrays with PPL.
>
> Write the IMSM MPB location for a device to the newly added rdev
> sb_start sysfs attribute and 'journal_ppl' to 'state' attribute for
> every active member.
>
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  mdadm.h       |  1 +
>  super-intel.c | 33 +++++++++++++++++++++++++++++++++
>  sysfs.c       |  4 ++++
>  3 files changed, 38 insertions(+)
>
> diff --git a/mdadm.h b/mdadm.h
> index 4eabf59..5600341 100755
> --- a/mdadm.h
> +++ b/mdadm.h
> @@ -252,6 +252,7 @@ struct mdinfo {
>  	unsigned long long	custom_array_size; /* size for non-default sized
>  						    * arrays (in sectors)
>  						    */
> +	unsigned long long	sb_start;
>  #define NO_RESHAPE		0
>  #define VOLUME_RESHAPE		1
>  #define CONTAINER_RESHAPE	2
> diff --git a/super-intel.c b/super-intel.c
> index df09272..79a3d78 100644
> --- a/super-intel.c
> +++ b/super-intel.c
> @@ -1261,6 +1261,15 @@ static void print_imsm_dev(struct intel_super *super,
>  	}
>  	printf("\n");
>  	printf("    Dirty State : %s\n", dev->vol.dirty ? "dirty" : "clean");
> +	printf("     RWH Policy : ");
> +	if (dev->rwh_policy == RWH_OFF)
> +		printf("off\n");
> +	else if (dev->rwh_policy == RWH_DISTRIBUTED)
> +		printf("PPL distributed\n");
> +	else if (dev->rwh_policy == RWH_JOURNALING_DRIVE)
> +		printf("PPL journaling drive\n");
> +	else
> +		printf("<unknown:%d>\n", dev->rwh_policy);
>  }
>  
>  static void print_imsm_disk(struct imsm_disk *disk, int index, __u32 reserved)
> @@ -3043,6 +3052,15 @@ static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info,
>  			}
>  		}
>  	}
> +
> +	if (info->array.level == 5) {
> +		if (dev->rwh_policy == RWH_OFF)
> +			info->rwh_policy = RWH_POLICY_OFF;
> +		else if (dev->rwh_policy == RWH_DISTRIBUTED)
> +			info->rwh_policy = RWH_POLICY_PPL;
> +		else
> +			info->rwh_policy = RWH_POLICY_UNKNOWN;
> +	}
>  }
>  
>  static __u8 imsm_check_degraded(struct intel_super *super, struct imsm_dev *dev,
> @@ -3177,6 +3195,9 @@ static void getinfo_super_imsm(struct supertype *st, struct mdinfo *info, char *
>  
>  		disk = &super->disks->disk;
>  		info->data_offset = total_blocks(&super->disks->disk) - reserved;
> +		/* mpb anchor sector - see store_imsm_mpb() */
> +		info->sb_start = total_blocks(&super->disks->disk) -
> +				 ((2 * super->sector_size) >> 9);
>  		info->component_size = reserved;
>  		info->disk.state  = is_configured(disk) ? (1 << MD_DISK_ACTIVE) : 0;
>  		/* we don't change info->disk.raid_disk here because

Hi Artur,

I have been sitting staring at the above for a while, and looking at
store_imsm_mpb() it is not clear to me what is meant to happen here.

For 4k sector drives, aren't you pushing back sb_start way further than
you are for 512 byte sector drives? Ie. you are subtracting 16 sectors
for the 4k drive but only two sectors for the 512 byte sector drive?

Maybe it's because it's Monday or I lost the last of my marbles, but
could you possibly enlighten me here please?

Thanks,
Jes

^ permalink raw reply

* Question about raid1_make_request
From: Robert LeBlanc @ 2016-11-28 23:41 UTC (permalink / raw)
  To: linux-raid

I'm looking through the code and am wondering why this is at the start
of raid1_make_request. It seems that it is only needed for WRITE and
would unnecessarily delay READ requests waiting for the superblock.
Would it make more sense to put this at the start of the WRITE branch?
What am I missing?

1057         /*
1058          * Register the new request and wait if the
reconstruction
1059          * thread has put up a bar for new requests.
1060          * Continue immediately if no resync is active currently.
1061          */
1062
1063         md_write_start(mddev, bio); /* wait on superblock update
early */
1064
1065         if (bio_data_dir(bio) == WRITE &&
1066             ((bio_end_sector(bio) > mddev->suspend_lo &&
1067             bio->bi_iter.bi_sector < mddev->suspend_hi) ||
1068             (mddev_is_clustered(mddev) &&
1069              md_cluster_ops->area_resyncing(mddev, WRITE,
1070                      bio->bi_iter.bi_sector,
bio_end_sector(bio))))) {
1071                 /* As the suspend_* range is controlled by
1072                  * userspace, we want an interruptible
1073                  * wait.
1074                  */
1075                 DEFINE_WAIT(w);
1076                 for (;;) {
1077                         flush_signals(current);
1078                         prepare_to_wait(&conf->wait_barrier,
1079                                         &w, TASK_INTERRUPTIBLE);
1080                         if (bio_end_sector(bio) <=
mddev->suspend_lo ||
1081                             bio->bi_iter.bi_sector >=
mddev->suspend_hi ||
1082                             (mddev_is_clustered(mddev) &&
1083
!md_cluster_ops->area_resyncing(mddev, WRITE,
1084                                      bio->bi_iter.bi_sector,
bio_end_sector(bio))))
1085                                 break;
1086                         schedule();
1087                 }
1088                 finish_wait(&conf->wait_barrier, &w);
1089         }
1090
1091         start_next_window = wait_barrier(conf, bio);
1092
1093         bitmap = mddev->bitmap;
1094

Thanks

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

^ permalink raw reply

* Re: [PATCH 1/7] imsm: metadata changes for PPL
From: Jes Sorensen @ 2016-11-28 23:27 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: linux-raid
In-Reply-To: <20161124122952.16529-2-artur.paszkiewicz@intel.com>

Artur Paszkiewicz <artur.paszkiewicz@intel.com> writes:
> Updates for the IMSM metadata format, including PPL header structures.
>
> Extend imsm_vol dirty field adding a third value, which is required to
> enable PPL recovery in Windows and UEFI drivers.
>
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  super-intel.c | 48 ++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 40 insertions(+), 8 deletions(-)

A couple of comments on this change:

> diff --git a/super-intel.c b/super-intel.c
> index 5740088..69f6201 100644
> --- a/super-intel.c
> +++ b/super-intel.c
> @@ -154,6 +157,9 @@ struct imsm_vol {
>  #define MIGR_STATE_CHANGE 4
>  #define MIGR_REPAIR 5
>  	__u8  migr_type;	/* Initializing, Rebuilding, ... */
> +#define RAIDVOL_CLEAN          0
> +#define RAIDVOL_DIRTY          1
> +#define RAIDVOL_DSRECORD_VALID 2
>  	__u8  dirty;
>  	__u8  fs_state;		/* fast-sync state for CnG (0xff == disabled) */
>  	__u16 verify_errors;	/* number of mismatches */
> @@ -189,7 +195,30 @@ struct imsm_dev {
>  	__u16 cache_policy;
>  	__u8  cng_state;
>  	__u8  cng_sub_state;
> -#define IMSM_DEV_FILLERS 10
> +	__u16 my_vol_raid_dev_num; /* Used in Unique volume Id for this RaidDev */
> +
> +	/* NVM_EN */
> +	__u8 nv_cache_mode;
> +	union {
> +		__u8 nv_cache_flags;
> +		struct {
> +		    __u8 dirty:1; /* 1 - cache is dirty, 0 - clean */
> +		    __u8 nvc_health:2;
> +		    __u8 expansion_bytes:5;
> +		} nvCache;
> +	};

This sets off alarm bells here - you declare a union of a u8 and a
matching bitfield, but do not handle bit endian bitfields. If someone
tried to use this on a big endian system this could get messy.

> @@ -7565,12 +7594,15 @@ mark_checkpoint:
>  
>  skip_mark_checkpoint:
>  	/* mark dirty / clean */
> -	if (dev->vol.dirty != !consistent) {
> +	if ((dev->vol.dirty & RAIDVOL_DIRTY) != !consistent) {

This part makes my head spin (to be honest, the original code did
too). I had to go write a small snipped of C to figure out what the
compiler does with !x given x > 0, since consistent can be 0, 1, and 2
here.

Basically for RAIDVOL_CLEAN and consistent = 0, and RAIDVOL_DIRTY and
consistent = 1 and 2, this statement triggers. Maybe I am just terrible
dense when it comes to reading obfuscated C, but could we change this
round to a construct that is a little easier to read?

Cheers,
Jes

^ permalink raw reply

* Re: md prefered minor has been renumbered
From: Phil Turmel @ 2016-11-28 23:10 UTC (permalink / raw)
  To: Randall C. Grimshaw; +Cc: Linux-RAID
In-Reply-To: <BC4985417D7F7B49B9AB545333BDB61B191FE29B@Exchange-MBX1.esf.edu>

Hi Randall,

{ Convention on kernel.org is reply-to-all and to avoid top-posting. I
added the list back. }

On 11/28/2016 05:56 PM, Randall C. Grimshaw wrote:
> Thank you Phil:
>   the linux rescue mode didn't have blkid but I had the uuid's from mdadm --examine,
> so I tried uuid=xxxxx in the fstab file. no joy. it is failing in the load of the ramfs, after assembly but before fstab.
>   it appears that I am caught in the catch22 where I cannot update initramfs until I minimally fix the duplicate md122 problem so that I can assemble /var and mount it.
> Randall

You must use the form UUID="xxxx" with caps and quotes.

you can test with mount UUID="....." /path/to/mount/point to check.

Phil

^ permalink raw reply

* Re: [PATCH 4/4 v4] mdmon: bad block support for external metadata - clear bad blocks
From: Jes Sorensen @ 2016-11-28 22:50 UTC (permalink / raw)
  To: Tomasz Majchrzak; +Cc: linux-raid
In-Reply-To: <1477558425-13332-4-git-send-email-tomasz.majchrzak@intel.com>

Tomasz Majchrzak <tomasz.majchrzak@intel.com> writes:
> If an update of acknowledged bad blocks file is notified, read entire
> bad block list from sysfs file and compare it against local list of bad
> blocks. If any obsolete entries are found, remove them from metadata.
>
> As mdmon cannot perform any memory allocation, new superswitch method
> get_bad_blocks is expected to return a list of bad blocks in metadata
> without allocating memory. It's up to metadata handler to allocate all
> required memory in advance.
>
> Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
> Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  mdadm.h   |  7 ++++++
>  monitor.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 86 insertions(+), 2 deletions(-)

Applied!

Thanks,
Jes

^ permalink raw reply

* Re: [PATCH 3/4 v2] mdmon: bad block support for external metadata - store bad blocks
From: Jes Sorensen @ 2016-11-28 22:49 UTC (permalink / raw)
  To: Tomasz Majchrzak; +Cc: linux-raid
In-Reply-To: <1477558425-13332-3-git-send-email-tomasz.majchrzak@intel.com>

Tomasz Majchrzak <tomasz.majchrzak@intel.com> writes:
> If md has changed the state to 'blocked' and metadata handler supports
> bad blocks, try process them first. If metadata handler has successfully
> stored bad block, acknowledge it to md via 'badblocks' sysfs file. If
> metadata handler has failed to store the new bad block (ie. lack of
> space), remove bad block support for a disk by writing "-external_bbl"
> to state sysfs file. If all bad blocks have been acknowledged, request
> to unblock the array.
>
> Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
> Acked-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  mdadm.h   |   4 +++
>  monitor.c | 112 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 116 insertions(+)

Applied!

Thanks,
Jes

^ permalink raw reply

* Re: [PATCH 2/4 v2] mdmon: bad block support for external metadata - sysfs file open
From: Jes Sorensen @ 2016-11-28 22:46 UTC (permalink / raw)
  To: Tomasz Majchrzak; +Cc: linux-raid
In-Reply-To: <1477558425-13332-2-git-send-email-tomasz.majchrzak@intel.com>

Tomasz Majchrzak <tomasz.majchrzak@intel.com> writes:
> Open 'badblocks' and 'unacknowledged_bad_blocks' sysfs files for each
> disk in the array. Add them to the list of files observed by monitor.
>
> Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
> Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  managemon.c | 17 +++++++++++++++++
>  mdadm.h     |  2 ++
>  monitor.c   |  7 ++++++-
>  3 files changed, 25 insertions(+), 1 deletion(-)

Applied!

Thanks,
Jes

^ permalink raw reply

* Re: [PATCH 1/4] mdadm: bad block support for external metadata - initialization
From: Jes Sorensen @ 2016-11-28 22:45 UTC (permalink / raw)
  To: Tomasz Majchrzak; +Cc: linux-raid
In-Reply-To: <1480342025-19508-1-git-send-email-tomasz.majchrzak@intel.com>

Tomasz Majchrzak <tomasz.majchrzak@intel.com> writes:
> If metadata handler provides support for bad blocks, tell md by writing
> 'external_bbl' to rdev state file (both on create and assemble),
> followed by a list of known bad blocks written via sysfs 'bad_blocks'
> file.
>
> Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
> Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  mdadm.h | 13 +++++++++++++
>  sysfs.c | 27 ++++++++++++++++++++++++++-
>  2 files changed, 39 insertions(+), 1 deletion(-)

Applied!

Thanks,
Jes

^ permalink raw reply

* Re: [PATCH 1/1] IMSM: Update num_data_stripes during migration
From: Jes Sorensen @ 2016-11-28 22:42 UTC (permalink / raw)
  To: Pawel Baldysiak; +Cc: linux-raid, Maksymilian Kunt
In-Reply-To: <20161124084824.28155-1-pawel.baldysiak@intel.com>

Pawel Baldysiak <pawel.baldysiak@intel.com> writes:
> This patch adds updataing num_data_stripes during reshape.
> Previously this field once set during creation was never updated.
> Also, num_data_strips value multipied by chunk_size is used
> for set proper component size for RAID5.
>
> Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
> Signed-off-by: Maksymilian Kunt <maksymilian.kunt@intel.com>
> ---
>  super-intel.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 46 insertions(+), 7 deletions(-)

Applied!

Thanks,
Jes

^ permalink raw reply

* Re: md prefered minor has been renumbered
From: Phil Turmel @ 2016-11-28 22:12 UTC (permalink / raw)
  To: Randall C. Grimshaw, linux-raid@vger.kernel.org
In-Reply-To: <BC4985417D7F7B49B9AB545333BDB61B191FE20C@Exchange-MBX1.esf.edu>

On 11/28/2016 03:34 PM, Randall C. Grimshaw wrote:
> Phil:
>   Thank you for responding, it is much appreciated.
>   This is a recently assigned system so I must apologize for not having all of the history re fstab.
>   But when I try to boot into the 'normal' system it continues to use the new numbering, the mounts fail, (the duplicate md122 obviously fails), and the system panics.
>   I suspect the answer will have something to do with the command I googled: mdadm -A --update=superminor /dev/mdNEWNUMBER  /dev/sd...
> .. but I have zero experience with it.
> Randall

Use 'blkid' to find out the UUIDs of your filesystems.  Replace /dev/mdX
in your fstab with UUID="......"

then "mount -av"

Then you don't care what number shows up, and can resolve it at your
leisure.

Whatever mdadm.conf you end up with, run your distro's utility to get it
into your initramfs.  On ubuntu, it is "update-initramfs".

Phil

^ permalink raw reply

* RE: md prefered minor has been renumbered
From: Randall C. Grimshaw @ 2016-11-28 22:10 UTC (permalink / raw)
  To: Phil Turmel, linux-raid@vger.kernel.org
In-Reply-To: <BC4985417D7F7B49B9AB545333BDB61B191FE20C@Exchange-MBX1.esf.edu>

Update: So I was really hopeful given the google find of this link:
http://superuser.com/questions/346719/how-to-change-the-name-of-an-md-device-mdadm
as instructed I stopped my devices, that went well.
mdadm --stop /dev/md122
mdadm --assemble --update=super-minor /dev/md5 /dev/sdb7 /dev/sda7
but the --assemble command errors with the message: no devices found for /dev/md5
note: the devices /dev/sda7 and /dev/sda7 *DO* appear with fdisk -l

do I need to --force the command or something?

Randall
________________________________________
From: linux-raid-owner@vger.kernel.org [linux-raid-owner@vger.kernel.org] on behalf of Randall C. Grimshaw [rgrimsha@esf.edu]
Sent: Monday, November 28, 2016 3:34 PM
To: Phil Turmel; linux-raid@vger.kernel.org
Subject: RE: md prefered minor has been renumbered

Phil:
  Thank you for responding, it is much appreciated.
  This is a recently assigned system so I must apologize for not having all of the history re fstab.
  But when I try to boot into the 'normal' system it continues to use the new numbering, the mounts fail, (the duplicate md122 obviously fails), and the system panics.
  I suspect the answer will have something to do with the command I googled: mdadm -A --update=superminor /dev/mdNEWNUMBER  /dev/sd...
. but I have zero experience with it.
Randall

________________________________________
From: Phil Turmel [philip@turmel.org]
Sent: Monday, November 28, 2016 2:58 PM
To: Randall C. Grimshaw; linux-raid@vger.kernel.org
Subject: Re: md prefered minor has been renumbered

On 11/28/2016 12:26 PM, Randall C. Grimshaw wrote:
> I have made the mistake of booting a centos-6.8 live cd to manipulate a centos 5 system.
> as a result md5 has become md122, md2 has become md125, md4 has become md126, md0 has become md127, and most unfortunately md3 has also become md122.

This is normal and expected for a livecd -- the numbering comes from the
mdadm.conf file in the initramfs, unless that mdadm.conf file excludes
arrays for later assembly (when the rootfs is available).

Many livecd initramfs scripts try to assemble everything, so the numbers
are allocated from 127 down.  Usually you can find a boot parameter to
suppress assembly.

> smartctl shows that the WD brand drives do support SCT
> mdadm --examine was used to reveal the mish-mash using uuid numbers compared with the file /etc/mdadm.conf also referencing /etc/fstab.

Eww.  You should not be counting on /dev/mdX numbering in your fstab.
Specifically for this reason.  Use LABEL= or UUID= syntax in fstab.

> can someone kindly tell me the mdadm command to put the correct numbers back.

No need.  When you boot back into your normal system, its initramfs will
supply this information.  If you are changing the info, make sure you
rebuild your initramfs, so the updated mdadm.conf file is included.

Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: md prefered minor has been renumbered
From: Phil Turmel @ 2016-11-28 22:06 UTC (permalink / raw)
  To: Wols Lists, linux-raid@vger.kernel.org
In-Reply-To: <583C95F5.9000701@youngman.org.uk>

On 11/28/2016 03:39 PM, Wols Lists wrote:
> I did say IME :-)

(-:

>> Most modern systems have two copies of mdadm.conf.  One in the initramfs
>> so you can boot a root filesystem that is inside an array, and the other
>> in your root filesystem to assemble everything else.
> 
> Mine don't appear to have any ... and when I've tried to create one, it
> was ignored. Of course, that doesn't rule out incompetence on my part
> :-) But it means the defaults are perfectly okay for me :-) and probably
> most people most of the time ...

If you create an mdadm.conf file, and *don't* put it in your initramfs,
and are using a distro that assembles in the initramfs, your mdadm.conf
file will be ignored *during* *boot*.  But then used any other time
after boot.  If you just left it there, the next kernel upgrade or
package that triggered a new initramfs would have picked it up.  And
"magically" applied those array numbers on your next boot.

> I think I created several of my arrays as md0/1/2. And they stayed that
> way for a little while before deciding to renumber.

Sounds like your distro changed its initramfs around that time.

Phil

^ permalink raw reply

* Re: md prefered minor has been renumbered
From: Wols Lists @ 2016-11-28 20:39 UTC (permalink / raw)
  To: Phil Turmel, linux-raid@vger.kernel.org
In-Reply-To: <7b7ed841-7fe7-627d-bdb5-edb8c7d6a894@turmel.org>

On 28/11/16 20:22, Phil Turmel wrote:
> On 11/28/2016 03:02 PM, Wols Lists wrote:
> 
>> In My Experience (I'll probably get told I'm wrong :-) mdadm.conf seems
>> to be pretty much ignored by modern systems. And md numbers now by
>> default count down from 127, not up from 0.
> 
> mdadm.conf is *not* ignored.  Defaults are taken if it isn't found, or
> doesn't contain the info needed.

I did say IME :-)
> 
> Most modern systems have two copies of mdadm.conf.  One in the initramfs
> so you can boot a root filesystem that is inside an array, and the other
> in your root filesystem to assemble everything else.

Mine don't appear to have any ... and when I've tried to create one, it
was ignored. Of course, that doesn't rule out incompetence on my part
:-) But it means the defaults are perfectly okay for me :-) and probably
most people most of the time ...
> 
> Most distros copy the latter to the former when updating the initramfs.
> You may have to manually trigger this operation when you are
> re-arranging your arrays.
> 
> You also have to manually intervene if you want your initramfs to have
> a different mdadm.conf, like a two-phase assembly.
> 
> mdadm has been counting down from 127 for as long as I've used it.

I think I created several of my arrays as md0/1/2. And they stayed that
way for a little while before deciding to renumber.
> 
> Phil
> 
Cheers,
Wol

^ permalink raw reply

* RE: md prefered minor has been renumbered
From: Randall C. Grimshaw @ 2016-11-28 20:34 UTC (permalink / raw)
  To: Phil Turmel, linux-raid@vger.kernel.org
In-Reply-To: <91b64061-6026-c738-75a6-8d80a177aec8@turmel.org>

Phil:
  Thank you for responding, it is much appreciated.
  This is a recently assigned system so I must apologize for not having all of the history re fstab.
  But when I try to boot into the 'normal' system it continues to use the new numbering, the mounts fail, (the duplicate md122 obviously fails), and the system panics.
  I suspect the answer will have something to do with the command I googled: mdadm -A --update=superminor /dev/mdNEWNUMBER  /dev/sd...
.. but I have zero experience with it.
Randall

________________________________________
From: Phil Turmel [philip@turmel.org]
Sent: Monday, November 28, 2016 2:58 PM
To: Randall C. Grimshaw; linux-raid@vger.kernel.org
Subject: Re: md prefered minor has been renumbered

On 11/28/2016 12:26 PM, Randall C. Grimshaw wrote:
> I have made the mistake of booting a centos-6.8 live cd to manipulate a centos 5 system.
> as a result md5 has become md122, md2 has become md125, md4 has become md126, md0 has become md127, and most unfortunately md3 has also become md122.

This is normal and expected for a livecd -- the numbering comes from the
mdadm.conf file in the initramfs, unless that mdadm.conf file excludes
arrays for later assembly (when the rootfs is available).

Many livecd initramfs scripts try to assemble everything, so the numbers
are allocated from 127 down.  Usually you can find a boot parameter to
suppress assembly.

> smartctl shows that the WD brand drives do support SCT
> mdadm --examine was used to reveal the mish-mash using uuid numbers compared with the file /etc/mdadm.conf also referencing /etc/fstab.

Eww.  You should not be counting on /dev/mdX numbering in your fstab.
Specifically for this reason.  Use LABEL= or UUID= syntax in fstab.

> can someone kindly tell me the mdadm command to put the correct numbers back.

No need.  When you boot back into your normal system, its initramfs will
supply this information.  If you are changing the info, make sure you
rebuild your initramfs, so the updated mdadm.conf file is included.

Phil

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox