Linux RAID subsystem development
 help / color / mirror / Atom feed
* Re: [md PATCH 3/4] md/bitmap: add blktrace event for writes to the bitmap.
From: Shaohua Li @ 2016-11-16 19:31 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <147910142125.27168.1639109067364223445.stgit@noble>

On Mon, Nov 14, 2016 at 04:30:21PM +1100, Neil Brown wrote:
> We trace wheneven bitmap_unplug() finds that it needs to write
> to the bitmap, or when bitmap_daemon_work() find there is work
> to do.
> 
> This makes it easier to correlate bitmap updates with data writes.

Looks good. This reminds me if we should do similar thing for md_write_start if
we write superblock. Especially for raid5-cache, as we write superblock regularly.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v6 04/11] md/r5cache: caching mode of r5cache
From: Song Liu @ 2016-11-16 19:55 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Shaohua Li, linux-raid@vger.kernel.org, NeilBrown, Kernel Team,
	dan.j.williams@intel.com, hch@infradead.org,
	liuzhengyuang521@gmail.com, liuzhengyuan@kylinos.cn
In-Reply-To: <20161115214919.GA28086@shli-mbp.local>


> On Nov 15, 2016, at 1:49 PM, Shaohua Li <shli@fb.com> wrote:
> 
> On Tue, Nov 15, 2016 at 07:08:29PM +0000, Song Liu wrote:
> 
>>>> 
>>>> 	might_sleep();
>>>> 
>>>> -	if (r5l_write_stripe(conf->log, sh) == 0)
>>>> -		return;
>>>> +	if (test_bit(STRIPE_R5C_WRITE_OUT, &sh->state)) {
>>>> +		/* writing out mode */
>>>> +		if (r5l_write_stripe(conf->log, sh) == 0)
>>>> +			return;
>>>> +	} else {
>>>> +		/* caching mode */
>>>> +		if (test_bit(STRIPE_LOG_TRAPPED, &sh->state)) {
>>> 
>>> The STRIPE_LOG_TRAPPED bit isn't designed for this, maybe it just works, I have
>>> to double check. but this one is a little confusing.
>> 
>> This is somehow reuse the flag in caching mode/phase. In handle_stripe(), a large part of the 
>> logic is skipped with STRIPE_LOG_TRAPPED. This behavior is the same in caching mode/phase
>> and write_out mode/phase. 
> 
> I'd think this is another abuse of the bit. Both writeout mode and the caching
> mode could set the bit. Here you are using the bit to determine if entering
> caching mode. This isn't clear at all to me. I'd add another bit to indicate
> the caching mode if necessary.

I guess the comment made it confusing... STRIPE_LOG_TRAPPED is not 
used to determine where it is in caching phase. The /* caching phase */ is 
for the "else" statement:

        if (test_bit(STRIPE_R5C_WRITE_OUT, &sh->state)) {
                /* writing out phase */
                if (r5l_write_stripe(conf->log, sh) == 0)
                        return;
        } else {  /* caching phase   i.e.  STRIPE_R5C_WRITE_OUT == 0 */
                if (test_bit(STRIPE_LOG_TRAPPED, &sh->state)) {
                        r5c_cache_data(conf->log, sh, s);
                        return;
                }
        }

In caching phase, STRIPE_LOG_TRAPPED is used as a gate of 
r5c_cache_data(). It is set in r5c_handle_dirtying() and cleared in 
endio of journal writes. It is kind similar to the name "log trapped"
meaning it is time to write data to the log. 

> 
>>>> @@ -1547,6 +1611,7 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
>>>> 
>>>> again:
>>>> 			dev = &sh->dev[i];
>>>> +			clear_bit(R5_InJournal, &dev->flags);
>>> 
>>> why clear the bit here? isn't this bit cleared when the stripe is initialized?
>> 
>> This is necessary when we rewrite a page that is already in journal. This means there is new data coming to 
>> this stripe and dev, so we should not consider the page is secured in journal. 
> 
> This sounds suggest we should clear the bit after writeout is done.

That may also work. Let me confirm. 

Thanks, 
Song

^ permalink raw reply

* Re: [PATCH v2] RAID1: Avoid unnecessary loop to decrease conf->nr_queued in raid1d()
From: Shaohua Li @ 2016-11-16 20:05 UTC (permalink / raw)
  To: Coly Li; +Cc: linux-raid, Shaohua Li, Neil Brown
In-Reply-To: <785b7474-2e3a-e423-08d7-26cb6136a235@suse.de>

On Wed, Nov 16, 2016 at 10:36:32PM +0800, Coly Li wrote:
> 在 2016/11/16 下午10:19, Coly Li 写道:
> [snip]
> > ---
> >  drivers/md/raid1.c | 9 +++++----
> >  1 file changed, 5 insertions(+), 4 deletions(-)
> > 
> > Index: linux-raid1/drivers/md/raid1.c
> > ===================================================================
> > --- linux-raid1.orig/drivers/md/raid1.c
> > +++ linux-raid1/drivers/md/raid1.c
> > @@ -2387,17 +2387,17 @@ static void raid1d(struct md_thread *thr
> [snip]
> >  		while (!list_empty(&tmp)) {
> >  			r1_bio = list_first_entry(&tmp, struct r1bio,
> >  						  retry_list);
> >  			list_del(&r1_bio->retry_list);
> > +			spin_lock_irqsave(&conf->device_lock, flags);
> > +			conf->nr_queued--;
> > +			spin_unlock_irqrestore(&conf->device_lock, flags);
> [snip]
> 
> Now I work on another 2 patches for a simpler I/O barrier, and a
> lockless I/O submit on raid1, where conf->nr_queued will be in atomic_t.
> So spin lock expense will not exist any more. Just FYI.

I'd like to hold this patch till you post the simpler I/O barrier, as the patch
itself currently doesn't make the process faster (lock/unlock is much heavier
than the loop).

Thanks,
Shaohua

^ permalink raw reply

* Just a thought - linux raid wiki - raid 5 grows getting stuck at 0%
From: Wols Lists @ 2016-11-16 23:51 UTC (permalink / raw)
  To: linux-raid

I've just been doing a bit of work on the wiki when I put 2 and 2
together, and hope I haven't made 5 ...

Bitmaps interfere with grow operations, I believe ...

And mdadm has recently been modified so that bitmaps are switched on by
default, I also believe ...

Could this be silently blocking the grow, so you don't get any error and
it just sits there doing nothing?

But that brings up two points. Should mdadm now explicitly look for a
bitmap when growing an array, and warn that the bitmap needs to be
disabled? Apparently it currently comes up with some errors that hint at
the cause rather than explicitly say so.

And secondly, I'm sure someone sent an email to the list that explained
when a bitmap was created. I can't find it in my archives. The man page
says it's created if the array is over 100G, but I'm sure the email gave
rather more detail than that and that the figure actually varied.

But I'm sure you can see from this that I'm updating this bit of the
wiki. Is there anything else I ought to know about bitmaps, so I can
document it? :-)

Cheers,
Wol

^ permalink raw reply

* Re: [PATCH v6 03/11] md/r5cache: State machine for raid5-cache write back mode
From: NeilBrown @ 2016-11-17  0:28 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid@vger.kernel.org, Shaohua Li, Kernel Team,
	dan.j.williams@intel.com, hch@infradead.org,
	liuzhengyuang521@gmail.com, liuzhengyuan@kylinos.cn
In-Reply-To: <0F99C371-BD11-4032-945A-4E19A80AD771@fb.com>

[-- Attachment #1: Type: text/plain, Size: 2324 bytes --]

On Wed, Nov 16 2016, Song Liu wrote:

>> This bothers me.  Why would a stripe *ever* be in "caching mode" (or
>> "caching phase") when the array is in write-through?  It doesn't seem to
>> make sense.
>
> I was thinking about replacing STRIPE_R5C_WRITE_OUT with something
> like STRIPE_R5C_CACHING. So that:
>
> caching-phase is STRIPE_R5C_CACHING == 1
> write-out phase is STRIPE_R5C_CACHING == 0 
>
> In this case, stripes in write-through mode will always have 
> STRIPE_R5C_CACHING == 0. 
>
> This requires some changes to current state machine, but it might work out. 
>
> How do you like this? 

I almost suggested that myself :-)
I'm not against it, but now that I think of "write_out" as an action, it
seems to make sense for that to be a flag.
Before we had the journal we didn't need a flag.  We just assessed the
state of the stripe and then either
 - read some blocks, or
 - calculate parity and write some blocks.

Now that we have the journal, write-out is multi-stage and
"write-to-journal" isn't always followed by 'write to RAID'.  So an extra
flag is needed.
So I'm now happy with WRITE_OUT, but I'd probably be happy with CACHING
too.


>
>
>> There are two actions that can be taken when where are ->towrite blocks
>> on a stripe.  We can enter WRITE_OUT, or they can be cached in the
>> journal.  Also we can enter WRITE_OUT when a stripe needs to be removed
>> From memory or from the journal.
>> This makes "writeout" and "cache" seem more like "actions" than states,
>> modes, or phases.  Naming is hard.
>
> Yes, it is more like action. We used to name it as "modes" as different *mode*
> handles writes with different *action*. So at end of the day, it doesn't really 
> matter?

The exact choice of word isn't vital but it is important to ensure the
code is maintainable, and that means it must be easy to understand.  The
fact that we have had trouble describing what is happening suggests that
this is an area that needs to be clearly explained.
So where the flag is declared, it would help a lot to document:

 /*  set when ...
  *  cleared when ...
  *  used for ....
  */
because whatever name we use for the flag, it won't by itself be
enough.  We need to also explain what it means.

Thanks,
NeilBrown


>
> Thanks,
> Song

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: [PATCH v6 05/11] md/r5cache: write-out mode and reclaim support
From: NeilBrown @ 2016-11-17  0:28 UTC (permalink / raw)
  To: linux-raid
  Cc: shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu
In-Reply-To: <20161110204623.3484694-6-songliubraving@fb.com>

[-- Attachment #1: Type: text/plain, Size: 5477 bytes --]

On Fri, Nov 11 2016, Song Liu wrote:

> +/*
> + * evaluate log space usage and update R5C_LOG_TIGHT and R5C_LOG_CRITICAL
> + *
> + * R5C_LOG_TIGHT is set when free space on the log device is less than 3x of
> + * reclaim_required_space. R5C_LOG_CRITICAL is set when free space on the log
> + * device is less than 2x of reclaim_required_space.
> + */
> +static inline void r5c_update_log_state(struct r5l_log *log)
> +{
> +	struct r5conf *conf = log->rdev->mddev->private;
> +	sector_t free_space;
> +	sector_t reclaim_space;
> +
> +	if (!r5c_is_writeback(log))
> +		return;
> +
> +	free_space = r5l_ring_distance(log, log->log_start,
> +				       log->last_checkpoint);
> +	reclaim_space = r5c_log_required_to_flush_cache(conf);
> +	if (free_space < 2 * reclaim_space)
> +		set_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> +	else
> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> +	if (free_space < 3 * reclaim_space)
> +		set_bit(R5C_LOG_TIGHT, &conf->cache_state);
> +	else
> +		clear_bit(R5C_LOG_TIGHT, &conf->cache_state);
> +}

This code, that you rewrote as I requested (Thanks) behaves slightly
differently to the previous version.
Maybe that is intentional, but I thought I would mention it anyway.
The previous would set TIGHT when free_space dropped below
3*reclaim_space, and would only clear it when free_space when above
4*reclaim_space.  This provided some hysteresis.
Now it is cleared as soon as free_space reaches 3*reclaim_space.

Maybe this is what you want, but as the hysteresis seemed like it might
be sensible, it is worth asking.

>  
> +/*
> + * calculate new last_checkpoint
> + * for write through mode, returns log->next_checkpoint
> + * for write back, returns log_start of first sh in stripe_in_cache_list
> + */
> +static sector_t r5c_calculate_new_cp(struct r5conf *conf)
> +{
> +	struct stripe_head *sh;
> +	struct r5l_log *log = conf->log;
> +	sector_t end = MaxSector;

The value assigned here is never used.

> +
> +	if (log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH)
> +		return log->next_checkpoint;
> +
> +	spin_lock(&log->stripe_in_cache_lock);
> +	if (list_empty(&conf->log->stripe_in_cache_list)) {
> +		/* all stripes flushed */
> +		spin_unlock(&log->stripe_in_cache_lock);
> +		return log->next_checkpoint;
> +	}
> +	sh = list_first_entry(&conf->log->stripe_in_cache_list,
> +			      struct stripe_head, r5c);
> +	end = sh->log_start;
> +	spin_unlock(&log->stripe_in_cache_lock);
> +	return end;

Given that we only assign "log_start" to the variable "end", it is
strange that it is called "end".
"new_cp" would make sense, or "log_start", but why "end" ??


> +}
> +
>  static sector_t r5l_reclaimable_space(struct r5l_log *log)
>  {
> +	struct r5conf *conf = log->rdev->mddev->private;
> +
>  	return r5l_ring_distance(log, log->last_checkpoint,
> -				 log->next_checkpoint);
> +				 r5c_calculate_new_cp(conf));
>  }
>  
>  static void r5l_run_no_mem_stripe(struct r5l_log *log)
> @@ -776,6 +966,7 @@ static bool r5l_complete_finished_ios(struct r5l_log *log)
>  static void __r5l_stripe_write_finished(struct r5l_io_unit *io)
>  {
>  	struct r5l_log *log = io->log;
> +	struct r5conf *conf = log->rdev->mddev->private;
>  	unsigned long flags;
>  
>  	spin_lock_irqsave(&log->io_list_lock, flags);
> @@ -786,7 +977,8 @@ static void __r5l_stripe_write_finished(struct r5l_io_unit *io)
>  		return;
>  	}
>  
> -	if (r5l_reclaimable_space(log) > log->max_free_space)
> +	if (r5l_reclaimable_space(log) > log->max_free_space ||
> +	    test_bit(R5C_LOG_TIGHT, &conf->cache_state))
>  		r5l_wake_reclaim(log, 0);
>  
>  	spin_unlock_irqrestore(&log->io_list_lock, flags);
> @@ -907,14 +1099,140 @@ static void r5l_write_super_and_discard_space(struct r5l_log *log,
>  	}
>  }
>  
> +/*
> + * r5c_flush_stripe moves stripe from cached list to handle_list. When called,
> + * the stripe must be on r5c_cached_full_stripes or r5c_cached_partial_stripes.
> + *
> + * must hold conf->device_lock
> + */
> +static void r5c_flush_stripe(struct r5conf *conf, struct stripe_head *sh)
> +{
> +	BUG_ON(list_empty(&sh->lru));
> +	BUG_ON(test_bit(STRIPE_R5C_WRITE_OUT, &sh->state));
> +	BUG_ON(test_bit(STRIPE_HANDLE, &sh->state));
> +	assert_spin_locked(&conf->device_lock);
> +
> +	list_del_init(&sh->lru);
> +	atomic_inc(&sh->count);
> +
> +	set_bit(STRIPE_HANDLE, &sh->state);
> +	atomic_inc(&conf->active_stripes);
> +	r5c_make_stripe_write_out(sh);
> +
> +	if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> +		atomic_inc(&conf->preread_active_stripes);
> +	raid5_release_stripe(sh);

This looks wrong.  raid5_release_stripe() can try to take
conf->device_lock but this function is called with ->device_lock
held. This would cause a deadlock.

It presumably doesn't deadlock because you just incremented sh->count,
so raid5_release_stripe() will probably just decrement sh->count and
that count will remain > 0.
So why are you incrementing ->count for a few instructions and then
releasing the stripe?  Either that isn't necessary, or it could
deadlock.

I guess that if we are certain that STRIPE_ON_RELEASE_LIST is clear,
then it won't deadlock as it will do a lock-less add to
conf->release_stripes.
But if that is the case, it needs to be documented, and probaby there
needs to be a WARN_ON(test_bit(STRIPE_ON_RELEASE_LIST.....));


Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: [PATCH v6 06/11] md/r5cache: sysfs entry r5c_journal_mode
From: NeilBrown @ 2016-11-17  0:29 UTC (permalink / raw)
  To: linux-raid
  Cc: shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu
In-Reply-To: <20161110204623.3484694-7-songliubraving@fb.com>

[-- Attachment #1: Type: text/plain, Size: 872 bytes --]

On Fri, Nov 11 2016, Song Liu wrote:

> +static ssize_t r5c_journal_mode_store(struct mddev *mddev,
> +				      const char *page, size_t len)
> +{
> +	struct r5conf *conf = mddev->private;
> +	struct r5l_log *log = conf->log;
> +	int val = -1, i;
> +
> +	if (!log)
> +		return -ENODEV;
> +
> +	for (i = 0; i < sizeof(r5c_journal_mode_str) / sizeof(r5c_journal_mode_str[0]); i++)
> +		if (strlen(r5c_journal_mode_str[i]) == len - 1 &&
> +		    strncmp(page, r5c_journal_mode_str[i], len - 1) == 0) {
> +			val = i;
> +			break;

I don't really like this "len - 1".
It is presumably there to skip a trailing newline, but it doesn't check
that there is a new line to skip.
Some people seem to have a habit of using "echo -n ...." to write to
sysfs.
I suggest.

 if (len && page[len-1] == '\n')
       len -= 1;

then just use "len" instead of "len - 1".


Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: [PATCH v6 05/11] md/r5cache: write-out mode and reclaim support
From: Song Liu @ 2016-11-17  0:57 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-raid@vger.kernel.org, Shaohua Li, Kernel Team,
	dan.j.williams@intel.com, hch@infradead.org,
	liuzhengyuang521@gmail.com, liuzhengyuan@kylinos.cn
In-Reply-To: <87zikz7xva.fsf@notabene.neil.brown.name>


> On Nov 16, 2016, at 4:28 PM, NeilBrown <neilb@suse.com> wrote:
> 
> On Fri, Nov 11 2016, Song Liu wrote:
>> 
>> +
>> +	free_space = r5l_ring_distance(log, log->log_start,
>> +				       log->last_checkpoint);
>> +	reclaim_space = r5c_log_required_to_flush_cache(conf);
>> +	if (free_space < 2 * reclaim_space)
>> +		set_bit(R5C_LOG_CRITICAL, &conf->cache_state);
>> +	else
>> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
>> +	if (free_space < 3 * reclaim_space)
>> +		set_bit(R5C_LOG_TIGHT, &conf->cache_state);
>> +	else
>> +		clear_bit(R5C_LOG_TIGHT, &conf->cache_state);
>> +}
> 
> This code, that you rewrote as I requested (Thanks) behaves slightly
> differently to the previous version.
> Maybe that is intentional, but I thought I would mention it anyway.
> The previous would set TIGHT when free_space dropped below
> 3*reclaim_space, and would only clear it when free_space when above
> 4*reclaim_space.  This provided some hysteresis.
> Now it is cleared as soon as free_space reaches 3*reclaim_space.
> 
> Maybe this is what you want, but as the hysteresis seemed like it might
> be sensible, it is worth asking.

Thanks for pointing this out. This part will need a little more fine-tuning. 

>> 
>> +	if (log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH)
>> +		return log->next_checkpoint;
>> +
>> +	spin_lock(&log->stripe_in_cache_lock);
>> +	if (list_empty(&conf->log->stripe_in_cache_list)) {
>> +		/* all stripes flushed */
>> +		spin_unlock(&log->stripe_in_cache_lock);
>> +		return log->next_checkpoint;
>> +	}
>> +	sh = list_first_entry(&conf->log->stripe_in_cache_list,
>> +			      struct stripe_head, r5c);
>> +	end = sh->log_start;
>> +	spin_unlock(&log->stripe_in_cache_lock);
>> +	return end;
> 
> Given that we only assign "log_start" to the variable "end", it is
> strange that it is called "end".
> "new_cp" would make sense, or "log_start", but why "end" ??

It is somehow like "end of what we can reclaim". I will rename it. 

> 
>> 
>> +/*
>> + * r5c_flush_stripe moves stripe from cached list to handle_list. When called,
>> + * the stripe must be on r5c_cached_full_stripes or r5c_cached_partial_stripes.
>> + *
>> + * must hold conf->device_lock
>> + */
>> +static void r5c_flush_stripe(struct r5conf *conf, struct stripe_head *sh)
>> +{
>> +	BUG_ON(list_empty(&sh->lru));
>> +	BUG_ON(test_bit(STRIPE_R5C_WRITE_OUT, &sh->state));
>> +	BUG_ON(test_bit(STRIPE_HANDLE, &sh->state));
>> +	assert_spin_locked(&conf->device_lock);
>> +
>> +	list_del_init(&sh->lru);
>> +	atomic_inc(&sh->count);
>> +
>> +	set_bit(STRIPE_HANDLE, &sh->state);
>> +	atomic_inc(&conf->active_stripes);
>> +	r5c_make_stripe_write_out(sh);
>> +
>> +	if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
>> +		atomic_inc(&conf->preread_active_stripes);
>> +	raid5_release_stripe(sh);
> 
> This looks wrong.  raid5_release_stripe() can try to take
> conf->device_lock but this function is called with ->device_lock
> held. This would cause a deadlock.
> 
> It presumably doesn't deadlock because you just incremented sh->count,
> so raid5_release_stripe() will probably just decrement sh->count and
> that count will remain > 0.
> So why are you incrementing ->count for a few instructions and then
> releasing the stripe?  Either that isn't necessary, or it could
> deadlock.
r5c_flush_stripe() will only work on stripes in r5c_cached_full_stripes or
r5c_cached_partial_stripes. So these stripes would always have count=0
(before atomic_inc). 

> I guess that if we are certain that STRIPE_ON_RELEASE_LIST is clear,
> then it won't deadlock as it will do a lock-less add to
> conf->release_stripes.
> But if that is the case, it needs to be documented, and probaby there
> needs to be a WARN_ON(test_bit(STRIPE_ON_RELEASE_LIST.....));

Since the stripe is on r5c_cached_full_stripes or r5c_cached_partial_stripes, 
it should not have STRIPE_ON_RELEASE_LIST. Let me add documents 
and check. 

Thanks,
Song


^ permalink raw reply

* [PATCH 1/2] raid5-cache: update superblock at shutdown/reboot
From: Shaohua Li @ 2016-11-17  1:20 UTC (permalink / raw)
  To: linux-raid; +Cc: songliubraving, neilb, Zhengyuan Liu

Currently raid5-cache update superblock in .quiesce. But since at
shutdown/reboot, .quiesce is called with reconfig mutex locked,
superblock isn't guaranteed to be called in reclaim thread (see
8e018c21da3). This will make assemble do unnecessary journal recovery.
It doesn't corrupt data but is annoying.  This adds an extra hook to
guarantee journal is flushed to raid disks.  And since this hook is
called before superblock update, this will guarantee we have a uptodate
superblock in shutdown/reboot

Signed-off-by: Shaohua Li <shli@fb.com>
Cc: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Cc: NeilBrown <neilb@suse.com>
---
 drivers/md/md.c          |  3 +++
 drivers/md/md.h          |  2 ++
 drivers/md/raid5-cache.c | 19 ++++++++++++++++++-
 drivers/md/raid5.c       |  3 +++
 drivers/md/raid5.h       |  1 +
 5 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 1f1c7f0..155dd42 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5471,6 +5471,9 @@ static void __md_stop_writes(struct mddev *mddev)
 
 	del_timer_sync(&mddev->safemode_timer);
 
+	if (mddev->pers && mddev->pers->stop_writes)
+		mddev->pers->stop_writes(mddev);
+
 	bitmap_flush(mddev);
 
 	if (mddev->ro == 0 &&
diff --git a/drivers/md/md.h b/drivers/md/md.h
index af6b33c..8da6fe9 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -512,6 +512,8 @@ struct md_personality
 	/* congested implements bdi.congested_fn().
 	 * Will not be called while array is 'suspended' */
 	int (*congested)(struct mddev *mddev, int bits);
+	/* stop touching raid disks */
+	void (*stop_writes)(struct mddev *mddev);
 };
 
 struct md_sysfs_entry {
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 2e270e6..641dec8 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -738,11 +738,13 @@ static void r5l_write_super_and_discard_space(struct r5l_log *log,
 
 static void r5l_do_reclaim(struct r5l_log *log)
 {
+	static DEFINE_MUTEX(lock);
 	sector_t reclaim_target = xchg(&log->reclaim_target, 0);
 	sector_t reclaimable;
 	sector_t next_checkpoint;
 	u64 next_cp_seq;
 
+	mutex_lock(&lock);
 	spin_lock_irq(&log->io_list_lock);
 	/*
 	 * move proper io_unit to reclaim list. We should not change the order.
@@ -769,8 +771,10 @@ static void r5l_do_reclaim(struct r5l_log *log)
 	spin_unlock_irq(&log->io_list_lock);
 
 	BUG_ON(reclaimable < 0);
-	if (reclaimable == 0)
+	if (reclaimable == 0) {
+		mutex_unlock(&lock);
 		return;
+	}
 
 	/*
 	 * write_super will flush cache of each raid disk. We must write super
@@ -784,6 +788,8 @@ static void r5l_do_reclaim(struct r5l_log *log)
 	log->last_cp_seq = next_cp_seq;
 	mutex_unlock(&log->io_mutex);
 
+	mutex_unlock(&lock);
+
 	r5l_run_no_space_stripes(log);
 }
 
@@ -836,6 +842,17 @@ void r5l_quiesce(struct r5l_log *log, int state)
 	}
 }
 
+void r5l_stop_writes(struct mddev *mddev)
+{
+	struct r5conf *conf = mddev->private;
+	struct r5l_log *log = conf->log;
+
+	if (!log)
+		return;
+	r5l_wake_reclaim(log, -1L);
+	r5l_do_reclaim(log);
+}
+
 bool r5l_log_disk_error(struct r5conf *conf)
 {
 	struct r5l_log *log;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index df88656..3d27338 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7866,6 +7866,7 @@ static struct md_personality raid6_personality =
 	.quiesce	= raid5_quiesce,
 	.takeover	= raid6_takeover,
 	.congested	= raid5_congested,
+	.stop_writes	= r5l_stop_writes,
 };
 static struct md_personality raid5_personality =
 {
@@ -7889,6 +7890,7 @@ static struct md_personality raid5_personality =
 	.quiesce	= raid5_quiesce,
 	.takeover	= raid5_takeover,
 	.congested	= raid5_congested,
+	.stop_writes	= r5l_stop_writes,
 };
 
 static struct md_personality raid4_personality =
@@ -7913,6 +7915,7 @@ static struct md_personality raid4_personality =
 	.quiesce	= raid5_quiesce,
 	.takeover	= raid4_takeover,
 	.congested	= raid5_congested,
+	.stop_writes	= r5l_stop_writes,
 };
 
 static int __init raid5_init(void)
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 57ec49f..0643cc0 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -633,4 +633,5 @@ extern void r5l_stripe_write_finished(struct stripe_head *sh);
 extern int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio);
 extern void r5l_quiesce(struct r5l_log *log, int state);
 extern bool r5l_log_disk_error(struct r5conf *conf);
+extern void r5l_stop_writes(struct mddev *mddev);
 #endif
-- 
2.9.3


^ permalink raw reply related

* [PATCH 2/2] raid5-cache: fix lockdep warning
From: Shaohua Li @ 2016-11-17  1:20 UTC (permalink / raw)
  To: linux-raid; +Cc: songliubraving, neilb
In-Reply-To: <5d6f023fb1d1398317d07f02634f90b055e26f4b.1479345454.git.shli@fb.com>

lockdep reports warning of the rcu_dereference usage. Using normal rdev
access pattern to avoid the warning.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5-cache.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 641dec8..fe62578 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -987,16 +987,28 @@ static int r5l_recovery_flush_one_stripe(struct r5l_log *log,
 			continue;
 
 		/* in case device is broken */
+		rcu_read_lock();
 		rdev = rcu_dereference(conf->disks[disk_index].rdev);
-		if (rdev)
+		if (rdev) {
+			atomic_inc(&rdev->nr_pending);
+			rcu_read_unlock();
 			sync_page_io(rdev, stripe_sect, PAGE_SIZE,
 				     sh->dev[disk_index].page, REQ_OP_WRITE, 0,
 				     false);
+			rdev_dec_pending(rdev, rdev->mddev);
+			rcu_read_lock();
+		}
 		rrdev = rcu_dereference(conf->disks[disk_index].replacement);
-		if (rrdev)
+		if (rrdev) {
+			atomic_inc(&rrdev->nr_pending);
+			rcu_read_unlock();
 			sync_page_io(rrdev, stripe_sect, PAGE_SIZE,
 				     sh->dev[disk_index].page, REQ_OP_WRITE, 0,
 				     false);
+			rdev_dec_pending(rrdev, rrdev->mddev);
+			rcu_read_lock();
+		}
+		rcu_read_unlock();
 	}
 	raid5_release_stripe(sh);
 	return 0;
-- 
2.9.3


^ permalink raw reply related

* Newly-created arrays don't auto-assemble - related to hostname change?
From: Andy Smith @ 2016-11-17  3:52 UTC (permalink / raw)
  To: linux-raid

Hi,

I feel I am missing something very simple here, as I usually don't
have this issue, but here goes…

I've got a Debian jessie host on which I created four arrays during
install (md{0,1,2,3}), using the Debian installer and partman. These
auto-assemble fine.

After install the name of the server was changed from "tbd" to
"jfd". Another array was then created (md5), added to
/etc/mdadm/mdadm.conf and the initramfs was rebuilt
(update-initramfs -u).

md5 does not auto-assemble. It can be started fine after boot with:

    # mdadm --assemble /dev/md5

or:

    # mdadm --incremental /dev/sdc
    # mdadm --incremental /dev/sdd

/etc/mdadm/mdadm.conf:

    DEVICE /dev/sd*
    CREATE owner=root group=disk mode=0660 auto=yes
    HOMEHOST <ignore>
    MAILADDR root
    ARRAY /dev/md/0  metadata=1.2 UUID=400bac1d:e2c5d6ef:fea3b8c8:bcb70f8f
    ARRAY /dev/md/1  metadata=1.2 UUID=e29c8b89:705f0116:d888f77e:2b6e32f5
    ARRAY /dev/md/2  metadata=1.2 UUID=039b3427:4be5157a:6e2d53bd:fe898803
    ARRAY /dev/md/3  metadata=1.2 UUID=30f745ce:7ed41b53:4df72181:7406ea1d
    ARRAY /dev/md/5  metadata=1.2 UUID=957030cf:c09f023d:ceaebb27:e546f095

I've unpacked the initramfs and looked at the mdadm.conf in there
and it is identical.

Initially HOMEHOST was set to <system> (the default), but I noticed
when looking at --detail that md5 has:

           Name : jfd:5  (local to host jfd)

whereas the others have:

           Name : tbd:0

…so I changed it to <ignore> to see if that would help. It didn't.

So, I'd really appreciate any hints as to what I've missed here!

Here follows --detail and --examine of md5 and its members, then the
contents of /proc/mdstat after I have manually assembled md5.

$ sudo mdadm --detail /dev/md5
/dev/md5:
        Version : 1.2
  Creation Time : Thu Nov 17 02:35:15 2016
     Raid Level : raid10
     Array Size : 1875243008 (1788.37 GiB 1920.25 GB)
  Used Dev Size : 1875243008 (1788.37 GiB 1920.25 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Thu Nov 17 02:35:15 2016
          State : clean 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

         Layout : far=2
     Chunk Size : 512K

           Name : jfd:5  (local to host jfd)
           UUID : 957030cf:c09f023d:ceaebb27:e546f095
         Events : 0

    Number   Major   Minor   RaidDevice State
       0       8       48        0      active sync   /dev/sdd
       1       8       32        1      active sync   /dev/sdc

$ sudo mdadm --examine /dev/sd{c,d}
/dev/sdc:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 957030cf:c09f023d:ceaebb27:e546f095
           Name : jfd:5  (local to host jfd)
  Creation Time : Thu Nov 17 02:35:15 2016
     Raid Level : raid10
   Raid Devices : 2

 Avail Dev Size : 3750486704 (1788.37 GiB 1920.25 GB)
     Array Size : 1875243008 (1788.37 GiB 1920.25 GB)
  Used Dev Size : 3750486016 (1788.37 GiB 1920.25 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=688 sectors
          State : clean
    Device UUID : 4ac82c29:2d109465:7fff9b22:8c411c1e

Internal Bitmap : 8 sectors from superblock
    Update Time : Thu Nov 17 02:35:15 2016
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : 96d669f1 - correct
         Events : 0

         Layout : far=2
     Chunk Size : 512K

   Device Role : Active device 1
   Array State : AA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdd:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 957030cf:c09f023d:ceaebb27:e546f095
           Name : jfd:5  (local to host jfd)
  Creation Time : Thu Nov 17 02:35:15 2016
     Raid Level : raid10
   Raid Devices : 2
 Avail Dev Size : 3750486704 (1788.37 GiB 1920.25 GB)
     Array Size : 1875243008 (1788.37 GiB 1920.25 GB)
  Used Dev Size : 3750486016 (1788.37 GiB 1920.25 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=688 sectors
          State : clean
    Device UUID : 3a067652:6e88afae:82722342:0036bae0

Internal Bitmap : 8 sectors from superblock
    Update Time : Thu Nov 17 02:35:15 2016
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : eb608799 - correct
         Events : 0

         Layout : far=2
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : AA ('A' == active, '.' == missing, 'R' == replacing)

$ cat /proc/mdstat 
Personalities : [raid1] [raid10] 
md5 : active (auto-read-only) raid10 sdd[0] sdc[1]
      1875243008 blocks super 1.2 512K chunks 2 far-copies [2/2] [UU]
      bitmap: 0/14 pages [0KB], 65536KB chunk

md3 : active raid10 sda5[0] sdb5[1]
      12199936 blocks super 1.2 512K chunks 2 far-copies [2/2] [UU]
      
md2 : active (auto-read-only) raid10 sda3[0] sdb3[1]
      975872 blocks super 1.2 512K chunks 2 far-copies [2/2] [UU]
      
md1 : active raid10 sda2[0] sdb2[1]
      1951744 blocks super 1.2 512K chunks 2 far-copies [2/2] [UU]
      
md0 : active raid1 sda1[0] sdb1[1]
      498368 blocks super 1.2 [2/2] [UU]
      
unused devices: <none>

Cheers,
Andy

^ permalink raw reply

* Re: [PATCH 1/2] raid5-cache: update superblock at shutdown/reboot
From: NeilBrown @ 2016-11-17  5:18 UTC (permalink / raw)
  To: Shaohua Li, linux-raid; +Cc: songliubraving, Zhengyuan Liu
In-Reply-To: <5d6f023fb1d1398317d07f02634f90b055e26f4b.1479345454.git.shli@fb.com>

[-- Attachment #1: Type: text/plain, Size: 5486 bytes --]

On Thu, Nov 17 2016, Shaohua Li wrote:

> Currently raid5-cache update superblock in .quiesce. But since at
> shutdown/reboot, .quiesce is called with reconfig mutex locked,
> superblock isn't guaranteed to be called in reclaim thread (see
> 8e018c21da3). This will make assemble do unnecessary journal recovery.
> It doesn't corrupt data but is annoying.  This adds an extra hook to
> guarantee journal is flushed to raid disks.  And since this hook is
> called before superblock update, this will guarantee we have a uptodate
> superblock in shutdown/reboot

Hi.
I don't quite follow some of the reasoning here.
In particular, the ->stop_writes() that you have implemented
does almost exactly the same thing as r5l_quiesce(1).
So why not simply call ->quiesce(mddev, 1) in __md_stop_writes()??
You probably need to also call ->quiesce(mddev, 0) to keep things
balanced.

Also you have introduced a static mutex (which isn't my favourite sort
of thing) without giving any explanation why in the changelog comment.
So I cannot easily see if that addition is at all justified.

Thanks,
NeilBrown

>
> Signed-off-by: Shaohua Li <shli@fb.com>
> Cc: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
> Cc: NeilBrown <neilb@suse.com>
> ---
>  drivers/md/md.c          |  3 +++
>  drivers/md/md.h          |  2 ++
>  drivers/md/raid5-cache.c | 19 ++++++++++++++++++-
>  drivers/md/raid5.c       |  3 +++
>  drivers/md/raid5.h       |  1 +
>  5 files changed, 27 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 1f1c7f0..155dd42 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -5471,6 +5471,9 @@ static void __md_stop_writes(struct mddev *mddev)
>  
>  	del_timer_sync(&mddev->safemode_timer);
>  
> +	if (mddev->pers && mddev->pers->stop_writes)
> +		mddev->pers->stop_writes(mddev);
> +
>  	bitmap_flush(mddev);
>  
>  	if (mddev->ro == 0 &&
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index af6b33c..8da6fe9 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -512,6 +512,8 @@ struct md_personality
>  	/* congested implements bdi.congested_fn().
>  	 * Will not be called while array is 'suspended' */
>  	int (*congested)(struct mddev *mddev, int bits);
> +	/* stop touching raid disks */
> +	void (*stop_writes)(struct mddev *mddev);
>  };
>  
>  struct md_sysfs_entry {
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 2e270e6..641dec8 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -738,11 +738,13 @@ static void r5l_write_super_and_discard_space(struct r5l_log *log,
>  
>  static void r5l_do_reclaim(struct r5l_log *log)
>  {
> +	static DEFINE_MUTEX(lock);
>  	sector_t reclaim_target = xchg(&log->reclaim_target, 0);
>  	sector_t reclaimable;
>  	sector_t next_checkpoint;
>  	u64 next_cp_seq;
>  
> +	mutex_lock(&lock);
>  	spin_lock_irq(&log->io_list_lock);
>  	/*
>  	 * move proper io_unit to reclaim list. We should not change the order.
> @@ -769,8 +771,10 @@ static void r5l_do_reclaim(struct r5l_log *log)
>  	spin_unlock_irq(&log->io_list_lock);
>  
>  	BUG_ON(reclaimable < 0);
> -	if (reclaimable == 0)
> +	if (reclaimable == 0) {
> +		mutex_unlock(&lock);
>  		return;
> +	}
>  
>  	/*
>  	 * write_super will flush cache of each raid disk. We must write super
> @@ -784,6 +788,8 @@ static void r5l_do_reclaim(struct r5l_log *log)
>  	log->last_cp_seq = next_cp_seq;
>  	mutex_unlock(&log->io_mutex);
>  
> +	mutex_unlock(&lock);
> +
>  	r5l_run_no_space_stripes(log);
>  }
>  
> @@ -836,6 +842,17 @@ void r5l_quiesce(struct r5l_log *log, int state)
>  	}
>  }
>  
> +void r5l_stop_writes(struct mddev *mddev)
> +{
> +	struct r5conf *conf = mddev->private;
> +	struct r5l_log *log = conf->log;
> +
> +	if (!log)
> +		return;
> +	r5l_wake_reclaim(log, -1L);
> +	r5l_do_reclaim(log);
> +}
> +
>  bool r5l_log_disk_error(struct r5conf *conf)
>  {
>  	struct r5l_log *log;
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index df88656..3d27338 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -7866,6 +7866,7 @@ static struct md_personality raid6_personality =
>  	.quiesce	= raid5_quiesce,
>  	.takeover	= raid6_takeover,
>  	.congested	= raid5_congested,
> +	.stop_writes	= r5l_stop_writes,
>  };
>  static struct md_personality raid5_personality =
>  {
> @@ -7889,6 +7890,7 @@ static struct md_personality raid5_personality =
>  	.quiesce	= raid5_quiesce,
>  	.takeover	= raid5_takeover,
>  	.congested	= raid5_congested,
> +	.stop_writes	= r5l_stop_writes,
>  };
>  
>  static struct md_personality raid4_personality =
> @@ -7913,6 +7915,7 @@ static struct md_personality raid4_personality =
>  	.quiesce	= raid5_quiesce,
>  	.takeover	= raid4_takeover,
>  	.congested	= raid5_congested,
> +	.stop_writes	= r5l_stop_writes,
>  };
>  
>  static int __init raid5_init(void)
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index 57ec49f..0643cc0 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -633,4 +633,5 @@ extern void r5l_stripe_write_finished(struct stripe_head *sh);
>  extern int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio);
>  extern void r5l_quiesce(struct r5l_log *log, int state);
>  extern bool r5l_log_disk_error(struct r5conf *conf);
> +extern void r5l_stop_writes(struct mddev *mddev);
>  #endif
> -- 
> 2.9.3

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: mdadm I/O error with Ddf RAID
From: Arka Sharma @ 2016-11-17  5:21 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <874m3ak3ci.fsf@notabene.neil.brown.name>

Thanks Neil for you response. The devices we are using have physical
sector size of 512 bytes. We are writing the anchor header at last LBA
of the devices. We also written the controller data, physical records,
virtual records, physical data, config data to both primary and
secondary. We have also dumped 32 MB of metadata for RAID created by
mdadm and our tool. The guid's, crc's, timestamps are different as
expected, and the other fields are mostly similar, except in between
some header mdadm is writing 'FF' and we are writing 0's. We have also
dumped the call trace in block layer and put following prints,

[   10.312362] generic_make_request: bio->bi_iter.bi_sector=18446744073709551615
[   10.312363] generic_make_request_checks:
bio->bi_iter.bi_sector=18446744073709551615
[   10.312364] bio_check_eod: nr_sector=8, max_sector=1000215216,
bio->bi_iter.bi_sector=18446744073709551615

And following is the call trace

[   10.312372]  [<ffffffff983d4b5c>] dump_stack+0x63/0x87
[   10.312374]  [<ffffffff983a06c1>] generic_make_request_checks+0x2a1/0x550
[   10.312375]  [<ffffffff980d54a9>] ? vprintk_default+0x29/0x40
[   10.312377]  [<ffffffff983a2cf2>] generic_make_request+0x52/0x210
[   10.312378]  [<ffffffff9839c0bf>] ? bio_clone_bioset+0x12f/0x320
[   10.312380]  [<ffffffffc031ed6a>] raid1_make_request+0xa2a/0xe00 [raid1]
[   10.312382]  [<ffffffff98192fcc>] ? get_page_from_freelist+0x36c/0xa50
[   10.312383]  [<ffffffff9822f8d5>] ? __d_alloc+0x25/0x1d0
[   10.312385]  [<ffffffff986739f2>] md_make_request+0xe2/0x230
[   10.312387]  [<ffffffff980be164>] ? __wake_up+0x44/0x50
[   10.312391]  [<ffffffff983a2dc6>] generic_make_request+0x126/0x210
[   10.312395]  [<ffffffff98325e3b>] ? d_lookup_done.part.24+0x26/0x2b
[   10.312398]  [<ffffffff983a2f3b>] submit_bio+0x8b/0x1a0
[   10.312401]  [<ffffffff9839ab48>] ? bio_alloc_bioset+0x168/0x2a0
[   10.312405]  [<ffffffff9824d9ac>] submit_bh_wbc+0x15c/0x1a0
[   10.312408]  [<ffffffff9824e26c>] block_read_full_page+0x1ec/0x370
[   10.312411]  [<ffffffff98250230>] ? I_BDEV+0x20/0x20
[   10.312414]  [<ffffffff9818ad60>] ? find_get_entry+0x20/0x140
[   10.312417]  [<ffffffff98250a18>] blkdev_readpage+0x18/0x20
[   10.312420]  [<ffffffff9818cc7b>] generic_file_read_iter+0x1ab/0x850
[   10.312423]  [<ffffffff98250c77>] blkdev_read_iter+0x37/0x40
[   10.312427]  [<ffffffff9821571e>] __vfs_read+0xbe/0x130
[   10.312430]  [<ffffffff9821674e>] vfs_read+0x8e/0x140
[   10.312433]  [<ffffffff98217b56>] SyS_read+0x46/0xa0
[   10.312437]  [<ffffffff98216437>] ? SyS_lseek+0x87/0xb0
[   10.312440]  [<ffffffff988106f6>] entry_SYSCALL_64_fastpath+0x1e/0xa8

Is it some header field our application is not writing correctly for
which mdadm is reading wrong data and invalid sector value is formed
in bio ?

Regards,
Arka

On Mon, Nov 14, 2016 at 11:30 AM, NeilBrown <neilb@suse.com> wrote:
> On Fri, Nov 11 2016, Arka Sharma wrote:
>
>> Hi All,
>>
>> We have developed a RAID creation application which create RAID with
>> Ddf RAID metadata. We are using PCIe ssd as physical disks. We are
>> writing the anchor, primary, secondary headers, virtual and physical
>> records, configuration record and physical disk data. The offsets of
>> the headers are updated in the primary, secondary and anchor headers
>> correctly. The problem is when we try to boot to Ubuntu server and we
>> observe that mdadm is throwing a disk failure error message and from
>> block layer we are getting rw=0, want=7, limit=1000215216. We also
>> confirmed using there is no I/O error is coming from the PCIe ssd,
>> using a logic analyzer. Also the limit value 1000215216 is the
>> capacity of the ssd in 512 byte blocks. Any insight will be highly
>> appreciated.
>>
>
> It looks like mdadm is attempting a 4K read starting at the last sector.
>
> Possibly the ssd's report a physical sector size of 4K.
>
> I don't know how DDF is supposed to work on a device like that.
> Should the anchor be at the start of the last 4K block,
> or in the last 512byte virtual block?
>
> DDF support in mdadm was written with the assumption of 512 byte blocks.
>
> I'm not at all certain this is the cause of the problem though.
>
> I would suggest starting by finding out which READ request in mdadm is
> causing the error.
>
> NeilBrown

^ permalink raw reply

* Re: [md PATCH 1/4] md: add block tracing for bio_remapping
From: NeilBrown @ 2016-11-17  5:33 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <20161116192922.ylqx32zocag4xp4b@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 1557 bytes --]

On Thu, Nov 17 2016, Shaohua Li wrote:

> On Mon, Nov 14, 2016 at 04:30:21PM +1100, Neil Brown wrote:
>> The block tracing infrastructure (accessed with blktrace/blkparse)
>> supports the tracing of mapping bios from one device to another.
>> This is currently used when a bio in a partition is mapped to the
>> whole device, when bios are mapped by dm, and for mapping in md/raid5.
>> Other md personalities do not include this tracing yet, so add it.
>> 
>> When a read-error is detected we redirect the request to a different device.
>> This could justifiably be seen as a new mapping for the originial bio,
>> or a secondary mapping for the bio that errors.  This patch uses
>> the second option.
>> 
>> When md is used under dm-raid, the mappings are not traced as we do
>> not have access to the block device number of the parent.
>
> Looks the the original sector (the last parameter of trace_block_bio_remap)
> isn't correct.
> - in linear/raid0, bio_split already updated bio->bi_iter.bi_sector

Oh yes, of course.  in the common case 'split == bio' so when
split->bi_iter.bi_sector  is adjusted, bio->.... is as well.
I'll fix that, and also add calls to trace_block_split() as appropriate.


> - in raid1/raid10, r1_bio->sector is updated before the bio is sent.

Here I really think my code is correct.  r1_bio->sector is always the
address in the array of the request.  It is only set once for each
r1_bio, and that is before the call to trace_block_io_remap().

Thanks,
NeilBrown


>
> Thanks,
> Shaohua

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: [md PATCH 2/4] md: add bio completion tracing for raid1/raid10
From: NeilBrown @ 2016-11-17  5:35 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Shaohua Li, linux-raid
In-Reply-To: <20161116143233.GA28615@infradead.org>

[-- Attachment #1: Type: text/plain, Size: 992 bytes --]

On Thu, Nov 17 2016, Christoph Hellwig wrote:

> On Mon, Nov 14, 2016 at 04:30:21PM +1100, NeilBrown wrote:
>> raid5 already has this, as does dm.
>> linear and raid0 do no see completions, only bio_chain_end() or bio_endio()
>> see those.
>> So just add it for raid1 and raid10.
>> 
>> Between
>>  Commit: 3a366e614d08 ("block: add missing block_bio_complete() tracepoint")
>> and
>>  Commit: 0a82a8d132b2 ("Revert "block: add missing block_bio_complete() tracepoint"")
>> in the 3.9-rc series, this was done centrally in bio_endio().
>> Until/unless that is resurected, do the tracing in the md/raid code.
>
> We're working on getting it back for 4.10, so please don't add these
> tracepoints to the MD driver for now.  Next time please also ask
> linux-block first and/or Cc the list on a patch like this.

Oh good, thanks for letting me know. I'll drop that one.
Do you know if there is any plan to include trace_block_split() in
bio_split()??

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: mdadm I/O error with Ddf RAID
From: NeilBrown @ 2016-11-17  5:49 UTC (permalink / raw)
  To: Arka Sharma; +Cc: linux-raid
In-Reply-To: <CAPO=kN3psOzSYzbKZ5dAGK5uEk3JfTSf-Ec+S_UTL0iaD78Pdg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 5274 bytes --]

On Thu, Nov 17 2016, Arka Sharma wrote:

> Thanks Neil for you response. The devices we are using have physical
> sector size of 512 bytes. We are writing the anchor header at last LBA
> of the devices. We also written the controller data, physical records,
> virtual records, physical data, config data to both primary and
> secondary. We have also dumped 32 MB of metadata for RAID created by
> mdadm and our tool. The guid's, crc's, timestamps are different as
> expected, and the other fields are mostly similar, except in between
> some header mdadm is writing 'FF' and we are writing 0's. We have also
> dumped the call trace in block layer and put following prints,
>
> [   10.312362] generic_make_request: bio->bi_iter.bi_sector=18446744073709551615
> [   10.312363] generic_make_request_checks:
> bio->bi_iter.bi_sector=18446744073709551615
> [   10.312364] bio_check_eod: nr_sector=8, max_sector=1000215216,
> bio->bi_iter.bi_sector=18446744073709551615

It might help if you provide A LOT more details.  Don't leave me
guessing.

You have shown me the output of some printks, but you haven't shown me
the patch which added the printks, so I cannot be sure how to interpret
them.

You originally said that mdadm was causing the error, but now it seems
that the error is coming after mdadm has already assembled the array and
something else is accessing it.

Have you tried running "mdadm --exmaine" on a component device?  I
previously assumed this would just crash because your original problem
report seemed to suggest that mdadm was getting a read error, but now
that doesn't seem to be the case.

Please be accurate and provide details.  Don't be afraid to send several
kilobytes of logs.  Err on the side of sending too much, not too
little.  Don't ever trim logs.

NeilBrown


>
> and following is the call trace
>
> [   10.312372]  [<ffffffff983d4b5c>] dump_stack+0x63/0x87

> [   10.312374]  [<ffffffff983a06c1>] generic_make_request_checks+0x2a1/0x550
> [   10.312375]  [<ffffffff980d54a9>] ? vprintk_default+0x29/0x40
> [   10.312377]  [<ffffffff983a2cf2>] generic_make_request+0x52/0x210
> [   10.312378]  [<ffffffff9839c0bf>] ? bio_clone_bioset+0x12f/0x320
> [   10.312380]  [<ffffffffc031ed6a>] raid1_make_request+0xa2a/0xe00 [raid1]
> [   10.312382]  [<ffffffff98192fcc>] ? get_page_from_freelist+0x36c/0xa50
> [   10.312383]  [<ffffffff9822f8d5>] ? __d_alloc+0x25/0x1d0
> [   10.312385]  [<ffffffff986739f2>] md_make_request+0xe2/0x230
> [   10.312387]  [<ffffffff980be164>] ? __wake_up+0x44/0x50
> [   10.312391]  [<ffffffff983a2dc6>] generic_make_request+0x126/0x210
> [   10.312395]  [<ffffffff98325e3b>] ? d_lookup_done.part.24+0x26/0x2b
> [   10.312398]  [<ffffffff983a2f3b>] submit_bio+0x8b/0x1a0
> [   10.312401]  [<ffffffff9839ab48>] ? bio_alloc_bioset+0x168/0x2a0
> [   10.312405]  [<ffffffff9824d9ac>] submit_bh_wbc+0x15c/0x1a0
> [   10.312408]  [<ffffffff9824e26c>] block_read_full_page+0x1ec/0x370
> [   10.312411]  [<ffffffff98250230>] ? I_BDEV+0x20/0x20
> [   10.312414]  [<ffffffff9818ad60>] ? find_get_entry+0x20/0x140
> [   10.312417]  [<ffffffff98250a18>] blkdev_readpage+0x18/0x20
> [   10.312420]  [<ffffffff9818cc7b>] generic_file_read_iter+0x1ab/0x850
> [   10.312423]  [<ffffffff98250c77>] blkdev_read_iter+0x37/0x40
> [   10.312427]  [<ffffffff9821571e>] __vfs_read+0xbe/0x130
> [   10.312430]  [<ffffffff9821674e>] vfs_read+0x8e/0x140
> [   10.312433]  [<ffffffff98217b56>] SyS_read+0x46/0xa0
> [   10.312437]  [<ffffffff98216437>] ? SyS_lseek+0x87/0xb0
> [   10.312440]  [<ffffffff988106f6>] entry_SYSCALL_64_fastpath+0x1e/0xa8
>
> Is it some header field our application is not writing correctly for
> which mdadm is reading wrong data and invalid sector value is formed
> in bio ?
>
> Regards,
> Arka
>
> On Mon, Nov 14, 2016 at 11:30 AM, NeilBrown <neilb@suse.com> wrote:
>> On Fri, Nov 11 2016, Arka Sharma wrote:
>>
>>> Hi All,
>>>
>>> We have developed a RAID creation application which create RAID with
>>> Ddf RAID metadata. We are using PCIe ssd as physical disks. We are
>>> writing the anchor, primary, secondary headers, virtual and physical
>>> records, configuration record and physical disk data. The offsets of
>>> the headers are updated in the primary, secondary and anchor headers
>>> correctly. The problem is when we try to boot to Ubuntu server and we
>>> observe that mdadm is throwing a disk failure error message and from
>>> block layer we are getting rw=0, want=7, limit=1000215216. We also
>>> confirmed using there is no I/O error is coming from the PCIe ssd,
>>> using a logic analyzer. Also the limit value 1000215216 is the
>>> capacity of the ssd in 512 byte blocks. Any insight will be highly
>>> appreciated.
>>>
>>
>> It looks like mdadm is attempting a 4K read starting at the last sector.
>>
>> Possibly the ssd's report a physical sector size of 4K.
>>
>> I don't know how DDF is supposed to work on a device like that.
>> Should the anchor be at the start of the last 4K block,
>> or in the last 512byte virtual block?
>>
>> DDF support in mdadm was written with the assumption of 512 byte blocks.
>>
>> I'm not at all certain this is the cause of the problem though.
>>
>> I would suggest starting by finding out which READ request in mdadm is
>> causing the error.
>>
>> NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: Newly-created arrays don't auto-assemble - related to hostname change?
From: NeilBrown @ 2016-11-17  6:09 UTC (permalink / raw)
  To: Andy Smith, linux-raid
In-Reply-To: <20161117035230.GG21587@bitfolk.com>

[-- Attachment #1: Type: text/plain, Size: 6347 bytes --]

On Thu, Nov 17 2016, Andy Smith wrote:

> Hi,
>
> I feel I am missing something very simple here, as I usually don't
> have this issue, but here goes…
>
> I've got a Debian jessie host on which I created four arrays during
> install (md{0,1,2,3}), using the Debian installer and partman. These
> auto-assemble fine.
>
> After install the name of the server was changed from "tbd" to
> "jfd". Another array was then created (md5), added to
> /etc/mdadm/mdadm.conf and the initramfs was rebuilt
> (update-initramfs -u).
>
> md5 does not auto-assemble. It can be started fine after boot with:
>
>     # mdadm --assemble /dev/md5
>
> or:
>
>     # mdadm --incremental /dev/sdc
>     # mdadm --incremental /dev/sdd

This is almost exactly what udev does when the devices are discovered,
so if it works here, it should work when udev does it.

My only guess is that maybe the "DEVICE /dev/sd*" line in the mdadm.conf
is causing confusion.  udev might be using a different name, though that
would be odd.

Can you try removing that line and see if it makes a difference?

NeilBrown


>
> /etc/mdadm/mdadm.conf:
>
>     DEVICE /dev/sd*
>     CREATE owner=root group=disk mode=0660 auto=yes
>     HOMEHOST <ignore>
>     MAILADDR root
>     ARRAY /dev/md/0  metadata=1.2 UUID=400bac1d:e2c5d6ef:fea3b8c8:bcb70f8f
>     ARRAY /dev/md/1  metadata=1.2 UUID=e29c8b89:705f0116:d888f77e:2b6e32f5
>     ARRAY /dev/md/2  metadata=1.2 UUID=039b3427:4be5157a:6e2d53bd:fe898803
>     ARRAY /dev/md/3  metadata=1.2 UUID=30f745ce:7ed41b53:4df72181:7406ea1d
>     ARRAY /dev/md/5  metadata=1.2 UUID=957030cf:c09f023d:ceaebb27:e546f095
>
> I've unpacked the initramfs and looked at the mdadm.conf in there
> and it is identical.
>
> Initially HOMEHOST was set to <system> (the default), but I noticed
> when looking at --detail that md5 has:
>
>            Name : jfd:5  (local to host jfd)
>
> whereas the others have:
>
>            Name : tbd:0
>
> …so I changed it to <ignore> to see if that would help. It didn't.
>
> So, I'd really appreciate any hints as to what I've missed here!
>
> Here follows --detail and --examine of md5 and its members, then the
> contents of /proc/mdstat after I have manually assembled md5.
>
> $ sudo mdadm --detail /dev/md5
> /dev/md5:
>         Version : 1.2
>   Creation Time : Thu Nov 17 02:35:15 2016
>      Raid Level : raid10
>      Array Size : 1875243008 (1788.37 GiB 1920.25 GB)
>   Used Dev Size : 1875243008 (1788.37 GiB 1920.25 GB)
>    Raid Devices : 2
>   Total Devices : 2
>     Persistence : Superblock is persistent
>
>   Intent Bitmap : Internal
>
>     Update Time : Thu Nov 17 02:35:15 2016
>           State : clean 
>  Active Devices : 2
> Working Devices : 2
>  Failed Devices : 0
>   Spare Devices : 0
>
>          Layout : far=2
>      Chunk Size : 512K
>
>            Name : jfd:5  (local to host jfd)
>            UUID : 957030cf:c09f023d:ceaebb27:e546f095
>          Events : 0
>
>     Number   Major   Minor   RaidDevice State
>        0       8       48        0      active sync   /dev/sdd
>        1       8       32        1      active sync   /dev/sdc
>
> $ sudo mdadm --examine /dev/sd{c,d}
> /dev/sdc:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x1
>      Array UUID : 957030cf:c09f023d:ceaebb27:e546f095
>            Name : jfd:5  (local to host jfd)
>   Creation Time : Thu Nov 17 02:35:15 2016
>      Raid Level : raid10
>    Raid Devices : 2
>
>  Avail Dev Size : 3750486704 (1788.37 GiB 1920.25 GB)
>      Array Size : 1875243008 (1788.37 GiB 1920.25 GB)
>   Used Dev Size : 3750486016 (1788.37 GiB 1920.25 GB)
>     Data Offset : 262144 sectors
>    Super Offset : 8 sectors
>    Unused Space : before=262056 sectors, after=688 sectors
>           State : clean
>     Device UUID : 4ac82c29:2d109465:7fff9b22:8c411c1e
>
> Internal Bitmap : 8 sectors from superblock
>     Update Time : Thu Nov 17 02:35:15 2016
>   Bad Block Log : 512 entries available at offset 72 sectors
>        Checksum : 96d669f1 - correct
>          Events : 0
>
>          Layout : far=2
>      Chunk Size : 512K
>
>    Device Role : Active device 1
>    Array State : AA ('A' == active, '.' == missing, 'R' == replacing)
> /dev/sdd:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x1
>      Array UUID : 957030cf:c09f023d:ceaebb27:e546f095
>            Name : jfd:5  (local to host jfd)
>   Creation Time : Thu Nov 17 02:35:15 2016
>      Raid Level : raid10
>    Raid Devices : 2
>  Avail Dev Size : 3750486704 (1788.37 GiB 1920.25 GB)
>      Array Size : 1875243008 (1788.37 GiB 1920.25 GB)
>   Used Dev Size : 3750486016 (1788.37 GiB 1920.25 GB)
>     Data Offset : 262144 sectors
>    Super Offset : 8 sectors
>    Unused Space : before=262056 sectors, after=688 sectors
>           State : clean
>     Device UUID : 3a067652:6e88afae:82722342:0036bae0
>
> Internal Bitmap : 8 sectors from superblock
>     Update Time : Thu Nov 17 02:35:15 2016
>   Bad Block Log : 512 entries available at offset 72 sectors
>        Checksum : eb608799 - correct
>          Events : 0
>
>          Layout : far=2
>      Chunk Size : 512K
>
>    Device Role : Active device 0
>    Array State : AA ('A' == active, '.' == missing, 'R' == replacing)
>
> $ cat /proc/mdstat 
> Personalities : [raid1] [raid10] 
> md5 : active (auto-read-only) raid10 sdd[0] sdc[1]
>       1875243008 blocks super 1.2 512K chunks 2 far-copies [2/2] [UU]
>       bitmap: 0/14 pages [0KB], 65536KB chunk
>
> md3 : active raid10 sda5[0] sdb5[1]
>       12199936 blocks super 1.2 512K chunks 2 far-copies [2/2] [UU]
>       
> md2 : active (auto-read-only) raid10 sda3[0] sdb3[1]
>       975872 blocks super 1.2 512K chunks 2 far-copies [2/2] [UU]
>       
> md1 : active raid10 sda2[0] sdb2[1]
>       1951744 blocks super 1.2 512K chunks 2 far-copies [2/2] [UU]
>       
> md0 : active raid1 sda1[0] sdb1[1]
>       498368 blocks super 1.2 [2/2] [UU]
>       
> unused devices: <none>
>
> Cheers,
> Andy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: Just a thought - linux raid wiki - raid 5 grows getting stuck at 0%
From: NeilBrown @ 2016-11-17  6:18 UTC (permalink / raw)
  To: Wols Lists, linux-raid
In-Reply-To: <582CF0E5.2030605@youngman.org.uk>

[-- Attachment #1: Type: text/plain, Size: 2323 bytes --]

On Thu, Nov 17 2016, Wols Lists wrote:

> I've just been doing a bit of work on the wiki when I put 2 and 2
> together, and hope I haven't made 5 ...
>
> Bitmaps interfere with grow operations, I believe ...

Used to, yes.  Don't any more.

>
> And mdadm has recently been modified so that bitmaps are switched on by
> default, I also believe ...
>
> Could this be silently blocking the grow, so you don't get any error and
> it just sits there doing nothing?

Nope.  If it was a problem it would explicitly fail.

>
> But that brings up two points. Should mdadm now explicitly look for a
> bitmap when growing an array, and warn that the bitmap needs to be
> disabled? Apparently it currently comes up with some errors that hint at
> the cause rather than explicitly say so.

If you try to reshape an array which has a bitmap, on a kernel that
doesn't support reshaping arrays with bitmaps, mdadm hits this line of
code:

			if (err == EBUSY &&
			    (array.state & (1<<MD_SB_BITMAP_PRESENT)))
				cont_err("Bitmap must be removed before size can be changed\n");

or maybe this one

		pr_err("Cannot set array shape for %s\n",
		       devname);
		if (err == EBUSY &&
		    (info->array.state & (1<<MD_SB_BITMAP_PRESENT)))
			cont_err("       Bitmap must be removed before shape can be changed\n");

or

			if (err == EBUSY &&
			    (array.state & (1<<MD_SB_BITMAP_PRESENT)))
				cont_err("Bitmap must be removed before level can be changed\n");



>
> And secondly, I'm sure someone sent an email to the list that explained
> when a bitmap was created. I can't find it in my archives. The man page
> says it's created if the array is over 100G, but I'm sure the email gave
> rather more detail than that and that the figure actually varied.

	if (!s->bitmap_file &&
	    s->level >= 1 &&
	    st->ss->add_internal_bitmap &&
	    (s->write_behind || s->size > 100*1024*1024ULL)) {
		if (c->verbose > 0)
			pr_err("automatically enabling write-intent bitmap on large array\n");
		s->bitmap_file = "internal";
	}

So it looks like "over 100G" is the only test.

>
> But I'm sure you can see from this that I'm updating this bit of the
> wiki. Is there anything else I ought to know about bitmaps, so I can
> document it? :-)

If only I could run "diff" between my brain and the wiki....

Thanks for doing this!

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: Just a thought - linux raid wiki - raid 5 grows getting stuck at 0%
From: Wols Lists @ 2016-11-17  9:14 UTC (permalink / raw)
  To: NeilBrown, linux-raid
In-Reply-To: <87inrmhbnh.fsf@notabene.neil.brown.name>

On 17/11/16 06:18, NeilBrown wrote:
>> > But I'm sure you can see from this that I'm updating this bit of the
>> > wiki. Is there anything else I ought to know about bitmaps, so I can
>> > document it? :-)

> If only I could run "diff" between my brain and the wiki....

:-)
> 
> Thanks for doing this!
> 
> NeilBrown

I'll update the wiki in the next day or so.

I'm a bit worried I'll spread myself too thin, but I've been following
the threads on the recent patch sets, and I've also picked up on the new
Sphinx kernel documentation stuff.

So my aim now is, on top of generally updating the wiki, to link it in
to the kernel doc, and update a lot of that doc using re-structured text
in the md source files. That's not going to be a problem, I hope?

I'm going to dig and find out more, but if anybody can point me to some
kerneldoc code in the md source, and the corresponding links to the doc
on the kernel.org website, that'll be wonderful. It'll just give me a
leg up on the hardest part - actually getting started.

And if there isn't any at present, it'll just tell me it's needed even
more :-)

Cheers,
Wol

^ permalink raw reply

* Re: [PATCH 1/2] raid5-cache: update superblock at shutdown/reboot
From: Wols Lists @ 2016-11-17  9:44 UTC (permalink / raw)
  To: NeilBrown, Shaohua Li, linux-raid; +Cc: songliubraving, Zhengyuan Liu
In-Reply-To: <87wpg2heg8.fsf@notabene.neil.brown.name>

On 17/11/16 05:18, NeilBrown wrote:
> On Thu, Nov 17 2016, Shaohua Li wrote:
> 
>> Currently raid5-cache update superblock in .quiesce. But since at
>> shutdown/reboot, .quiesce is called with reconfig mutex locked,
>> superblock isn't guaranteed to be called in reclaim thread (see
>> 8e018c21da3). This will make assemble do unnecessary journal recovery.
>> It doesn't corrupt data but is annoying.  This adds an extra hook to
>> guarantee journal is flushed to raid disks.  And since this hook is
>> called before superblock update, this will guarantee we have a uptodate
>> superblock in shutdown/reboot
> 
> Hi.
> I don't quite follow some of the reasoning here.
> In particular, the ->stop_writes() that you have implemented
> does almost exactly the same thing as r5l_quiesce(1).
> So why not simply call ->quiesce(mddev, 1) in __md_stop_writes()??
> You probably need to also call ->quiesce(mddev, 0) to keep things
> balanced.
> 
> Also you have introduced a static mutex (which isn't my favourite sort
> of thing) without giving any explanation why in the changelog comment.
> So I cannot easily see if that addition is at all justified.
> 
> Thanks,
> NeilBrown
> 
I need to be careful I don't ruffle any feathers here ...

But this is saying to me this is a nice feature that hasn't been
properly spec'd and thought through. Don't get me wrong, I know that -
in typical linux fashion - people have been adding things, and raid has
"just growed" topsy fashion. So it's incredibly difficult to spec a new
feature when you don't have a spec for the stuff you're building it on.

Anyways, what I'm saying is, it seems to me this caching stuff (it's a
new feature, iirc) would be great for trying to write out a proper spec
of what's meant to be going on. It'll roll over into spec'ing the stuff
it relies on ...

And yes, I *AM* volunteering to do the work - as I said elsewhere, I
want to put a load of kerneldoc into the raid source, and get to
understand it all, but the downside is you'll get a lot of newbie-ish
questions from me trying to get to grips with what's going on. I'm an
experienced C programmer but kernel style is alien to me - you know the
disconnect when you're reading something, you can read the words easily,
but you can't decipher the meaning. That's how I feel reading the kernel
source at the moment.

Are we up for it?

Cheers,
Wol


^ permalink raw reply

* Re: [md PATCH 2/4] md: add bio completion tracing for raid1/raid10
From: Christoph Hellwig @ 2016-11-17 12:51 UTC (permalink / raw)
  To: NeilBrown; +Cc: Christoph Hellwig, Shaohua Li, linux-raid
In-Reply-To: <87r36ahdn0.fsf@notabene.neil.brown.name>

On Thu, Nov 17, 2016 at 04:35:47PM +1100, NeilBrown wrote:
> Oh good, thanks for letting me know. I'll drop that one.
> Do you know if there is any plan to include trace_block_split() in
> bio_split()??

Not that I know off.  But doing so seems sensible, and I'm pretty
sure patches would be welcome.

^ permalink raw reply

* [PATCH v2 0/4] IMSM: Add support for 4Kn sector size drives
From: Pawel Baldysiak @ 2016-11-17 13:58 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Pawel Baldysiak

This patch set adds support for IMSM with 4Kn sector size drives
First patch adds the generic function for receiving sector size,
rest are IMSM specific.
Internal calculation are still based on 512-bytes sector,
variables are converted during read/write from/to member drive.
Mixing of devices with different sector size is not allowed.

v2: Adjustments according to Jes' comments:
Patch 1/4: Fixed typo in original and copy-pasted function ("\b"->"\n");
Patch 2/4: Added missing "\n" at the end of error msg;
Patch 3/4, 4/4: Switched hardcoded 4k to #define;

Pawel Baldysiak (4):
  Add function for getting member drive sector size
  IMSM: Read and store device sector size
  IMSM: Add support for 4Kn sector size drives
  IMSM: 4Kn drives support - adapt general migration record

 mdadm.h       |   1 +
 super-intel.c | 323 +++++++++++++++++++++++++++++++++++++++++++++-------------
 super1.c      |   3 +-
 util.c        |  18 +++-
 4 files changed, 271 insertions(+), 74 deletions(-)

-- 
2.7.4


^ permalink raw reply

* [PATCH v2 1/4] Add function for getting member drive sector size
From: Pawel Baldysiak @ 2016-11-17 13:58 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Pawel Baldysiak
In-Reply-To: <1479391118-31600-1-git-send-email-pawel.baldysiak@intel.com>

This patch introduces the function for getting sector size of
given device (fd).

Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
---
 mdadm.h  |  1 +
 super1.c |  3 +--
 util.c   | 18 +++++++++++++++++-
 3 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/mdadm.h b/mdadm.h
index 0516c82..1aeb232 100755
--- a/mdadm.h
+++ b/mdadm.h
@@ -1112,6 +1112,7 @@ static inline struct supertype *guess_super(int fd) {
 }
 extern struct supertype *dup_super(struct supertype *st);
 extern int get_dev_size(int fd, char *dname, unsigned long long *sizep);
+extern int get_dev_sector_size(int fd, char *dname, unsigned int *sectsizep);
 extern int must_be_container(int fd);
 extern int dev_size_from_id(dev_t id, unsigned long long *size);
 void wait_for(char *dev, int fd);
diff --git a/super1.c b/super1.c
index 4fef378..8f800b5 100644
--- a/super1.c
+++ b/super1.c
@@ -212,8 +212,7 @@ struct align_fd {
 static void init_afd(struct align_fd *afd, int fd)
 {
 	afd->fd = fd;
-
-	if (ioctl(afd->fd, BLKSSZGET, &afd->blk_sz) != 0)
+	if (!get_dev_sector_size(afd->fd, NULL, (unsigned int *)&afd->blk_sz))
 		afd->blk_sz = 512;
 }
 
diff --git a/util.c b/util.c
index 9e4718f..883eaa4 100644
--- a/util.c
+++ b/util.c
@@ -1324,7 +1324,7 @@ int get_dev_size(int fd, char *dname, unsigned long long *sizep)
 			ldsize <<= 9;
 		} else {
 			if (dname)
-				pr_err("Cannot get size of %s: %s\b",
+				pr_err("Cannot get size of %s: %s\n",
 					dname, strerror(errno));
 			return 0;
 		}
@@ -1333,6 +1333,22 @@ int get_dev_size(int fd, char *dname, unsigned long long *sizep)
 	return 1;
 }
 
+/* Return sector size of device in bytes */
+int get_dev_sector_size(int fd, char *dname, unsigned int *sectsizep)
+{
+	unsigned int sectsize;
+
+	if (ioctl(fd, BLKSSZGET, &sectsize) != 0) {
+		if (dname)
+			pr_err("Cannot get sector size of %s: %s\n",
+				dname, strerror(errno));
+		return 0;
+	}
+
+	*sectsizep = sectsize;
+	return 1;
+}
+
 /* Return true if this can only be a container, not a member device.
  * i.e. is and md device and size is zero
  */
-- 
2.7.4


^ permalink raw reply related

* [PATCH v2 2/4] IMSM: Read and store device sector size
From: Pawel Baldysiak @ 2016-11-17 13:58 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Pawel Baldysiak
In-Reply-To: <1479391118-31600-1-git-send-email-pawel.baldysiak@intel.com>

This patch adds retriving device sector size at startup
and set it in intel_super, so it can be used in other places.

Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
---
 super-intel.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/super-intel.c b/super-intel.c
index 21e8532..a1b5cfb 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -366,6 +366,7 @@ struct intel_super {
 	unsigned long long create_offset; /* common start for 'current_vol' */
 	__u32 random; /* random data for seeding new family numbers */
 	struct intel_dev *devlist;
+	unsigned int sector_size; /* sector size of used member drives */
 	struct dl {
 		struct dl *next;
 		int index;
@@ -4491,6 +4492,7 @@ static int get_super_block(struct intel_super **super_list, char *devnm, char *d
 		goto error;
 	}
 
+	get_dev_sector_size(dfd, NULL, &s->sector_size);
 	find_intel_hba_capability(dfd, s, devname);
 	err = load_and_parse_mpb(dfd, s, NULL, keep_fd);
 
@@ -4570,6 +4572,7 @@ static int load_super_imsm(struct supertype *st, int fd, char *devname)
 	free_super_imsm(st);
 
 	super = alloc_super();
+	get_dev_sector_size(fd, NULL, &super->sector_size);
 	/* Load hba and capabilities if they exist.
 	 * But do not preclude loading metadata in case capabilities or hba are
 	 * non-compliant and ignore_hw_compat is set.
@@ -5102,6 +5105,7 @@ static int add_to_super_imsm(struct supertype *st, mdu_disk_info_t *dk,
 	struct intel_super *super = st->sb;
 	struct dl *dd;
 	unsigned long long size;
+	unsigned int member_sector_size;
 	__u32 id;
 	int rv;
 	struct stat stb;
@@ -5182,6 +5186,19 @@ static int add_to_super_imsm(struct supertype *st, mdu_disk_info_t *dk,
 	}
 
 	get_dev_size(fd, NULL, &size);
+	get_dev_sector_size(fd, NULL, &member_sector_size);
+
+	if (super->sector_size == 0) {
+		/* this a first device, so sector_size is not set yet */
+		super->sector_size = member_sector_size;
+	} else if (member_sector_size != super->sector_size) {
+		pr_err("Mixing between different sector size is forbidden, aborting...\n");
+		if (dd->devname)
+			free(dd->devname);
+		free(dd);
+		return 1;
+	}
+
 	/* clear migr_rec when adding disk to container */
 	memset(super->migr_rec_buf, 0, MIGR_REC_BUF_SIZE);
 	if (lseek64(fd, size - MIGR_REC_POSITION, SEEK_SET) >= 0) {
@@ -5529,6 +5546,12 @@ static int validate_geometry_imsm_container(struct supertype *st, int level,
 	 * note that there is no fd for the disks in array.
 	 */
 	super = alloc_super();
+	if (!get_dev_sector_size(fd, NULL, &super->sector_size)) {
+		close(fd);
+		free_imsm(super);
+		return 0;
+	}
+
 	rv = find_intel_hba_capability(fd, super, verbose > 0 ? dev : NULL);
 	if (rv != 0) {
 #if DEBUG
-- 
2.7.4


^ permalink raw reply related

* [PATCH v2 3/4] IMSM: Add support for 4Kn sector size drives
From: Pawel Baldysiak @ 2016-11-17 13:58 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Pawel Baldysiak, Tomasz Majchrzak
In-Reply-To: <1479391118-31600-1-git-send-email-pawel.baldysiak@intel.com>

This patch adds support for drives with 4Kn sector size
for IMSM metadata. Mixing member drives with 4kn and 512
is not allowed. Some offsets were aligned with sector size.
Internal metadata representation and all calculations
are still based on 512-byte sector sizes. This
implementation converts only sector based values
when reading/writing to drive, because they needs to be
stored in metadata according to accual member drive sector size.

Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
---
 super-intel.c | 201 +++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 156 insertions(+), 45 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index a1b5cfb..a1114ac 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -90,6 +90,7 @@
 #define IMSM_RESERVED_SECTORS 4096
 #define NUM_BLOCKS_DIRTY_STRIPE_REGION 2056
 #define SECT_PER_MB_SHIFT 11
+#define MAX_SECTOR_SIZE 4096
 
 /* Disk configuration info. */
 #define IMSM_MAX_DEVICES 255
@@ -318,14 +319,15 @@ static void set_migr_type(struct imsm_dev *dev, __u8 migr_type)
 	}
 }
 
-static unsigned int sector_count(__u32 bytes)
+static unsigned int sector_count(__u32 bytes, unsigned int sector_size)
 {
-	return ROUND_UP(bytes, 512) / 512;
+	return ROUND_UP(bytes, sector_size) / sector_size;
 }
 
-static unsigned int mpb_sectors(struct imsm_super *mpb)
+static unsigned int mpb_sectors(struct imsm_super *mpb,
+					unsigned int sector_size)
 {
-	return sector_count(__le32_to_cpu(mpb->mpb_size));
+	return sector_count(__le32_to_cpu(mpb->mpb_size), sector_size);
 }
 
 struct intel_dev {
@@ -915,12 +917,12 @@ static unsigned long long num_data_stripes(struct imsm_map *map)
 		return 0;
 	return join_u32(map->num_data_stripes_lo, map->num_data_stripes_hi);
 }
+#endif
 
 static void set_total_blocks(struct imsm_disk *disk, unsigned long long n)
 {
 	split_ull(n, &disk->total_blocks_lo, &disk->total_blocks_hi);
 }
-#endif
 
 static void set_pba_of_lba0(struct imsm_map *map, unsigned long long n)
 {
@@ -1122,6 +1124,8 @@ static unsigned long long min_acceptable_spare_size_imsm(struct supertype *st)
 
 static int is_gen_migration(struct imsm_dev *dev);
 
+#define IMSM_4K_DIV 8
+
 #ifndef MDASSEMBLE
 static __u64 blocks_per_migr_unit(struct intel_super *super,
 				  struct imsm_dev *dev);
@@ -1253,6 +1257,48 @@ static void print_imsm_disk(struct imsm_disk *disk, int index, __u32 reserved)
 	       human_size(sz * 512));
 }
 
+void convert_to_4k_imsm_disk(struct imsm_disk *disk)
+{
+	set_total_blocks(disk, (total_blocks(disk)/IMSM_4K_DIV));
+}
+
+void convert_to_4k(struct intel_super *super)
+{
+	struct imsm_super *mpb = super->anchor;
+	struct imsm_disk *disk;
+	int i;
+
+	for (i = 0; i < mpb->num_disks ; i++) {
+		disk = __get_imsm_disk(mpb, i);
+		/* disk */
+		convert_to_4k_imsm_disk(disk);
+	}
+	for (i = 0; i < mpb->num_raid_devs; i++) {
+		struct imsm_dev *dev = __get_imsm_dev(mpb, i);
+		struct imsm_map *map = get_imsm_map(dev, MAP_0);
+		/* dev */
+		split_ull((join_u32(dev->size_low, dev->size_high)/IMSM_4K_DIV),
+				 &dev->size_low, &dev->size_high);
+		dev->vol.curr_migr_unit /= IMSM_4K_DIV;
+
+		/* map0 */
+		set_blocks_per_member(map, blocks_per_member(map)/IMSM_4K_DIV);
+		map->blocks_per_strip /= IMSM_4K_DIV;
+		set_pba_of_lba0(map, pba_of_lba0(map)/IMSM_4K_DIV);
+
+		if (dev->vol.migr_state) {
+			/* map1 */
+			map = get_imsm_map(dev, MAP_1);
+			set_blocks_per_member(map,
+			    blocks_per_member(map)/IMSM_4K_DIV);
+			map->blocks_per_strip /= IMSM_4K_DIV;
+			set_pba_of_lba0(map, pba_of_lba0(map)/IMSM_4K_DIV);
+		}
+	}
+
+	mpb->check_sum = __gen_imsm_checksum(mpb);
+}
+
 void examine_migr_rec_imsm(struct intel_super *super)
 {
 	struct migr_record *migr_rec = super->migr_rec;
@@ -1310,6 +1356,45 @@ void examine_migr_rec_imsm(struct intel_super *super)
 	}
 }
 #endif /* MDASSEMBLE */
+
+void convert_from_4k(struct intel_super *super)
+{
+	struct imsm_super *mpb = super->anchor;
+	struct imsm_disk *disk;
+	int i;
+
+	for (i = 0; i < mpb->num_disks ; i++) {
+		disk = __get_imsm_disk(mpb, i);
+		/* disk */
+		set_total_blocks(disk, (total_blocks(disk)*IMSM_4K_DIV));
+	}
+
+	for (i = 0; i < mpb->num_raid_devs; i++) {
+		struct imsm_dev *dev = __get_imsm_dev(mpb, i);
+		struct imsm_map *map = get_imsm_map(dev, MAP_0);
+		/* dev */
+		split_ull((join_u32(dev->size_low, dev->size_high)*IMSM_4K_DIV),
+				 &dev->size_low, &dev->size_high);
+		dev->vol.curr_migr_unit *= IMSM_4K_DIV;
+
+		/* map0 */
+		set_blocks_per_member(map, blocks_per_member(map)*IMSM_4K_DIV);
+		map->blocks_per_strip *= IMSM_4K_DIV;
+		set_pba_of_lba0(map, pba_of_lba0(map)*IMSM_4K_DIV);
+
+		if (dev->vol.migr_state) {
+			/* map1 */
+			map = get_imsm_map(dev, MAP_1);
+			set_blocks_per_member(map,
+			    blocks_per_member(map)*IMSM_4K_DIV);
+			map->blocks_per_strip *= IMSM_4K_DIV;
+			set_pba_of_lba0(map, pba_of_lba0(map)*IMSM_4K_DIV);
+		}
+	}
+
+	mpb->check_sum = __gen_imsm_checksum(mpb);
+}
+
 /*******************************************************************************
  * function: imsm_check_attributes
  * Description: Function checks if features represented by attributes flags
@@ -1430,7 +1515,7 @@ static void examine_super_imsm(struct supertype *st, char *homehost)
 	sum = __le32_to_cpu(mpb->check_sum);
 	printf("       Checksum : %08x %s\n", sum,
 		__gen_imsm_checksum(mpb) == sum ? "correct" : "incorrect");
-	printf("    MPB Sectors : %d\n", mpb_sectors(mpb));
+	printf("    MPB Sectors : %d\n", mpb_sectors(mpb, super->sector_size));
 	printf("          Disks : %d\n", mpb->num_disks);
 	printf("   RAID Devices : %d\n", mpb->num_raid_devs);
 	print_imsm_disk(__get_imsm_disk(mpb, super->disks->index), super->disks->index, reserved);
@@ -1527,7 +1612,7 @@ static void export_examine_super_imsm(struct supertype *st)
 
 static int copy_metadata_imsm(struct supertype *st, int from, int to)
 {
-	/* The second last 512byte sector of the device contains
+	/* The second last sector of the device contains
 	 * the "struct imsm_super" metadata.
 	 * This contains mpb_size which is the size in bytes of the
 	 * extended metadata.  This is located immediately before
@@ -1540,7 +1625,9 @@ static int copy_metadata_imsm(struct supertype *st, int from, int to)
 	unsigned long long dsize, offset;
 	int sectors;
 	struct imsm_super *sb;
-	int written = 0;
+	struct intel_super *super = st->sb;
+	unsigned int sector_size = super->sector_size;
+	unsigned int written = 0;
 
 	if (posix_memalign(&buf, 4096, 4096) != 0)
 		return 1;
@@ -1548,21 +1635,21 @@ static int copy_metadata_imsm(struct supertype *st, int from, int to)
 	if (!get_dev_size(from, NULL, &dsize))
 		goto err;
 
-	if (lseek64(from, dsize-1024, 0) < 0)
+	if (lseek64(from, dsize-(2*sector_size), 0) < 0)
 		goto err;
-	if (read(from, buf, 512) != 512)
+	if (read(from, buf, sector_size) != sector_size)
 		goto err;
 	sb = buf;
 	if (strncmp((char*)sb->sig, MPB_SIGNATURE, MPB_SIG_LEN) != 0)
 		goto err;
 
-	sectors = mpb_sectors(sb) + 2;
-	offset = dsize - sectors * 512;
+	sectors = mpb_sectors(sb, sector_size) + 2;
+	offset = dsize - sectors * sector_size;
 	if (lseek64(from, offset, 0) < 0 ||
 	    lseek64(to, offset, 0) < 0)
 		goto err;
-	while (written < sectors * 512) {
-		int n = sectors*512 - written;
+	while (written < sectors * sector_size) {
+		int n = sectors*sector_size - written;
 		if (n > 4096)
 			n = 4096;
 		if (read(from, buf, n) != n)
@@ -2678,13 +2765,14 @@ int imsm_reshape_blocks_arrays_changes(struct intel_super *super)
 }
 static unsigned long long imsm_component_size_aligment_check(int level,
 					      int chunk_size,
+					      unsigned int sector_size,
 					      unsigned long long component_size)
 {
 	unsigned int component_size_alligment;
 
 	/* check component size aligment
 	*/
-	component_size_alligment = component_size % (chunk_size/512);
+	component_size_alligment = component_size % (chunk_size/sector_size);
 
 	dprintf("(Level: %i, chunk_size = %i, component_size = %llu), component_size_alligment = %u\n",
 		level, chunk_size, component_size,
@@ -2795,6 +2883,7 @@ static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info,
 	info->component_size = imsm_component_size_aligment_check(
 							info->array.level,
 							info->array.chunk_size,
+							super->sector_size,
 							info->component_size);
 
 	memset(info->uuid, 0, sizeof(info->uuid));
@@ -3615,8 +3704,9 @@ static int parse_raid_devices(struct intel_super *super)
 	if (__le32_to_cpu(mpb->mpb_size) + space_needed > super->len) {
 		void *buf;
 
-		len = ROUND_UP(__le32_to_cpu(mpb->mpb_size) + space_needed, 512);
-		if (posix_memalign(&buf, 512, len) != 0)
+		len = ROUND_UP(__le32_to_cpu(mpb->mpb_size) + space_needed,
+			      super->sector_size);
+		if (posix_memalign(&buf, MAX_SECTOR_SIZE, len) != 0)
 			return 1;
 
 		memcpy(buf, super->buf, super->len);
@@ -3689,31 +3779,32 @@ static int load_imsm_mpb(int fd, struct intel_super *super, char *devname)
 {
 	unsigned long long dsize;
 	unsigned long long sectors;
+	unsigned int sector_size = super->sector_size;
 	struct stat;
 	struct imsm_super *anchor;
 	__u32 check_sum;
 
 	get_dev_size(fd, NULL, &dsize);
-	if (dsize < 1024) {
+	if (dsize < 2*sector_size) {
 		if (devname)
 			pr_err("%s: device to small for imsm\n",
 			       devname);
 		return 1;
 	}
 
-	if (lseek64(fd, dsize - (512 * 2), SEEK_SET) < 0) {
+	if (lseek64(fd, dsize - (sector_size * 2), SEEK_SET) < 0) {
 		if (devname)
 			pr_err("Cannot seek to anchor block on %s: %s\n",
 			       devname, strerror(errno));
 		return 1;
 	}
 
-	if (posix_memalign((void**)&anchor, 512, 512) != 0) {
+	if (posix_memalign((void **)&anchor, sector_size, sector_size) != 0) {
 		if (devname)
 			pr_err("Failed to allocate imsm anchor buffer on %s\n", devname);
 		return 1;
 	}
-	if (read(fd, anchor, 512) != 512) {
+	if (read(fd, anchor, sector_size) != sector_size) {
 		if (devname)
 			pr_err("Cannot read anchor block on %s: %s\n",
 			       devname, strerror(errno));
@@ -3733,17 +3824,17 @@ static int load_imsm_mpb(int fd, struct intel_super *super, char *devname)
 
 	/* capability and hba must be updated with new super allocation */
 	find_intel_hba_capability(fd, super, devname);
-	super->len = ROUND_UP(anchor->mpb_size, 512);
-	if (posix_memalign(&super->buf, 512, super->len) != 0) {
+	super->len = ROUND_UP(anchor->mpb_size, sector_size);
+	if (posix_memalign(&super->buf, MAX_SECTOR_SIZE, super->len) != 0) {
 		if (devname)
 			pr_err("unable to allocate %zu byte mpb buffer\n",
 			       super->len);
 		free(anchor);
 		return 2;
 	}
-	memcpy(super->buf, anchor, 512);
+	memcpy(super->buf, anchor, sector_size);
 
-	sectors = mpb_sectors(anchor) - 1;
+	sectors = mpb_sectors(anchor, sector_size) - 1;
 	free(anchor);
 
 	if (posix_memalign(&super->migr_rec_buf, 512, MIGR_REC_BUF_SIZE) != 0) {
@@ -3768,14 +3859,15 @@ static int load_imsm_mpb(int fd, struct intel_super *super, char *devname)
 	}
 
 	/* read the extended mpb */
-	if (lseek64(fd, dsize - (512 * (2 + sectors)), SEEK_SET) < 0) {
+	if (lseek64(fd, dsize - (sector_size * (2 + sectors)), SEEK_SET) < 0) {
 		if (devname)
 			pr_err("Cannot seek to extended mpb on %s: %s\n",
 			       devname, strerror(errno));
 		return 1;
 	}
 
-	if ((unsigned)read(fd, super->buf + 512, super->len - 512) != super->len - 512) {
+	if ((unsigned int)read(fd, super->buf + sector_size,
+		    super->len - sector_size) != super->len - sector_size) {
 		if (devname)
 			pr_err("Cannot read extended mpb on %s: %s\n",
 			       devname, strerror(errno));
@@ -3836,6 +3928,8 @@ load_and_parse_mpb(int fd, struct intel_super *super, char *devname, int keep_fd
 	err = load_imsm_mpb(fd, super, devname);
 	if (err)
 		return err;
+	if (super->sector_size == 4096)
+		convert_from_4k(super);
 	err = load_imsm_disk(fd, super, devname, keep_fd);
 	if (err)
 		return err;
@@ -4733,6 +4827,7 @@ static int init_super_imsm_volume(struct supertype *st, mdu_array_info_t *info,
 	 * so st->sb is already set.
 	 */
 	struct intel_super *super = st->sb;
+	unsigned int sector_size = super->sector_size;
 	struct imsm_super *mpb = super->anchor;
 	struct intel_dev *dv;
 	struct imsm_dev *dev;
@@ -4754,9 +4849,9 @@ static int init_super_imsm_volume(struct supertype *st, mdu_array_info_t *info,
 	size_new = disks_to_mpb_size(info->nr_disks);
 	if (size_new > size_old) {
 		void *mpb_new;
-		size_t size_round = ROUND_UP(size_new, 512);
+		size_t size_round = ROUND_UP(size_new, sector_size);
 
-		if (posix_memalign(&mpb_new, 512, size_round) != 0) {
+		if (posix_memalign(&mpb_new, sector_size, size_round) != 0) {
 			pr_err("could not allocate new mpb\n");
 			return 0;
 		}
@@ -4911,10 +5006,11 @@ static int init_super_imsm(struct supertype *st, mdu_array_info_t *info,
 	if (info)
 		mpb_size = disks_to_mpb_size(info->nr_disks);
 	else
-		mpb_size = 512;
+		mpb_size = MAX_SECTOR_SIZE;
 
 	super = alloc_super();
-	if (super && posix_memalign(&super->buf, 512, mpb_size) != 0) {
+	if (super &&
+	    posix_memalign(&super->buf, MAX_SECTOR_SIZE, mpb_size) != 0) {
 		free(super);
 		super = NULL;
 	}
@@ -5261,9 +5357,9 @@ static int remove_from_super_imsm(struct supertype *st, mdu_disk_info_t *dk)
 static int store_imsm_mpb(int fd, struct imsm_super *mpb);
 
 static union {
-	char buf[512];
+	char buf[MAX_SECTOR_SIZE];
 	struct imsm_super anchor;
-} spare_record __attribute__ ((aligned(512)));
+} spare_record __attribute__ ((aligned(MAX_SECTOR_SIZE)));
 
 /* spare records have their own family number and do not have any defined raid
  * devices
@@ -5294,6 +5390,9 @@ static int write_super_imsm_spares(struct intel_super *super, int doclose)
 		if (__le32_to_cpu(d->disk.total_blocks_hi) > 0)
 			spare->attributes |= MPB_ATTRIB_2TB_DISK;
 
+		if (super->sector_size == 4096)
+			convert_to_4k_imsm_disk(&spare->disk[0]);
+
 		sum = __gen_imsm_checksum(spare);
 		spare->family_num = __cpu_to_le32(sum);
 		spare->orig_family_num = 0;
@@ -5317,6 +5416,7 @@ static int write_super_imsm_spares(struct intel_super *super, int doclose)
 static int write_super_imsm(struct supertype *st, int doclose)
 {
 	struct intel_super *super = st->sb;
+	unsigned int sector_size = super->sector_size;
 	struct imsm_super *mpb = super->anchor;
 	struct dl *d;
 	__u32 generation;
@@ -5377,6 +5477,9 @@ static int write_super_imsm(struct supertype *st, int doclose)
 	if (clear_migration_record)
 		memset(super->migr_rec_buf, 0, MIGR_REC_BUF_SIZE);
 
+	if (sector_size == 4096)
+		convert_to_4k(super);
+
 	/* write the mpb for disks that compose raid devices */
 	for (d = super->disks; d ; d = d->next) {
 		if (d->index < 0 || is_failed(&d->disk))
@@ -5500,6 +5603,8 @@ static int store_super_imsm(struct supertype *st, int fd)
 		return 1;
 
 #ifndef MDASSEMBLE
+	if (super->sector_size == 4096)
+		convert_to_4k(super);
 	return store_imsm_mpb(fd, mpb);
 #else
 	return 1;
@@ -7574,27 +7679,30 @@ static int store_imsm_mpb(int fd, struct imsm_super *mpb)
 	__u32 mpb_size = __le32_to_cpu(mpb->mpb_size);
 	unsigned long long dsize;
 	unsigned long long sectors;
+	unsigned int sector_size;
 
+	get_dev_sector_size(fd, NULL, &sector_size);
 	get_dev_size(fd, NULL, &dsize);
 
-	if (mpb_size > 512) {
+	if (mpb_size > sector_size) {
 		/* -1 to account for anchor */
-		sectors = mpb_sectors(mpb) - 1;
+		sectors = mpb_sectors(mpb, sector_size) - 1;
 
 		/* write the extended mpb to the sectors preceeding the anchor */
-		if (lseek64(fd, dsize - (512 * (2 + sectors)), SEEK_SET) < 0)
+		if (lseek64(fd, dsize - (sector_size * (2 + sectors)),
+		   SEEK_SET) < 0)
 			return 1;
 
-		if ((unsigned long long)write(fd, buf + 512, 512 * sectors)
-		    != 512 * sectors)
+		if ((unsigned long long)write(fd, buf + sector_size,
+		   sector_size * sectors) != sector_size * sectors)
 			return 1;
 	}
 
 	/* first block is stored on second to last sector of the disk */
-	if (lseek64(fd, dsize - (512 * 2), SEEK_SET) < 0)
+	if (lseek64(fd, dsize - (sector_size * 2), SEEK_SET) < 0)
 		return 1;
 
-	if (write(fd, buf, 512) != 512)
+	if (write(fd, buf, sector_size) != sector_size)
 		return 1;
 
 	return 0;
@@ -8830,6 +8938,7 @@ static int imsm_prepare_update(struct supertype *st,
 	 */
 	enum imsm_update_type type;
 	struct intel_super *super = st->sb;
+	unsigned int sector_size = super->sector_size;
 	struct imsm_super *mpb = super->anchor;
 	size_t buf_len;
 	size_t len = 0;
@@ -9066,12 +9175,13 @@ static int imsm_prepare_update(struct supertype *st,
 		 * if this allocation fails process_update will notice that
 		 * ->next_len is set and ->next_buf is NULL
 		 */
-		buf_len = ROUND_UP(__le32_to_cpu(mpb->mpb_size) + len, 512);
+		buf_len = ROUND_UP(__le32_to_cpu(mpb->mpb_size) + len,
+				  sector_size);
 		if (super->next_buf)
 			free(super->next_buf);
 
 		super->next_len = buf_len;
-		if (posix_memalign(&super->next_buf, 512, buf_len) == 0)
+		if (posix_memalign(&super->next_buf, sector_size, buf_len) == 0)
 			memset(super->next_buf, 0, buf_len);
 		else
 			super->next_buf = NULL;
@@ -9566,6 +9676,7 @@ int recover_backup_imsm(struct supertype *st, struct mdinfo *info)
 	int new_disks, i, err;
 	char *buf = NULL;
 	int retval = 1;
+	unsigned int sector_size = super->sector_size;
 	unsigned long curr_migr_unit = __le32_to_cpu(migr_rec->curr_migr_unit);
 	unsigned long num_migr_units = __le32_to_cpu(migr_rec->num_migr_units);
 	char buffer[20];
@@ -9602,7 +9713,7 @@ int recover_backup_imsm(struct supertype *st, struct mdinfo *info)
 			pba_of_lba0(map_dest)) * 512;
 
 	unit_len = __le32_to_cpu(migr_rec->dest_depth_per_unit) * 512;
-	if (posix_memalign((void **)&buf, 512, unit_len) != 0)
+	if (posix_memalign((void **)&buf, sector_size, unit_len) != 0)
 		goto abort;
 	targets = xcalloc(new_disks, sizeof(int));
 
@@ -10148,7 +10259,7 @@ enum imsm_reshape_type imsm_analyze_change(struct supertype *st,
 		 */
 		geo->size = imsm_component_size_aligment_check(
 				    get_imsm_raid_level(dev->vol.map),
-				    chunk * 1024,
+				    chunk * 1024, super->sector_size,
 				    geo->size * 2);
 		if (geo->size == 0) {
 			pr_err("Error. Size expansion is supported only (current size is %llu, requested size /rounded/ is 0).\n",
@@ -10182,7 +10293,7 @@ enum imsm_reshape_type imsm_analyze_change(struct supertype *st,
 			 */
 			max_size = imsm_component_size_aligment_check(
 					get_imsm_raid_level(dev->vol.map),
-					chunk * 1024,
+					chunk * 1024, super->sector_size,
 					max_size);
 		}
 		if (geo->size == MAX_SIZE) {
-- 
2.7.4


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox