Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: Superblock of raid5 log can't be updated when array stoped
From: Shaohua Li @ 2016-10-21 22:43 UTC (permalink / raw)
  To: Zhengyuan Liu; +Cc: shli, Song Liu, linux-raid
In-Reply-To: <tencent_5F7A406104D26FAD7587F1C4@qq.com>

On Tue, Oct 18, 2016 at 01:51:57PM +0800, Zhengyuan Liu wrote:
> If that's the problem, I think there is another problem about next_checkpoint initialization.
> No initial operation was done to this field when we loaded/recovery  the log , it got
> assignment  only when IO to raid disk was finished.  So r5l_quiesce may use wrong 
> next_checkpoint to reclaim log space, that would confuse log recovery.  Bellow patch
> may help the expression.

Good catch. This doesn't confuse log recovery but maks reclaimable space
calculation confused. Could you please send a formal patch?

Thanks,
Shaohua 
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 1b1ab4a..998ea00 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -1096,6 +1096,8 @@ static int r5l_recovery_log(struct r5l_log *log)
>                 log->seq = ctx.seq + 11;
>                 log->log_start = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
>                 r5l_write_super(log, ctx.pos);
> +               log->last_checkpoint = ctx.pos;
> +               log->next_checkpoint = ctx.pos;
>         } else {
>                 log->log_start = ctx.pos;
>                 log->seq = ctx.seq;
> @@ -1168,6 +1170,7 @@ create:
>         if (log->max_free_space > RECLAIM_MAX_FREE_SPACE)
>                 log->max_free_space = RECLAIM_MAX_FREE_SPACE;
>         log->last_checkpoint = cp;
> +       log->next_checkpoint = cp;
>  
>         __free_page(page);
> 
> Thanks,
> --Zhengyuan
> 
> ------------------ Original ------------------
> From:  "Shaohua Li"<shli@kernel.org>;
> Date:  Tue, Oct 18, 2016 08:28 AM
> To:  "Zhengyuan Liu"<liuzhengyuan@kylinos.cn>;
> Cc:  "shli"<shli@fb.com>; "Song Liu"<songliubraving@fb.com>; "linux-raid"<linux-raid@vger.kernel.org>;
> Subject:  Re: Superblock of raid5 log can't be updated when array stoped
>  
> On Sat, Oct 15, 2016 at 10:19:36AM +0800, liuzhengyuan wrote:
> > Hi, Shaohua.
> > 
> > when we stop raid5 array with "mdadm -S" or reboot the system,  md module will 
> > call raid5_quiesce and r5l_quiesce to do some clean work. Some code of r5l_quiesce
> > was pasted bellow.
> > 
> >                 /* make sure r5l_write_super_and_discard_space exits */
> >                 mddev = log->rdev->mddev;
> >                 wake_up(&mddev->sb_wait);
> >                 r5l_wake_reclaim(log, -1L);
> >                 md_unregister_thread(&log->reclaim_thread);
> >                 r5l_do_reclaim(log);
> > +                md_update_sb(mddev, 1);
> > 
> > It will reclaim all used space of log and call r5l_write_super to reset rdev->journal_tail
> > to log->next_checkpoint . However, new rdev->journal_tail would not be written to 
> > journal device for persistent because journal device may not support discard operation
> > or due to mddev_trylock fail (this trylock should always get failed since raid5_quiesce 
> > was called with  reconfig_mutex hold, isn't it?). As a result, it will take a long time to
> >  recovery the log when the arrary was restarted. Should r5l_quiesce call md_update_sb
> >  directly to guarantee  superblock  update?
> 
> Yep, that's problem here. Unfortunately we can't call md_update_sb here,
> because we might not hold the mddev lock. I think we should call it at do_md_stop.
> 
> Thanks,
> Shaohua

^ permalink raw reply

* Re: How to fix mistake on raid: mdadm create instead of assemble?
From: Shaohua Li @ 2016-10-21 22:35 UTC (permalink / raw)
  To: Santiago DIEZ; +Cc: Linux Raid LIST, songliubraving, Jes.Sorensen
In-Reply-To: <CAJh8RqU7kcTrp_1Rgorx02OR5H5bm-a8=DSvQz-Bkrg7S3357w@mail.gmail.com>

On Fri, Oct 21, 2016 at 10:45:10AM +0200, Santiago DIEZ wrote:
> Hi,
> 
> Thanks Andreas,
> 
> Yes apparently, 3/4 of the original disks seem to be safe. But I'm
> terrified at the idea of doing something wrong assembling them.
> Incidentally, I indeed did a mistake trying to assemble the ddrescue
> images of the 3 safe disks. I tried to create again with proper
> metadata and chunck but it did not work. I'm still scared at the idea
> of restarting the original raid. I'm currently ddrescuing again the 3
> partitions to then try and *assemble* them rather than *create*.
> 
> 
> Thanks Wol,
> 
> I use loop devices because I work on partition images, not on actual partitions:
> I use ddrescue to copy data from /dev/sd[abc]10 to
> some.other.server:/home/sd[abc].img
> Then I go to some.other.server and turn the images into loop devices :
> losetup /dev/loop0 /home/sda10.img
> losetup /dev/loop1 /home/sdb10.img
> losetup /dev/loop2 /home/sdc10.img
> Then I tried to created the raid, it worked but as I said, the
> filesystem was unreadable.
> I know the idea of using loop devices works because I tested it before.
> I'm doing the whole procedure all over again (takes 5 days to ddrescue
> the 3 partitions to another server) and then I will use the command
> you recommended :
> mdadm --assemble /dev/md0 /dev/loop0 /dev/loop1 /dev/loop2 --force
> 
> 
> Will keep you posted
> 
> -------------------------
> Santiago DIEZ
> Quark Systems & CAOBA
> 23 rue du Buisson Saint-Louis, 75010 Paris
> -------------------------
> 
> On Mon, Oct 10, 2016 at 12:39 AM, Wols Lists <antlists@youngman.org.uk> wrote:
> >
> > On 08/10/16 13:30, Andreas Klauer wrote:
> > > On Fri, Oct 07, 2016 at 05:37:32PM +0200, Santiago DIEZ wrote:
> > >> > First thing I did is ddrescue the remaining partitions sd[abc]10 .
> > >> > ddrescue did not stumble into any read error so I assume all remaining
> > >> > partitions are perfectly safe.
> > > So ... don't you still have a good copy?
> > >
> > > You only killed one of them, right? Did not make same mistake twice?
> > >
> > >> > There comes my mistake: I ran the --create command instead of --assemble :
> > >> >
> > >> > ================================================================================
> > >> > # mdadm --create --verbose /dev/md1 --raid-devices=4 --level=raid5
> > >> > --run --readonly /dev/loop0 /dev/loop1 /dev/loop2 missing
> >
> > One oddity I've noticed. You've created the array using loop devices.
> > What are these?
> >
> > The reason I ask is that using loopback devices is a standard technique
> > for rebuilding a damaged array, specifically to prevent md from actually
> > writing to the drive. So is it possible that "mdadm --create" only wrote
> > to ram, and a reboot will recover your ddrescue copies untouched?
> >
> > My raid-fu isn't enough to tell me whether I'm right or not ... :-)
> >
> > If necessary you'll have to do another ddrescue from the original
> > drives, and you should then be able to assemble the array from the
> > copies. Don't use "missing", use "--force" and you should get a working,
> > degraded, array to which you can add a new drive and rebuild the array.
> >
> > mdadm --assemble /dev/md0 /dev/sd[efg]10 --force
> >
> > if I'm right ... so long as it's the copies, you can always recover
> > again from the original disks, and if there's a problem with the copies
> > mdadm should complain when it assembles the array.

Hmm, those commands work for me. I'm adding Song and Jes if they have ideas.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH 1/2 v3] md: add bad block support for external metadata
From: Shaohua Li @ 2016-10-21 22:30 UTC (permalink / raw)
  To: Tomasz Majchrzak; +Cc: linux-raid
In-Reply-To: <1477060017-8217-1-git-send-email-tomasz.majchrzak@intel.com>

On Fri, Oct 21, 2016 at 04:26:57PM +0200, Tomasz Majchrzak wrote:
> Add new rdev flag which external metadata handler can use to switch
> on/off bad block support. If new bad block is encountered, notify it via
> rdev 'unacknowledged_bad_blocks' sysfs file. If bad block has been
> cleared, notify update to rdev 'bad_blocks' sysfs file.
> 
> When bad blocks support is being removed, just clear rdev flag. It is
> not necessary to reset badblocks->shift field. If there are bad blocks
> cleared or added at the same time, it is ok for those changes to be
> applied to the structure. The array is in blocked state and the drive
> which cannot handle bad blocks any more will be removed from the array
> before it is unlocked.
> 
> Simplify state_show function by adding a separator at the end of each
> string and overwrite last separator with new line.
> 
> Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
> Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>

Thanks, applied. After the badblocks patch upstream, I'll push these two.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v5 6/8] md/r5cache: sysfs entry r5c_state
From: Song Liu @ 2016-10-21 21:20 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-raid@vger.kernel.org, Shaohua Li, Kernel Team,
	dan.j.williams@intel.com, hch@infradead.org,
	liuzhengyuang521@gmail.com, liuzhengyuan@kylinos.cn
In-Reply-To: <87lgxlxeq5.fsf@notabene.neil.brown.name>


> On Oct 18, 2016, at 7:19 PM, NeilBrown <neilb@suse.com> wrote:
> 
> On Thu, Oct 13 2016, Song Liu wrote:
> 
>> r5c_state have 4 states:
>> * no-cache;
>> * write-through (write journal only);
>> * write-back (w/ write cache);
>> * cache-broken (journal missing or Faulty)
> 
> I don't like this.
> 
> We already have a mechanism for removing a failed device from
> an array: write 'remove' to the 'state' file in the relevant dev-*
> subdirectory.
> You can also use that file to tell if the journal has failed.

I think we would like an separate mechanism to remove "journal feature",
not "journal device".  Maybe we should put it in an different sysfs file. 


> Please call the file something a little more obvious that 'r5c_state',
> maybe 'journal_mode', and use it only to switch between write-through
> and write-back.
> If there is no journal, then either remove the file, or have writes fail
> and reads return something obvious ... maybe ENODEV.

Thanks,
Song


^ permalink raw reply

* Re: [PATCH v5 7/8] md/r5cache: r5c recovery
From: Song Liu @ 2016-10-21 21:12 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-raid@vger.kernel.org, Shaohua Li, Kernel Team,
	dan.j.williams@intel.com, hch@infradead.org,
	liuzhengyuang521@gmail.com, liuzhengyuan@kylinos.cn
In-Reply-To: <87vawnwwlq.fsf@notabene.neil.brown.name>


> On Oct 19, 2016, at 8:03 PM, NeilBrown <neilb@suse.com> wrote:
> 
> On Thu, Oct 13 2016, Song Liu wrote:
> 
>> This is the recovery part of raid5-cache.
>> 
>> With cache feature, there are 2 different scenarios of recovery:
>> 1. Data-Parity stripe: a stripe with complete parity in journal.
>> 2. Data-Only stripe: a stripe with only data in journal (or partial
>>   parity).
>> 
>> The code differentiate Data-Parity stripe from Data-Only stripe with
>> flag (STRIPE_R5C_WRITTEN).
>> 
>> For Data-Parity stripes, we use the same procedure as raid5 journal,
>> where all the data and parity are replayed to the RAID devices.
>> 
>> For Data-Only strips, we need to finish complete calculate parity and
>> finish the full reconstruct write or RMW write. For simplicity, in
>> the recovery, we load the stripe to stripe cache. Once the array is
>> started, the stripe cache state machine will handle these stripes
>> through normal write path.
>> 
>> r5c_recovery_flush_log contains the main procedure of recovery. The
>> recovery code first scans through the journal and loads data to
>> stripe cache. The code keeps tracks of all these stripes in a list
>> (use sh->lru and ctx->cached_list), stripes in the list are
>> organized in the order of its first appearance on the journal.
>> During the scan, the recovery code assesses each stripe as
>> Data-Parity or Data-Only.
>> 
>> During scan, the array may run out of stripe cache. In these cases,
>> the recovery code will also call raid5_set_cache_size to increase
>> stripe cache size.
>> 
>> At the end of scan, the recovery code replays all Data-Parity
>> stripes, and sets proper states for Data-Only stripes. The recovery
>> code also increases seq number by 10 and rewrites all Data-Only
>> stripes to journal. This is to avoid confusion after repeated
>> crashes. More details is explained in raid5-cache.c before
>> r5c_recovery_rewrite_data_only_stripes().
>> 
>> Signed-off-by: Song Liu <songliubraving@fb.com>
> 
> 
> This patch seems to do a number of different things.
> I think it re-factors the journal reading code.
> It adds code to write a new "empty" journal metadata block
> and it adds support for recovery of cached data.
> 
> All this together makes it hard to review.  I'd rather more smaller
> patches.
> 

I will try to split the patch into meaningful smaller patches. 
> 
>> +	/* stripes only have parity are already flushed to RAID */
>> +	if (data_count == 0)
>> +		goto out;
> 
> Can you explain why that is?  When were they flushed to the RAID, and
> how was the parity determined?

It happens like this: say two stripes on journal: 100 and 200. The data (D)
and parity (P) pages are store in journal as:

      --->  D100   D200   P100  P200 ---->  newer data

Before we flush D100, journal_start points as D100. Then we flush D100, 
and new journal_start points as D200. Now the system fails, so next 
recovery starts from D200. Recovery code will find stripe 100 only has
parity. This means, stripe 100 is already flushed to raid. so we can ignore it. 

> 
>> +/* returns 0 for match; 1 for mismtach */
> 
> No, please don't do that.
> You can return an negative error if you like, and call it as
>   function_name() < 0
> or
>   function_name() == 0
> 
> or give the function a name that describes what it tests
> i.e.
>   r5l_data_checksum_is_correct()
> or
>   r5l_data_checksum_does_not_match()
> 
> and then return 0 if the test fails and 1 if it succeeds.
> 
>> +static int
>> +r5l_recovery_verify_data_checksum(struct r5l_log *log, struct page *page,
>> +				  sector_t log_offset, __le32 log_checksum)

I will fix this. 

Thanks,
Song

^ permalink raw reply

* Re: [PATCH v5 5/8] md/r5cache: reclaim support
From: Song Liu @ 2016-10-21 21:04 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-raid@vger.kernel.org, Shaohua Li, Kernel Team,
	dan.j.williams@intel.com, hch@infradead.org,
	liuzhengyuang521@gmail.com, liuzhengyuan@kylinos.cn
In-Reply-To: <87oa2hxfg6.fsf@notabene.neil.brown.name>


> On Oct 18, 2016, at 7:03 PM, NeilBrown <neilb@suse.com> wrote:
> 
>> AL, 5 seconds) reclaim
>>   if there is no reclaim in the past 5 seconds.
> 
> 5 seconds is an arbitrary number.  I don't think it needs to be made
> configurable, but it might help to justify it.
> I would probably make it closer to 30 seconds.  There really isn't any
> rush.  If there is much load, it will never wait this long.  If there is
> not much load ... then it probably doesn't matter.

Yeah, 30 second should work here. 
> 
> 
>> 2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (32) cached full stripes
>>   (r5c_check_cached_full_stripe)
> 
> This is another arbitrary number, but I think it needs more
> justification.  Why 32?  That's 128K per device, which isn't very much.
> It might make sense to set the batch size based on the stripe size
> (e.g. at least on stripe?) or on the cache size.

How about we use something like min(stripe_cache_size / 8, 256)? Journal space
will trigger reclaim in other conditions (#4 below). 

> 
>> 3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
>> 4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)
>> 
>> r5c_do_reclaim() contains new logic of reclaim.
>> 
>> For stripe cache:
>> 
>> When stripe cache pressure is high (more than 3/4 stripes are cached,
> 
> You say "3/4" here bit I think the code says "1/2"
>> +	if (total_cached > conf->min_nr_stripes * 1 / 2 ||
> But there is also
>> +	if (total_cached > conf->min_nr_stripes * 3 / 4 ||
>> +	    atomic_read(&conf->empty_inactive_list_nr) > 0)
> 
> so maybe you do mean 3/4..  Maybe some comments closer to the different
> fractions would help.

There are level of cache pressure:
high pressure: total_cached > 3/4 || empty_inactivelist_nr > 0
moderate pressure: total_cached > 1/2. 

The condition:
+	if (total_cached > conf->min_nr_stripes * 1 / 2 ||
+	    atomic_read(&conf->empty_inactive_list_nr) > 0)

covers "either high or moderate pressure". 


>> @@ -141,6 +150,12 @@ struct r5l_log {
>> 
>> 	/* for r5c_cache */
>> 	enum r5c_state r5c_state;
>> +
>> +	/* all stripes in r5cache, in the order of seq at sh->log_start */
>> +	struct list_head stripe_in_cache_list;
>> +
>> +	spinlock_t stripe_in_cache_lock;
> 
> You use the _irqsave / _irqrestore versions of spinlock for this lock,
> but I cannot see where it is takes from interrupt-context.
> What did I miss?

Let me fix them. 

> 
>> +/* evaluate log space usage and update R5C_LOG_TIGHT and R5C_LOG_CRITICAL */
>> +static inline void r5c_update_log_state(struct r5l_log *log)
>> +{
>> +	struct r5conf *conf = log->rdev->mddev->private;
>> +	sector_t free_space = r5l_ring_distance(log, log->log_start,
>> +						log->last_checkpoint);
>> +	sector_t reclaim_space = r5c_log_required_to_flush_cache(conf);
>> +
>> +	if (!r5c_is_writeback(log))
>> +		return;
>> +	else if (free_space < 2 * reclaim_space) {
>> +		set_bit(R5C_LOG_CRITICAL, &conf->cache_state);
>> +		set_bit(R5C_LOG_TIGHT, &conf->cache_state);
>> +	} else if (free_space < 3 * reclaim_space) {
>> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
>> +		set_bit(R5C_LOG_TIGHT, &conf->cache_state);
>> +	} else if (free_space > 4 * reclaim_space) {
>> +		clear_bit(R5C_LOG_TIGHT, &conf->cache_state);
>> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
>> +	} else
>> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> 
> Hmmm...
> We set LOG_CRITICAL if free_space < 2*reclaim_space, else we clear it.
> We set LOG_TIGHT if free_space < 3*reclaim_space and clear it if > 4*reclaim_space
> 
> Would the code be clearer if you just made that explicit.
> 
> At the very least, change
>> +	} else if (free_space > 4 * reclaim_space) {
>> +		clear_bit(R5C_LOG_TIGHT, &conf->cache_state);
>> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
>> +	} else
>> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> 
> to
> 
>> +	} else if (free_space < 4 * reclaim_space) {
>> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
>> +	} else {
>> +		clear_bit(R5C_LOG_TIGHT, &conf->cache_state);
>> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
>> +     }
> 
> so the order of the branches is consistent and all the tests a '<'.

I will revise the code and the comments. 

> 
>> 
>> 	BUG_ON(reclaimable < 0);
>> @@ -941,7 +1131,7 @@ static void r5l_do_reclaim(struct r5l_log *log)
>> 
>> 	mutex_lock(&log->io_mutex);
>> 	log->last_checkpoint = next_checkpoint;
>> -	log->last_cp_seq = next_cp_seq;
> 
> Why don't we update last_cp_seq here any more?

There are two reasons: 
1. We are not using last_cp_seq after recovery is done. So it is not necessary
    to update it. Shaohua had some idea to use it for reclaim (keep ordering of 
    stripes). But my initial implementation uses sh->log_start and 
    r5l_ring_distance() instead. 
2. With write back cache, last_cp_seq will be the seq of first stripe in 
    stripe_in_cache_list. To track last_cp_seq, we will need an extra first_seq_in_cache
    in stripe_head. I feel it is a waste of memory. 
   
> 
>> +
>> +/*
>> + * if num < 0, flush all stripes
>> + * if num == 0, flush all full stripes
>> + * if num > 0, flush all full stripes. If less than num full stripes are
>> + *             flushed, flush some partial stripes until totally num stripes are
>> + *             flushed or there is no more cached stripes.
>> + */
> 
> I'm not very keen on the idea of having "num < 0".
> 
> I'd rather the meaning was:
>  flush all full stripes, and a total of at least num stripes.
> 
> To flush all stripes, pass MAX_INT for num (or similar).

This is a great idea. I will fix it. 

Thanks,
Song


^ permalink raw reply

* Re: [PATCH v5 4/8] md/r5cache: write part of r5cache
From: Song Liu @ 2016-10-21 20:42 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-raid@vger.kernel.org, Shaohua Li, Kernel Team,
	dan.j.williams@intel.com, hch@infradead.org,
	liuzhengyuang521@gmail.com, liuzhengyuan@kylinos.cn
In-Reply-To: <87r37dxipq.fsf@notabene.neil.brown.name>


> On Oct 18, 2016, at 5:53 PM, NeilBrown <neilb@suse.com> wrote:
> 
> On Fri, Oct 14 2016, Song Liu wrote:
> 
>>> On Oct 1st of the work. 
>> 
>> Or we can force the code to rcw, which does not need extra page. 
>> But rcw, does not always work in degraded mode. So, this is a good 
>> reason not to do write-back in degraded mode...
> 
> Prohibiting write-back in degraded mode would not be enough to ensure
> that you can always use rcw.  The array can become degraded after you
> make the decision to use caching, and before to need to read old data
> for rmw.
> 
> I would suggest a small (2 entry?) mempool where each entry in the
> mempool holds enough pages to complete an rmw.  Only use the mempool if
> an alloc_page() fails.
> 
This is a great idea. I will try it. 


>>> 
>>>> +
>>>> +void r5c_handle_cached_data_endio(struct r5conf *conf,
>>>> +	  struct stripe_head *sh, int disks, struct bio_list *return_bi)
>>>> +{
>>>> +	int i;
>>>> +
>>>> +	for (i = sh->disks; i--; ) {
>>>> +		if (test_bit(R5_InCache, &sh->dev[i].flags) &&
>>>> +		    sh->dev[i].written) {
>>> 
>>> Is it possible for R5_InCache to be set, but 'written' to be NULL ???
>> 
>> Yes, it is possible. A stripe may go through "write data to journal, return IO" multiple
>> times before parity calculation. When it comes here the second time, dev written in the 
>> first time will have R5_InCache set, but its written will be NULL. 
> 
> OK, that makes sense.
> So is it possible for 'written' to be set, but R5_InCache to be clear?
> i.e. do we really need to test R5_InCache here?

In current implementation, there is only one log_io per sh, so we should be fine just check 
'written'. Let me double check. 

> 
>>>> 
>>>> static void r5l_io_run_stripes(struct r5l_io_unit *io)
>>>> @@ -483,7 +566,8 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
>>>> 	io = log->current_io;
>>>> 
>>>> 	for (i = 0; i < sh->disks; i++) {
>>>> -		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags))
>>>> +		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags) &&
>>>> +		    !test_bit(R5_Wantcache, &sh->dev[i].flags))
>>>> 			continue;
>>> 
>>> If changed R5_Wantcache to R5_Wantjournal, and always set it on blocks
>>> that were destined for the journal, then this would just be
>>> 
>>> 		if (!test_bit(R5_Wantjournal, &sh->dev[i].flags))
>>> 
>>> which would make lots of sense...  Just a thought.
>> 
>> We set R5_Wantwrite in multiple places. If we simplify the code here, we will need to make
>> those places aware of journal. I guess that is not ideal either? 
> 
> Maybe...
> We have so many state flags that I like to be very cautious about adding
> more, and to make sure they have a very well defined meaning that
> doesn't overlap with other flags too much.
> The above code suggests that Wantwrite and Wantcache overlap to some
> extent.
> 
> Could we discard Wantcache and just use Wantwrite combined with InCache?
> Wantwrite means that the block needed to be written to the RAID.
> If InCache isn't set, it also needs to be written to the cache (if the
> cache is being used).
> Then the above code would be
>   if (!test_bit(R5_Wantwrite) || test_bit(R5_InCache))
>      continue;
> 
> which means "if we don't want to write this, or if it is already in the
> cache, then nothing to do here".
> 
> Maybe.

I am not sure whether we can totally remove Wantcache. Let me try it. 

Thanks,
Song

^ permalink raw reply

* how to assemble an array with journal??
From: Carlos Carvalho @ 2016-10-21 20:10 UTC (permalink / raw)
  To: linux-raid

I've created an array with journal but after stopping it I'm stuck to assemble
it back, mdadm refuses to assemble it because of stale journal device:

First create journal device, it's a 2-disk raid0:

foobar# mdadm -C /dev/md0 -l 0 -n 2 -e 1.1 /dev/sd[fg]
mdadm: array /dev/md0 started.
foobar# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] 
md0 : active raid0 sdg[1] sdf[0]
      312318976 blocks super 1.1 512k chunks

Now create the raid6 array with journal:

foobar# mdadm -C /dev/md1 -l 6 -n 20 -e 1.1 --assume-clean --write-journal=/dev/md0 /dev/sd[de] /dev/sd[h-y]
mdadm: array /dev/md1 started.

foobar# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] 
md1 : active raid6 sdy[20] sdx[19] sdw[18] sdv[17] sdu[16] sdt[15] sds[14] sdr[13] sdq[12] sdp[11] sdo[10] sdn[9] sdm[8] sdl[7] sdk[6] sdj[5] sdi[4] sdh[3] sde[2] sdd[1] md0[0](J)
      105487045632 blocks super 1.1 level 6, 512k chunk, algorithm 2 [20/20] [UUUUUUUUUUUUUUUUUUUU]
      bitmap: 0/44 pages [0KB], 65536KB chunk

md0 : active raid0 sdg[1] sdf[0]
      312318976 blocks super 1.1 512k chunks

Now stop it:
foobar#~[17:56]  mdadm -S /dev/md1
mdadm: stopped /dev/md1

I put the uuid in /etc/mdadm/mdadm.conf to make it easier to assemble, then

foobar#~[17:58]  mdadm -vv -A /dev/md1 
mdadm: /dev/md0 is identified as a member of /dev/md1, slot 65533.
mdadm: /dev/sdk is identified as a member of /dev/md1, slot 5.
mdadm: /dev/sdt is identified as a member of /dev/md1, slot 14.
mdadm: /dev/sdj is identified as a member of /dev/md1, slot 4.
mdadm: /dev/sdu is identified as a member of /dev/md1, slot 15.
mdadm: /dev/sdo is identified as a member of /dev/md1, slot 9.
mdadm: /dev/sdm is identified as a member of /dev/md1, slot 7.
mdadm: /dev/sdl is identified as a member of /dev/md1, slot 6.
mdadm: /dev/sdx is identified as a member of /dev/md1, slot 18.
mdadm: /dev/sdw is identified as a member of /dev/md1, slot 17.
mdadm: /dev/sdy is identified as a member of /dev/md1, slot 19.
mdadm: /dev/sdn is identified as a member of /dev/md1, slot 8.
mdadm: /dev/sds is identified as a member of /dev/md1, slot 13.
mdadm: /dev/sdv is identified as a member of /dev/md1, slot 16.
mdadm: /dev/sdq is identified as a member of /dev/md1, slot 11.
mdadm: /dev/sdd is identified as a member of /dev/md1, slot 0.
mdadm: /dev/sdp is identified as a member of /dev/md1, slot 10.
mdadm: /dev/sdr is identified as a member of /dev/md1, slot 12.
mdadm: /dev/sdi is identified as a member of /dev/md1, slot 3.
mdadm: /dev/sdh is identified as a member of /dev/md1, slot 2.
mdadm: /dev/sde is identified as a member of /dev/md1, slot 1.
mdadm: Not safe to assemble with missing or stale journal device, consider --force.

Note that all slots were found, including the journal one. Why does it complain
it's stale??

Trying to assemble listing the devices directly gives the same result:
foobar#~[17:58]  mdadm -A /dev/md1 /dev/sd[de] /dev/sd[h-y]                     mdadm: Not safe to assemble with missing or stale journal device, consider --force.

This is with 4.7.9. How can I reassemble it?

^ permalink raw reply

* Re: [PATCH] badblocks: badblocks_set/clear update unacked_exist
From: Jens Axboe @ 2016-10-21 16:15 UTC (permalink / raw)
  To: Shaohua Li, linux-raid, linux-block; +Cc: tomasz.majchrzak
In-Reply-To: <39f303dcb7bdf2958b6f82a0646467dfbc8d1abd.1476999400.git.shli@fb.com>

On 10/20/2016 03:40 PM, Shaohua Li wrote:
> When bandblocks_set acknowledges a range or badblocks_clear a range,
> it's possible all badblocks are acknowledged. We should update
> unacked_exist if this occurs.

Applied, thanks.

-- 
Jens Axboe


^ permalink raw reply

* Resync issue in RAID1
From: V @ 2016-10-21 15:53 UTC (permalink / raw)
  To: linux-raid

Hi,

I am facing an issue during RAID1 resync. I have an ubuntu
4.4.0-31-generic running with raid1 configured with 2 disks as active
and 2 as spares. On the first powercycle, after installing RAID, i see
the following messages in kern.log


My disks are configured with 4K sector size (both logical and
physical) (sda and sdb are active disks for this raid)


===========
Oct 18 03:52:56  kernel: [   52.869113] md: using 128k window, over a
total of 51167104k.
Oct 18 03:52:56  kernel: [   52.869114] md: resuming resync of md2
from checkpoint.
Oct 18 03:52:56  kernel: [   52.869378] sd 0:0:0:0: [sda] Bad block
number requested
Oct 18 03:52:56  kernel: [   52.869414] sd 0:0:0:0: [sda] Bad block
number requested
Oct 18 03:52:56  kernel: [   52.869436] sd 0:0:0:0: [sda] Bad block
number requested
Oct 18 03:52:56  kernel: [   52.869465] sd 0:0:0:0: [sda] Bad block
number requested
Oct 18 03:52:56  kernel: [   52.869503] sd 0:0:1:0: [sdb] Bad block
number requested
Oct 18 03:52:56  kernel: [   52.869536] md/raid1:md2: sda:
unrecoverable I/O read error for block 3
Oct 18 03:52:56  kernel: [   52.869581] md: md2: resync interrupted.
Oct 18 03:52:56  kernel: [   52.869584] sd 0:0:0:0: [sda] Bad block
number requested
Oct 18 03:52:56  kernel: [   52.869609] sd 0:0:0:0: [sda] Bad block
number requested
Oct 18 03:52:56  kernel: [   52.869633] sd 0:0:1:0: [sdb] Bad block
number requested
Oct 18 03:52:56  kernel: [   52.869692] md/raid1:md2: sda:
unrecoverable I/O read error for block 131
Oct 18 03:52:56  kernel: [   52.869735] sd 0:0:0:0: [sda] Bad block
number requested
Oct 18 03:52:56  kernel: [   52.869808] sd 0:0:1:0: [sdb] Bad block
number requested
Oct 18 03:52:56  kernel: [   52.869837] md/raid1:md2: sda:
unrecoverable I/O read error for block 259
Oct 18 03:52:56  kernel: [   52.869908] sd 0:0:0:0: [sda] Bad block
number requested
Oct 18 03:52:56  kernel: [   52.869958] sd 0:0:1:0: [sdb] Bad block
number requested
Oct 18 03:52:56  kernel: [   52.870022] md/raid1:md2: sda:
unrecoverable I/O read error for block 387

===========


Seems to be an unaligned access from mdraid1 during resync. Do you
have any idea, why we are seeing it. How do i debug this issue.


I was looking at the following patch (
http://marc.info/?l=linux-raid&m=142405887609959&w=2 ), by Nate in the
function narrow_write_error(), where the alignment was made correctly
to disk sector size. Do you think, similar roundups are required in
resync path as well.

More info:

1) We have checks for previous raid configurations present in device
partitions ( we do mdadm --examine and if present, we zero out
superblocks)
2) Each device is partitioned into 4 and used for 4 different raid devices.
3) Say sda is partitioned into 4 (sda1,sda2,sda3,sda4) and each
partition is part of one raid device.
4) Raid1 create command.
     /sbin/mdadm --quiet --create /dev/md1 --force --level=1
--metadata=default --raid-devices=2 /dev/sda1 /dev/sdb1
--spare-devices=2 /dev/sdc1/dev/sdd1


Thanks in advance,
Viswesh

^ permalink raw reply

* Re: [PATCH] badblocks: badblocks_set/clear update unacked_exist
From: Tomasz Majchrzak @ 2016-10-21 14:35 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, linux-block
In-Reply-To: <39f303dcb7bdf2958b6f82a0646467dfbc8d1abd.1476999400.git.shli@fb.com>

On Thu, Oct 20, 2016 at 02:40:06PM -0700, Shaohua Li wrote:
> When bandblocks_set acknowledges a range or badblocks_clear a range,
> it's possible all badblocks are acknowledged. We should update
> unacked_exist if this occurs.
> 
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  block/badblocks.c | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
> 
> diff --git a/block/badblocks.c b/block/badblocks.c
> index 7be53cb..139115d2 100644
> --- a/block/badblocks.c
> +++ b/block/badblocks.c
> @@ -133,6 +133,26 @@ int badblocks_check(struct badblocks *bb, sector_t s, int sectors,
>  }
>  EXPORT_SYMBOL_GPL(badblocks_check);
>  
> +static void badblocks_update_acked(struct badblocks *bb)
> +{
> +	u64 *p = bb->page;
> +	int i;
> +	bool unacked = false;
> +
> +	if (!bb->unacked_exist)
> +		return;
> +
> +	for (i = 0; i < bb->count ; i++) {
> +		if (!BB_ACK(p[i])) {
> +			unacked = true;
> +			break;
> +		}
> +	}
> +
> +	if (!unacked)
> +		bb->unacked_exist = 0;
> +}
> +
>  /**
>   * badblocks_set() - Add a range of bad blocks to the table.
>   * @bb:		the badblocks structure that holds all badblock information
> @@ -294,6 +314,8 @@ int badblocks_set(struct badblocks *bb, sector_t s, int sectors,
>  	bb->changed = 1;
>  	if (!acknowledged)
>  		bb->unacked_exist = 1;
> +	else
> +		badblocks_update_acked(bb);
>  	write_sequnlock_irqrestore(&bb->lock, flags);
>  
>  	return rv;
> @@ -399,6 +421,7 @@ int badblocks_clear(struct badblocks *bb, sector_t s, int sectors)
>  		}
>  	}
>  
> +	badblocks_update_acked(bb);
>  	bb->changed = 1;
>  out:
>  	write_sequnlock_irq(&bb->lock);
> -- 
> 2.9.3

Reviewed-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Tested-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>

^ permalink raw reply

* Re: [PATCH 3/3 v2] md: unblock array if bad blocks have been acknowledged
From: Tomasz Majchrzak @ 2016-10-21 14:33 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <20161020215515.GA97678@kernel.org>

On Thu, Oct 20, 2016 at 02:55:15PM -0700, Shaohua Li wrote:
> On Thu, Oct 20, 2016 at 02:09:15PM +0200, Tomasz Majchrzak wrote:
> > On Wed, Oct 19, 2016 at 10:28:18PM -0700, Shaohua Li wrote:
> > > On Tue, Oct 18, 2016 at 04:10:24PM +0200, Tomasz Majchrzak wrote:
> > > > Once external metadata handler acknowledges all bad blocks (by writing
> > > > to rdev 'bad_blocks' sysfs file), it requests to unblock the array.
> > > > Check if all bad blocks are actually acknowledged as there might be a
> > > > race if new bad blocks are notified at the same time. If all bad blocks
> > > > are acknowledged, just unblock the array and continue. If not, ignore
> > > > the request to unblock (do not fail an array). External metadata handler
> > > > is expected to either process remaining bad blocks and try to unblock
> > > > again or remove bad block support for a disk (which will cause disk to
> > > > fail as in no-support case).
> > > > 
> > > > Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
> > > > Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> > > > ---
> > > >  drivers/md/md.c | 24 +++++++++++++++++-------
> > > >  1 file changed, 17 insertions(+), 7 deletions(-)
> > > > 
> > > > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > > > index cc05236..ce585b7 100644
> > > > --- a/drivers/md/md.c
> > > > +++ b/drivers/md/md.c
> > > > @@ -2612,19 +2612,29 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
> > > >  		set_bit(Blocked, &rdev->flags);
> > > >  		err = 0;
> > > >  	} else if (cmd_match(buf, "-blocked")) {
> > > > -		if (!test_bit(Faulty, &rdev->flags) &&
> > > > +		int unblock = 1;
> > > > +		int acked = !rdev->badblocks.unacked_exist;
> > > > +
> > > > +		if ((test_bit(ExternalBbl, &rdev->flags) &&
> > > > +		     rdev->badblocks.changed))
> > > > +			acked = check_if_badblocks_acked(&rdev->badblocks);
> > > > +
> > > > +		if (test_bit(ExternalBbl, &rdev->flags) && !acked) {
> > > > +			unblock = 0;
> > > > +		} else if (!test_bit(Faulty, &rdev->flags) &&
> > > 
> > > I missed one thing in last review. writing to bad_blocks sysfs file already
> > > clears the BlockedBadBlocks bit and wakeup the thread sleeping at blocked_wait,
> > > so the array can continue. Why do we need to fix state_store here?
> > 
> > We cannot unblock the rdev until all bad blocks are acknowledged. The problem is
> > mdadm cannot be sure it has stored all bad blocks in the first pass (read of
> > unacknowledged_bad_blocks sysfs file). When bad block is encountered, rdev
> > enters Blocked, Faulty state (unacked_exists is non-zero in state_show). Then
> > mdadm reads the bad block, stores it in metadata and acknowledges it to the
> > kernel. Initially I have tried to call ack_all_badblocks in bb_store or in
> > state_store("-blocked") but there was a race. If other requests (the ones that
> > had started before array got into blocked state) notified bad blocks after sysfs
> > file was read by mdadm but before ack_all_badblocks call, ack_all_badblocks call
> > was also acknowledging bad blocks not stored (and never to be as a result) in
> > metadata. That's why I have introduced a new function
> > check_if_all_badblocks_acked to close this race.
> > 
> > I'm not sure if native bad block support is not prone to the similar problem -
> > bad block structure modified between metadata sync and ack_all_badblocks call.
> > 
> 
> Yep, we always have the race here. Fortunately we don't need to wait all
> badblocks acknowledged, the user of md_wait_for_blocked_rdev will retry. In the
> retry, we will check if the badblock is acknowledged.
> 
> The native bad block support doesn't have the race. We copy badblocks to a new
> page, clear badblocks->changed and then write the new page to disks.
> ack_all_badblocks will check the ->changed, and do nothing if it's set. So if
> something happens in between, ack_all_badblocks will do nothing.
> 
> While the external metadata array hasn't such mechanism to avoid race, I still
> thought changing state_store isn't a good idea.
> 
> I just sent a patch to fix badblocks_set() and make it clear unacked_exists.
> bb_store shouldn't call ack_all_badblocks in your case, but we don't need to.
> As long as mdadm uses bb_store to acknowledge the ranges, the array can
> continue. And if badblocks_set() can clear unacked_exists, the array will not
> be reported as Blocked.
> 
> > As for BlockedBadBlocks flag cleared in bb_store, commit de393cdea66c ("md: make
> > it easier to wait for bad blocks to be acknowledged") explains this flag is only
> > an advisory. All awaiting requests are woken up and check if bad block that
> > interests them is already acknowledged. If so, then can continue, and if not,
> > they set the flag again to check in a while. It is just a useful optimization.
> 
> I think 'advisory' means the driver should retry
>  
> > Please note that rdev with unacknowledged bad block is reported as Blocked via
> > sysfs state (non-zero unacked_exists), even though the corresponding rdev kernel
> > flag is not set. It is the reason why mdadm calls state_store("-blocked").
> 
> If badblocks_set() can clear unacked_exists, this isn't required.

Thank you for your hints. I have resent my patches. They come on top of your patch.

Tomek

^ permalink raw reply

* [PATCH 2/2 v3] md: don't fail an array if there are unacknowledged bad blocks
From: Tomasz Majchrzak @ 2016-10-21 14:27 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, Tomasz Majchrzak

If external metadata handler supports bad blocks and unacknowledged bad
blocks are present, don't report disk via sysfs as faulty. Such
situation can be still handled so disk just has to be blocked for a
moment. It makes it consistent with kernel state as corresponding rdev
flag is also not set.

When the disk in being unblocked there are few cases:
1. Disk has been in blocked and faulty state, it is being unblocked but
it still remains in faulty state. Metadata handler will remove it from
array in the next call.
2. There is no bad block support in external metadata handler and bad
blocks are present - put the disk in blocked and faulty state (see
case 1).
3. There is bad block support in external metadata handler and all bad
blocks are acknowledged - clear all flags, continue.
4. There is bad block support in external metadata handler but there are
still unacknowledged bad blocks - clear all flags, continue. It is fine
to clear Blocked flag because it was probably not set anyway (if it was
it is case 1). BlockedBadBlocks flag can also be cleared because the
request waiting for it will set it again when it finds out that some bad
block is still not acknowledged. Recovery is not necessary but there are
no problems if the flag is set. Sysfs rdev state is still reported as
blocked (due to unacknowledged bad blocks) so metadata handler will
process remaining bad blocks and unblock disk again.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
---
 drivers/md/md.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index b661a45..734edb2 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2528,7 +2528,8 @@ state_show(struct md_rdev *rdev, char *page)
 	unsigned long flags = ACCESS_ONCE(rdev->flags);

 	if (test_bit(Faulty, &flags) ||
-	    rdev->badblocks.unacked_exist)
+	    (!test_bit(ExternalBbl, &flags) &&
+	    rdev->badblocks.unacked_exist))
 		len += sprintf(page+len, "faulty%s", sep);
 	if (test_bit(In_sync, &flags))
 		len += sprintf(page+len, "in_sync%s", sep);
@@ -2613,6 +2614,7 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
 		err = 0;
 	} else if (cmd_match(buf, "-blocked")) {
 		if (!test_bit(Faulty, &rdev->flags) &&
+		    !test_bit(ExternalBbl, &rdev->flags) &&
 		    rdev->badblocks.unacked_exist) {
 			/* metadata handler doesn't understand badblocks,
 			 * so we need to fail the device
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH 1/2 v3] md: add bad block support for external metadata
From: Tomasz Majchrzak @ 2016-10-21 14:26 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, Tomasz Majchrzak

Add new rdev flag which external metadata handler can use to switch
on/off bad block support. If new bad block is encountered, notify it via
rdev 'unacknowledged_bad_blocks' sysfs file. If bad block has been
cleared, notify update to rdev 'bad_blocks' sysfs file.

When bad blocks support is being removed, just clear rdev flag. It is
not necessary to reset badblocks->shift field. If there are bad blocks
cleared or added at the same time, it is ok for those changes to be
applied to the structure. The array is in blocked state and the drive
which cannot handle bad blocks any more will be removed from the array
before it is unlocked.

Simplify state_show function by adding a separator at the end of each
string and overwrite last separator with new line.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 drivers/md/md.c | 78 ++++++++++++++++++++++++++++-----------------------------
 drivers/md/md.h |  3 +++
 2 files changed, 42 insertions(+), 39 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0695013..b661a45 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2523,51 +2523,38 @@ struct rdev_sysfs_entry {
 static ssize_t
 state_show(struct md_rdev *rdev, char *page)
 {
-	char *sep = "";
+	char *sep = ",";
 	size_t len = 0;
 	unsigned long flags = ACCESS_ONCE(rdev->flags);
 
 	if (test_bit(Faulty, &flags) ||
-	    rdev->badblocks.unacked_exist) {
-		len+= sprintf(page+len, "%sfaulty",sep);
-		sep = ",";
-	}
-	if (test_bit(In_sync, &flags)) {
-		len += sprintf(page+len, "%sin_sync",sep);
-		sep = ",";
-	}
-	if (test_bit(Journal, &flags)) {
-		len += sprintf(page+len, "%sjournal",sep);
-		sep = ",";
-	}
-	if (test_bit(WriteMostly, &flags)) {
-		len += sprintf(page+len, "%swrite_mostly",sep);
-		sep = ",";
-	}
+	    rdev->badblocks.unacked_exist)
+		len += sprintf(page+len, "faulty%s", sep);
+	if (test_bit(In_sync, &flags))
+		len += sprintf(page+len, "in_sync%s", sep);
+	if (test_bit(Journal, &flags))
+		len += sprintf(page+len, "journal%s", sep);
+	if (test_bit(WriteMostly, &flags))
+		len += sprintf(page+len, "write_mostly%s", sep);
 	if (test_bit(Blocked, &flags) ||
 	    (rdev->badblocks.unacked_exist
-	     && !test_bit(Faulty, &flags))) {
-		len += sprintf(page+len, "%sblocked", sep);
-		sep = ",";
-	}
+	     && !test_bit(Faulty, &flags)))
+		len += sprintf(page+len, "blocked%s", sep);
 	if (!test_bit(Faulty, &flags) &&
 	    !test_bit(Journal, &flags) &&
-	    !test_bit(In_sync, &flags)) {
-		len += sprintf(page+len, "%sspare", sep);
-		sep = ",";
-	}
-	if (test_bit(WriteErrorSeen, &flags)) {
-		len += sprintf(page+len, "%swrite_error", sep);
-		sep = ",";
-	}
-	if (test_bit(WantReplacement, &flags)) {
-		len += sprintf(page+len, "%swant_replacement", sep);
-		sep = ",";
-	}
-	if (test_bit(Replacement, &flags)) {
-		len += sprintf(page+len, "%sreplacement", sep);
-		sep = ",";
-	}
+	    !test_bit(In_sync, &flags))
+		len += sprintf(page+len, "spare%s", sep);
+	if (test_bit(WriteErrorSeen, &flags))
+		len += sprintf(page+len, "write_error%s", sep);
+	if (test_bit(WantReplacement, &flags))
+		len += sprintf(page+len, "want_replacement%s", sep);
+	if (test_bit(Replacement, &flags))
+		len += sprintf(page+len, "replacement%s", sep);
+	if (test_bit(ExternalBbl, &flags))
+		len += sprintf(page+len, "external_bbl%s", sep);
+
+	if (len)
+		len -= strlen(sep);
 
 	return len+sprintf(page+len, "\n");
 }
@@ -2708,6 +2695,13 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
 			}
 		} else
 			err = -EBUSY;
+	} else if (cmd_match(buf, "external_bbl") && (rdev->mddev->external)) {
+		set_bit(ExternalBbl, &rdev->flags);
+		rdev->badblocks.shift = 0;
+		err = 0;
+	} else if (cmd_match(buf, "-external_bbl") && (rdev->mddev->external)) {
+		clear_bit(ExternalBbl, &rdev->flags);
+		err = 0;
 	}
 	if (!err)
 		sysfs_notify_dirent_safe(rdev->sysfs_state);
@@ -8615,6 +8609,9 @@ int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
 	rv = badblocks_set(&rdev->badblocks, s, sectors, 0);
 	if (rv == 0) {
 		/* Make sure they get written out promptly */
+		if (test_bit(ExternalBbl, &rdev->flags))
+			sysfs_notify(&rdev->kobj, NULL,
+				     "unacknowledged_bad_blocks");
 		sysfs_notify_dirent_safe(rdev->sysfs_state);
 		set_mask_bits(&mddev->flags, 0,
 			      BIT(MD_CHANGE_CLEAN) | BIT(MD_CHANGE_PENDING));
@@ -8628,12 +8625,15 @@ EXPORT_SYMBOL_GPL(rdev_set_badblocks);
 int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
 			 int is_new)
 {
+	int rv;
 	if (is_new)
 		s += rdev->new_data_offset;
 	else
 		s += rdev->data_offset;
-	return badblocks_clear(&rdev->badblocks,
-				  s, sectors);
+	rv = badblocks_clear(&rdev->badblocks, s, sectors);
+	if ((rv == 0) && test_bit(ExternalBbl, &rdev->flags))
+		sysfs_notify(&rdev->kobj, NULL, "bad_blocks");
+	return rv;
 }
 EXPORT_SYMBOL_GPL(rdev_clear_badblocks);
 
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 2b20417..21bd94f 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -168,6 +168,9 @@ enum flag_bits {
 				 * so it is safe to remove without
 				 * another synchronize_rcu() call.
 				 */
+	ExternalBbl,            /* External metadata provides bad
+				 * block management for a disk
+				 */
 };
 
 static inline int is_badblock(struct md_rdev *rdev, sector_t s, int sectors,
-- 
1.8.3.1


^ permalink raw reply related

* Re: [PATCH] IMSM: Add warning message when x8-type device is used
From: Jes Sorensen @ 2016-10-21 14:22 UTC (permalink / raw)
  To: Baldysiak, Pawel; +Cc: linux-raid@vger.kernel.org
In-Reply-To: <1477059381.15890.9.camel@intel.com>

"Baldysiak, Pawel" <pawel.baldysiak@intel.com> writes:
> On Fri, 2016-10-21 at 08:47 -0400, Jes Sorensen wrote:
>> Pawel Baldysiak <pawel.baldysiak@intel.com> writes:
>> > This patch adds the warning message when x8-type device
>> > is used with IMSM metadata. x8 device is a special
>> > NVMe drive - two of them on a single PCIe card.
>> > This card could be a single point of failure for
>> > RAID levels different than RAID0. x8 devices have
>> > serial number ending with "-A/-B" or "-1/-2".
>> >
>> > Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
>> > Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>> > ---
>> >  super-intel.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>> >  1 file changed, 42 insertions(+)
>> Hi Pawel,
>> 
>> When posting a multi-patch series, it would be helpful to have it sent
>> with appropriate headers staging the set is X/Y patches and a cover
>> letter.
>> 
>> I normally use -n --cover-letter
> Hi Jes,
>
> Sorry about that. Those patches supposed to be sent separately, but I use
> one git send-mail command to sent them all. Now I know that this was not
> the most clever thing to do :) Please treat them as separate patches.

Gotcha! No worries, I can do that!

Would you mind fixing up the first one as per the comment below?

Thanks,
Jes

>> 
>> > diff --git a/super-intel.c b/super-intel.c
>> > index 5c6ab05..e860e75 100644
>> > --- a/super-intel.c
>> > +++ b/super-intel.c
>> > @@ -5139,9 +5139,51 @@ static int add_to_super_imsm(struct
>> > supertype *st, mdu_disk_info_t *dk,
>> >  	rv = imsm_read_serial(fd, devname, dd->serial);
>> >  	if (rv) {
>> >  		pr_err("failed to retrieve scsi serial, aborting\n");
>> > +		if (dd->devname)
>> > +			free(dd->devname);
>> >  		free(dd);
>> >  		abort();
>> >  	}
>> > +	if (super->hba && ((super->hba->type == SYS_DEV_NVME) ||
>> > +	   (super->hba->type == SYS_DEV_VMD))) {
>> > +		int i;
>> > +		char *devpath = diskfd_to_devpath(fd);
>> > +		char controller_path[PATH_MAX];
>> > +
>> > +		if (!devpath) {
>> > +			pr_err("failed to get devpath, aborting\n");
>> > +			if (dd->devname)
>> > +				free(dd->devname);
>> > +			free(dd);
>> > +			return 1;
>> > +		}
>> > +
>> > +		snprintf(controller_path, PATH_MAX-1, "%s/device", devpath);
>> > +		free(devpath);
>> > +
>> > +		if (devpath_to_vendor(controller_path) == 0x8086) {
>> > +			/*
>> > +			 * If Intel's NVMe drive has serial ended with
>> > +			 * "-A","-B","-1" or "-2" it means that this is "x8"
>> > +			 * device (double drive on single PCIe card).
>> > +			 * User should be warned about potential data loss.
>> > +			 */
>> > +			for (i = MAX_RAID_SERIAL_LEN-1; i > 0; i--) {
>> > +				/* Skip empty character at the end */
>> > +				if (dd->serial[i] == 0)
>> > +					continue;
>> > +
>> > + if (((dd->serial[i] == 'A') || (dd->serial[i] == 'B') ||
>> > +    (dd->serial[i] == '1') || (dd->serial[i] == '2')) &&
>> > +				   (dd->serial[i-1] == '-'))
>> 
>> Minor nit here - please keep this within 80 characters.
>
> Would you like me to resend corrected version?
> I saw that those lines are over 80 chars, but I kept it this way.
> I think that its more readable right now, but if you prefer I can
> break those lines.
>
> Thanks
> Pawel
>> 
>> Thanks,
>> Jes

^ permalink raw reply

* Re: [PATCH] IMSM: Add warning message when x8-type device is used
From: Baldysiak, Pawel @ 2016-10-21 14:17 UTC (permalink / raw)
  To: Jes.Sorensen@redhat.com; +Cc: linux-raid@vger.kernel.org
In-Reply-To: <wrfj4m45q36e.fsf@redhat.com>

On Fri, 2016-10-21 at 08:47 -0400, Jes Sorensen wrote:
> Pawel Baldysiak <pawel.baldysiak@intel.com> writes:
> > This patch adds the warning message when x8-type device
> > is used with IMSM metadata. x8 device is a special
> > NVMe drive - two of them on a single PCIe card.
> > This card could be a single point of failure for
> > RAID levels different than RAID0. x8 devices have
> > serial number ending with "-A/-B" or "-1/-2".
> >
> > Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
> > Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> > ---
> >  super-intel.c | 42 ++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 42 insertions(+)
> Hi Pawel,
> 
> When posting a multi-patch series, it would be helpful to have it sent
> with appropriate headers staging the set is X/Y patches and a cover
> letter.
> 
> I normally use -n --cover-letter
Hi Jes,

Sorry about that. Those patches supposed to be sent separately, but I use
one git send-mail command to sent them all. Now I know that this was not
the most clever thing to do :) Please treat them as separate patches.

> 
> > diff --git a/super-intel.c b/super-intel.c
> > index 5c6ab05..e860e75 100644
> > --- a/super-intel.c
> > +++ b/super-intel.c
> > @@ -5139,9 +5139,51 @@ static int add_to_super_imsm(struct supertype *st, mdu_disk_info_t *dk,
> >  	rv = imsm_read_serial(fd, devname, dd->serial);
> >  	if (rv) {
> >  		pr_err("failed to retrieve scsi serial, aborting\n");
> > +		if (dd->devname)
> > +			free(dd->devname);
> >  		free(dd);
> >  		abort();
> >  	}
> > +	if (super->hba && ((super->hba->type == SYS_DEV_NVME) ||
> > +	   (super->hba->type == SYS_DEV_VMD))) {
> > +		int i;
> > +		char *devpath = diskfd_to_devpath(fd);
> > +		char controller_path[PATH_MAX];
> > +
> > +		if (!devpath) {
> > +			pr_err("failed to get devpath, aborting\n");
> > +			if (dd->devname)
> > +				free(dd->devname);
> > +			free(dd);
> > +			return 1;
> > +		}
> > +
> > +		snprintf(controller_path, PATH_MAX-1, "%s/device", devpath);
> > +		free(devpath);
> > +
> > +		if (devpath_to_vendor(controller_path) == 0x8086) {
> > +			/*
> > +			 * If Intel's NVMe drive has serial ended with
> > +			 * "-A","-B","-1" or "-2" it means that this is "x8"
> > +			 * device (double drive on single PCIe card).
> > +			 * User should be warned about potential data loss.
> > +			 */
> > +			for (i = MAX_RAID_SERIAL_LEN-1; i > 0; i--) {
> > +				/* Skip empty character at the end */
> > +				if (dd->serial[i] == 0)
> > +					continue;
> > +
> > +				if (((dd->serial[i] == 'A') || (dd->serial[i] == 'B') ||
> > +				   (dd->serial[i] == '1') || (dd->serial[i] == '2')) &&
> > +				   (dd->serial[i-1] == '-'))
> 
> Minor nit here - please keep this within 80 characters.

Would you like me to resend corrected version?
I saw that those lines are over 80 chars, but I kept it this way.
I think that its more readable right now, but if you prefer I can break those lines.

Thanks
Pawel
> 
> Thanks,
> Jes

^ permalink raw reply

* Re: [PATCH] IMSM: Add warning message when x8-type device is used
From: Jes Sorensen @ 2016-10-21 12:47 UTC (permalink / raw)
  To: Pawel Baldysiak; +Cc: linux-raid
In-Reply-To: <1477042671-12934-1-git-send-email-pawel.baldysiak@intel.com>

Pawel Baldysiak <pawel.baldysiak@intel.com> writes:
> This patch adds the warning message when x8-type device
> is used with IMSM metadata. x8 device is a special
> NVMe drive - two of them on a single PCIe card.
> This card could be a single point of failure for
> RAID levels different than RAID0. x8 devices have
> serial number ending with "-A/-B" or "-1/-2".
>
> Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
> Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  super-intel.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 42 insertions(+)
Hi Pawel,

When posting a multi-patch series, it would be helpful to have it sent
with appropriate headers staging the set is X/Y patches and a cover
letter.

I normally use -n --cover-letter

> diff --git a/super-intel.c b/super-intel.c
> index 5c6ab05..e860e75 100644
> --- a/super-intel.c
> +++ b/super-intel.c
> @@ -5139,9 +5139,51 @@ static int add_to_super_imsm(struct supertype *st, mdu_disk_info_t *dk,
>  	rv = imsm_read_serial(fd, devname, dd->serial);
>  	if (rv) {
>  		pr_err("failed to retrieve scsi serial, aborting\n");
> +		if (dd->devname)
> +			free(dd->devname);
>  		free(dd);
>  		abort();
>  	}
> +	if (super->hba && ((super->hba->type == SYS_DEV_NVME) ||
> +	   (super->hba->type == SYS_DEV_VMD))) {
> +		int i;
> +		char *devpath = diskfd_to_devpath(fd);
> +		char controller_path[PATH_MAX];
> +
> +		if (!devpath) {
> +			pr_err("failed to get devpath, aborting\n");
> +			if (dd->devname)
> +				free(dd->devname);
> +			free(dd);
> +			return 1;
> +		}
> +
> +		snprintf(controller_path, PATH_MAX-1, "%s/device", devpath);
> +		free(devpath);
> +
> +		if (devpath_to_vendor(controller_path) == 0x8086) {
> +			/*
> +			 * If Intel's NVMe drive has serial ended with
> +			 * "-A","-B","-1" or "-2" it means that this is "x8"
> +			 * device (double drive on single PCIe card).
> +			 * User should be warned about potential data loss.
> +			 */
> +			for (i = MAX_RAID_SERIAL_LEN-1; i > 0; i--) {
> +				/* Skip empty character at the end */
> +				if (dd->serial[i] == 0)
> +					continue;
> +
> +				if (((dd->serial[i] == 'A') || (dd->serial[i] == 'B') ||
> +				   (dd->serial[i] == '1') || (dd->serial[i] == '2')) &&
> +				   (dd->serial[i-1] == '-'))

Minor nit here - please keep this within 80 characters.

Thanks,
Jes

^ permalink raw reply

* [PATCH] Lib.c: Fix geting devname for devices with long path
From: Pawel Baldysiak @ 2016-10-21  9:37 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Pawel Baldysiak
In-Reply-To: <1477042671-12934-1-git-send-email-pawel.baldysiak@intel.com>

In scenario where VMD is enabled, and "x8" type of NVMe drive is
plugged into PCIe switch - the path will be longer than 200 chars
(additional VMD domain + 2 level of PCIe switches).
This patch makes the buffer big enough to handle this kind of
configurations.

Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
---
 lib.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib.c b/lib.c
index 057ee52..b640634 100644
--- a/lib.c
+++ b/lib.c
@@ -64,7 +64,7 @@ int get_mdp_major(void)
 char *devid2kname(int devid)
 {
 	char path[30];
-	char link[200];
+	char link[PATH_MAX];
 	static char devnm[32];
 	char *cp;
 	int n;
-- 
2.7.4


^ permalink raw reply related

* [PATCH] IMSM: Enable spanning between VMD domains
From: Pawel Baldysiak @ 2016-10-21  9:37 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Pawel Baldysiak
In-Reply-To: <1477042671-12934-1-git-send-email-pawel.baldysiak@intel.com>

Each VMD domain adds additional PCI domain. This patch
enables RAID creation with NVMe drives from different
VMD domains.

Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
---
 super-intel.c | 19 +++----------------
 1 file changed, 3 insertions(+), 16 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index f52e3d4..6fc9cb1 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -567,10 +567,6 @@ static int attach_hba_to_super(struct intel_super *super, struct sys_dev *device
 	if (device->type != hba->type)
 		return 2;
 
-	/* Always forbid spanning between VMD domains (seen as different controllers by mdadm) */
-	if (device->type == SYS_DEV_VMD && !path_attached_to_hba(device->path, hba->path))
-		return 2;
-
 	/* Multiple same type HBAs can be used if they share the same OROM */
 	const struct imsm_orom *device_orom = get_orom_by_device_id(device->dev_id);
 
@@ -2023,10 +2019,10 @@ static int detail_platform_imsm(int verbose, int enumerate_only, char *controlle
 
 	for (entry = orom_entries; entry; entry = entry->next) {
 		if (entry->type == SYS_DEV_VMD) {
+			print_imsm_capability(&entry->orom);
 			for (hba = list; hba; hba = hba->next) {
 				if (hba->type == SYS_DEV_VMD) {
 					char buf[PATH_MAX];
-					print_imsm_capability(&entry->orom);
 					printf(" I/O Controller : %s (%s)\n",
 						vmd_domain_to_controller(hba, buf), get_sys_dev_type(hba->type));
 					if (print_vmd_attached_devs(hba)) {
@@ -2034,9 +2030,9 @@ static int detail_platform_imsm(int verbose, int enumerate_only, char *controlle
 							pr_err("failed to get devices attached to VMD domain.\n");
 						result |= 2;
 					}
-					printf("\n");
 				}
 			}
+			printf("\n");
 			continue;
 		}
 
@@ -5999,14 +5995,6 @@ count_volumes(struct intel_hba *hba, int dpa, int verbose)
 		else
 			return 0;
 
-		/* VMD has one orom entry for all domain, but spanning is not allowed.
-		 * VMD arrays should be counted per domain (controller), so skip
-		 * domains that are not the given one.
-		 */
-		if (hba->type == SYS_DEV_VMD &&
-		   (strncmp(device->path, hba->path, strlen(device->path)) != 0))
-			continue;
-
 		devlist = get_devices(hba_path);
 		/* if no intel devices return zero volumes */
 		if (devlist == NULL)
@@ -9268,8 +9256,7 @@ int validate_container_imsm(struct mdinfo *info)
 			return 1;
 		}
 
-		if (orom != orom2 ||
-		    (hba->type == SYS_DEV_VMD && hba != hba2)) {
+		if (orom != orom2) {
 			pr_err("WARNING - IMSM container assembled with disks under different HBAs!\n"
 				"       This operation is not supported and can lead to data loss.\n");
 			return 1;
-- 
2.7.4


^ permalink raw reply related

* [PATCH] IMSM: Do not update metadata if not able to migrate
From: Pawel Baldysiak @ 2016-10-21  9:37 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Pawel Baldysiak
In-Reply-To: <1477042671-12934-1-git-send-email-pawel.baldysiak@intel.com>

This patch prevents mdadm from updating metadata if migration is
not possible. The same check is done in analyse_change(),
but in that place - metadata is already modified.

Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
---
 super-intel.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/super-intel.c b/super-intel.c
index 6fc9cb1..5c6ab05 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -10067,6 +10067,11 @@ enum imsm_reshape_type imsm_analyze_change(struct supertype *st,
 			pr_err("Error. Chunk size change for RAID 10 is not supported.\n");
 			change = -1;
 			goto analyse_change_exit;
+		} else if (info.component_size % (geo->chunksize/512)) {
+			pr_err("New chunk size (%dK) does not evenly divide device size (%lluk). Aborting...\n",
+			       geo->chunksize/1024, info.component_size/2);
+			change = -1;
+			goto analyse_change_exit;
 		}
 		change = CH_MIGRATION;
 	} else {
-- 
2.7.4


^ permalink raw reply related

* [PATCH] IMSM: Add warning message when x8-type device is used
From: Pawel Baldysiak @ 2016-10-21  9:37 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Pawel Baldysiak

This patch adds the warning message when x8-type device
is used with IMSM metadata. x8 device is a special
NVMe drive - two of them on a single PCIe card.
This card could be a single point of failure for
RAID levels different than RAID0. x8 devices have
serial number ending with "-A/-B" or "-1/-2".

Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 super-intel.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/super-intel.c b/super-intel.c
index 5c6ab05..e860e75 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -5139,9 +5139,51 @@ static int add_to_super_imsm(struct supertype *st, mdu_disk_info_t *dk,
 	rv = imsm_read_serial(fd, devname, dd->serial);
 	if (rv) {
 		pr_err("failed to retrieve scsi serial, aborting\n");
+		if (dd->devname)
+			free(dd->devname);
 		free(dd);
 		abort();
 	}
+	if (super->hba && ((super->hba->type == SYS_DEV_NVME) ||
+	   (super->hba->type == SYS_DEV_VMD))) {
+		int i;
+		char *devpath = diskfd_to_devpath(fd);
+		char controller_path[PATH_MAX];
+
+		if (!devpath) {
+			pr_err("failed to get devpath, aborting\n");
+			if (dd->devname)
+				free(dd->devname);
+			free(dd);
+			return 1;
+		}
+
+		snprintf(controller_path, PATH_MAX-1, "%s/device", devpath);
+		free(devpath);
+
+		if (devpath_to_vendor(controller_path) == 0x8086) {
+			/*
+			 * If Intel's NVMe drive has serial ended with
+			 * "-A","-B","-1" or "-2" it means that this is "x8"
+			 * device (double drive on single PCIe card).
+			 * User should be warned about potential data loss.
+			 */
+			for (i = MAX_RAID_SERIAL_LEN-1; i > 0; i--) {
+				/* Skip empty character at the end */
+				if (dd->serial[i] == 0)
+					continue;
+
+				if (((dd->serial[i] == 'A') || (dd->serial[i] == 'B') ||
+				   (dd->serial[i] == '1') || (dd->serial[i] == '2')) &&
+				   (dd->serial[i-1] == '-'))
+					pr_err("\tThe action you are about to take may put your data at risk.\n"
+						"\tPlease note that x8 devices may consist of two separate x4 devices "
+						"located on a single PCIe port.\n"
+						"\tRAID 0 is the only supported configuration for this type of x8 device.\n");
+				break;
+			}
+		}
+	}
 
 	get_dev_size(fd, NULL, &size);
 	/* clear migr_rec when adding disk to container */
-- 
2.7.4


^ permalink raw reply related

* Re: How to fix mistake on raid: mdadm create instead of assemble?
From: Santiago DIEZ @ 2016-10-21  8:45 UTC (permalink / raw)
  To: Linux Raid LIST
In-Reply-To: <57FAC73A.6020603@youngman.org.uk>

Hi,

Thanks Andreas,

Yes apparently, 3/4 of the original disks seem to be safe. But I'm
terrified at the idea of doing something wrong assembling them.
Incidentally, I indeed did a mistake trying to assemble the ddrescue
images of the 3 safe disks. I tried to create again with proper
metadata and chunck but it did not work. I'm still scared at the idea
of restarting the original raid. I'm currently ddrescuing again the 3
partitions to then try and *assemble* them rather than *create*.

Thanks Wol,

I use loop devices because I work on partition images, not on actual partitions:
I use ddrescue to copy data from /dev/sd[abc]10 to
some.other.server:/home/sd[abc].img
Then I go to some.other.server and turn the images into loop devices :
losetup /dev/loop0 /home/sda10.img
losetup /dev/loop1 /home/sdb10.img
losetup /dev/loop2 /home/sdc10.img
Then I tried to created the raid, it worked but as I said, the
filesystem was unreadable.
I know the idea of using loop devices works because I tested it before.
I'm doing the whole procedure all over again (takes 5 days to ddrescue
the 3 partitions to another server) and then I will use the command
you recommended :
mdadm --assemble /dev/md0 /dev/loop0 /dev/loop1 /dev/loop2 --force

Will keep you posted

-------------------------
Santiago DIEZ
Quark Systems & CAOBA
23 rue du Buisson Saint-Louis, 75010 Paris
-------------------------

On Mon, Oct 10, 2016 at 12:39 AM, Wols Lists <antlists@youngman.org.uk> wrote:
>
> On 08/10/16 13:30, Andreas Klauer wrote:
> > On Fri, Oct 07, 2016 at 05:37:32PM +0200, Santiago DIEZ wrote:
> >> > First thing I did is ddrescue the remaining partitions sd[abc]10 .
> >> > ddrescue did not stumble into any read error so I assume all remaining
> >> > partitions are perfectly safe.
> > So ... don't you still have a good copy?
> >
> > You only killed one of them, right? Did not make same mistake twice?
> >
> >> > There comes my mistake: I ran the --create command instead of --assemble :
> >> >
> >> > ================================================================================
> >> > # mdadm --create --verbose /dev/md1 --raid-devices=4 --level=raid5
> >> > --run --readonly /dev/loop0 /dev/loop1 /dev/loop2 missing
>
> One oddity I've noticed. You've created the array using loop devices.
> What are these?
>
> The reason I ask is that using loopback devices is a standard technique
> for rebuilding a damaged array, specifically to prevent md from actually
> writing to the drive. So is it possible that "mdadm --create" only wrote
> to ram, and a reboot will recover your ddrescue copies untouched?
>
> My raid-fu isn't enough to tell me whether I'm right or not ... :-)
>
> If necessary you'll have to do another ddrescue from the original
> drives, and you should then be able to assemble the array from the
> copies. Don't use "missing", use "--force" and you should get a working,
> degraded, array to which you can add a new drive and rebuild the array.
>
> mdadm --assemble /dev/md0 /dev/sd[efg]10 --force
>
> if I'm right ... so long as it's the copies, you can always recover
> again from the original disks, and if there's a problem with the copies
> mdadm should complain when it assembles the array.
>
> Cheers,
> Wol

^ permalink raw reply

* Re: [PATCH 3/3 v2] md: unblock array if bad blocks have been acknowledged
From: Shaohua Li @ 2016-10-20 21:55 UTC (permalink / raw)
  To: Tomasz Majchrzak; +Cc: linux-raid
In-Reply-To: <20161020120915.GA8711@proton.igk.intel.com>

On Thu, Oct 20, 2016 at 02:09:15PM +0200, Tomasz Majchrzak wrote:
> On Wed, Oct 19, 2016 at 10:28:18PM -0700, Shaohua Li wrote:
> > On Tue, Oct 18, 2016 at 04:10:24PM +0200, Tomasz Majchrzak wrote:
> > > Once external metadata handler acknowledges all bad blocks (by writing
> > > to rdev 'bad_blocks' sysfs file), it requests to unblock the array.
> > > Check if all bad blocks are actually acknowledged as there might be a
> > > race if new bad blocks are notified at the same time. If all bad blocks
> > > are acknowledged, just unblock the array and continue. If not, ignore
> > > the request to unblock (do not fail an array). External metadata handler
> > > is expected to either process remaining bad blocks and try to unblock
> > > again or remove bad block support for a disk (which will cause disk to
> > > fail as in no-support case).
> > > 
> > > Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
> > > Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> > > ---
> > >  drivers/md/md.c | 24 +++++++++++++++++-------
> > >  1 file changed, 17 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > > index cc05236..ce585b7 100644
> > > --- a/drivers/md/md.c
> > > +++ b/drivers/md/md.c
> > > @@ -2612,19 +2612,29 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
> > >  		set_bit(Blocked, &rdev->flags);
> > >  		err = 0;
> > >  	} else if (cmd_match(buf, "-blocked")) {
> > > -		if (!test_bit(Faulty, &rdev->flags) &&
> > > +		int unblock = 1;
> > > +		int acked = !rdev->badblocks.unacked_exist;
> > > +
> > > +		if ((test_bit(ExternalBbl, &rdev->flags) &&
> > > +		     rdev->badblocks.changed))
> > > +			acked = check_if_badblocks_acked(&rdev->badblocks);
> > > +
> > > +		if (test_bit(ExternalBbl, &rdev->flags) && !acked) {
> > > +			unblock = 0;
> > > +		} else if (!test_bit(Faulty, &rdev->flags) &&
> > 
> > I missed one thing in last review. writing to bad_blocks sysfs file already
> > clears the BlockedBadBlocks bit and wakeup the thread sleeping at blocked_wait,
> > so the array can continue. Why do we need to fix state_store here?
> 
> We cannot unblock the rdev until all bad blocks are acknowledged. The problem is
> mdadm cannot be sure it has stored all bad blocks in the first pass (read of
> unacknowledged_bad_blocks sysfs file). When bad block is encountered, rdev
> enters Blocked, Faulty state (unacked_exists is non-zero in state_show). Then
> mdadm reads the bad block, stores it in metadata and acknowledges it to the
> kernel. Initially I have tried to call ack_all_badblocks in bb_store or in
> state_store("-blocked") but there was a race. If other requests (the ones that
> had started before array got into blocked state) notified bad blocks after sysfs
> file was read by mdadm but before ack_all_badblocks call, ack_all_badblocks call
> was also acknowledging bad blocks not stored (and never to be as a result) in
> metadata. That's why I have introduced a new function
> check_if_all_badblocks_acked to close this race.
> 
> I'm not sure if native bad block support is not prone to the similar problem -
> bad block structure modified between metadata sync and ack_all_badblocks call.
> 

Yep, we always have the race here. Fortunately we don't need to wait all
badblocks acknowledged, the user of md_wait_for_blocked_rdev will retry. In the
retry, we will check if the badblock is acknowledged.

The native bad block support doesn't have the race. We copy badblocks to a new
page, clear badblocks->changed and then write the new page to disks.
ack_all_badblocks will check the ->changed, and do nothing if it's set. So if
something happens in between, ack_all_badblocks will do nothing.

While the external metadata array hasn't such mechanism to avoid race, I still
thought changing state_store isn't a good idea.

I just sent a patch to fix badblocks_set() and make it clear unacked_exists.
bb_store shouldn't call ack_all_badblocks in your case, but we don't need to.
As long as mdadm uses bb_store to acknowledge the ranges, the array can
continue. And if badblocks_set() can clear unacked_exists, the array will not
be reported as Blocked.

> As for BlockedBadBlocks flag cleared in bb_store, commit de393cdea66c ("md: make
> it easier to wait for bad blocks to be acknowledged") explains this flag is only
> an advisory. All awaiting requests are woken up and check if bad block that
> interests them is already acknowledged. If so, then can continue, and if not,
> they set the flag again to check in a while. It is just a useful optimization.

I think 'advisory' means the driver should retry
 
> Please note that rdev with unacknowledged bad block is reported as Blocked via
> sysfs state (non-zero unacked_exists), even though the corresponding rdev kernel
> flag is not set. It is the reason why mdadm calls state_store("-blocked").

If badblocks_set() can clear unacked_exists, this isn't required.

Thanks,
Shaohua

^ permalink raw reply

* [PATCH] badblocks: badblocks_set/clear update unacked_exist
From: Shaohua Li @ 2016-10-20 21:40 UTC (permalink / raw)
  To: linux-raid, linux-block; +Cc: tomasz.majchrzak

When bandblocks_set acknowledges a range or badblocks_clear a range,
it's possible all badblocks are acknowledged. We should update
unacked_exist if this occurs.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/badblocks.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/block/badblocks.c b/block/badblocks.c
index 7be53cb..139115d2 100644
--- a/block/badblocks.c
+++ b/block/badblocks.c
@@ -133,6 +133,26 @@ int badblocks_check(struct badblocks *bb, sector_t s, int sectors,
 }
 EXPORT_SYMBOL_GPL(badblocks_check);
 
+static void badblocks_update_acked(struct badblocks *bb)
+{
+	u64 *p = bb->page;
+	int i;
+	bool unacked = false;
+
+	if (!bb->unacked_exist)
+		return;
+
+	for (i = 0; i < bb->count ; i++) {
+		if (!BB_ACK(p[i])) {
+			unacked = true;
+			break;
+		}
+	}
+
+	if (!unacked)
+		bb->unacked_exist = 0;
+}
+
 /**
  * badblocks_set() - Add a range of bad blocks to the table.
  * @bb:		the badblocks structure that holds all badblock information
@@ -294,6 +314,8 @@ int badblocks_set(struct badblocks *bb, sector_t s, int sectors,
 	bb->changed = 1;
 	if (!acknowledged)
 		bb->unacked_exist = 1;
+	else
+		badblocks_update_acked(bb);
 	write_sequnlock_irqrestore(&bb->lock, flags);
 
 	return rv;
@@ -399,6 +421,7 @@ int badblocks_clear(struct badblocks *bb, sector_t s, int sectors)
 		}
 	}
 
+	badblocks_update_acked(bb);
 	bb->changed = 1;
 out:
 	write_sequnlock_irq(&bb->lock);
-- 
2.9.3


^ permalink raw reply related

* Re: [PATCH 1/4] mdadm: bad block support for external metadata - initialization
From: Tomasz Majchrzak @ 2016-10-20 14:26 UTC (permalink / raw)
  To: keld; +Cc: linux-raid
In-Reply-To: <20161020140220.GB17867@www5.open-std.org>

I cannot see how badblocks program is related to this patch. It is a generic
code for bad blocks support in IMSM metadata. It introduces 64-bit value for
sector address, the same size as in kernel. All it does is syncing kernel bad
block list with raid metadata.

Tomek

On Thu, Oct 20, 2016 at 04:02:20PM +0200, keld@keldix.com wrote:
> Is this safe for 2TB+ arrays?
> I think the badblocks program does not handle 2TB+ partitions?
> 
> best regards
> Keld
> 
> Best regards
> Keld
> 
> On Thu, Oct 20, 2016 at 03:03:46PM +0200, Tomasz Majchrzak wrote:
> > If metadata handler provides support for bad blocks, tell md by writing
> > 'external_bbl' to rdev state file (both on create and assemble),
> > followed by a list of known bad blocks written via sysfs 'bad_blocks'
> > file.
> > 
> > Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
> > Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> > ---
> >  mdadm.h | 13 +++++++++++++
> >  sysfs.c | 29 ++++++++++++++++++++++++++++-
> >  2 files changed, 41 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mdadm.h b/mdadm.h
> > index 0516c82..5156ea4 100755
> > --- a/mdadm.h
> > +++ b/mdadm.h
> > @@ -237,6 +237,17 @@ struct dlm_lksb {
> >  
> >  extern const char Name[];
> >  
> > +struct md_bb_entry {
> > +	unsigned long long sector;
> > +	int length;
> > +};
> > +
> > +struct md_bb {
> > +	int supported;
> > +	int count;
> > +	struct md_bb_entry *entries;
> > +};
> > +
> >  /* general information that might be extracted from a superblock */
> >  struct mdinfo {
> >  	mdu_array_info_t	array;
> > @@ -311,6 +322,8 @@ struct mdinfo {
> >  
> >  	/* info read from sysfs */
> >  	char		sysfs_array_state[20];
> > +
> > +	struct md_bb bb;
> >  };
> >  
> >  struct createinfo {
> > diff --git a/sysfs.c b/sysfs.c
> > index d28e21a..c7a8e66 100644
> > --- a/sysfs.c
> > +++ b/sysfs.c
> > @@ -50,8 +50,12 @@ void sysfs_free(struct mdinfo *sra)
> >  		while (sra->devs) {
> >  			struct mdinfo *d = sra->devs;
> >  			sra->devs = d->next;
> > +			if (d->bb.entries)
> > +				free(d->bb.entries);
> >  			free(d);
> >  		}
> > +		if (sra->bb.entries)
> > +			free(sra->bb.entries);
> >  		free(sra);
> >  		sra = sra2;
> >  	}
> > @@ -259,7 +263,7 @@ struct mdinfo *sysfs_read(int fd, char *devnm, unsigned long options)
> >  		dbase = base + strlen(base);
> >  		*dbase++ = '/';
> >  
> > -		dev = xmalloc(sizeof(*dev));
> > +		dev = xcalloc(1, sizeof(*dev));
> >  
> >  		/* Always get slot, major, minor */
> >  		strcpy(dbase, "slot");
> > @@ -687,6 +691,7 @@ int sysfs_add_disk(struct mdinfo *sra, struct mdinfo *sd, int resume)
> >  	char nm[PATH_MAX];
> >  	char *dname;
> >  	int rv;
> > +	int i;
> >  
> >  	sprintf(dv, "%d:%d", sd->disk.major, sd->disk.minor);
> >  	rv = sysfs_set_str(sra, NULL, "new_dev", dv);
> > @@ -718,6 +723,28 @@ int sysfs_add_disk(struct mdinfo *sra, struct mdinfo *sd, int resume)
> >  		if (resume)
> >  			sysfs_set_num(sra, sd, "recovery_start", sd->recovery_start);
> >  	}
> > +	if (sd->bb.supported) {
> > +		if (sysfs_set_str(sra, sd, "state", "external_bbl")) {
> > +			/*
> > +			 * backward compatibility - if kernel doesn't support
> > +			 * bad blocks for external metadata, let it continue
> > +			 * as long as there are none known so far
> > +			 */
> > +			if (sd->bb.count) {
> > +				pr_err("The kernel has no support for bad blocks in external metadata\n");
> > +				return -1;
> > +			}
> > +		}
> > +
> > +		for (i = 0; i < sd->bb.count; i++) {
> > +			char s[30];
> > +			const struct md_bb_entry *entry = &sd->bb.entries[i];
> > +
> > +			snprintf(s, sizeof(s) - 1, "%llu %d\n", entry->sector,
> > +				 entry->length);
> > +			rv |= sysfs_set_str(sra, sd, "bad_blocks", s);
> > +		}
> > +	}
> >  	return rv;
> >  }
> >  
> > -- 
> > 1.8.3.1
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox