Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: [PATCH 3/7] imsm: PPL support
From: Artur Paszkiewicz @ 2016-11-30  8:07 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: linux-raid
In-Reply-To: <wrfjzikigvj8.fsf@redhat.com>

On 11/29/2016 04:23 PM, Jes Sorensen wrote:
> Artur Paszkiewicz <artur.paszkiewicz@intel.com> writes:
>> On 11/29/2016 12:51 AM, Jes Sorensen wrote:
>>>> @@ -3177,6 +3195,9 @@ static void getinfo_super_imsm(struct supertype *st, struct mdinfo *info, char *
>>>>  
>>>>  		disk = &super->disks->disk;
>>>>  		info->data_offset = total_blocks(&super->disks->disk) - reserved;
>>>> +		/* mpb anchor sector - see store_imsm_mpb() */
>>>> +		info->sb_start = total_blocks(&super->disks->disk) -
>>>> +				 ((2 * super->sector_size) >> 9);
>>>>  		info->component_size = reserved;
>>>>  		info->disk.state  = is_configured(disk) ? (1 << MD_DISK_ACTIVE) : 0;
>>>>  		/* we don't change info->disk.raid_disk here because
>>>
>>> Hi Artur,
>>>
>>> I have been sitting staring at the above for a while, and looking at
>>> store_imsm_mpb() it is not clear to me what is meant to happen here.
>>>
>>> For 4k sector drives, aren't you pushing back sb_start way further than
>>> you are for 512 byte sector drives? Ie. you are subtracting 16 sectors
>>> for the 4k drive but only two sectors for the 512 byte sector drive?
>>>
>>> Maybe it's because it's Monday or I lost the last of my marbles, but
>>> could you possibly enlighten me here please?
>>
>> Jes,
>>
>> You read it correctly. The reason for this is that the IMSM anchor is
>> located in the second _logical_ sector from the end of the drive. So for
>> 4k drives this will be 16 512-byte sectors from the end.
> 
> I see, so the IMSM implementation uses 512 byte logical sectors on top
> of 4k drives? Could I ask you to add a note explaining this in the code?

IMSM uses logical (4k or 512b) sector sizes in the metadata, but mdadm
implementation uses just 512 byte sectors. This is how it works since
Pawel's 4k support patches - it converts 4k metadata internally to 512b
sector values. Plus here the sb_start value is passed to the kernel, so
it must be in 512 byte sectors. Sure, I can add a comment about this.

Thanks,
Artur

^ permalink raw reply

* Re: Is trim/discard supported on LVM RAID logical volume?
From: Patrick Dung @ 2016-11-30  8:25 UTC (permalink / raw)
  To: Andreas Klauer; +Cc: Chris Murphy, Linux-RAID
In-Reply-To: <20161129233251.GA4404@metamorpher.de>

Hello,

I got a reply in linux-lvm that there should be problem, and Heinz
from Redhat is looking into it.
related message:
https://www.redhat.com/archives/linux-lvm/2016-November/msg00041.html

For your information, I have issue_discards set to 1.
If I create a non-RAID LVM, the fstrim works.

Thanks and regards,
Patrick

On Wed, Nov 30, 2016 at 7:32 AM, Andreas Klauer
<Andreas.Klauer@metamorpher.de> wrote:
> On Tue, Nov 29, 2016 at 03:21:36PM -0700, Chris Murphy wrote:
>> Reading the documentation for both, they're functionally the same
>> thing in that with these options the particular layer will pass on a
>> trim to the next layer underneath it.
>
> discard/fstrim/blkdiscard works fine with issue_discards=0.
>
> If it doesn't work, setting issue_discards=1 won't do a thing for you.
> That flag only affects the lvm tool itself, discard freed LVM extents
> after lvreduce or pvmove. Can't undo botched resizes anymore, too bad.
>
> Most storage layers simply don't have any setting for this at all.
> It just works. Or it does not (in many cases, old/stable kernel/distro).
> cryptsetup's allow-discards is the exception, the one layer where you
> have to manually enable something to make it work.
>
> Regards
> Andreas Klauer

^ permalink raw reply

* [PATCH v3 3/9] imsm: give md list of known bad blocks on startup
From: Tomasz Majchrzak @ 2016-11-30  8:41 UTC (permalink / raw)
  To: linux-raid; +Cc: Jes.Sorensen, Tomasz Majchrzak
In-Reply-To: <wrfjzikh7whw.fsf@redhat.com>

On create set bad block support flag for each drive. On assmble also
provide a list of known bad blocks. Bad blocks are stored in metadata
per disk so they have to be checked against volume boundaries
beforehand.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
---
 super-intel.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/super-intel.c b/super-intel.c
index 20144fe..9720ced 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -865,6 +865,56 @@ static int load_bbm_log(struct intel_super *super)
 	return 0;
 }
 
+/* checks if bad block is within volume boundaries */
+static int is_bad_block_in_volume(const struct bbm_log_entry *entry,
+			const unsigned long long start_sector,
+			const unsigned long long size)
+{
+	unsigned long long bb_start;
+	unsigned long long bb_end;
+
+	bb_start = __le48_to_cpu(&entry->defective_block_start);
+	bb_end = bb_start + (entry->marked_count + 1);
+
+	if (((bb_start >= start_sector) && (bb_start < start_sector + size)) ||
+	    ((bb_end >= start_sector) && (bb_end <= start_sector + size)))
+		return 1;
+
+	return 0;
+}
+
+/* get list of bad blocks on a drive for a volume */
+static void get_volume_badblocks(const struct bbm_log *log, const __u8 idx,
+			const unsigned long long start_sector,
+			const unsigned long long size,
+			struct md_bb *bbs)
+{
+	__u32 count = 0;
+	__u32 i;
+
+	for (i = 0; i < log->entry_count; i++) {
+		const struct bbm_log_entry *ent =
+			&log->marked_block_entries[i];
+		struct md_bb_entry *bb;
+
+		if ((ent->disk_ordinal == idx) &&
+		    is_bad_block_in_volume(ent, start_sector, size)) {
+
+			if (!bbs->entries) {
+				bbs->entries = xmalloc(BBM_LOG_MAX_ENTRIES *
+						     sizeof(*bb));
+				if (!bbs->entries)
+					break;
+			}
+
+			bb = &bbs->entries[count++];
+			bb->sector = __le48_to_cpu(&ent->defective_block_start);
+			bb->length = ent->marked_count + 1;
+		}
+	}
+	bbs->count = count;
+}
+
 /*
  * for second_map:
  *  == MAP_0 get first map
@@ -3001,6 +3051,7 @@ static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info,
 							info->array.chunk_size,
 							super->sector_size,
 							info->component_size);
+	info->bb.supported = 0;
 
 	memset(info->uuid, 0, sizeof(info->uuid));
 	info->recovery_start = MaxSector;
@@ -3167,6 +3218,7 @@ static void getinfo_super_imsm(struct supertype *st, struct mdinfo *info, char *
 	info->name[0] = 0;
 	info->recovery_start = MaxSector;
 	info->recovery_blocked = imsm_reshape_blocks_arrays_changes(st->sb);
+	info->bb.supported = 0;
 
 	/* do we have the all the insync disks that we expect? */
 	mpb = super->anchor;
@@ -7152,6 +7204,12 @@ static struct mdinfo *container_content_imsm(struct supertype *st, char *subarra
 			} else {
 				info_d->component_size = blocks_per_member(map);
 			}
+
+			info_d->bb.supported = 0;
+			get_volume_badblocks(super->bbm_log, ord_to_idx(ord),
+					     info_d->data_offset,
+					     info_d->component_size,
+					     &info_d->bb);
 		}
 		/* now that the disk list is up-to-date fixup recovery_start */
 		update_recovery_start(super, dev, this);
@@ -8158,6 +8216,7 @@ static struct mdinfo *imsm_activate_spare(struct active_array *a,
 		di->data_offset = pba_of_lba0(map);
 		di->component_size = a->info.component_size;
 		di->container_member = inst;
+		di->bb.supported = 0;
 		super->random = random32();
 		di->next = rv;
 		rv = di;
-- 
1.8.3.1


^ permalink raw reply related

* Re: [PATCH v2 3/9] imsm: give md list of known bad blocks on startup
From: Tomasz Majchrzak @ 2016-11-30  8:51 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: linux-raid
In-Reply-To: <wrfjzikh7whw.fsf@redhat.com>

On Tue, Nov 29, 2016 at 05:27:07PM -0500, Jes Sorensen wrote:
> Tomasz Majchrzak <tomasz.majchrzak@intel.com> writes:
> > On create set bad block support flag for each drive. On assmble also
> > provide a list of known bad blocks. Bad blocks are stored in metadata
> > per disk so they have to be checked against volume boundaries
> > beforehand.
> >
> > Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
> > ---
> >  super-intel.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 59 insertions(+)
> 
> Tomasz,
> 
> Thanks for posting 1/9 v3 - I was in the process of applying this set,
> but something is causing a conflict.
> 
> > diff --git a/super-intel.c b/super-intel.c
> > index 421bfbc..b2afdff 100644
> > --- a/super-intel.c
> > +++ b/super-intel.c
> [snip]
> > @@ -7160,6 +7212,12 @@ static struct mdinfo *container_content_imsm(struct supertype *st, char *subarra
> >  			info_d->events = __le32_to_cpu(mpb->generation_num);
> >  			info_d->data_offset = pba_of_lba0(map);
> >  			info_d->component_size = blocks_per_member(map);
> > +
> > +			info_d->bb.supported = 0;
> > +			get_volume_badblocks(super->bbm_log, ord_to_idx(ord),
> > +					     info_d->data_offset,
> > +					     info_d->component_size,
> > +					     &info_d->bb);
> >  		}
> >  		/* now that the disk list is up-to-date fixup recovery_start */
> >  		update_recovery_start(super, dev, this);
> 
> This hunk is failing as my tree doesn't have the line above:
> 			info_d->component_size = blocks_per_member(map);
> 
> I can merge this manually, but I prefer to be sure we are in sync just
> in case. Where did that line come from - did I miss an earlier patch?

My bad. I have just sent new version of the patch. Previously I had tested
it against my local mirror of your repository which happened to run
out-of-sync few days ago. Please keep telling me when patches do not apply
cleanly, it would help me to spot my working environment issues.

Thanks,

Tomek

^ permalink raw reply

* Re: [BUG] MD/RAID1 hung forever on bitmap_startwrite+0x122
From: Jinpu Wang @ 2016-11-30  9:29 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, NeilBrown
In-Reply-To: <20161130000838.ahkicokl6ccbrf5x@kernel.org>

On Wed, Nov 30, 2016 at 1:08 AM, Shaohua Li <shli@kernel.org> wrote:
> On Mon, Nov 28, 2016 at 09:45:07AM +0100, Jinpu Wang wrote:
>> Hi folks,
>>
>> We hit another hung task on bitmap_startwrite with our test machines, this time
>> it's hung in bitmap_startwrite.
>>
>> We build MD/RAID1 over 2 block devices exported via IB,
>> bitmap=internal. KVM build on top of
>> RAID1 on compute node, disks are on remote storage node, one storage
>> node crash/reboot, multiple raid1 on multiple compute node KVM run
>> into hung task:
>>
>> [106204.343870] INFO: task kvm:37669 blocked for more than 180 seconds.
>>
>> [106204.344138]       Tainted: G          IO    4.4.28-1-pserver #1
>>
>> [106204.344385] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>>
>> [106204.344798] kvm             D ffff882037723710     0 37669      1 0x00000000
>>
>> [106204.344805]  ffff882037723710 ffff882038f08d00 ffff882029770d00
>> ffff8820377236d8
>>
>> [106204.344809]  ffff8820377236d8 ffff882037724000 0000000000308648
>> 0000000000000008
>>
>> [106204.344813]  ffff880f9bd9e8c0 ffff882037723768 ffff882037723728
>> ffffffff81811c60
>>
>> [106204.344818] Call Trace:
>>
>> [106204.344831]  [<ffffffff81811c60>] schedule+0x30/0x80
>>
>> [106204.344841]  [<ffffffffa09d31a2>] bitmap_startwrite+0x122/0x190 [md_mod]
>>
>> [106204.344848]  [<ffffffff813f660b>] ? bio_clone_bioset+0x11b/0x310
>>
>> [106204.344853]  [<ffffffff810956b0>] ? wait_woken+0x80/0x80
>>
>> [106204.344859]  [<ffffffffa0cc5127>] 0xffffffffa0cc5127
>>
>> [106204.344865]  [<ffffffffa09c4863>] md_set_array_sectors+0xac3/0xe20 [md_mod]
>>
>> [106204.344871]  [<ffffffff813faa94>] ? generic_make_request_checks+0x234/0x4c0
>>
>> [106204.344875]  [<ffffffff813fdb91>] blk_prologue_bio+0x91/0xc0
>>
>> [106204.344879]  [<ffffffff813fd54e>] generic_make_request+0xfe/0x1e0
>>
>> [106204.344883]  [<ffffffff813fd692>] submit_bio+0x62/0x150
>>
>> [106204.344892]  [<ffffffff811d3257>] do_blockdev_direct_IO+0x2317/0x2ba0
>>
>> [106204.344897]  [<ffffffff810b9999>] ? __remove_hrtimer+0x89/0xa0
>>
>> [106204.344903]  [<ffffffff8173c08f>] ? udp_poll+0x1f/0xb0
>>
>> [106204.344908]  [<ffffffff816b71c7>] ? sock_poll+0x57/0x120
>>
>> [106204.344913]  [<ffffffff811cdbf0>] ? I_BDEV+0x10/0x10
>>
>> [106204.344918]  [<ffffffff811d3b1e>] __blockdev_direct_IO+0x3e/0x40
>>
>> [106204.344922]  [<ffffffff811ce287>] blkdev_direct_IO+0x47/0x50
>>
>> [106204.344930]  [<ffffffff81132c60>] generic_file_direct_write+0xb0/0x170
>>
>> [106204.344934]  [<ffffffff81132ded>] __generic_file_write_iter+0xcd/0x1f0
>>
>> [106204.344943]  [<ffffffff81184ff8>] ? kmem_cache_free+0x78/0x190
>>
>> [106204.344948]  [<ffffffff811ce4c0>] ? bd_unlink_disk_holder+0xf0/0xf0
>>
>> [106204.344952]  [<ffffffff811ce547>] blkdev_write_iter+0x87/0x110
>>
>> [106204.344956]  [<ffffffff811ce4c0>] ? bd_unlink_disk_holder+0xf0/0xf0
>>
>> [106204.344962]  [<ffffffff811dec56>] aio_run_iocb+0x236/0x2a0
>>
>> [106204.344966]  [<ffffffff811dd183>] ? eventfd_ctx_read+0x53/0x200
>>
>> [106204.344973]  [<ffffffff811b3bbf>] ? __fget_light+0x1f/0x60
>>
>> [106204.344976]  [<ffffffff811b3c0e>] ? __fdget+0xe/0x10
>>
>> [106204.344980]  [<ffffffff811dfb5a>] do_io_submit+0x23a/0x4d0
>>
>> [106204.344985]  [<ffffffff811dfdfb>] SyS_io_submit+0xb/0x10
>>
>> [106204.344989]  [<ffffffff818154d7>] entry_SYSCALL_64_fastpath+0x12/0x6a
>>
>> [106384.345330] INFO: task kvm:37669 blocked for more than 180 seconds.
>>
>> [106384.345621]       Tainted: G          IO    4.4.28-1-pserver #1
>>
>> [106384.345866] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>>
>> [106384.346275] kvm             D ffff882037723710     0 37669      1 0x00000000
>>
>> [106384.346282]  ffff882037723710 ffff882038f08d00 ffff882029770d00
>> ffff8820377236d8
>>
>> [106384.346286]  ffff8820377236d8 ffff882037724000 0000000000308648
>> 0000000000000008
>>
>> [106384.346290]  ffff880f9bd9e8c0 ffff882037723768 ffff882037723728
>> ffffffff81811c60
>>
>> [106384.346294] Call Trace:
>>
>> [106384.346308]  [<ffffffff81811c60>] schedule+0x30/0x80
>>
>> [106384.346317]  [<ffffffffa09d31a2>] bitmap_startwrite+0x122/0x190 [md_mod]
>>
>> [106384.346325]  [<ffffffff813f660b>] ? bio_clone_bioset+0x11b/0x310
>>
>> [106384.346330]  [<ffffffff810956b0>] ? wait_woken+0x80/0x80
>>
>> [106384.346336]  [<ffffffffa0cc5127>] 0xffffffffa0cc5127
>>
>> [106384.346341]  [<ffffffffa09c4863>] md_set_array_sectors+0xac3/0xe20 [md_mod]
>>
>> [106384.346347]  [<ffffffff813faa94>] ? generic_make_request_checks+0x234/0x4c0
>>
>> [106384.346352]  [<ffffffff813fdb91>] blk_prologue_bio+0x91/0xc0
>>
>> [106384.346356]  [<ffffffff813fd54e>] generic_make_request+0xfe/0x1e0
>>
>> [106384.346360]  [<ffffffff813fd692>] submit_bio+0x62/0x150
>>
>> [106384.346369]  [<ffffffff811d3257>] do_blockdev_direct_IO+0x2317/0x2ba0
>>
>>
>> (gdb) l *bitmap_startwrite+0x122
>>
>> 0x121d2 is in bitmap_startwrite (drivers/md/bitmap.c:1396).
>>
>>
>>
>> 1394                 if (unlikely(COUNTER(*bmc) == COUNTER_MAX)) {
>>
>> 1395                         DEFINE_WAIT(__wait);
>>
>> 1396                         /* note that it is safe to do the prepare_to_wait
>>
>> 1397                          * after the test as long as we do it
>> before dropping
>>
>> 1398                          * the spinlock.
>>
>> 1399                          */
>>
>> 1400                         prepare_to_wait(&bitmap->overflow_wait, &__wait,
>>
>> 1401                                         TASK_UNINTERRUPTIBLE);
>>
>> 1402                         spin_unlock_irq(&bitmap->counts.lock);
>>
>> 1403                         schedule();
>>
>> 1404                         finish_wait(&bitmap->overflow_wait, &__wait);
>>
>> 1405                         continue;
>>
>> 1406                 }
>>
>> so seem KVM is waiting on overflow_wait queue, but somehow no one wake
>> him up. During reboot one storage, raid1 has a lot IO errors in that
>> time, I guess some error handle part is broken.
>>
>> I haven't have a reproducer, just want to report to community, in case
>> this is known bug, or anyone has patch already :)
>
> Does the kernel report the raid disk is faulty and gets removed? Is this a
> real hang, eg maybe we are waitting for IO error reported.


Thanks Shaohua for reply.

I checked the log again, the hung task was there for 10+ hours.
I found something wrong with the testcase, it create MD/RAID1 on two
drives from 2 remote storage servers, and
reboot one storage before some MDs finished recovery, and then the
other storage also rebooted, this led to both legs
were broken, and somehow led to the hung tasks.

As RAID1 can't handle 2 broken legs at same time, I think we will
change our test case for more practical case.


-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply

* Feature request, resumable raid check action
From: Patrick Dung @ 2016-11-30 12:21 UTC (permalink / raw)
  To: linux-raid

Hello,

As I know if MD raid is using the newer metadata version, it support
resumable raid rebuild/sync. (that is, if a server is rebooted during
rebuild, it would resume from last position after reboot, instead of
starting from beginning).

In my recently testing (a few months ago):
I sometimes use the mdadm 'check' action for doing the disk scrubbing
of a MD raid.
After I rebooted the server, the 'check' operation is forgotten and is
not resumable.

I think resumable 'check' operation is useful as the array size would
become bigger in the future.

Thanks and regards,
Patrick

^ permalink raw reply

* Re: [PATCH 3/7] imsm: PPL support
From: Jes Sorensen @ 2016-11-30 15:40 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: linux-raid
In-Reply-To: <10603129-62b3-6342-bfc9-095c5b43fb4c@intel.com>

Artur Paszkiewicz <artur.paszkiewicz@intel.com> writes:
> On 11/29/2016 04:23 PM, Jes Sorensen wrote:
>> Artur Paszkiewicz <artur.paszkiewicz@intel.com> writes:
>>> On 11/29/2016 12:51 AM, Jes Sorensen wrote:
>>>>> @@ -3177,6 +3195,9 @@ static void getinfo_super_imsm(struct
>>>>> supertype *st, struct mdinfo *info, char *
>>>>>  
>>>>>  		disk = &super->disks->disk;
>>>>>  		info->data_offset = total_blocks(&super->disks->disk)
>>>>> - reserved;
>>>>> +		/* mpb anchor sector - see store_imsm_mpb() */
>>>>> +		info->sb_start = total_blocks(&super->disks->disk) -
>>>>> +				 ((2 * super->sector_size) >> 9);
>>>>>  		info->component_size = reserved;
>>>>>  		info->disk.state = is_configured(disk) ? (1 <<
>>>>> MD_DISK_ACTIVE) : 0;
>>>>>  		/* we don't change info->disk.raid_disk here because
>>>>
>>>> Hi Artur,
>>>>
>>>> I have been sitting staring at the above for a while, and looking at
>>>> store_imsm_mpb() it is not clear to me what is meant to happen here.
>>>>
>>>> For 4k sector drives, aren't you pushing back sb_start way further than
>>>> you are for 512 byte sector drives? Ie. you are subtracting 16 sectors
>>>> for the 4k drive but only two sectors for the 512 byte sector drive?
>>>>
>>>> Maybe it's because it's Monday or I lost the last of my marbles, but
>>>> could you possibly enlighten me here please?
>>>
>>> Jes,
>>>
>>> You read it correctly. The reason for this is that the IMSM anchor is
>>> located in the second _logical_ sector from the end of the drive. So for
>>> 4k drives this will be 16 512-byte sectors from the end.
>> 
>> I see, so the IMSM implementation uses 512 byte logical sectors on top
>> of 4k drives? Could I ask you to add a note explaining this in the code?
>
> IMSM uses logical (4k or 512b) sector sizes in the metadata, but mdadm
> implementation uses just 512 byte sectors. This is how it works since
> Pawel's 4k support patches - it converts 4k metadata internally to 512b
> sector values. Plus here the sb_start value is passed to the kernel, so
> it must be in 512 byte sectors. Sure, I can add a comment about this.

Great, I prefer to have it documented so nobody else tries to pull all
their hairs out trying to understand it :)

Thanks,
Jes

^ permalink raw reply

* Re: Is trim/discard supported on LVM RAID logical volume?
From: Phil Turmel @ 2016-11-30 16:18 UTC (permalink / raw)
  To: Patrick Dung, linux-raid
In-Reply-To: <CAEtPA0ByX8yYx8xNc_=4SP0bqXVz-D4o-bM5XZF8Y6NcnqsEKQ@mail.gmail.com>

On 11/29/2016 07:59 AM, Patrick Dung wrote:
> Hello,
> 
> Sorry if it is a cross posting.
> I had post my question in linux-lvm mailing list but did not have reply.
> 
> In my old setting, fstrim is supported:
> ext4 over LVM over MD software raid 1 (mdadm) > SSD.
> 
> After reading the recommendation in RHEL 6
> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/ch-ssd.html
> 
> I changed to use raid 1 mirror using 'LVM RAID logical volume'. It
> make use of MD raid internally.
> But I found fstrim does not work anymore. Is the behavior expected?

I can't remember when it was changed, but md raid is only trustworthy
for the mirroring types, and only if the underlying devices return
zeroes when reading trimmed blocks.  Since this isn't guaranteed, and I
recall reading that some devices that claim to return zeroes don't, md
raid cannot determine automatically if it is safe.

Theoretically, raid5 could be compatible if the underlying device
returns zeros, and trims always cover entire stripes.  But that would
require code within mdadm to track stripe-fragment trims that have been
received.  Blegh.

Phil

^ permalink raw reply

* Re: [PATCH] Reorganize make_request to clean up code.
From: Christoph Hellwig @ 2016-11-30 16:24 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: linux-raid
In-Reply-To: <20161129230711.9987-1-robert@leblancnet.us>

On Tue, Nov 29, 2016 at 04:07:11PM -0700, Robert LeBlanc wrote:
> This code only runs during a write, so move it to the write section to
> clean up the code.
> 
> Signed-off-by: Robert LeBlanc <robert@leblancnet.us>

Can you please just split make_request into one function for reads
and one for writes?  There is no shared code at all, so this should be
much cleaner.

^ permalink raw reply

* [PATCH 0/2] Reorganize raid*_make_request to clean up code
From: Robert LeBlanc @ 2016-11-30 21:20 UTC (permalink / raw)
  To: linux-raid; +Cc: Robert LeBlanc

In response to Christoph, I've broken the read and writes into their own
functions to make the code even cleaner. Since it is such a big change, I broke
up the commits into this series instead of creating a v2 of the previous patch.

Robert LeBlanc (2):
  md/raid1: Refactor raid1_make_request
  md/raid10: Refactor raid10_make_request

 drivers/md/raid1.c  | 244 +++++++++++++++++++++++-----------------------
 drivers/md/raid10.c | 271 +++++++++++++++++++++++++++-------------------------
 2 files changed, 265 insertions(+), 250 deletions(-)

-- 
2.10.2

^ permalink raw reply

* [PATCH 1/2] md/raid1: Refactor raid1_make_request
From: Robert LeBlanc @ 2016-11-30 21:20 UTC (permalink / raw)
  To: linux-raid; +Cc: Robert LeBlanc
In-Reply-To: <20161130212020.15762-1-robert@leblancnet.us>

Refactor raid1_make_request to make read and write code in their own
functions to clean up the code.

Signed-off-by: Robert LeBlanc <robert@leblancnet.us>
---
 drivers/md/raid1.c | 244 +++++++++++++++++++++++++++--------------------------
 1 file changed, 126 insertions(+), 118 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 21dc00e..5aa0f89 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1032,17 +1032,96 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
 	kfree(plug);
 }
 
-static void raid1_make_request(struct mddev *mddev, struct bio * bio)
+static void raid1_read_request(struct mddev *mddev, struct bio *bio,
+				 struct r1bio *r1_bio)
 {
 	struct r1conf *conf = mddev->private;
 	struct raid1_info *mirror;
-	struct r1bio *r1_bio;
 	struct bio *read_bio;
+	struct bitmap *bitmap = mddev->bitmap;
+	const int op = bio_op(bio);
+	const unsigned long do_sync = (bio->bi_opf & REQ_SYNC);
+	int sectors_handled;
+	int max_sectors;
+	int rdisk;
+
+read_again:
+	rdisk = read_balance(conf, r1_bio, &max_sectors);
+
+	if (rdisk < 0) {
+		/* couldn't find anywhere to read from */
+		raid_end_bio_io(r1_bio);
+		return;
+	}
+	mirror = conf->mirrors + rdisk;
+
+	if (test_bit(WriteMostly, &mirror->rdev->flags) &&
+	    bitmap) {
+		/* Reading from a write-mostly device must
+		 * take care not to over-take any writes
+		 * that are 'behind'
+		 */
+		wait_event(bitmap->behind_wait,
+			   atomic_read(&bitmap->behind_writes) == 0);
+	}
+	r1_bio->read_disk = rdisk;
+	r1_bio->start_next_window = 0;
+
+	read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev);
+	bio_trim(read_bio, r1_bio->sector - bio->bi_iter.bi_sector,
+		 max_sectors);
+
+	r1_bio->bios[rdisk] = read_bio;
+
+	read_bio->bi_iter.bi_sector = r1_bio->sector +
+		mirror->rdev->data_offset;
+	read_bio->bi_bdev = mirror->rdev->bdev;
+	read_bio->bi_end_io = raid1_end_read_request;
+	bio_set_op_attrs(read_bio, op, do_sync);
+	read_bio->bi_private = r1_bio;
+
+	if (max_sectors < r1_bio->sectors) {
+		/* could not read all from this device, so we will
+		 * need another r1_bio.
+		 */
+
+		sectors_handled = (r1_bio->sector + max_sectors
+				   - bio->bi_iter.bi_sector);
+		r1_bio->sectors = max_sectors;
+		spin_lock_irq(&conf->device_lock);
+		if (bio->bi_phys_segments == 0)
+			bio->bi_phys_segments = 2;
+		else
+			bio->bi_phys_segments++;
+		spin_unlock_irq(&conf->device_lock);
+		/* Cannot call generic_make_request directly
+		 * as that will be queued in __make_request
+		 * and subsequent mempool_alloc might block waiting
+		 * for it.  So hand bio over to raid1d.
+		 */
+		reschedule_retry(r1_bio);
+
+		r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
+
+		r1_bio->master_bio = bio;
+		r1_bio->sectors = bio_sectors(bio) - sectors_handled;
+		r1_bio->state = 0;
+		r1_bio->mddev = mddev;
+		r1_bio->sector = bio->bi_iter.bi_sector + sectors_handled;
+		goto read_again;
+	} else
+		generic_make_request(read_bio);
+}
+
+static void raid1_write_request(struct mddev *mddev, struct bio *bio,
+				struct r1bio *r1_bio,
+				sector_t start_next_window)
+{
+	struct r1conf *conf = mddev->private;
 	int i, disks;
-	struct bitmap *bitmap;
+	struct bitmap *bitmap = mddev->bitmap;
 	unsigned long flags;
 	const int op = bio_op(bio);
-	const int rw = bio_data_dir(bio);
 	const unsigned long do_sync = (bio->bi_opf & REQ_SYNC);
 	const unsigned long do_flush_fua = (bio->bi_opf &
 						(REQ_PREFLUSH | REQ_FUA));
@@ -1052,7 +1131,6 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
 	int first_clone;
 	int sectors_handled;
 	int max_sectors;
-	sector_t start_next_window;
 
 	/*
 	 * Register the new request and wait if the reconstruction
@@ -1062,12 +1140,11 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
 
 	md_write_start(mddev, bio); /* wait on superblock update early */
 
-	if (bio_data_dir(bio) == WRITE &&
-	    ((bio_end_sector(bio) > mddev->suspend_lo &&
+	if ((bio_end_sector(bio) > mddev->suspend_lo &&
 	    bio->bi_iter.bi_sector < mddev->suspend_hi) ||
 	    (mddev_is_clustered(mddev) &&
 	     md_cluster_ops->area_resyncing(mddev, WRITE,
-		     bio->bi_iter.bi_sector, bio_end_sector(bio))))) {
+		     bio->bi_iter.bi_sector, bio_end_sector(bio)))) {
 		/* As the suspend_* range is controlled by
 		 * userspace, we want an interruptible
 		 * wait.
@@ -1081,119 +1158,14 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
 			    bio->bi_iter.bi_sector >= mddev->suspend_hi ||
 			    (mddev_is_clustered(mddev) &&
 			     !md_cluster_ops->area_resyncing(mddev, WRITE,
-				     bio->bi_iter.bi_sector, bio_end_sector(bio))))
+				     bio->bi_iter.bi_sector,
+				     bio_end_sector(bio))))
 				break;
 			schedule();
 		}
 		finish_wait(&conf->wait_barrier, &w);
 	}
 
-	start_next_window = wait_barrier(conf, bio);
-
-	bitmap = mddev->bitmap;
-
-	/*
-	 * make_request() can abort the operation when read-ahead is being
-	 * used and no empty request is available.
-	 *
-	 */
-	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
-
-	r1_bio->master_bio = bio;
-	r1_bio->sectors = bio_sectors(bio);
-	r1_bio->state = 0;
-	r1_bio->mddev = mddev;
-	r1_bio->sector = bio->bi_iter.bi_sector;
-
-	/* We might need to issue multiple reads to different
-	 * devices if there are bad blocks around, so we keep
-	 * track of the number of reads in bio->bi_phys_segments.
-	 * If this is 0, there is only one r1_bio and no locking
-	 * will be needed when requests complete.  If it is
-	 * non-zero, then it is the number of not-completed requests.
-	 */
-	bio->bi_phys_segments = 0;
-	bio_clear_flag(bio, BIO_SEG_VALID);
-
-	if (rw == READ) {
-		/*
-		 * read balancing logic:
-		 */
-		int rdisk;
-
-read_again:
-		rdisk = read_balance(conf, r1_bio, &max_sectors);
-
-		if (rdisk < 0) {
-			/* couldn't find anywhere to read from */
-			raid_end_bio_io(r1_bio);
-			return;
-		}
-		mirror = conf->mirrors + rdisk;
-
-		if (test_bit(WriteMostly, &mirror->rdev->flags) &&
-		    bitmap) {
-			/* Reading from a write-mostly device must
-			 * take care not to over-take any writes
-			 * that are 'behind'
-			 */
-			wait_event(bitmap->behind_wait,
-				   atomic_read(&bitmap->behind_writes) == 0);
-		}
-		r1_bio->read_disk = rdisk;
-		r1_bio->start_next_window = 0;
-
-		read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev);
-		bio_trim(read_bio, r1_bio->sector - bio->bi_iter.bi_sector,
-			 max_sectors);
-
-		r1_bio->bios[rdisk] = read_bio;
-
-		read_bio->bi_iter.bi_sector = r1_bio->sector +
-			mirror->rdev->data_offset;
-		read_bio->bi_bdev = mirror->rdev->bdev;
-		read_bio->bi_end_io = raid1_end_read_request;
-		bio_set_op_attrs(read_bio, op, do_sync);
-		read_bio->bi_private = r1_bio;
-
-		if (max_sectors < r1_bio->sectors) {
-			/* could not read all from this device, so we will
-			 * need another r1_bio.
-			 */
-
-			sectors_handled = (r1_bio->sector + max_sectors
-					   - bio->bi_iter.bi_sector);
-			r1_bio->sectors = max_sectors;
-			spin_lock_irq(&conf->device_lock);
-			if (bio->bi_phys_segments == 0)
-				bio->bi_phys_segments = 2;
-			else
-				bio->bi_phys_segments++;
-			spin_unlock_irq(&conf->device_lock);
-			/* Cannot call generic_make_request directly
-			 * as that will be queued in __make_request
-			 * and subsequent mempool_alloc might block waiting
-			 * for it.  So hand bio over to raid1d.
-			 */
-			reschedule_retry(r1_bio);
-
-			r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
-
-			r1_bio->master_bio = bio;
-			r1_bio->sectors = bio_sectors(bio) - sectors_handled;
-			r1_bio->state = 0;
-			r1_bio->mddev = mddev;
-			r1_bio->sector = bio->bi_iter.bi_sector +
-				sectors_handled;
-			goto read_again;
-		} else
-			generic_make_request(read_bio);
-		return;
-	}
-
-	/*
-	 * WRITE:
-	 */
 	if (conf->pending_count >= max_queued_requests) {
 		md_wakeup_thread(mddev->thread);
 		wait_event(conf->wait_barrier,
@@ -1236,8 +1208,7 @@ read_again:
 			int bad_sectors;
 			int is_bad;
 
-			is_bad = is_badblock(rdev, r1_bio->sector,
-					     max_sectors,
+			is_bad = is_badblock(rdev, r1_bio->sector, max_sectors,
 					     &first_bad, &bad_sectors);
 			if (is_bad < 0) {
 				/* mustn't write here until the bad block is
@@ -1325,7 +1296,8 @@ read_again:
 			continue;
 
 		mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
-		bio_trim(mbio, r1_bio->sector - bio->bi_iter.bi_sector, max_sectors);
+		bio_trim(mbio, r1_bio->sector - bio->bi_iter.bi_sector,
+			 max_sectors);
 
 		if (first_clone) {
 			/* do behind I/O ?
@@ -1408,6 +1380,42 @@ read_again:
 	wake_up(&conf->wait_barrier);
 }
 
+static void raid1_make_request(struct mddev *mddev, struct bio *bio)
+{
+	struct r1conf *conf = mddev->private;
+	struct r1bio *r1_bio;
+	sector_t start_next_window = wait_barrier(conf, bio);
+
+	/*
+	 * make_request() can abort the operation when read-ahead is being
+	 * used and no empty request is available.
+	 *
+	 */
+	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
+
+	r1_bio->master_bio = bio;
+	r1_bio->sectors = bio_sectors(bio);
+	r1_bio->state = 0;
+	r1_bio->mddev = mddev;
+	r1_bio->sector = bio->bi_iter.bi_sector;
+
+	/* We might need to issue multiple reads to different
+	 * devices if there are bad blocks around, so we keep
+	 * track of the number of reads in bio->bi_phys_segments.
+	 * If this is 0, there is only one r1_bio and no locking
+	 * will be needed when requests complete.  If it is
+	 * non-zero, then it is the number of not-completed requests.
+	 */
+	bio->bi_phys_segments = 0;
+	bio_clear_flag(bio, BIO_SEG_VALID);
+
+	if (bio_data_dir(bio) == READ) {
+		raid1_read_request(mddev, bio, r1_bio);
+		return;
+	}
+	raid1_write_request(mddev, bio, r1_bio, start_next_window);
+}
+
 static void raid1_status(struct seq_file *seq, struct mddev *mddev)
 {
 	struct r1conf *conf = mddev->private;
-- 
2.10.2


^ permalink raw reply related

* [PATCH 2/2] md/raid10: Refactor raid10_make_request
From: Robert LeBlanc @ 2016-11-30 21:20 UTC (permalink / raw)
  To: linux-raid; +Cc: Robert LeBlanc
In-Reply-To: <20161130212020.15762-1-robert@leblancnet.us>

Refactor raid10_make_request into seperate read and write functions to
clean up the code.

Signed-off-by: Robert LeBlanc <robert@leblancnet.us>
---
 drivers/md/raid10.c | 271 +++++++++++++++++++++++++++-------------------------
 1 file changed, 139 insertions(+), 132 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index be1a9fc..d3bd756 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1046,150 +1046,89 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
 	kfree(plug);
 }
 
-static void __make_request(struct mddev *mddev, struct bio *bio)
+static void raid10_read_request(struct mddev *mddev, struct bio *bio,
+				struct r10bio *r10_bio)
 {
 	struct r10conf *conf = mddev->private;
-	struct r10bio *r10_bio;
 	struct bio *read_bio;
-	int i;
 	const int op = bio_op(bio);
-	const int rw = bio_data_dir(bio);
 	const unsigned long do_sync = (bio->bi_opf & REQ_SYNC);
-	const unsigned long do_fua = (bio->bi_opf & REQ_FUA);
-	unsigned long flags;
-	struct md_rdev *blocked_rdev;
-	struct blk_plug_cb *cb;
-	struct raid10_plug_cb *plug = NULL;
 	int sectors_handled;
 	int max_sectors;
-	int sectors;
-
-	md_write_start(mddev, bio);
-
-	/*
-	 * Register the new request and wait if the reconstruction
-	 * thread has put up a bar for new requests.
-	 * Continue immediately if no resync is active currently.
-	 */
-	wait_barrier(conf);
-
-	sectors = bio_sectors(bio);
-	while (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
-	    bio->bi_iter.bi_sector < conf->reshape_progress &&
-	    bio->bi_iter.bi_sector + sectors > conf->reshape_progress) {
-		/* IO spans the reshape position.  Need to wait for
-		 * reshape to pass
-		 */
-		allow_barrier(conf);
-		wait_event(conf->wait_barrier,
-			   conf->reshape_progress <= bio->bi_iter.bi_sector ||
-			   conf->reshape_progress >= bio->bi_iter.bi_sector +
-			   sectors);
-		wait_barrier(conf);
-	}
-	if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
-	    bio_data_dir(bio) == WRITE &&
-	    (mddev->reshape_backwards
-	     ? (bio->bi_iter.bi_sector < conf->reshape_safe &&
-		bio->bi_iter.bi_sector + sectors > conf->reshape_progress)
-	     : (bio->bi_iter.bi_sector + sectors > conf->reshape_safe &&
-		bio->bi_iter.bi_sector < conf->reshape_progress))) {
-		/* Need to update reshape_position in metadata */
-		mddev->reshape_position = conf->reshape_progress;
-		set_mask_bits(&mddev->flags, 0,
-			      BIT(MD_CHANGE_DEVS) | BIT(MD_CHANGE_PENDING));
-		md_wakeup_thread(mddev->thread);
-		wait_event(mddev->sb_wait,
-			   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
+	struct md_rdev *rdev;
+	int slot;
 
-		conf->reshape_safe = mddev->reshape_position;
+read_again:
+	rdev = read_balance(conf, r10_bio, &max_sectors);
+	if (!rdev) {
+		raid_end_bio_io(r10_bio);
+		return;
 	}
+	slot = r10_bio->read_slot;
 
-	r10_bio = mempool_alloc(conf->r10bio_pool, GFP_NOIO);
-
-	r10_bio->master_bio = bio;
-	r10_bio->sectors = sectors;
+	read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev);
+	bio_trim(read_bio, r10_bio->sector - bio->bi_iter.bi_sector,
+		 max_sectors);
 
-	r10_bio->mddev = mddev;
-	r10_bio->sector = bio->bi_iter.bi_sector;
-	r10_bio->state = 0;
+	r10_bio->devs[slot].bio = read_bio;
+	r10_bio->devs[slot].rdev = rdev;
 
-	/* We might need to issue multiple reads to different
-	 * devices if there are bad blocks around, so we keep
-	 * track of the number of reads in bio->bi_phys_segments.
-	 * If this is 0, there is only one r10_bio and no locking
-	 * will be needed when the request completes.  If it is
-	 * non-zero, then it is the number of not-completed requests.
-	 */
-	bio->bi_phys_segments = 0;
-	bio_clear_flag(bio, BIO_SEG_VALID);
+	read_bio->bi_iter.bi_sector = r10_bio->devs[slot].addr +
+		choose_data_offset(r10_bio, rdev);
+	read_bio->bi_bdev = rdev->bdev;
+	read_bio->bi_end_io = raid10_end_read_request;
+	bio_set_op_attrs(read_bio, op, do_sync);
+	read_bio->bi_private = r10_bio;
 
-	if (rw == READ) {
-		/*
-		 * read balancing logic:
+	if (max_sectors < r10_bio->sectors) {
+		/* Could not read all from this device, so we will
+		 * need another r10_bio.
 		 */
-		struct md_rdev *rdev;
-		int slot;
-
-read_again:
-		rdev = read_balance(conf, r10_bio, &max_sectors);
-		if (!rdev) {
-			raid_end_bio_io(r10_bio);
-			return;
-		}
-		slot = r10_bio->read_slot;
-
-		read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev);
-		bio_trim(read_bio, r10_bio->sector - bio->bi_iter.bi_sector,
-			 max_sectors);
-
-		r10_bio->devs[slot].bio = read_bio;
-		r10_bio->devs[slot].rdev = rdev;
-
-		read_bio->bi_iter.bi_sector = r10_bio->devs[slot].addr +
-			choose_data_offset(r10_bio, rdev);
-		read_bio->bi_bdev = rdev->bdev;
-		read_bio->bi_end_io = raid10_end_read_request;
-		bio_set_op_attrs(read_bio, op, do_sync);
-		read_bio->bi_private = r10_bio;
+		sectors_handled = (r10_bio->sector + max_sectors
+				   - bio->bi_iter.bi_sector);
+		r10_bio->sectors = max_sectors;
+		spin_lock_irq(&conf->device_lock);
+		if (bio->bi_phys_segments == 0)
+			bio->bi_phys_segments = 2;
+		else
+			bio->bi_phys_segments++;
+		spin_unlock_irq(&conf->device_lock);
+		/* Cannot call generic_make_request directly
+		 * as that will be queued in __generic_make_request
+		 * and subsequent mempool_alloc might block
+		 * waiting for it.  so hand bio over to raid10d.
+		 */
+		reschedule_retry(r10_bio);
 
-		if (max_sectors < r10_bio->sectors) {
-			/* Could not read all from this device, so we will
-			 * need another r10_bio.
-			 */
-			sectors_handled = (r10_bio->sector + max_sectors
-					   - bio->bi_iter.bi_sector);
-			r10_bio->sectors = max_sectors;
-			spin_lock_irq(&conf->device_lock);
-			if (bio->bi_phys_segments == 0)
-				bio->bi_phys_segments = 2;
-			else
-				bio->bi_phys_segments++;
-			spin_unlock_irq(&conf->device_lock);
-			/* Cannot call generic_make_request directly
-			 * as that will be queued in __generic_make_request
-			 * and subsequent mempool_alloc might block
-			 * waiting for it.  so hand bio over to raid10d.
-			 */
-			reschedule_retry(r10_bio);
+		r10_bio = mempool_alloc(conf->r10bio_pool, GFP_NOIO);
 
-			r10_bio = mempool_alloc(conf->r10bio_pool, GFP_NOIO);
+		r10_bio->master_bio = bio;
+		r10_bio->sectors = bio_sectors(bio) - sectors_handled;
+		r10_bio->state = 0;
+		r10_bio->mddev = mddev;
+		r10_bio->sector = bio->bi_iter.bi_sector + sectors_handled;
+		goto read_again;
+	} else
+		generic_make_request(read_bio);
+	return;
+}
 
-			r10_bio->master_bio = bio;
-			r10_bio->sectors = bio_sectors(bio) - sectors_handled;
-			r10_bio->state = 0;
-			r10_bio->mddev = mddev;
-			r10_bio->sector = bio->bi_iter.bi_sector +
-				sectors_handled;
-			goto read_again;
-		} else
-			generic_make_request(read_bio);
-		return;
-	}
+static void raid10_write_request(struct mddev *mddev, struct bio *bio,
+				 struct r10bio *r10_bio)
+{
+	struct r10conf *conf = mddev->private;
+	int i;
+	const int op = bio_op(bio);
+	const unsigned long do_sync = (bio->bi_opf & REQ_SYNC);
+	const unsigned long do_fua = (bio->bi_opf & REQ_FUA);
+	unsigned long flags;
+	struct md_rdev *blocked_rdev;
+	struct blk_plug_cb *cb;
+	struct raid10_plug_cb *plug = NULL;
+	int sectors_handled;
+	int max_sectors;
+	md_write_start(mddev, bio);
 
-	/*
-	 * WRITE:
-	 */
 	if (conf->pending_count >= max_queued_requests) {
 		md_wakeup_thread(mddev->thread);
 		wait_event(conf->wait_barrier,
@@ -1249,8 +1188,7 @@ retry_write:
 			int bad_sectors;
 			int is_bad;
 
-			is_bad = is_badblock(rdev, dev_sector,
-					     max_sectors,
+			is_bad = is_badblock(rdev, dev_sector, max_sectors,
 					     &first_bad, &bad_sectors);
 			if (is_bad < 0) {
 				/* Mustn't write here until the bad block
@@ -1353,8 +1291,7 @@ retry_write:
 			r10_bio->devs[i].bio = mbio;
 
 			mbio->bi_iter.bi_sector	= (r10_bio->devs[i].addr+
-					   choose_data_offset(r10_bio,
-							      rdev));
+					   choose_data_offset(r10_bio, rdev));
 			mbio->bi_bdev = rdev->bdev;
 			mbio->bi_end_io	= raid10_end_write_request;
 			bio_set_op_attrs(mbio, op, do_sync | do_fua);
@@ -1395,8 +1332,7 @@ retry_write:
 			r10_bio->devs[i].repl_bio = mbio;
 
 			mbio->bi_iter.bi_sector	= (r10_bio->devs[i].addr +
-					   choose_data_offset(
-						   r10_bio, rdev));
+					   choose_data_offset(r10_bio, rdev));
 			mbio->bi_bdev = rdev->bdev;
 			mbio->bi_end_io	= raid10_end_write_request;
 			bio_set_op_attrs(mbio, op, do_sync | do_fua);
@@ -1434,6 +1370,77 @@ retry_write:
 	one_write_done(r10_bio);
 }
 
+static void __make_request(struct mddev *mddev, struct bio *bio)
+{
+	struct r10conf *conf = mddev->private;
+	struct r10bio *r10_bio;
+	int sectors;
+
+	/*
+	 * Register the new request and wait if the reconstruction
+	 * thread has put up a bar for new requests.
+	 * Continue immediately if no resync is active currently.
+	 */
+	wait_barrier(conf);
+
+	sectors = bio_sectors(bio);
+	while (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
+	    bio->bi_iter.bi_sector < conf->reshape_progress &&
+	    bio->bi_iter.bi_sector + sectors > conf->reshape_progress) {
+		/* IO spans the reshape position.  Need to wait for
+		 * reshape to pass
+		 */
+		allow_barrier(conf);
+		wait_event(conf->wait_barrier,
+			   conf->reshape_progress <= bio->bi_iter.bi_sector ||
+			   conf->reshape_progress >= bio->bi_iter.bi_sector +
+			   sectors);
+		wait_barrier(conf);
+	}
+	if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
+	    bio_data_dir(bio) == WRITE &&
+	    (mddev->reshape_backwards
+	     ? (bio->bi_iter.bi_sector < conf->reshape_safe &&
+		bio->bi_iter.bi_sector + sectors > conf->reshape_progress)
+	     : (bio->bi_iter.bi_sector + sectors > conf->reshape_safe &&
+		bio->bi_iter.bi_sector < conf->reshape_progress))) {
+		/* Need to update reshape_position in metadata */
+		mddev->reshape_position = conf->reshape_progress;
+		set_mask_bits(&mddev->flags, 0,
+			      BIT(MD_CHANGE_DEVS) | BIT(MD_CHANGE_PENDING));
+		md_wakeup_thread(mddev->thread);
+		wait_event(mddev->sb_wait,
+			   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
+
+		conf->reshape_safe = mddev->reshape_position;
+	}
+
+	r10_bio = mempool_alloc(conf->r10bio_pool, GFP_NOIO);
+
+	r10_bio->master_bio = bio;
+	r10_bio->sectors = sectors;
+
+	r10_bio->mddev = mddev;
+	r10_bio->sector = bio->bi_iter.bi_sector;
+	r10_bio->state = 0;
+
+	/* We might need to issue multiple reads to different
+	 * devices if there are bad blocks around, so we keep
+	 * track of the number of reads in bio->bi_phys_segments.
+	 * If this is 0, there is only one r10_bio and no locking
+	 * will be needed when the request completes.  If it is
+	 * non-zero, then it is the number of not-completed requests.
+	 */
+	bio->bi_phys_segments = 0;
+	bio_clear_flag(bio, BIO_SEG_VALID);
+
+	if (bio_data_dir(bio) == READ) {
+		raid10_read_request(mddev, bio, r10_bio);
+		return;
+	}
+	raid10_write_request(mddev, bio, r10_bio);
+}
+
 static void raid10_make_request(struct mddev *mddev, struct bio *bio)
 {
 	struct r10conf *conf = mddev->private;
-- 
2.10.2


^ permalink raw reply related

* [PATCH 1/2] md: disable WRITE SAME if it fails for linear/raid0
From: Shaohua Li @ 2016-12-01  0:39 UTC (permalink / raw)
  To: linux-raid; +Cc: sitsofe, neilb

This makes md do the same thing as dm for write same IO failure. Please
see 7eee4ae(dm: disable WRITE SAME if it fails) for details why we need
this.

Also reported here: https://bugzilla.kernel.org/show_bug.cgi?id=118581

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/linear.c |  2 ++
 drivers/md/md.c     | 42 ++++++++++++++++++++++++++++++++++++++++++
 drivers/md/md.h     |  2 ++
 drivers/md/raid0.c  |  2 ++
 4 files changed, 48 insertions(+)

diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 5975c99..d3c7b4d 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -262,6 +262,8 @@ static void linear_make_request(struct mddev *mddev, struct bio *bio)
 				trace_block_bio_remap(bdev_get_queue(split->bi_bdev),
 						      split, disk_devt(mddev->gendisk),
 						      bio_sector);
+			if (bio_op(split) == REQ_OP_WRITE_SAME)
+				md_writesame_setup(mddev, split);
 			generic_make_request(split);
 		}
 	} while (split != bio);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index c7894fb..5e6efcd 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -312,6 +312,48 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
 	return BLK_QC_T_NONE;
 }
 
+struct md_writesame_data {
+	bio_end_io_t *orig_endio;
+	void *orig_private;
+	struct mddev *mddev;
+};
+
+static void md_writesame_endio(struct bio *bio)
+{
+	struct md_writesame_data *data = bio->bi_private;
+
+	if (bio->bi_error == -EREMOTEIO &&
+	    !bdev_get_queue(bio->bi_bdev)->limits.max_write_same_sectors)
+		data->mddev->queue->limits.max_write_same_sectors = 0;
+
+	bio->bi_private = data->orig_private;
+	bio->bi_end_io = data->orig_endio;
+	bio_endio(bio);
+
+	kfree(data);
+}
+
+void md_writesame_setup(struct mddev *mddev, struct bio *bio)
+{
+	struct md_writesame_data *data;
+
+	/*
+	 * this failure means we ignore a chance to handle writesame failure,
+	 * which isn't critcal, we can handle the failure if new writesame IO
+	 * comes
+	 */
+	data = kmalloc(sizeof(*data), GFP_NOIO | __GFP_NORETRY);
+	if (!data)
+		return;
+	data->orig_endio = bio->bi_end_io;
+	data->orig_private = bio->bi_private;
+	data->mddev = mddev;
+
+	bio->bi_private = data;
+	bio->bi_end_io = md_writesame_endio;
+}
+EXPORT_SYMBOL_GPL(md_writesame_setup);
+
 /* mddev_suspend makes sure no new requests are submitted
  * to the device, and that any requests that have been submitted
  * are completely handled.
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 5c08f84..2d1556b 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -700,4 +700,6 @@ static inline int mddev_is_clustered(struct mddev *mddev)
 {
 	return mddev->cluster_info && mddev->bitmap_info.nodes > 1;
 }
+
+extern void md_writesame_setup(struct mddev *mddev, struct bio *bio);
 #endif /* _MD_MD_H */
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index e628f18..4811116 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -498,6 +498,8 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
 				trace_block_bio_remap(bdev_get_queue(split->bi_bdev),
 						      split, disk_devt(mddev->gendisk),
 						      bio_sector);
+			if (bio_op(split) == REQ_OP_WRITE_SAME)
+				md_writesame_setup(mddev, split);
 			generic_make_request(split);
 		}
 	} while (split != bio);
-- 
2.9.3


^ permalink raw reply related

* [PATCH 2/2] md/multipath: disable WRITE SAME if it fails for multipath
From: Shaohua Li @ 2016-12-01  0:39 UTC (permalink / raw)
  To: linux-raid; +Cc: sitsofe, neilb
In-Reply-To: <35d7516cdfdcaa734e5b8cc90a8dbac8e3d201e0.1480552575.git.shli@fb.com>

This is the part for multipath. Since multipatch already attaches
private data into original bio, we just disable write same there.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/multipath.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index 589b807..42ddabc 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -89,6 +89,10 @@ static void multipath_end_request(struct bio *bio)
 	struct mpconf *conf = mp_bh->mddev->private;
 	struct md_rdev *rdev = conf->multipaths[mp_bh->path].rdev;
 
+	if (bio_op(bio) == REQ_OP_WRITE_SAME && bio->bi_error == -EREMOTEIO &&
+	    !bdev_get_queue(bio->bi_bdev)->limits.max_write_same_sectors)
+		mp_bh->mddev->queue->limits.max_write_same_sectors = 0;
+
 	if (!bio->bi_error)
 		multipath_end_bh_io(mp_bh, 0);
 	else if (!(bio->bi_opf & REQ_RAHEAD)) {
-- 
2.9.3


^ permalink raw reply related

* [PATCH v2] md/r5cache: run_no_space_stripes() when R5C_LOG_CRITICAL == 0
From: Song Liu @ 2016-12-01  0:57 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, shli, kernel-team, dan.j.williams, hch, Song Liu

With writeback cache, we define log space critical as

   free_space < 2 * reclaim_required_space

So the deassert of R5C_LOG_CRITICAL could happen when
  1. free_space increases
  2. reclaim_required_space decreases

Currently, run_no_space_stripes() is called when 1 happens, but
not (always) when 2 happens.

With this patch, run_no_space_stripes() is call when
R5C_LOG_CRITICAL is cleared.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index e786d4e..c36f86b 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -370,6 +370,7 @@ static inline void r5c_update_log_state(struct r5l_log *log)
 	struct r5conf *conf = log->rdev->mddev->private;
 	sector_t free_space;
 	sector_t reclaim_space;
+	bool wake_reclaim = false;
 
 	if (!r5c_is_writeback(log))
 		return;
@@ -379,12 +380,18 @@ static inline void r5c_update_log_state(struct r5l_log *log)
 	reclaim_space = r5c_log_required_to_flush_cache(conf);
 	if (free_space < 2 * reclaim_space)
 		set_bit(R5C_LOG_CRITICAL, &conf->cache_state);
-	else
+	else {
+		if (test_bit(R5C_LOG_CRITICAL, &conf->cache_state))
+			wake_reclaim = true;
 		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
+	}
 	if (free_space < 3 * reclaim_space)
 		set_bit(R5C_LOG_TIGHT, &conf->cache_state);
 	else
 		clear_bit(R5C_LOG_TIGHT, &conf->cache_state);
+
+	if (wake_reclaim)
+		r5l_wake_reclaim(log, 0);
 }
 
 /*
@@ -1345,6 +1352,10 @@ static void r5c_do_reclaim(struct r5conf *conf)
 		spin_unlock(&conf->device_lock);
 		spin_unlock_irqrestore(&log->stripe_in_journal_lock, flags);
 	}
+
+	if (!test_bit(R5C_LOG_CRITICAL, &conf->cache_state))
+		r5l_run_no_space_stripes(log);
+
 	md_wakeup_thread(conf->mddev->thread);
 }
 
@@ -2401,6 +2412,7 @@ void r5c_finish_stripe_write_out(struct r5conf *conf,
 	spin_unlock_irq(&conf->log->stripe_in_journal_lock);
 	sh->log_start = MaxSector;
 	atomic_dec(&conf->log->stripe_in_journal_count);
+	r5c_update_log_state(conf->log);
 }
 
 int
-- 
2.9.3


^ permalink raw reply related

* Re: MD Remnants After --stop
From: NeilBrown @ 2016-12-01  2:52 UTC (permalink / raw)
  To: Marc Smith; +Cc: linux-raid
In-Reply-To: <CAHkw+Lc9YEdbJcRu2bOLU-gex9UYSXB32iRKD8TPS7JpjSOdQA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1311 bytes --]

On Mon, Nov 28 2016, Marc Smith wrote:

>
> # find /sys/block/md127/md
> /sys/block/md127/md
> /sys/block/md127/md/reshape_position
> /sys/block/md127/md/layout
> /sys/block/md127/md/raid_disks
> /sys/block/md127/md/bitmap
> /sys/block/md127/md/bitmap/chunksize

This tells me that:
	sysfs_remove_group(&mddev->kobj, &md_bitmap_group);
hasn't been run, so mddev_delayed_delete() hasn't run.
That suggests the final mddev_put() hsn't run. i.e. mddev->active is > 0

Everything else suggests that array has been stopped and cleaned and
should be gone...

This seems to suggest that there is an unbalanced mddev_get() without a
matching mddev_put().  I cannot find it though.

If I could reproduce it, I would try to see what is happening by:

- putting
   printk("mddev->active = %d\n", atomic_read(&mddev->active));
 in the top of mddev_put().  That shouldn't be *too* noisy.

- putting
   printk("rd=%d empty=%d ctime=%d hold=%d\n", mddev->raid_disks,
   	list_empty(&mddev->disks), mddev->ctime, mddev->hold_active);

  in mddev_put() just before those values are tested.

 - putting
     printk("queue_work\n");
   just before the 'queue_work()' call in mddev_put.

 - putting
     printk("mddev_delayed_delete\n");
   in mddev_delayed_delete()

Then see what gets printed when you stop the array.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: Testing RAID 1 redundancy
From: Arka Sharma @ 2016-12-01 11:19 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <CAPO=kN2kcPA++DXw3H=S0+PrcVhJptBqAhX39uY7ZY-Q0VaJHQ@mail.gmail.com>

I have done the following steps

1. Created a raid 1 with two PCIe SSD's say /dev/ssd1 /dev/ssd2 in
mdadm with 1.2
2. Created ext4
3. Mounted and created a file
4. Poweroff the system
5. Disconnect one PCIe SSD
6. Boot the system and mdadm --detail shows the status of array as FAILED
7. Stopped the array
8. Did mdadm --assemble --force /dev/md127 /dev/ssd1
9. Mounted the array and the file I created was intact
10. Issued mdadm --detail and status is reported as degraded

For the second case created raid 1 with ddf
Till step 7 everything was same as in with superblock 1.2
But when issued mdadm --assemble --force /dev/md126 /dev/ssd1 it shows
that /dev/ssd1 is busy skipping.
Checked in dmesg nothing is captured there

Is there any mistake in the procedure ?

Regards,
Arka

On Mon, Nov 28, 2016 at 2:50 PM, Arka Sharma <arka.sw1988@gmail.com> wrote:
> Hello,
>
> I want test a redundancy scenario with RAID 1. I have crated an array
> with mdadm and formatted ext4 and after mounting I have written a text
> file. Now I want to switch off one device, and with another remaining
> device I expect to view the text file. When I issue mdadm --detail it
> shows the State as active,FAILED,Not Started. But in dmesg I can't
> find any error message from md related to one of the physical disk
> missing. I was searching about this and I found some references of
> creating degraded RAID 1 with missing parameter in mdadm, but what I
> want to simulate is that let's say one of the disk has gone bad and
> not yet being detected under /dev and in this case we want to verify
> that if data is intact.
>
> Regards,
> Arka

^ permalink raw reply

* Re: [PATCH 1/2] md/raid1: Refactor raid1_make_request
From: John Stoffel @ 2016-12-01 16:04 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: linux-raid
In-Reply-To: <20161130212020.15762-2-robert@leblancnet.us>

>>>>> "Robert" == Robert LeBlanc <robert@leblancnet.us> writes:

Robert> Refactor raid1_make_request to make read and write code in their own
Robert> functions to clean up the code.

Robert> Signed-off-by: Robert LeBlanc <robert@leblancnet.us>
Robert> ---
Robert>  drivers/md/raid1.c | 244 +++++++++++++++++++++++++++--------------------------
Robert>  1 file changed, 126 insertions(+), 118 deletions(-)

Robert> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
Robert> index 21dc00e..5aa0f89 100644
Robert> --- a/drivers/md/raid1.c
Robert> +++ b/drivers/md/raid1.c
Robert> @@ -1032,17 +1032,96 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
Robert>  	kfree(plug);
Robert>  }
 
Robert> -static void raid1_make_request(struct mddev *mddev, struct bio * bio)
Robert> +static void raid1_read_request(struct mddev *mddev, struct bio *bio,
Robert> +				 struct r1bio *r1_bio)
Robert>  {
Robert>  	struct r1conf *conf = mddev->private;
Robert>  	struct raid1_info *mirror;
Robert> -	struct r1bio *r1_bio;
Robert>  	struct bio *read_bio;
Robert> +	struct bitmap *bitmap = mddev->bitmap;
Robert> +	const int op = bio_op(bio);
Robert> +	const unsigned long do_sync = (bio->bi_opf & REQ_SYNC);
Robert> +	int sectors_handled;
Robert> +	int max_sectors;
Robert> +	int rdisk;
Robert> +
Robert> +read_again:
Robert> +	rdisk = read_balance(conf, r1_bio, &max_sectors);
Robert> +
Robert> +	if (rdisk < 0) {
Robert> +		/* couldn't find anywhere to read from */
Robert> +		raid_end_bio_io(r1_bio);
Robert> +		return;
Robert> +	}
Robert> +	mirror = conf->mirrors + rdisk;
Robert> +
Robert> +	if (test_bit(WriteMostly, &mirror->rdev->flags) &&
Robert> +	    bitmap) {
Robert> +		/* Reading from a write-mostly device must
Robert> +		 * take care not to over-take any writes
Robert> +		 * that are 'behind'
Robert> +		 */
Robert> +		wait_event(bitmap->behind_wait,
Robert> +			   atomic_read(&bitmap->behind_writes) == 0);
Robert> +	}
Robert> +	r1_bio->read_disk = rdisk;
Robert> +	r1_bio->start_next_window = 0;
Robert> +
Robert> +	read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev);
Robert> +	bio_trim(read_bio, r1_bio->sector - bio->bi_iter.bi_sector,
Robert> +		 max_sectors);
Robert> +
Robert> +	r1_bio->bios[rdisk] = read_bio;
Robert> +
Robert> +	read_bio->bi_iter.bi_sector = r1_bio->sector +
Robert> +		mirror->rdev->data_offset;
Robert> +	read_bio->bi_bdev = mirror->rdev->bdev;
Robert> +	read_bio->bi_end_io = raid1_end_read_request;
Robert> +	bio_set_op_attrs(read_bio, op, do_sync);
Robert> +	read_bio->bi_private = r1_bio;
Robert> +
Robert> +	if (max_sectors < r1_bio->sectors) {
Robert> +		/* could not read all from this device, so we will
Robert> +		 * need another r1_bio.
Robert> +		 */
Robert> +
Robert> +		sectors_handled = (r1_bio->sector + max_sectors
Robert> +				   - bio->bi_iter.bi_sector);
Robert> +		r1_bio->sectors = max_sectors;
Robert> +		spin_lock_irq(&conf->device_lock);
Robert> +		if (bio->bi_phys_segments == 0)
Robert> +			bio->bi_phys_segments = 2;
Robert> +		else
Robert> +			bio->bi_phys_segments++;
Robert> +		spin_unlock_irq(&conf->device_lock);
Robert> +		/* Cannot call generic_make_request directly
Robert> +		 * as that will be queued in __make_request
Robert> +		 * and subsequent mempool_alloc might block waiting
Robert> +		 * for it.  So hand bio over to raid1d.
Robert> +		 */
Robert> +		reschedule_retry(r1_bio);
Robert> +
Robert> +		r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
Robert> +
Robert> +		r1_bio->master_bio = bio;
Robert> +		r1_bio->sectors = bio_sectors(bio) - sectors_handled;
Robert> +		r1_bio->state = 0;
Robert> +		r1_bio->mddev = mddev;
Robert> +		r1_bio->sector = bio->bi_iter.bi_sector + sectors_handled;
Robert> +		goto read_again;
Robert> +	} else
Robert> +		generic_make_request(read_bio);
Robert> +}
Robert> +
Robert> +static void raid1_write_request(struct mddev *mddev, struct bio *bio,
Robert> +				struct r1bio *r1_bio,
Robert> +				sector_t start_next_window)
Robert> +{
Robert> +	struct r1conf *conf = mddev->private;
Robert>  	int i, disks;
Robert> -	struct bitmap *bitmap;
Robert> +	struct bitmap *bitmap = mddev->bitmap;
Robert>  	unsigned long flags;
Robert>  	const int op = bio_op(bio);
Robert> -	const int rw = bio_data_dir(bio);
Robert>  	const unsigned long do_sync = (bio->bi_opf & REQ_SYNC);
Robert>  	const unsigned long do_flush_fua = (bio->bi_opf &
Robert>  						(REQ_PREFLUSH | REQ_FUA));
Robert> @@ -1052,7 +1131,6 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
Robert>  	int first_clone;
Robert>  	int sectors_handled;
Robert>  	int max_sectors;
Robert> -	sector_t start_next_window;
 
Robert>  	/*
Robert>  	 * Register the new request and wait if the reconstruction
Robert> @@ -1062,12 +1140,11 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
 
Robert>  	md_write_start(mddev, bio); /* wait on superblock update early */
 
Robert> -	if (bio_data_dir(bio) == WRITE &&
Robert> -	    ((bio_end_sector(bio) > mddev->suspend_lo &&
Robert> +	if ((bio_end_sector(bio) > mddev->suspend_lo &&
bio-> bi_iter.bi_sector < mddev->suspend_hi) ||
Robert>  	    (mddev_is_clustered(mddev) &&
md_cluster_ops-> area_resyncing(mddev, WRITE,
Robert> -		     bio->bi_iter.bi_sector, bio_end_sector(bio))))) {
Robert> +		     bio->bi_iter.bi_sector, bio_end_sector(bio)))) {
Robert>  		/* As the suspend_* range is controlled by
Robert>  		 * userspace, we want an interruptible
Robert>  		 * wait.
Robert> @@ -1081,119 +1158,14 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
bio-> bi_iter.bi_sector >= mddev->suspend_hi ||
Robert>  			    (mddev_is_clustered(mddev) &&
Robert>  			     !md_cluster_ops->area_resyncing(mddev, WRITE,
Robert> -				     bio->bi_iter.bi_sector, bio_end_sector(bio))))
Robert> +				     bio->bi_iter.bi_sector,
Robert> +				     bio_end_sector(bio))))
Robert>  				break;
Robert>  			schedule();
Robert>  		}
Robert>  		finish_wait(&conf->wait_barrier, &w);
Robert>  	}
 
Robert> -	start_next_window = wait_barrier(conf, bio);
Robert> -
Robert> -	bitmap = mddev->bitmap;
Robert> -
Robert> -	/*
Robert> -	 * make_request() can abort the operation when read-ahead is being
Robert> -	 * used and no empty request is available.
Robert> -	 *
Robert> -	 */
Robert> -	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
Robert> -
Robert> -	r1_bio->master_bio = bio;
Robert> -	r1_bio->sectors = bio_sectors(bio);
Robert> -	r1_bio->state = 0;
Robert> -	r1_bio->mddev = mddev;
Robert> -	r1_bio->sector = bio->bi_iter.bi_sector;
Robert> -
Robert> -	/* We might need to issue multiple reads to different
Robert> -	 * devices if there are bad blocks around, so we keep
Robert> -	 * track of the number of reads in bio->bi_phys_segments.
Robert> -	 * If this is 0, there is only one r1_bio and no locking
Robert> -	 * will be needed when requests complete.  If it is
Robert> -	 * non-zero, then it is the number of not-completed requests.
Robert> -	 */
Robert> -	bio->bi_phys_segments = 0;
Robert> -	bio_clear_flag(bio, BIO_SEG_VALID);
Robert> -
Robert> -	if (rw == READ) {
Robert> -		/*
Robert> -		 * read balancing logic:
Robert> -		 */
Robert> -		int rdisk;
Robert> -
Robert> -read_again:
Robert> -		rdisk = read_balance(conf, r1_bio, &max_sectors);
Robert> -
Robert> -		if (rdisk < 0) {
Robert> -			/* couldn't find anywhere to read from */
Robert> -			raid_end_bio_io(r1_bio);
Robert> -			return;
Robert> -		}
Robert> -		mirror = conf->mirrors + rdisk;
Robert> -
Robert> -		if (test_bit(WriteMostly, &mirror->rdev->flags) &&
Robert> -		    bitmap) {
Robert> -			/* Reading from a write-mostly device must
Robert> -			 * take care not to over-take any writes
Robert> -			 * that are 'behind'
Robert> -			 */
Robert> -			wait_event(bitmap->behind_wait,
Robert> -				   atomic_read(&bitmap->behind_writes) == 0);
Robert> -		}
Robert> -		r1_bio->read_disk = rdisk;
Robert> -		r1_bio->start_next_window = 0;
Robert> -
Robert> -		read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev);
Robert> -		bio_trim(read_bio, r1_bio->sector - bio->bi_iter.bi_sector,
Robert> -			 max_sectors);
Robert> -
Robert> -		r1_bio->bios[rdisk] = read_bio;
Robert> -
Robert> -		read_bio->bi_iter.bi_sector = r1_bio->sector +
Robert> -			mirror->rdev->data_offset;
Robert> -		read_bio->bi_bdev = mirror->rdev->bdev;
Robert> -		read_bio->bi_end_io = raid1_end_read_request;
Robert> -		bio_set_op_attrs(read_bio, op, do_sync);
Robert> -		read_bio->bi_private = r1_bio;
Robert> -
Robert> -		if (max_sectors < r1_bio->sectors) {
Robert> -			/* could not read all from this device, so we will
Robert> -			 * need another r1_bio.
Robert> -			 */
Robert> -
Robert> -			sectors_handled = (r1_bio->sector + max_sectors
Robert> -					   - bio->bi_iter.bi_sector);
Robert> -			r1_bio->sectors = max_sectors;
Robert> -			spin_lock_irq(&conf->device_lock);
Robert> -			if (bio->bi_phys_segments == 0)
Robert> -				bio->bi_phys_segments = 2;
Robert> -			else
Robert> -				bio->bi_phys_segments++;
Robert> -			spin_unlock_irq(&conf->device_lock);
Robert> -			/* Cannot call generic_make_request directly
Robert> -			 * as that will be queued in __make_request
Robert> -			 * and subsequent mempool_alloc might block waiting
Robert> -			 * for it.  So hand bio over to raid1d.
Robert> -			 */
Robert> -			reschedule_retry(r1_bio);
Robert> -
Robert> -			r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
Robert> -
Robert> -			r1_bio->master_bio = bio;
Robert> -			r1_bio->sectors = bio_sectors(bio) - sectors_handled;
Robert> -			r1_bio->state = 0;
Robert> -			r1_bio->mddev = mddev;
Robert> -			r1_bio->sector = bio->bi_iter.bi_sector +
Robert> -				sectors_handled;
Robert> -			goto read_again;
Robert> -		} else
Robert> -			generic_make_request(read_bio);
Robert> -		return;
Robert> -	}
Robert> -
Robert> -	/*
Robert> -	 * WRITE:
Robert> -	 */
Robert>  	if (conf->pending_count >= max_queued_requests) {
Robert>  		md_wakeup_thread(mddev->thread);
Robert>  		wait_event(conf->wait_barrier,
Robert> @@ -1236,8 +1208,7 @@ read_again:
Robert>  			int bad_sectors;
Robert>  			int is_bad;
 
Robert> -			is_bad = is_badblock(rdev, r1_bio->sector,
Robert> -					     max_sectors,
Robert> +			is_bad = is_badblock(rdev, r1_bio->sector, max_sectors,
Robert>  					     &first_bad, &bad_sectors);
Robert>  			if (is_bad < 0) {
Robert>  				/* mustn't write here until the bad block is
Robert> @@ -1325,7 +1296,8 @@ read_again:
Robert>  			continue;
 
Robert>  		mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
Robert> -		bio_trim(mbio, r1_bio->sector - bio->bi_iter.bi_sector, max_sectors);
Robert> +		bio_trim(mbio, r1_bio->sector - bio->bi_iter.bi_sector,
Robert> +			 max_sectors);
 
Robert>  		if (first_clone) {
Robert>  			/* do behind I/O ?
Robert> @@ -1408,6 +1380,42 @@ read_again:
Robert>  	wake_up(&conf->wait_barrier);
Robert>  }
 
Robert> +static void raid1_make_request(struct mddev *mddev, struct bio *bio)
Robert> +{
Robert> +	struct r1conf *conf = mddev->private;
Robert> +	struct r1bio *r1_bio;
Robert> +	sector_t start_next_window = wait_barrier(conf, bio);
Robert> +
Robert> +	/*
Robert> +	 * make_request() can abort the operation when read-ahead is being
Robert> +	 * used and no empty request is available.
Robert> +	 *
Robert> +	 */
Robert> +	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
Robert> +
Robert> +	r1_bio->master_bio = bio;
Robert> +	r1_bio->sectors = bio_sectors(bio);
Robert> +	r1_bio->state = 0;
Robert> +	r1_bio->mddev = mddev;
Robert> +	r1_bio->sector = bio->bi_iter.bi_sector;
Robert> +
Robert> +	/* We might need to issue multiple reads to different
Robert> +	 * devices if there are bad blocks around, so we keep
Robert> +	 * track of the number of reads in bio->bi_phys_segments.
Robert> +	 * If this is 0, there is only one r1_bio and no locking
Robert> +	 * will be needed when requests complete.  If it is
Robert> +	 * non-zero, then it is the number of not-completed requests.
Robert> +	 */
Robert> +	bio->bi_phys_segments = 0;
Robert> +	bio_clear_flag(bio, BIO_SEG_VALID);
Robert> +
Robert> +	if (bio_data_dir(bio) == READ) {
Robert> +		raid1_read_request(mddev, bio, r1_bio);
Robert> +		return;
Robert> +	}
Robert> +	raid1_write_request(mddev, bio, r1_bio, start_next_window);
Robert> +}

Maybe this is pedantic, but why not have an 'else' instead of the
return to make it more clear?  If someone (like me :-) missed the
return on the quick read through, it's going to be some confusion.

if (bio_data_dir(bio) == READ)
   raid1_read_request(mddev, bui, r1_bio);
else
   raid1_write_request(mddev, bio, r1_bio, start_next_window);

seems cleaner.

Also, why are the calls different with the start_next_window parameter
added in?  Sorry for being dense, trying to understand the code
better...

John

^ permalink raw reply

* Re: [PATCH 1/2] md/raid1: Refactor raid1_make_request
From: Robert LeBlanc @ 2016-12-01 17:40 UTC (permalink / raw)
  To: John Stoffel; +Cc: linux-raid
In-Reply-To: <22592.18979.608624.661894@quad.stoffel.home>

On Thu, Dec 1, 2016 at 9:04 AM, John Stoffel <john@stoffel.org> wrote:
> Maybe this is pedantic, but why not have an 'else' instead of the
> return to make it more clear?  If someone (like me :-) missed the
> return on the quick read through, it's going to be some confusion.
>
> if (bio_data_dir(bio) == READ)
>    raid1_read_request(mddev, bui, r1_bio);
> else
>    raid1_write_request(mddev, bio, r1_bio, start_next_window);
>
> seems cleaner.

Yeah, I agree that is a bit cleaner, I was just following the previous
style. I'll change it to what you propose.

> Also, why are the calls different with the start_next_window parameter
> added in?  Sorry for being dense, trying to understand the code
> better...

start_next_windows is set by wait_barrier() and is only non-zero for
writes and the return value of zero is never used for reads. I tried
to move that code into the write branch, but I could not unload the
module due to conf->nr_pending == conf->queued+extra failing in
wait_event_lock_irq_cmd() in freeze_array(). Each read was
decrementing conf->nr_pending and since wait_barrier() was not being
called on reads, conf->nr_pending was very negative and would not
return to zero. In order to refactor that out completely would require
a lot more work and changes. I'd need to do more research into how
much the spin_locks are needed for reading when there is recovery
going on before I could start to try and refactor that. The other
option is to break wait_barrier() into two functions, one that gets
called for both read and write and the other for just the writes. I
can do the latter pretty easily if that is a desired direction.

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

^ permalink raw reply

* Re: Help: RAID5 - Disk failure during upgrade
From: Wols Lists @ 2016-12-01 18:32 UTC (permalink / raw)
  To: Thomas Büschgens, linux-raid
In-Reply-To: <CAPqb7wrkYKVMKoje7rpZmhd+3BFautqmhugFTbMvsuHxVoazSA@mail.gmail.com>

On 29/11/16 22:22, Thomas Büschgens wrote:
> Hi there,
> 
> 
> kind of "cry for help" mail to the list.
> 
> 
> I am running a Thecus N7510 NAS with 7 * 4TB diskd (Western Digital)
> in a RAID5 setup. This config was running "smoothly" for about 3 years
> now. Couple of days ago I decided to upgrade to 8TB disks instead.
> 
What sort of WD disk? Reds?
> 
> Following the recommended Thecus procedure I did the following:
> 
> 
> 1. Check SMART on all disks. Fine
> 2. Pull Disk No. 1
> 3. Re-assemble HD-Case with new 8TB disk
> 4. Put new Disk into slot 1
> 
> 
> So far, so good. The array immediatly started the rebuild... and a
> couple of minutes later disk No. 5 failed.
> 
> 
> Hiere the excerpt from the Thecus log:
> 
> 2016-11-28 23:13:09 [N7510] : User admin logged in from 192.168.7.29
> 2016-11-28 22:30:36 [N7510] : The RAID [RAID] on system [N7510] change
> to degrade mode.
> 2016-11-28 22:29:57 [N7510] : Disk 5 on [N7510] has failed.
> 2016-11-28 22:29:56 [N7510] : Disk 5 on [N7510] has failed.
> 2016-11-28 22:29:56 [N7510] : Disk 5 on [N7510] has failed.
> 2016-11-28 22:23:52 [N7510] : The RAID [RAID] on system [N7510] is
> recovering the RAID and rebuilding is in progress.
> 2016-11-28 22:23:43 [N7510] : Disk 1 on [N7510] has been added.
> 2016-11-28 22:17:06 [N7510] : The RAID [RAID] on system [N7510] change
> to degrade mode.
> 2016-11-28 22:17:05 [N7510] : Disk 1 on [N7510] has been removed.
> 
> 
> Disk No. 5 is now marked as a potential spare. The output from "mdadm
> --examine" is attached to the email.
> 
> 
> My basic question is the following: How to proceed.
> 
> 
> Currently I am considering the following options:
> 
> 
> 1. Change back to Disk No. 1 (4TB) - the original one. The disk was
> running smoothly when I changed it

Has the array been "live" while you've been upgrading it - in other
words has the data on it been updated? That'll put a spanner in the
works for this option.

> 2. Option No. 1 - but shutting the system down while doing this
> 3. Pull/Plug Disk No. 5 and see what happens
> 4. Reboot?
> 
Two disks out? The array won't come back after a reboot :-( I notice
however that mdadm says you still have 6 drives, so something doesn't
add up ... 7 drives, no 1 has been removed, no 5 has failed, 6 left???
> 
> I don't think this is a Thecus specific question - rather a
> mdraid-related issue - in terms of finding the correct procedure.
> 
> 
> Any advice / guidance will be appreciated. In case someone needs more
> detailed information I am happy to provide this.
> 
Does the array have space to slot an 8th drive in? Pulling a drive and
putting a replacement in does NOT sound sensible to me - for exactly
this reason! It kills redundancy while the array rebuilds :-(

I'll step back and let the experts tell you how to recover the array (if
you haven't modified the data, sticking the old drive 1 back in should
work), but once you've done that, if you can cope with the downtime I'd
dd the old drives to the new ones, then expand the partitions and raid
after the fact.

Or better, if you can add an eighth disk to the running array, move them
across one by one with an "mdadm --replace". You might need a new mdadm
for that, but it's a LOT safer!
> 
> Thx,
> 
> 
> Tom
> 
Cheers,
Wol

^ permalink raw reply

* Re: MD Remnants After --stop
From: Marc Smith @ 2016-12-01 19:40 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <87shq8743j.fsf@notabene.neil.brown.name>

On Wed, Nov 30, 2016 at 9:52 PM, NeilBrown <neilb@suse.com> wrote:
> On Mon, Nov 28 2016, Marc Smith wrote:
>
>>
>> # find /sys/block/md127/md
>> /sys/block/md127/md
>> /sys/block/md127/md/reshape_position
>> /sys/block/md127/md/layout
>> /sys/block/md127/md/raid_disks
>> /sys/block/md127/md/bitmap
>> /sys/block/md127/md/bitmap/chunksize
>
> This tells me that:
>         sysfs_remove_group(&mddev->kobj, &md_bitmap_group);
> hasn't been run, so mddev_delayed_delete() hasn't run.
> That suggests the final mddev_put() hsn't run. i.e. mddev->active is > 0
>
> Everything else suggests that array has been stopped and cleaned and
> should be gone...
>
> This seems to suggest that there is an unbalanced mddev_get() without a
> matching mddev_put().  I cannot find it though.
>
> If I could reproduce it, I would try to see what is happening by:
>
> - putting
>    printk("mddev->active = %d\n", atomic_read(&mddev->active));
>  in the top of mddev_put().  That shouldn't be *too* noisy.
>
> - putting
>    printk("rd=%d empty=%d ctime=%d hold=%d\n", mddev->raid_disks,
>         list_empty(&mddev->disks), mddev->ctime, mddev->hold_active);
>
>   in mddev_put() just before those values are tested.
>
>  - putting
>      printk("queue_work\n");
>    just before the 'queue_work()' call in mddev_put.
>
>  - putting
>      printk("mddev_delayed_delete\n");
>    in mddev_delayed_delete()
>
> Then see what gets printed when you stop the array.

I made those modifications to md.c and here is the kernel log when stopping:

--snip--
[ 3937.233487] mddev->active = 2
[ 3937.233503] mddev->active = 2
[ 3937.233509] mddev->active = 2
[ 3937.233516] mddev->active = 1
[ 3937.233516] rd=2 empty=0 ctime=1480617270 hold=0
[ 3937.233679] udevd[492]: inotify event: 8 for /dev/md127
[ 3937.241489] md127: detected capacity change from 73340747776 to 0
[ 3937.241493] md: md127 stopped.
[ 3937.241665] udevd[492]: device /dev/md127 closed, synthesising 'change'
[ 3937.241726] udevd[492]: seq 3631 queued, 'change' 'block'
[ 3937.241829] udevd[492]: seq 3631 forked new worker [4991]
[ 3937.241989] udevd[4991]: seq 3631 running
[ 3937.242002] dlm: dc18e34c-b136-1964-1c34-4509a7c60a19: leaving the
lockspace group...
[ 3937.242039] udevd[4991]: removing watch on '/dev/md127'
[ 3937.242068] mddev->active = 3
[ 3937.242069] udevd[492]: seq 3632 queued, 'offline' 'dlm'
[ 3937.242080] mddev->active = 3
[ 3937.242104] udevd[4991]: IMPORT 'probe-bcache -o udev /dev/md127'
/usr/lib/udev/rules.d/69-bcache.rules:16
[ 3937.242161] udevd[492]: seq 3632 forked new worker [4992]
[ 3937.242259] udevd[4993]: starting 'probe-bcache -o udev /dev/md127'
[ 3937.242753] dlm: dc18e34c-b136-1964-1c34-4509a7c60a19: group event done 0 0
[ 3937.242847] dlm: dc18e34c-b136-1964-1c34-4509a7c60a19:
release_lockspace final free
[ 3937.242861] md: unbind<dm-1>
[ 3937.256606] md: export_rdev(dm-1)
[ 3937.256612] md: unbind<dm-0>
[ 3937.263601] md: export_rdev(dm-0)
[ 3937.263688] mddev->active = 4
[ 3937.263751] mddev->active = 3
--snip--

I didn't use my modified mdadm which stops the synthesized CHANGE from
occurring, but if needed, I can re-run the test using that.

I don't see any of the "queue_work" or "mddev_delayed_delete" messages
anywhere in the kernel logs. Here is how those monitoring lines are
set in md.c:

--snip--
static void mddev_put(struct mddev *mddev)
{
        struct bio_set *bs = NULL;
        printk("mddev->active = %d\n", atomic_read(&mddev->active));
        if (!atomic_dec_and_lock(&mddev->active, &all_mddevs_lock))
                return;
        printk("rd=%d empty=%d ctime=%d hold=%d\n", mddev->raid_disks,
                list_empty(&mddev->disks), mddev->ctime, mddev->hold_active);
        if (!mddev->raid_disks && list_empty(&mddev->disks) &&
            mddev->ctime == 0 && !mddev->hold_active) {
                /* Array is not configured at all, and not held active,
                 * so destroy it */
                list_del_init(&mddev->all_mddevs);
                bs = mddev->bio_set;
                mddev->bio_set = NULL;
                if (mddev->gendisk) {
                        /* We did a probe so need to clean up.  Call
                         * queue_work inside the spinlock so that
                         * flush_workqueue() after mddev_find will
                         * succeed in waiting for the work to be done.
                         */
                        INIT_WORK(&mddev->del_work, mddev_delayed_delete);
                        printk("queue_work\n");
                        queue_work(md_misc_wq, &mddev->del_work);
                } else
                        kfree(mddev);
        }
        spin_unlock(&all_mddevs_lock);
        if (bs)
                bioset_free(bs);
}
--snip--

--snip--
static void mddev_delayed_delete(struct work_struct *ws)
{
        struct mddev *mddev = container_of(ws, struct mddev, del_work);
        printk("mddev_delayed_delete\n");
        sysfs_remove_group(&mddev->kobj, &md_bitmap_group);
        kobject_del(&mddev->kobj);
        kobject_put(&mddev->kobj);
}
--snip--

Let me know if the printk() lines weren't placed in the proper spots
and I'll fix and re-run the test. Thanks for your time.


--Marc


>
> NeilBrown

^ permalink raw reply

* Re: [PATCH 1/2] md/raid1: Refactor raid1_make_request
From: John Stoffel @ 2016-12-01 22:00 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: John Stoffel, linux-raid
In-Reply-To: <CAANLjFq2eezGRded8uDk6U0si_gt9j250qR4VQW+WFUeYJ70OA@mail.gmail.com>

>>>>> "Robert" == Robert LeBlanc <robert@leblancnet.us> writes:

Robert> On Thu, Dec 1, 2016 at 9:04 AM, John Stoffel <john@stoffel.org> wrote:
>> Maybe this is pedantic, but why not have an 'else' instead of the
>> return to make it more clear?  If someone (like me :-) missed the
>> return on the quick read through, it's going to be some confusion.
>> 
>> if (bio_data_dir(bio) == READ)
>> raid1_read_request(mddev, bui, r1_bio);
>> else
>> raid1_write_request(mddev, bio, r1_bio, start_next_window);
>> 
>> seems cleaner.

Robert> Yeah, I agree that is a bit cleaner, I was just following the
Robert> previous style. I'll change it to what you propose.

It's not a big deal, whatever makes you happy.  

>> Also, why are the calls different with the start_next_window parameter
>> added in?  Sorry for being dense, trying to understand the code
>> better...

Robert> start_next_windows is set by wait_barrier() and is only
Robert> non-zero for writes and the return value of zero is never used
Robert> for reads. I tried to move that code into the write branch,
Robert> but I could not unload the module due to conf->nr_pending ==
Robert> conf->queued+extra failing in wait_event_lock_irq_cmd() in
Robert> freeze_array(). Each read was decrementing conf->nr_pending
Robert> and since wait_barrier() was not being called on reads,
Robert> conf->nr_pending was very negative and would not return to
Robert> zero. In order to refactor that out completely would require a
Robert> lot more work and changes. I'd need to do more research into
Robert> how much the spin_locks are needed for reading when there is
Robert> recovery going on before I could start to try and refactor
Robert> that. The other option is to break wait_barrier() into two
Robert> functions, one that gets called for both read and write and
Robert> the other for just the writes. I can do the latter pretty
Robert> easily if that is a desired direction.

That second option might be the way to move forward.  But it sounds
like a bigger refactoring than is really needed now.

John

^ permalink raw reply

* Re: [PATCH v2 3/9] imsm: give md list of known bad blocks on startup
From: Jes Sorensen @ 2016-12-01 22:33 UTC (permalink / raw)
  To: Tomasz Majchrzak; +Cc: linux-raid
In-Reply-To: <20161130085134.GA29667@proton.igk.intel.com>

Tomasz Majchrzak <tomasz.majchrzak@intel.com> writes:
> On Tue, Nov 29, 2016 at 05:27:07PM -0500, Jes Sorensen wrote:
>> Tomasz Majchrzak <tomasz.majchrzak@intel.com> writes:
>> > On create set bad block support flag for each drive. On assmble also
>> > provide a list of known bad blocks. Bad blocks are stored in metadata
>> > per disk so they have to be checked against volume boundaries
>> > beforehand.
>> >
>> > Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
>> > ---
>> >  super-intel.c | 59
>> > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >  1 file changed, 59 insertions(+)
>> 
>> Tomasz,
>> 
>> Thanks for posting 1/9 v3 - I was in the process of applying this set,
>> but something is causing a conflict.
>> 
>> > diff --git a/super-intel.c b/super-intel.c
>> > index 421bfbc..b2afdff 100644
>> > --- a/super-intel.c
>> > +++ b/super-intel.c
>> [snip]
>> > @@ -7160,6 +7212,12 @@ static struct mdinfo *container_content_imsm(struct supertype *st, char *subarra
>> >  			info_d->events = __le32_to_cpu(mpb->generation_num);
>> >  			info_d->data_offset = pba_of_lba0(map);
>> >  			info_d->component_size = blocks_per_member(map);
>> > +
>> > +			info_d->bb.supported = 0;
>> > +			get_volume_badblocks(super->bbm_log, ord_to_idx(ord),
>> > +					     info_d->data_offset,
>> > +					     info_d->component_size,
>> > +					     &info_d->bb);
>> >  		}
>> >  		/* now that the disk list is up-to-date fixup recovery_start */
>> >  		update_recovery_start(super, dev, this);
>> 
>> This hunk is failing as my tree doesn't have the line above:
>> 			info_d->component_size = blocks_per_member(map);
>> 
>> I can merge this manually, but I prefer to be sure we are in sync just
>> in case. Where did that line come from - did I miss an earlier patch?
>
> My bad. I have just sent new version of the patch. Previously I had tested
> it against my local mirror of your repository which happened to run
> out-of-sync few days ago. Please keep telling me when patches do not apply
> cleanly, it would help me to spot my working environment issues.

Hi Tomek,

All good, thanks for the quick update. I'm looking through the patches
now, I did find a minor nit in patch 7 as well - I'll respond to that
one.

Cheers,
Jes

^ permalink raw reply

* Re: [PATCH v2 7/9] imsm: provide list of bad blocks for an array
From: Jes Sorensen @ 2016-12-01 22:34 UTC (permalink / raw)
  To: Tomasz Majchrzak; +Cc: linux-raid
In-Reply-To: <1480424555-31509-8-git-send-email-tomasz.majchrzak@intel.com>

Tomasz Majchrzak <tomasz.majchrzak@intel.com> writes:
> Provide list of bad blocks using memory allocated in advance so it's
> safe to call it from monitor.
>
> Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
> ---
>  super-intel.c | 39 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 39 insertions(+)
>
> diff --git a/super-intel.c b/super-intel.c
> index 731878c..fd2145c 100644
> --- a/super-intel.c
> +++ b/super-intel.c
> @@ -393,6 +393,7 @@ struct intel_super {
>  	struct intel_hba *hba; /* device path of the raid controller for this metadata */
>  	const struct imsm_orom *orom; /* platform firmware support */
>  	struct intel_super *next; /* (temp) list for disambiguating family_num */
> +	struct md_bb bb;	/* memory for get_bad_blocks call */
>  };
>  
>  struct intel_disk {
> @@ -4298,6 +4299,7 @@ static void __free_imsm(struct intel_super *super, int free_disks)
>  static void free_imsm(struct intel_super *super)
>  {
>  	__free_imsm(super, 1);
> +	free(super->bb.entries);
>  	free(super);
>  }
>  
> @@ -4318,6 +4320,14 @@ static struct intel_super *alloc_super(void)
>  
>  	super->current_vol = -1;
>  	super->create_offset = ~((unsigned long long) 0);
> +
> +	super->bb.entries = malloc(BBM_LOG_MAX_ENTRIES *
> +				   sizeof(struct md_bb_entry));

Hi Tomek,

Another nit - you are using plain malloc() here, while every other part
of super-intel.c uses xmalloc().

Cheers,
Jes

^ permalink raw reply

* Re: MD Remnants After --stop
From: NeilBrown @ 2016-12-01 22:35 UTC (permalink / raw)
  To: Marc Smith; +Cc: linux-raid
In-Reply-To: <CAHkw+LfmEyY8CSumeeftOrcYKzbkumHwtfeTuT0DhHZbaxdUSw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4565 bytes --]

On Fri, Dec 02 2016, Marc Smith wrote:

> On Wed, Nov 30, 2016 at 9:52 PM, NeilBrown <neilb@suse.com> wrote:
>> On Mon, Nov 28 2016, Marc Smith wrote:
>>
>>>
>>> # find /sys/block/md127/md
>>> /sys/block/md127/md
>>> /sys/block/md127/md/reshape_position
>>> /sys/block/md127/md/layout
>>> /sys/block/md127/md/raid_disks
>>> /sys/block/md127/md/bitmap
>>> /sys/block/md127/md/bitmap/chunksize
>>
>> This tells me that:
>>         sysfs_remove_group(&mddev->kobj, &md_bitmap_group);
>> hasn't been run, so mddev_delayed_delete() hasn't run.
>> That suggests the final mddev_put() hsn't run. i.e. mddev->active is > 0
>>
>> Everything else suggests that array has been stopped and cleaned and
>> should be gone...
>>
>> This seems to suggest that there is an unbalanced mddev_get() without a
>> matching mddev_put().  I cannot find it though.
>>
>> If I could reproduce it, I would try to see what is happening by:
>>
>> - putting
>>    printk("mddev->active = %d\n", atomic_read(&mddev->active));
>>  in the top of mddev_put().  That shouldn't be *too* noisy.
>>
>> - putting
>>    printk("rd=%d empty=%d ctime=%d hold=%d\n", mddev->raid_disks,
>>         list_empty(&mddev->disks), mddev->ctime, mddev->hold_active);
>>
>>   in mddev_put() just before those values are tested.
>>
>>  - putting
>>      printk("queue_work\n");
>>    just before the 'queue_work()' call in mddev_put.
>>
>>  - putting
>>      printk("mddev_delayed_delete\n");
>>    in mddev_delayed_delete()
>>
>> Then see what gets printed when you stop the array.
>
> I made those modifications to md.c and here is the kernel log when stopping:
>
> --snip--
> [ 3937.233487] mddev->active = 2
> [ 3937.233503] mddev->active = 2
> [ 3937.233509] mddev->active = 2
> [ 3937.233516] mddev->active = 1
> [ 3937.233516] rd=2 empty=0 ctime=1480617270 hold=0

At this point, mdadm has opened the /dev/md127 device, accessed a few
attributes via sysfs just to check on the status, and then closed it
again.
The array is still active, but we know that no other process has it
open.


> [ 3937.233679] udevd[492]: inotify event: 8 for /dev/md127
> [ 3937.241489] md127: detected capacity change from 73340747776 to 0
> [ 3937.241493] md: md127 stopped.

Now mdadm has opened the array again and issued the STOP_ARRAY ioctl.
Still nothing else has the array open.

> [ 3937.241665] udevd[492]: device /dev/md127 closed, synthesising 'change'
> [ 3937.241726] udevd[492]: seq 3631 queued, 'change' 'block'
> [ 3937.241829] udevd[492]: seq 3631 forked new worker [4991]
> [ 3937.241989] udevd[4991]: seq 3631 running
> [ 3937.242002] dlm: dc18e34c-b136-1964-1c34-4509a7c60a19: leaving the
> lockspace group...
> [ 3937.242039] udevd[4991]: removing watch on '/dev/md127'
> [ 3937.242068] mddev->active = 3

But somehow the ->active count got up to 3.
mdadm probably still has it open, but two other things do too.
If you have "mdadm --monitor" running in the background (which is good)
it will temporarily increase, then decrease the count.
udevd opens the device temporarily too.
So this isn't necessarily a problem.

> [ 3937.242069] udevd[492]: seq 3632 queued, 'offline' 'dlm'
> [ 3937.242080] mddev->active = 3
> [ 3937.242104] udevd[4991]: IMPORT 'probe-bcache -o udev /dev/md127'
> /usr/lib/udev/rules.d/69-bcache.rules:16
> [ 3937.242161] udevd[492]: seq 3632 forked new worker [4992]
> [ 3937.242259] udevd[4993]: starting 'probe-bcache -o udev /dev/md127'
> [ 3937.242753] dlm: dc18e34c-b136-1964-1c34-4509a7c60a19: group event done 0 0
> [ 3937.242847] dlm: dc18e34c-b136-1964-1c34-4509a7c60a19:
> release_lockspace final free
> [ 3937.242861] md: unbind<dm-1>
> [ 3937.256606] md: export_rdev(dm-1)
> [ 3937.256612] md: unbind<dm-0>
> [ 3937.263601] md: export_rdev(dm-0)
> [ 3937.263688] mddev->active = 4
> [ 3937.263751] mddev->active = 3

But here, the active count only drops down to 2. (it is decremented
after it is printed).  Assuming there really were no more messages like
this, there are two active references to the md device, and we don't
know what they are.

>
> I didn't use my modified mdadm which stops the synthesized CHANGE from
> occurring, but if needed, I can re-run the test using that.

It would be good to use the modified mdadm, if only to reduce the
noise.  It won't change the end result, but might make it easier to see
what is happening.
Also please add
   WARN_ON(1);

in the start of mddev_get() and mddev_put().
That will provide a stack trace whenever either of these are called, so
we can see who takes a references, and who doesn't release it.

Thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox