Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* 双11预售专场已经开启！低至五折，双十一前囤货高峰！
From: 中华新村 @ 2016-10-31  9:01 UTC (permalink / raw)
  To: linux-raid

                双11海量好货提前曝光

提前查看双11好货，抢先收藏，双11当天立即下单，避免排队付款，不止是五折！

电脑用户入口：http://s.click.taobao.com/0DPUWPx
手机用户入口：http://s.click.taobao.com/XQlUWPx

^ permalink raw reply

* Re: recovering failed raid5
From: Alexander Shenkin @ 2016-10-31 10:44 UTC (permalink / raw)
  To: linux-raid, Andreas Klauer, rm, robin
In-Reply-To: <20161028133626.GA27462@cthulhu.home.robinhill.me.uk>

Thanks to everyone for their input.  I need to get a new, non-horrible 
3TB drive to ddrescue to.  The question: can I get a different 3TB drive 
(e.g. Toshiba P300 3TB, 
https://www.amazon.co.uk/Toshiba-P300-7200RPM-SATA-Drive/dp/B0151KM6F0), 
or are sizes of 3TB slightly different enough for that to cause me 
headaches when adding it back into the array?  If the latter is the 
case, then perhaps I need to aim for a 4TB drive replacement...

Thanks,
Allie

On 10/28/2016 2:36 PM, Robin Hill wrote:
> On Fri Oct 28, 2016 at 01:22:31PM +0100, Alexander Shenkin wrote:
>
>> Thanks Andreas, much appreciated.  Your points about selftests and smart
>> are well taken, and i'll implement them once i get this back up.  I'll
>> buy yet another new, non drive-from-hell (yes Roman, I did buy the same
>> damn drive again.  Will try to return it, thanks for the heads up...)
>> and follow your instructions below.
>>
>> One remaining question: is sdc definitely toast?  Or, is it possible
>> that the Timeout Mismatch (as mentioned by Robin Hill; thanks Robin) is
>> flagging the drive as failed, when something else is at play and perhaps
>> the drive is actually fine?
>>
> It's not definitely toast, no (but this is unrelated to the Timeout
> mismatches). It has some pending reallocations, which means the drive
> was unable to read from some blocks - if a write to the blocks fails
> then one of the spare blocks will be reallocated instead, but a write
> will often succeed and the pending reallocation will just be cleared.
>
> Unfortunately, reconstruction of the array depends on this data being
> readable, so the fact the drive isn't toast doesn't necessarily help.
> I'd suggest replicating (using ddrescue) that drive to the new one (when
> it arrives) as a first step. It's possible ddrescue will manage to read
> the data (it'll make several attempts, so can sometimes read data that
> fails initially), otherwise you'll end up with some missing data
> (possibly corrupt files, possibly corrupt filesystem metadata, possibly
> just a bit of extra noise in an audio/video file). Once that's done, you
> can do a proper check on sdc (e.g. a badblocks read/write test), which
> will either lead to sector actually being reallocated, or to clearing
> the pending reallocations. Unless you get a lot more reallocated sectors
> than are currently pending, you can put the drive back into use if you
> like (bearing in mind the reputation of these drives and weighing the
> replacement cost against the value of your data).
>
> If you run a regular selftest on the array, these sort of issues would
> be picked up and repaired automatically (the read errors will trigger
> rewrites and either reallocate blocks, clear the pending reallocations,
> or fail the drive). Otherwise they're liable to come back to bite you
> when you're trying to recover from a different failure.
>
> Timeout Mismatches will lead to drives being failed from an otherwise
> healthy array - a read failure on the drive can't be corrected as the
> drive is still busy trying when the write request goes through, so the
> drive gets kicked out of the array. You didn't say what the issue was
> with your original sdb, but if it wasn't a definite fault then it may
> have been affected by a timeout mismatch.
>
> Cheers,
>     Robin
>
>> To everyone: sorry for the multiple posts.  Was having majordomo issues...
>>
>> On 10/27/2016 5:04 PM, Andreas Klauer wrote:
>>> On Thu, Oct 27, 2016 at 04:06:14PM +0100, Alexander Shenkin wrote:
>>>> md2: raid5 mounted on /, via sd[abcd]3
>>>
>>> Two failed disks...
>>>
>>>> md0: raid1 mounted on /boot, via sd[abcd]1
>>>
>>> Actually only two disks active in that one, the other two are spares.
>>> It hardly matters for /boot, but you could grow it to a 4 disk raid1.
>>> Spares are not useful.
>>>
>>>> My sdb was recently reporting problems.  Instead of second guessing
>>>> those problems, I just got a new disk, replaced it, and added it to
>>>> the arrays.
>>>
>>> Replacing right away is the right thing to do.
>>> Unfortunately it seems you have another disk that is broke too.
>>>
>>>> 2) smartctl (disabled on drives - can enable once back up.  should I?)
>>>> note: SMART only enabled after problems started cropping up.
>>>
>>> But... why? Why disable smart? And if you do, is it a surprise that you
>>> only notice disk failures when it's already too late?
>>
>> yeah, i asked myself that same question.  there was probably some reason
>> I did, but i don't remember what it was.  i'll keep smart enabled from
>> now on...
>>
>>> You should enable smart, and not only that, also run regular selftests,
>>> and have smartd running, and have it send you mail when something happens.
>>> Same with raid checks, raid checks are at least something but it won't
>>> tell you about how many reallocated sectors your drive has.
>>
>> will do
>>
>>>> root@machinename:/home/username# smartctl --xall /dev/sda
>>>
>>> Looks fine but never ran a selftest.
>>>
>>>> root@machinename:/home/username# smartctl --xall /dev/sdb
>>>
>>> Looks new. (New drives need selftests too.)
>>>
>>>> root@machinename:/home/username# smartctl --xall /dev/sdc
>>>> smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.19.0-39-generic] (local build)
>>>> Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
>>>>
>>>> === START OF INFORMATION SECTION ===
>>>> Model Family:     Seagate Barracuda 7200.14 (AF)
>>>> Device Model:     ST3000DM001-1CH166
>>>> Serial Number:    W1F1N909
>>>>
>>>> 197 Current_Pending_Sector  -O--C-   100   100   000    -    8
>>>> 198 Offline_Uncorrectable   ----C-   100   100   000    -    8
>>>
>>> This one is faulty and probably the reason why your resync failed.
>>> You have no redundancy left, so an option here would be to get a
>>> new drive and ddrescue it over.
>>>
>>> That's exactly the kind of thing you should be notified instantly
>>> about via mail. And it should be discovered when running selftests.
>>> Without full surface scan of the media, the disk itself won't know.
>>>
>>>> ==> WARNING: A firmware update for this drive may be available,
>>>> see the following Seagate web pages:
>>>> http://knowledge.seagate.com/articles/en_US/FAQ/207931en
>>>> http://knowledge.seagate.com/articles/en_US/FAQ/223651en
>>>
>>> About this, *shrug*
>>> I don't have these drives, you might want to check that out.
>>> But it probably won't fix bad sectors.
>>>
>>>> root@machinename:/home/username# smartctl --xall /dev/sdd
>>>
>>> Some strange things in the error log here, but old.
>>> Still, same as for all others - selftest.
>>>
>>>> ################### mdadm --examine ###########################
>>>>
>>>> /dev/sda1:
>>>>      Raid Level : raid1
>>>>    Raid Devices : 2
>>>
>>> A RAID 1 with two drives, could be four.
>>>
>>>> /dev/sdb1:
>>>> /dev/sdc1:
>>>
>>> So these would also have data instead of being spare.
>>>
>>>> /dev/sda3:
>>>>      Raid Level : raid5
>>>>    Raid Devices : 4
>>>>
>>>>     Update Time : Mon Oct 24 09:02:52 2016
>>>>          Events : 53547
>>>>
>>>>    Device Role : Active device 0
>>>>    Array State : A..A ('A' == active, '.' == missing)
>>>
>>> RAID-5 with two failed disks.
>>>
>>>> /dev/sdc3:
>>>>      Raid Level : raid5
>>>>    Raid Devices : 4
>>>>
>>>>     Update Time : Mon Oct 24 08:53:57 2016
>>>>          Events : 53539
>>>>
>>>>    Device Role : Active device 2
>>>>    Array State : AAAA ('A' == active, '.' == missing)
>>>
>>> This one failed, 8:53.
>>>
>>>> ############ /proc/mdstat ############################################
>>>>
>>>> md2 : active raid5 sda3[0] sdc3[2](F) sdd3[3]
>>>>       8760565248 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/2]
>>>> [U__U]
>>>
>>> [U__U] refers to device roles as in [0123],
>>> so device role 0 and 3 is okay, 1 and 2 missing.
>>>
>>>> md0 : active raid1 sdb1[4](S) sdc1[2](S) sda1[0] sdd1[3]
>>>>       1950656 blocks super 1.2 [2/2] [UU]
>>>
>>> Those two spares again, could be [UUUU] instead.
>>>
>>> tl;dr
>>> stop it all,
>>> ddrescue /dev/sdc to your new disk,
>>> try your luck with --assemble --force (not using /dev/sdc!),
>>> get yet another new disk, add, sync, cross fingers.
>>>
>>> There's also mdadm --replace instead of --remove, --add,
>>> that sometimes helps if there's only a few bad sectors
>>> on each disk. If the disk you already removed wasn't
>>> already kicked from the array by the time you replaced,
>>> maybe it would have avoided this problem.
>>>
>>> But good disk monitoring and testing is even more important.
>>
>> thanks a bunch, Andreas.  I'll monitor and test from now on...
>>
>>> Regards
>>> Andreas Klauer
>>
>

^ permalink raw reply

* Re: recovering failed raid5
From: Andreas Klauer @ 2016-10-31 11:09 UTC (permalink / raw)
  To: Alexander Shenkin; +Cc: linux-raid
In-Reply-To: <af2bf92d-c944-2269-1925-50baa44755a2@shenkin.org>

On Mon, Oct 31, 2016 at 10:44:38AM +0000, Alexander Shenkin wrote:
> or are sizes of 3TB slightly different enough for that to cause me 
> headaches when adding it back into the array?

This is usually not much to worry about, worst case you tell mdadm to 
shrink the size a little. But I don't see how it applies in your case. 
The partition you're interested in is not 3TB, and there is a swap 
partition you could skip if there's really no other way...

You can ddrescue partitions too, doesn't have to be the whole disk.

You posted parted output earlier, parted uses stupid unit by default 
so I usually prefer 'parted /dev/disk unit s print free'.

Regards
Andreas Klauer

^ permalink raw reply

* [PATCH 1/8] imsm: parse bad block log in metadata on startup
From: Tomasz Majchrzak @ 2016-10-31 14:50 UTC (permalink / raw)
  To: linux-raid; +Cc: Jes.Sorensen, Tomasz Majchrzak

Always allocate memory for all log entries to avoid a need for memory
allocation when monitor requests to record a bad block.

Also some extra checks added to make static code analyzer happy.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
---
 super-intel.c | 158 +++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 112 insertions(+), 46 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index c146bbd..5d6d534 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -217,22 +217,24 @@ struct imsm_super {
 } __attribute__ ((packed));
 
 #define BBM_LOG_MAX_ENTRIES 254
+#define BBM_LOG_MAX_LBA_ENTRY_VAL 256		/* Represents 256 LBAs */
+#define BBM_LOG_SIGNATURE 0xABADB10C
+
+struct bbm_log_block_addr {
+	__u16 w1;
+	__u32 dw1;
+} __attribute__ ((__packed__));
 
 struct bbm_log_entry {
-	__u64 defective_block_start;
-#define UNREADABLE 0xFFFFFFFF
-	__u32 spare_block_offset;
-	__u16 remapped_marked_count;
-	__u16 disk_ordinal;
+	__u8 marked_count;		/* Number of blocks marked - 1 */
+	__u8 disk_ordinal;		/* Disk entry within the imsm_super */
+	struct bbm_log_block_addr defective_block_start;
 } __attribute__ ((__packed__));
 
 struct bbm_log {
 	__u32 signature; /* 0xABADB10C */
 	__u32 entry_count;
-	__u32 reserved_spare_block_count; /* 0 */
-	__u32 reserved; /* 0xFFFF */
-	__u64 first_spare_lba;
-	struct bbm_log_entry mapped_block_entries[BBM_LOG_MAX_ENTRIES];
+	struct bbm_log_entry marked_block_entries[BBM_LOG_MAX_ENTRIES];
 } __attribute__ ((__packed__));
 
 #ifndef MDASSEMBLE
@@ -785,6 +787,92 @@ static struct imsm_dev *get_imsm_dev(struct intel_super *super, __u8 index)
 	return NULL;
 }
 
+#if BYTE_ORDER == LITTLE_ENDIAN
+static inline unsigned long long __le48_to_cpu(const struct bbm_log_block_addr
+					       *addr)
+{
+	return ((((__u64)addr->dw1) << 16) | addr->w1);
+}
+
+static inline struct bbm_log_block_addr __cpu_to_le48(unsigned long long sec)
+{
+	struct bbm_log_block_addr addr;
+
+	addr.w1 =  (__u16)(sec & 0xFFFF);
+	addr.dw1 = (__u32)((sec >> 16) & 0xFFFFFFFF);
+	return addr;
+}
+#elif BYTE_ORDER == BIG_ENDIAN
+static inline unsigned long long __le48_to_cpu(const struct bbm_log_block_addr
+					       *addr)
+{
+	return ((((__u64)__le32_to_cpu(addr->dw1)) << 16) |
+		__le16_to_cpu(addr->w1));
+}
+
+static inline struct bbm_log_block_addr __cpu_to_le48(unsigned long long sec)
+{
+	struct bbm_log_block_addr addr;
+
+	addr.w1 =  __cpu_to_le16((__u16)(sec & 0xFFFF));
+	addr.dw1 = __cpu_to_le32((__u32)(sec >> 16) & 0xFFFFFFFF);
+	return addr;
+}
+#else
+#  error "unknown endianness."
+#endif
+
+#ifndef MDASSEMBLE
+/* get size of the bbm log */
+static __u32 get_imsm_bbm_log_size(struct bbm_log *log)
+{
+	if (!log || log->entry_count == 0)
+		return 0;
+
+	return sizeof(log->signature) +
+		sizeof(log->entry_count) +
+		log->entry_count * sizeof(struct bbm_log_entry);
+}
+#endif /* MDASSEMBLE */
+
+/* allocate and load BBM log from metadata */
+static int load_bbm_log(struct intel_super *super)
+{
+	struct imsm_super *mpb = super->anchor;
+	__u32 bbm_log_size =  __le32_to_cpu(mpb->bbm_log_size);
+
+	super->bbm_log = malloc(sizeof(struct bbm_log));
+	if (!super->bbm_log)
+		return 1;
+
+	if (bbm_log_size) {
+		struct bbm_log *log = (void *)mpb +
+			__le32_to_cpu(mpb->mpb_size) - bbm_log_size;
+		__u32 entry_count;
+
+		if (bbm_log_size < sizeof(log->signature) +
+		    sizeof(log->entry_count))
+			return 2;
+
+		entry_count = __le32_to_cpu(log->entry_count);
+		if ((__le32_to_cpu(log->signature) != BBM_LOG_SIGNATURE) ||
+		    (entry_count > BBM_LOG_MAX_ENTRIES))
+			return 3;
+
+		if (bbm_log_size !=
+		    sizeof(log->signature) + sizeof(log->entry_count) +
+		    entry_count * sizeof(struct bbm_log_entry))
+			return 4;
+
+		memcpy(super->bbm_log, log, bbm_log_size);
+	} else {
+		super->bbm_log->signature = __cpu_to_le32(BBM_LOG_SIGNATURE);
+		super->bbm_log->entry_count = 0;
+	}
+
+	return 0;
+}
+
 /*
  * for second_map:
  *  == MAP_0 get first map
@@ -1433,7 +1521,7 @@ static void examine_super_imsm(struct supertype *st, char *homehost)
 	printf("          Disks : %d\n", mpb->num_disks);
 	printf("   RAID Devices : %d\n", mpb->num_raid_devs);
 	print_imsm_disk(__get_imsm_disk(mpb, super->disks->index), super->disks->index, reserved);
-	if (super->bbm_log) {
+	if (get_imsm_bbm_log_size(super->bbm_log)) {
 		struct bbm_log *log = super->bbm_log;
 
 		printf("\n");
@@ -1441,9 +1529,6 @@ static void examine_super_imsm(struct supertype *st, char *homehost)
 		printf("       Log Size : %d\n", __le32_to_cpu(mpb->bbm_log_size));
 		printf("      Signature : %x\n", __le32_to_cpu(log->signature));
 		printf("    Entry Count : %d\n", __le32_to_cpu(log->entry_count));
-		printf("   Spare Blocks : %d\n",  __le32_to_cpu(log->reserved_spare_block_count));
-		printf("    First Spare : %llx\n",
-		       (unsigned long long) __le64_to_cpu(log->first_spare_lba));
 	}
 	for (i = 0; i < mpb->num_raid_devs; i++) {
 		struct mdinfo info;
@@ -3628,19 +3713,6 @@ static int parse_raid_devices(struct intel_super *super)
 	return 0;
 }
 
-/* retrieve a pointer to the bbm log which starts after all raid devices */
-struct bbm_log *__get_imsm_bbm_log(struct imsm_super *mpb)
-{
-	void *ptr = NULL;
-
-	if (__le32_to_cpu(mpb->bbm_log_size)) {
-		ptr = mpb;
-		ptr += mpb->mpb_size - __le32_to_cpu(mpb->bbm_log_size);
-	}
-
-	return ptr;
-}
-
 /*******************************************************************************
  * Function:	check_mpb_migr_compatibility
  * Description:	Function checks for unsupported migration features:
@@ -3790,12 +3862,6 @@ static int load_imsm_mpb(int fd, struct intel_super *super, char *devname)
 		return 3;
 	}
 
-	/* FIXME the BBM log is disk specific so we cannot use this global
-	 * buffer for all disks.  Ok for now since we only look at the global
-	 * bbm_log_size parameter to gate assembly
-	 */
-	super->bbm_log = __get_imsm_bbm_log(super->anchor);
-
 	return 0;
 }
 
@@ -3839,6 +3905,9 @@ load_and_parse_mpb(int fd, struct intel_super *super, char *devname, int keep_fd
 	if (err)
 		return err;
 	err = parse_raid_devices(super);
+	if (err)
+		return err;
+	err = load_bbm_log(super);
 	clear_hi(super);
 	return err;
 }
@@ -3903,6 +3972,8 @@ static void __free_imsm(struct intel_super *super, int free_disks)
 		free(elem);
 		elem = next;
 	}
+	if (super->bbm_log)
+		free(super->bbm_log);
 	super->hba = NULL;
 }
 
@@ -4508,7 +4579,7 @@ static int get_super_block(struct intel_super **super_list, char *devnm, char *d
 		*super_list = s;
 	} else {
 		if (s)
-			free(s);
+			free_imsm(s);
 		if (dfd >= 0)
 			close(dfd);
 	}
@@ -4570,6 +4641,8 @@ static int load_super_imsm(struct supertype *st, int fd, char *devname)
 	free_super_imsm(st);
 
 	super = alloc_super();
+	if (!super)
+		return 1;
 	/* Load hba and capabilities if they exist.
 	 * But do not preclude loading metadata in case capabilities or hba are
 	 * non-compliant and ignore_hw_compat is set.
@@ -4912,7 +4985,7 @@ static int init_super_imsm(struct supertype *st, mdu_array_info_t *info,
 
 	super = alloc_super();
 	if (super && posix_memalign(&super->buf, 512, mpb_size) != 0) {
-		free(super);
+		free_imsm(super);
 		super = NULL;
 	}
 	if (!super) {
@@ -4922,7 +4995,7 @@ static int init_super_imsm(struct supertype *st, mdu_array_info_t *info,
 	if (posix_memalign(&super->migr_rec_buf, 512, MIGR_REC_BUF_SIZE) != 0) {
 		pr_err("could not allocate migr_rec buffer\n");
 		free(super->buf);
-		free(super);
+		free_imsm(super);
 		return 0;
 	}
 	memset(super->buf, 0, mpb_size);
@@ -5489,11 +5562,6 @@ static int store_super_imsm(struct supertype *st, int fd)
 #endif
 }
 
-static int imsm_bbm_log_size(struct imsm_super *mpb)
-{
-	return __le32_to_cpu(mpb->bbm_log_size);
-}
-
 #ifndef MDASSEMBLE
 static int validate_geometry_imsm_container(struct supertype *st, int level,
 					    int layout, int raiddisks, int chunk,
@@ -5529,6 +5597,10 @@ static int validate_geometry_imsm_container(struct supertype *st, int level,
 	 * note that there is no fd for the disks in array.
 	 */
 	super = alloc_super();
+	if (!super) {
+		close(fd);
+		return 0;
+	}
 	rv = find_intel_hba_capability(fd, super, verbose > 0 ? dev : NULL);
 	if (rv != 0) {
 #if DEBUG
@@ -6760,12 +6832,6 @@ static struct mdinfo *container_content_imsm(struct supertype *st, char *subarra
 		pr_err("Unsupported attributes in IMSM metadata.Arrays activation is blocked.\n");
 	}
 
-	/* check for bad blocks */
-	if (imsm_bbm_log_size(super->anchor)) {
-		pr_err("BBM log found in IMSM metadata.Arrays activation is blocked.\n");
-		sb_errors = 1;
-	}
-
 	/* count spare devices, not used in maps
 	 */
 	for (d = super->disks; d; d = d->next)
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH 2/8] imsm: write bad block log on metadata sync
From: Tomasz Majchrzak @ 2016-10-31 14:50 UTC (permalink / raw)
  To: linux-raid; +Cc: Jes.Sorensen, Tomasz Majchrzak
In-Reply-To: <1477925454-16809-1-git-send-email-tomasz.majchrzak@intel.com>

Pre-allocate memory for largest possible bad block sectionwhen monitor
is being opened to avoid a need for memory allocation on metadata sync.

If memory for a structure has been allocated in mpb buffer but it hasn't
been used yet, it will be taken by next buffer grow, leading to
insufficient memory on metadata flush. Start tracking such memory and
take it into calculation when growing a buffer. Also assert has been
added to debug mode to warn when more metadata has been written than
memory allocated.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
---
 super-intel.c | 49 +++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 45 insertions(+), 4 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index 5d6d534..0591c55 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -81,7 +81,8 @@
 					MPB_ATTRIB_RAID1           | \
 					MPB_ATTRIB_RAID10          | \
 					MPB_ATTRIB_RAID5           | \
-					MPB_ATTRIB_EXP_STRIPE_SIZE)
+					MPB_ATTRIB_EXP_STRIPE_SIZE | \
+					MPB_ATTRIB_BBM)
 
 /* Define attributes that are unused but not harmful */
 #define MPB_ATTRIB_IGNORED		(MPB_ATTRIB_NEVER_USE)
@@ -361,6 +362,7 @@ struct intel_super {
 		array, it indicates that mdmon is allowed to clean migration
 		record */
 	size_t len; /* size of the 'buf' allocation */
+	size_t extra_space; /* extra space in 'buf' that is not used yet */
 	void *next_buf; /* for realloc'ing buf from the manager */
 	size_t next_len;
 	int updates_pending; /* count of pending updates for mdmon */
@@ -420,6 +422,7 @@ enum imsm_update_type {
 	update_takeover,
 	update_general_migration_checkpoint,
 	update_size_change,
+	update_prealloc_badblocks_mem,
 };
 
 struct imsm_update_activate_spare {
@@ -508,6 +511,10 @@ struct imsm_update_add_remove_disk {
 	enum imsm_update_type type;
 };
 
+struct imsm_update_prealloc_bb_mem {
+	enum imsm_update_type type;
+};
+
 static const char *_sys_dev_type[] = {
 	[SYS_DEV_UNKNOWN] = "Unknown",
 	[SYS_DEV_SAS] = "SAS",
@@ -3250,6 +3257,8 @@ static size_t disks_to_mpb_size(int disks)
 	size += (4 - 2) * sizeof(struct imsm_map);
 	/* 4 possible disk_ord_tbl's */
 	size += 4 * (disks - 1) * sizeof(__u32);
+	/* maximum bbm log */
+	size += sizeof(struct bbm_log);
 
 	return size;
 }
@@ -3710,6 +3719,8 @@ static int parse_raid_devices(struct intel_super *super)
 		super->len = len;
 	}
 
+	super->extra_space += space_needed;
+
 	return 0;
 }
 
@@ -4844,6 +4855,7 @@ static int init_super_imsm_volume(struct supertype *st, mdu_array_info_t *info,
 		super->anchor = mpb_new;
 		mpb->mpb_size = __cpu_to_le32(size_new);
 		memset(mpb_new + size_old, 0, size_round - size_old);
+		super->len = size_round;
 	}
 	super->current_vol = idx;
 
@@ -5382,6 +5394,7 @@ static int write_super_imsm(struct supertype *st, int doclose)
 	__u32 mpb_size = sizeof(struct imsm_super) - sizeof(struct imsm_disk);
 	int num_disks = 0;
 	int clear_migration_record = 1;
+	__u32 bbm_log_size;
 
 	/* 'generation' is incremented everytime the metadata is written */
 	generation = __le32_to_cpu(mpb->generation_num);
@@ -5419,9 +5432,23 @@ static int write_super_imsm(struct supertype *st, int doclose)
 		if (is_gen_migration(dev2))
 			clear_migration_record = 0;
 	}
-	mpb_size += __le32_to_cpu(mpb->bbm_log_size);
+
+	bbm_log_size = get_imsm_bbm_log_size(super->bbm_log);
+
+	if (bbm_log_size) {
+		memcpy((void *)mpb + mpb_size, super->bbm_log, bbm_log_size);
+		mpb->attributes |= MPB_ATTRIB_BBM;
+	} else
+		mpb->attributes &= ~MPB_ATTRIB_BBM;
+
+	super->anchor->bbm_log_size = __cpu_to_le32(bbm_log_size);
+	mpb_size += bbm_log_size;
 	mpb->mpb_size = __cpu_to_le32(mpb_size);
 
+#ifdef DEBUG
+	assert(super->len == 0 || mpb_size <= super->len);
+#endif
+
 	/* recalculate checksum */
 	sum = __gen_imsm_checksum(mpb);
 	mpb->check_sum = __cpu_to_le32(sum);
@@ -7104,6 +7131,7 @@ static int imsm_open_new(struct supertype *c, struct active_array *a,
 {
 	struct intel_super *super = c->sb;
 	struct imsm_super *mpb = super->anchor;
+	struct imsm_update_prealloc_bb_mem u;
 
 	if (atoi(inst) >= mpb->num_raid_devs) {
 		pr_err("subarry index %d, out of range\n", atoi(inst));
@@ -7112,6 +7140,10 @@ static int imsm_open_new(struct supertype *c, struct active_array *a,
 
 	dprintf("imsm: open_new %s\n", inst);
 	a->info.container_member = atoi(inst);
+
+	u.type = update_prealloc_badblocks_mem;
+	imsm_update_metadata_locally(c, &u, sizeof(u));
+
 	return 0;
 }
 
@@ -8854,6 +8886,9 @@ static void imsm_process_update(struct supertype *st,
 		}
 		break;
 	}
+	case update_prealloc_badblocks_mem: {
+		break;
+	}
 	default:
 		pr_err("error: unsuported process update type:(type: %d)\n",	type);
 	}
@@ -9094,6 +9129,11 @@ static int imsm_prepare_update(struct supertype *st,
 	case update_add_remove_disk:
 		/* no update->len needed */
 		break;
+	case update_prealloc_badblocks_mem: {
+		super->extra_space += sizeof(struct bbm_log) -
+			get_imsm_bbm_log_size(super->bbm_log);
+		break;
+	}
 	default:
 		return 0;
 	}
@@ -9104,12 +9144,13 @@ static int imsm_prepare_update(struct supertype *st,
 	else
 		buf_len = super->len;
 
-	if (__le32_to_cpu(mpb->mpb_size) + len > buf_len) {
+	if (__le32_to_cpu(mpb->mpb_size) + super->extra_space + len > buf_len) {
 		/* ok we need a larger buf than what is currently allocated
 		 * if this allocation fails process_update will notice that
 		 * ->next_len is set and ->next_buf is NULL
 		 */
-		buf_len = ROUND_UP(__le32_to_cpu(mpb->mpb_size) + len, 512);
+		buf_len = ROUND_UP(__le32_to_cpu(mpb->mpb_size) +
+				   super->extra_space + len, 512);
 		if (super->next_buf)
 			free(super->next_buf);
 
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH 3/8] imsm: give md list of known bad blocks on startup
From: Tomasz Majchrzak @ 2016-10-31 14:50 UTC (permalink / raw)
  To: linux-raid; +Cc: Jes.Sorensen, Tomasz Majchrzak
In-Reply-To: <1477925454-16809-1-git-send-email-tomasz.majchrzak@intel.com>

On create set bad block support flag for each drive. On assmble also
provide a list of known bad blocks. Bad blocks are stored in metadata
per disk so they have to be checked against volume boundaries
beforehand.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
---
 super-intel.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/super-intel.c b/super-intel.c
index 0591c55..77a2ca9 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -880,6 +880,56 @@ static int load_bbm_log(struct intel_super *super)
 	return 0;
 }
 
+/* checks if bad block is within volume boundaries */
+static int is_bad_block_in_volume(const struct bbm_log_entry *entry,
+			const unsigned long long start_sector,
+			const unsigned long long size)
+{
+	unsigned long long bb_start;
+	unsigned long long bb_end;
+
+	bb_start = __le48_to_cpu(&entry->defective_block_start);
+	bb_end = bb_start + (entry->marked_count + 1);
+
+	if (((bb_start >= start_sector) && (bb_start < start_sector + size)) ||
+	    ((bb_end >= start_sector) && (bb_end <= start_sector + size)))
+		return 1;
+
+	return 0;
+}
+
+/* get list of bad blocks on a drive for a volume */
+static void get_volume_badblocks(const struct bbm_log *log, const __u8 idx,
+			const unsigned long long start_sector,
+			const unsigned long long size,
+			struct md_bb *bbs)
+{
+	__u32 count = 0;
+	__u32 i;
+
+	for (i = 0; i < log->entry_count; i++) {
+		const struct bbm_log_entry *ent =
+			&log->marked_block_entries[i];
+		struct md_bb_entry *bb;
+
+		if ((ent->disk_ordinal == idx) &&
+		    is_bad_block_in_volume(ent, start_sector, size)) {
+
+			if (!bbs->entries) {
+				bbs->entries = xmalloc(BBM_LOG_MAX_ENTRIES *
+						     sizeof(*bb));
+				if (!bbs->entries)
+					break;
+			}
+
+			bb = &bbs->entries[count++];
+			bb->sector = __le48_to_cpu(&ent->defective_block_start);
+			bb->length = ent->marked_count + 1;
+		}
+	}
+	bbs->count = count;
+}
+
 /*
  * for second_map:
  *  == MAP_0 get first map
@@ -2887,6 +2937,7 @@ static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info,
 							info->array.level,
 							info->array.chunk_size,
 							info->component_size);
+	info->bb.supported = 0;
 
 	memset(info->uuid, 0, sizeof(info->uuid));
 	info->recovery_start = MaxSector;
@@ -3053,6 +3104,7 @@ static void getinfo_super_imsm(struct supertype *st, struct mdinfo *info, char *
 	info->name[0] = 0;
 	info->recovery_start = MaxSector;
 	info->recovery_blocked = imsm_reshape_blocks_arrays_changes(st->sb);
+	info->bb.supported = 0;
 
 	/* do we have the all the insync disks that we expect? */
 	mpb = super->anchor;
@@ -6986,6 +7038,12 @@ static struct mdinfo *container_content_imsm(struct supertype *st, char *subarra
 			info_d->events = __le32_to_cpu(mpb->generation_num);
 			info_d->data_offset = pba_of_lba0(map);
 			info_d->component_size = blocks_per_member(map);
+
+			info_d->bb.supported = 0;
+			get_volume_badblocks(super->bbm_log, ord_to_idx(ord),
+					     info_d->data_offset,
+					     info_d->component_size,
+					     &info_d->bb);
 		}
 		/* now that the disk list is up-to-date fixup recovery_start */
 		update_recovery_start(super, dev, this);
@@ -7989,6 +8047,7 @@ static struct mdinfo *imsm_activate_spare(struct active_array *a,
 		di->data_offset = pba_of_lba0(map);
 		di->component_size = a->info.component_size;
 		di->container_member = inst;
+		di->bb.supported = 0;
 		super->random = random32();
 		di->next = rv;
 		rv = di;
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH 4/8] imsm: record new bad block in bad block log
From: Tomasz Majchrzak @ 2016-10-31 14:50 UTC (permalink / raw)
  To: linux-raid; +Cc: Jes.Sorensen, Tomasz Majchrzak
In-Reply-To: <1477925454-16809-1-git-send-email-tomasz.majchrzak@intel.com>

Check for a duplicate first or try to merge it with existing bad block.
If block range exceeds BBM_LOG_MAX_LBA_ENTRY_VAL (256) blocks, it must
be split into multiple ranges. Fail if maximum number of bad blocks has
been already reached.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
---
 super-intel.c | 142 ++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 134 insertions(+), 8 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index 77a2ca9..5c54b8c 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -840,6 +840,86 @@ static __u32 get_imsm_bbm_log_size(struct bbm_log *log)
 		sizeof(log->entry_count) +
 		log->entry_count * sizeof(struct bbm_log_entry);
 }
+
+/* check if bad block is not partially stored in bbm log */
+static int is_stored_in_bbm(struct bbm_log *log, const __u8 idx, const unsigned
+			    long long sector, const int length, __u32 *pos)
+{
+	__u32 i;
+
+	for (i = *pos; i < log->entry_count; i++) {
+		struct bbm_log_entry *entry = &log->marked_block_entries[i];
+		unsigned long long bb_start;
+		unsigned long long bb_end;
+
+		bb_start = __le48_to_cpu(&entry->defective_block_start);
+		bb_end = bb_start + (entry->marked_count + 1);
+
+		if ((entry->disk_ordinal == idx) && (bb_start >= sector) &&
+		    (bb_end <= sector + length)) {
+			*pos = i;
+			return 1;
+		}
+	}
+	return 0;
+}
+
+/* record new bad block in bbm log */
+static int record_new_badblock(struct bbm_log *log, const __u8 idx, unsigned
+			       long long sector, int length)
+{
+	int new_bb = 0;
+	__u32 pos = 0;
+	struct bbm_log_entry *entry = NULL;
+
+	while (is_stored_in_bbm(log, idx, sector, length, &pos)) {
+		struct bbm_log_entry *e = &log->marked_block_entries[pos];
+
+		if ((e->marked_count + 1 == BBM_LOG_MAX_LBA_ENTRY_VAL) &&
+		    (__le48_to_cpu(&e->defective_block_start) == sector)) {
+			sector += BBM_LOG_MAX_LBA_ENTRY_VAL;
+			length -= BBM_LOG_MAX_LBA_ENTRY_VAL;
+			pos = pos + 1;
+			continue;
+		}
+		entry = e;
+		break;
+	}
+
+	if (entry) {
+		int cnt = (length <= BBM_LOG_MAX_LBA_ENTRY_VAL) ? length :
+			BBM_LOG_MAX_LBA_ENTRY_VAL;
+		entry->defective_block_start = __cpu_to_le48(sector);
+		entry->marked_count = cnt - 1;
+		if (cnt == length)
+			return 1;
+		sector += cnt;
+		length -= cnt;
+	}
+
+	new_bb = ROUND_UP(length, BBM_LOG_MAX_LBA_ENTRY_VAL) /
+		BBM_LOG_MAX_LBA_ENTRY_VAL;
+	if (log->entry_count + new_bb > BBM_LOG_MAX_ENTRIES)
+		return 0;
+
+	while (length > 0) {
+		int cnt = (length <= BBM_LOG_MAX_LBA_ENTRY_VAL) ? length :
+			BBM_LOG_MAX_LBA_ENTRY_VAL;
+		struct bbm_log_entry *entry =
+			&log->marked_block_entries[log->entry_count];
+
+		entry->defective_block_start = __cpu_to_le48(sector);
+		entry->marked_count = cnt - 1;
+		entry->disk_ordinal = idx;
+
+		sector += cnt;
+		length -= cnt;
+
+		log->entry_count++;
+	}
+
+	return new_bb;
+}
 #endif /* MDASSEMBLE */
 
 /* allocate and load BBM log from metadata */
@@ -7560,6 +7640,25 @@ skip_mark_checkpoint:
 	return consistent;
 }
 
+static int imsm_disk_slot_to_ord(struct active_array *a, int slot)
+{
+	int inst = a->info.container_member;
+	struct intel_super *super = a->container->sb;
+	struct imsm_dev *dev = get_imsm_dev(super, inst);
+	struct imsm_map *map = get_imsm_map(dev, MAP_0);
+
+	if (slot > map->num_members) {
+		pr_err("imsm: imsm_disk_slot_to_ord %d out of range 0..%d\n",
+		       slot, map->num_members - 1);
+		return -1;
+	}
+
+	if (slot < 0)
+		return -1;
+
+	return get_imsm_ord_tbl_ent(dev, slot, MAP_0);
+}
+
 static void imsm_set_disk(struct active_array *a, int n, int state)
 {
 	int inst = a->info.container_member;
@@ -7570,19 +7669,14 @@ static void imsm_set_disk(struct active_array *a, int n, int state)
 	struct mdinfo *mdi;
 	int recovery_not_finished = 0;
 	int failed;
-	__u32 ord;
+	int ord;
 	__u8 map_state;
 
-	if (n > map->num_members)
-		pr_err("imsm: set_disk %d out of range 0..%d\n",
-			n, map->num_members - 1);
-
-	if (n < 0)
+	ord = imsm_disk_slot_to_ord(a, n);
+	if (ord < 0)
 		return;
 
 	dprintf("imsm: set_disk %d:%x\n", n, state);
-
-	ord = get_imsm_ord_tbl_ent(dev, n, MAP_0);
 	disk = get_imsm_disk(super, ord_to_idx(ord));
 
 	/* check for new failures */
@@ -9477,6 +9571,37 @@ int validate_container_imsm(struct mdinfo *info)
 }
 #ifndef MDASSEMBLE
 /*******************************************************************************
+* Function:   imsm_record_badblock
+* Description: This routine stores new bad block record in BBM log
+*
+* Parameters:
+*     a		: array containing a bad block
+*     slot	: disk number containing a bad block
+*     sector	: bad block sector
+*     length	: bad block sectors range
+* Returns:
+*     1 : Success
+*     0 : Error
+******************************************************************************/
+static int imsm_record_badblock(struct active_array *a, int slot,
+			  unsigned long long sector, int length)
+{
+	struct intel_super *super = a->container->sb;
+	int ord;
+	int ret;
+
+	ord = imsm_disk_slot_to_ord(a, slot);
+	if (ord < 0)
+		return 0;
+
+	ret = record_new_badblock(super->bbm_log, ord_to_idx(ord), sector,
+				   length);
+	if (ret)
+		super->updates_pending++;
+
+	return ret;
+}
+/*******************************************************************************
  * Function:	init_migr_record_imsm
  * Description:	Function inits imsm migration record
  * Parameters:
@@ -11026,5 +11151,6 @@ struct superswitch super_imsm = {
 	.activate_spare = imsm_activate_spare,
 	.process_update = imsm_process_update,
 	.prepare_update = imsm_prepare_update,
+	.record_bad_block = imsm_record_badblock,
 #endif /* MDASSEMBLE */
 };
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH 5/8] imsm: clear bad block from bad block log
From: Tomasz Majchrzak @ 2016-10-31 14:50 UTC (permalink / raw)
  To: linux-raid; +Cc: Jes.Sorensen, Tomasz Majchrzak
In-Reply-To: <1477925454-16809-1-git-send-email-tomasz.majchrzak@intel.com>

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
---
 super-intel.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)

diff --git a/super-intel.c b/super-intel.c
index 5c54b8c..0b84012 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -920,6 +920,28 @@ static int record_new_badblock(struct bbm_log *log, const __u8 idx, unsigned
 
 	return new_bb;
 }
+
+/* clear given bad block */
+static int clear_badblock(struct bbm_log *log, const __u8 idx, const unsigned
+			  long long sector, const int length) {
+	__u32 i = 0;
+
+	while (i < log->entry_count) {
+		struct bbm_log_entry *entries = log->marked_block_entries;
+
+		if ((entries[i].disk_ordinal == idx) &&
+		    (__le48_to_cpu(&entries[i].defective_block_start) ==
+		     sector) && (entries[i].marked_count + 1 == length)) {
+			if (i < log->entry_count - 1)
+				entries[i] = entries[log->entry_count - 1];
+			log->entry_count--;
+			break;
+		}
+		i++;
+	}
+
+	return 1;
+}
 #endif /* MDASSEMBLE */
 
 /* allocate and load BBM log from metadata */
@@ -9602,6 +9624,36 @@ static int imsm_record_badblock(struct active_array *a, int slot,
 	return ret;
 }
 /*******************************************************************************
+* Function:   imsm_clear_badblock
+* Description: This routine clears bad block record from BBM log
+*
+* Parameters:
+*     a		: array containing a bad block
+*     slot	: disk number containing a bad block
+*     sector	: bad block sector
+*     length	: bad block sectors range
+* Returns:
+*     1 : Success
+*     0 : Error
+******************************************************************************/
+static int imsm_clear_badblock(struct active_array *a, int slot,
+			unsigned long long sector, int length)
+{
+	struct intel_super *super = a->container->sb;
+	int ord;
+	int ret;
+
+	ord = imsm_disk_slot_to_ord(a, slot);
+	if (ord < 0)
+		return 0;
+
+	ret = clear_badblock(super->bbm_log, ord_to_idx(ord), sector, length);
+	if (ret)
+		super->updates_pending++;
+
+	return ret;
+}
+/*******************************************************************************
  * Function:	init_migr_record_imsm
  * Description:	Function inits imsm migration record
  * Parameters:
@@ -11152,5 +11204,6 @@ struct superswitch super_imsm = {
 	.process_update = imsm_process_update,
 	.prepare_update = imsm_prepare_update,
 	.record_bad_block = imsm_record_badblock,
+	.clear_bad_block  = imsm_clear_badblock,
 #endif /* MDASSEMBLE */
 };
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH 6/8] imsm: clear bad blocks if disk becomes unavailable
From: Tomasz Majchrzak @ 2016-10-31 14:50 UTC (permalink / raw)
  To: linux-raid; +Cc: Jes.Sorensen, Tomasz Majchrzak
In-Reply-To: <1477925454-16809-1-git-send-email-tomasz.majchrzak@intel.com>

If a disk fails or goes missing, clear the bad blocks associated with it
from metadata. If necessary, update disk ordinals.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
---
 super-intel.c | 46 +++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 39 insertions(+), 7 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index 0b84012..efcad01 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -921,6 +921,24 @@ static int record_new_badblock(struct bbm_log *log, const __u8 idx, unsigned
 	return new_bb;
 }
 
+/* clear all bad blocks for given disk */
+static void clear_disk_badblocks(struct bbm_log *log, const __u8 idx)
+{
+	__u32 i = 0;
+
+	while (i < log->entry_count) {
+		struct bbm_log_entry *entries = log->marked_block_entries;
+
+		if (entries[i].disk_ordinal == idx) {
+			if (i < log->entry_count - 1)
+				entries[i] = entries[log->entry_count - 1];
+			log->entry_count--;
+		} else {
+			i++;
+		}
+	}
+}
+
 /* clear given bad block */
 static int clear_badblock(struct bbm_log *log, const __u8 idx, const unsigned
 			  long long sector, const int length) {
@@ -7331,7 +7349,8 @@ static int is_resyncing(struct imsm_dev *dev)
 }
 
 /* return true if we recorded new information */
-static int mark_failure(struct imsm_dev *dev, struct imsm_disk *disk, int idx)
+static int mark_failure(struct intel_super *super,
+			struct imsm_dev *dev, struct imsm_disk *disk, int idx)
 {
 	__u32 ord;
 	int slot;
@@ -7373,12 +7392,16 @@ static int mark_failure(struct imsm_dev *dev, struct imsm_disk *disk, int idx)
 	}
 	if (map->failed_disk_num == 0xff)
 		map->failed_disk_num = slot;
+
+	clear_disk_badblocks(super->bbm_log, ord_to_idx(ord));
+
 	return 1;
 }
 
-static void mark_missing(struct imsm_dev *dev, struct imsm_disk *disk, int idx)
+static void mark_missing(struct intel_super *super,
+			 struct imsm_dev *dev, struct imsm_disk *disk, int idx)
 {
-	mark_failure(dev, disk, idx);
+	mark_failure(super, dev, disk, idx);
 
 	if (disk->scsi_id == __cpu_to_le32(~(__u32)0))
 		return;
@@ -7414,7 +7437,7 @@ static void handle_missing(struct intel_super *super, struct imsm_dev *dev)
 			end_migration(dev, super, map_state);
 	}
 	for (dl = super->missing; dl; dl = dl->next)
-		mark_missing(dev, &dl->disk, dl->index);
+		mark_missing(super, dev, &dl->disk, dl->index);
 	super->updates_pending++;
 }
 
@@ -7703,7 +7726,7 @@ static void imsm_set_disk(struct active_array *a, int n, int state)
 
 	/* check for new failures */
 	if (state & DS_FAULTY) {
-		if (mark_failure(dev, disk, ord_to_idx(ord)))
+		if (mark_failure(super, dev, disk, ord_to_idx(ord)))
 			super->updates_pending++;
 	}
 
@@ -8770,7 +8793,7 @@ static int apply_takeover_update(struct imsm_update_takeover *u,
 	for (du = super->missing; du; du = du->next)
 		if (du->index >= 0) {
 			set_imsm_ord_tbl_ent(map, du->index, du->index);
-			mark_missing(dv->dev, &du->disk, du->index);
+			mark_missing(super, dv->dev, &du->disk, du->index);
 		}
 
 	return 1;
@@ -9345,8 +9368,9 @@ static void imsm_delete(struct intel_super *super, struct dl **dlp, unsigned ind
 	struct dl *iter;
 	struct imsm_dev *dev;
 	struct imsm_map *map;
-	int i, j, num_members;
+	unsigned int i, j, num_members;
 	__u32 ord;
+	struct bbm_log *log = super->bbm_log;
 
 	dprintf("deleting device[%d] from imsm_super\n", index);
 
@@ -9379,6 +9403,14 @@ static void imsm_delete(struct intel_super *super, struct dl **dlp, unsigned ind
 		}
 	}
 
+	for (i = 0; i < log->entry_count; i++) {
+		struct bbm_log_entry *entry = &log->marked_block_entries[i];
+
+		if (entry->disk_ordinal <= index)
+			continue;
+		entry->disk_ordinal--;
+	}
+
 	mpb->num_disks--;
 	super->updates_pending++;
 	if (*dlp) {
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH 7/8] imsm: provide list of bad blocks for an array
From: Tomasz Majchrzak @ 2016-10-31 14:50 UTC (permalink / raw)
  To: linux-raid; +Cc: Jes.Sorensen, Tomasz Majchrzak
In-Reply-To: <1477925454-16809-1-git-send-email-tomasz.majchrzak@intel.com>

Provide list of bad blocks using memory allocated in advance so it's
safe to call it from monitor.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
---
 super-intel.c | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/super-intel.c b/super-intel.c
index efcad01..e795730 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -390,6 +390,7 @@ struct intel_super {
 	struct intel_hba *hba; /* device path of the raid controller for this metadata */
 	const struct imsm_orom *orom; /* platform firmware support */
 	struct intel_super *next; /* (temp) list for disambiguating family_num */
+	struct md_bb bb;	/* memory for get_bad_blocks call */
 };
 
 struct intel_disk {
@@ -4163,6 +4164,7 @@ static void __free_imsm(struct intel_super *super, int free_disks)
 static void free_imsm(struct intel_super *super)
 {
 	__free_imsm(super, 1);
+	free(super->bb.entries);
 	free(super);
 }
 
@@ -4183,6 +4185,14 @@ static struct intel_super *alloc_super(void)
 
 	super->current_vol = -1;
 	super->create_offset = ~((unsigned long long) 0);
+
+	super->bb.entries = malloc(BBM_LOG_MAX_ENTRIES *
+				   sizeof(struct md_bb_entry));
+	if (!super->bb.entries) {
+		free(super);
+		return NULL;
+	}
+
 	return super;
 }
 
@@ -9686,6 +9696,34 @@ static int imsm_clear_badblock(struct active_array *a, int slot,
 	return ret;
 }
 /*******************************************************************************
+* Function:   imsm_get_badblocks
+* Description: This routine get list of bad blocks for an array
+*
+* Parameters:
+*     a		: array
+*     slot	: disk number
+* Returns:
+*     bb	: structure containing bad blocks
+*     NULL	: error
+******************************************************************************/
+static struct md_bb *imsm_get_badblocks(struct active_array *a, int slot)
+{
+	int inst = a->info.container_member;
+	struct intel_super *super = a->container->sb;
+	struct imsm_dev *dev = get_imsm_dev(super, inst);
+	struct imsm_map *map = get_imsm_map(dev, MAP_0);
+	int ord;
+
+	ord = imsm_disk_slot_to_ord(a, slot);
+	if (ord < 0)
+		return NULL;
+
+	get_volume_badblocks(super->bbm_log, ord_to_idx(ord), pba_of_lba0(map),
+			     blocks_per_member(map), &super->bb);
+
+	return &super->bb;
+}
+/*******************************************************************************
  * Function:	init_migr_record_imsm
  * Description:	Function inits imsm migration record
  * Parameters:
@@ -11237,5 +11275,6 @@ struct superswitch super_imsm = {
 	.prepare_update = imsm_prepare_update,
 	.record_bad_block = imsm_record_badblock,
 	.clear_bad_block  = imsm_clear_badblock,
+	.get_bad_blocks   = imsm_get_badblocks,
 #endif /* MDASSEMBLE */
 };
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH 8/8] imsm: implement "--examine-badblocks" command
From: Tomasz Majchrzak @ 2016-10-31 14:50 UTC (permalink / raw)
  To: linux-raid; +Cc: Jes.Sorensen, Tomasz Majchrzak
In-Reply-To: <1477925454-16809-1-git-send-email-tomasz.majchrzak@intel.com>

Implement "--examine-badblocks" command to provide list of bad blocks in
metadata for a disk.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
---
 super-intel.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/super-intel.c b/super-intel.c
index e795730..c534afd 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -9724,6 +9724,61 @@ static struct md_bb *imsm_get_badblocks(struct active_array *a, int slot)
 	return &super->bb;
 }
 /*******************************************************************************
+* Function:   examine_badblocks_imsm
+* Description: Prints list of bad blocks on a disk to the standard output
+*
+* Parameters:
+*     st	: metadata handler
+*     fd	: open file descriptor for device
+*     devname	: device name
+* Returns:
+*     0 : Success
+*     1 : Error
+******************************************************************************/
+static int examine_badblocks_imsm(struct supertype *st, int fd, char *devname)
+{
+	struct intel_super *super = st->sb;
+	struct bbm_log *log = super->bbm_log;
+	struct dl *d = NULL;
+	int any = 0;
+
+	for (d = super->disks; d ; d = d->next) {
+		if (strcmp(d->devname, devname) == 0)
+			break;
+	}
+
+	if ((d == NULL) || (d->index < 0)) { /* serial mismatch probably */
+		pr_err("%s doesn't appear to be part of a raid array\n",
+		       devname);
+		return 1;
+	}
+
+	if (log != NULL) {
+		unsigned int i;
+		struct bbm_log_entry *entry = &log->marked_block_entries[0];
+
+		for (i = 0; i < log->entry_count; i++) {
+			if (entry[i].disk_ordinal == d->index) {
+				unsigned long long sector = __le48_to_cpu(
+					&entry[i].defective_block_start);
+				int cnt = entry[i].marked_count + 1;
+
+				if (!any) {
+					printf("Bad-blocks on %s:\n", devname);
+					any = 1;
+				}
+
+				printf("%20llu for %d sectors\n", sector, cnt);
+			}
+		}
+	}
+
+	if (!any)
+		printf("No bad-blocks list configured on %s\n", devname);
+
+	return 0;
+}
+/*******************************************************************************
  * Function:	init_migr_record_imsm
  * Description:	Function inits imsm migration record
  * Parameters:
@@ -11241,6 +11296,7 @@ struct superswitch super_imsm = {
 	.manage_reshape = imsm_manage_reshape,
 	.recover_backup = recover_backup_imsm,
 	.copy_metadata = copy_metadata_imsm,
+	.examine_badblocks = examine_badblocks_imsm,
 #endif
 	.match_home	= match_home_imsm,
 	.uuid_from_super= uuid_from_super_imsm,
-- 
1.8.3.1


^ permalink raw reply related

* Re: recovering failed raid5
From: Robin Hill @ 2016-10-31 15:19 UTC (permalink / raw)
  To: Alexander Shenkin; +Cc: linux-raid, Andreas Klauer, rm, robin
In-Reply-To: <af2bf92d-c944-2269-1925-50baa44755a2@shenkin.org>

On Mon Oct 31, 2016 at 10:44:38AM +0000, Alexander Shenkin wrote:

> Thanks to everyone for their input.  I need to get a new, non-horrible 
> 3TB drive to ddrescue to.  The question: can I get a different 3TB drive 
> (e.g. Toshiba P300 3TB, 
> https://www.amazon.co.uk/Toshiba-P300-7200RPM-SATA-Drive/dp/B0151KM6F0), 
> or are sizes of 3TB slightly different enough for that to cause me 
> headaches when adding it back into the array?  If the latter is the 
> case, then perhaps I need to aim for a 4TB drive replacement...
> 
> Thanks,
> Allie
> 

Any 3TB drive should be exactly the same size. Somewhere around the 1TB
drive size they stopped using different sizes and standardised across
the industry.

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

^ permalink raw reply

* Re: [PATCH 00/60] block: support multipage bvec
From: Christoph Hellwig @ 2016-10-31 15:25 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-kernel, linux-block, linux-fsdevel,
	Christoph Hellwig, Kirill A . Shutemov, Al Viro, Andrew Morton,
	Bart Van Assche, open list:GFS2 FILE SYSTEM, Coly Li,
	Dan Williams, open list:DEVICE-MAPPER  (LVM),
	open list:DRBD DRIVER, Eric Wheeler, Guoqing Jiang,
	Hannes Reinecke, Hannes Reinecke, Jiri Kosina, Joe Perches,
	Johannes Berg, Johannes Thumshirn, Keith Busch
In-Reply-To: <1477728600-12938-1-git-send-email-tom.leiming@gmail.com>

Hi Ming,

can you send a first patch just doing the obvious cleanups like
converting to bio_add_page and replacing direct poking into the
bio with the proper accessors?  That should help reducing the
actual series to a sane size, and it should also help to cut
down the Cc list.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 09/60] dm: dm.c: replace 'bio->bi_vcnt == 1' with !bio_multiple_segments
From: Christoph Hellwig @ 2016-10-31 15:29 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-kernel, linux-block, linux-fsdevel,
	Christoph Hellwig, Kirill A . Shutemov, Alasdair Kergon,
	Mike Snitzer, maintainer:DEVICE-MAPPER (LVM), Shaohua Li,
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT
In-Reply-To: <1477728600-12938-10-git-send-email-tom.leiming@gmail.com>

On Sat, Oct 29, 2016 at 04:08:08PM +0800, Ming Lei wrote:
> Avoid to access .bi_vcnt directly, because it may be not what
> the driver expected any more after supporting multipage bvec.
> 
> Signed-off-by: Ming Lei <tom.leiming@gmail.com>

It would be really nice to have a comment in the code why it's
even checking for multiple segments.


^ permalink raw reply

* Re: Fail to assemble raid4 with replaced disk
From: Wols Lists @ 2016-10-31 15:57 UTC (permalink / raw)
  To: Santiago DIEZ; +Cc: Linux Raid LIST
In-Reply-To: <CAJh8RqXV4_v-zAu9Us_K6oTS_N_bR6vd97G4msN00qXn_vtXEQ@mail.gmail.com>

On 27/10/16 15:11, Santiago DIEZ wrote:
> Hi,
> 
> Indeed, here is what I had in terms of event count:
> /dev/sda10: 81589
> /dev/sdb10: 81626
> /dev/sdc10: 81589
> 
> Then the following procedure worked quite straightforward:
> --------------------------------------------------------------------------------
> # mdadm --assemble /dev/md10 --verbose --force /dev/sda10 /dev/sdb10 /dev/sdc10
> # mdadm --manage /dev/md10 --add /dev/sdd10
> --------------------------------------------------------------------------------
> 
> And 6h+ later:
> --------------------------------------------------------------------------------
> # cat /proc/mdstat
> Personalities : [raid1] [raid6] [raid5] [raid4]
> md10 : active raid5 sdd10[3] sda10[0] sdc10[2] sdb10[1]
>       5778741888 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
> --------------------------------------------------------------------------------
> 
> Then I ran:
> --------------------------------------------------------------------------------
> # e2fsck -f -n -t -v /dev/md10
> e2fsck 1.42.5 (29-Jul-2012)
> Pass 1: Checking inodes, blocks, and sizes
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> 
>     15675837 inodes used (4.34%, out of 361177088)
>       188798 non-contiguous files (1.2%)
>        14751 non-contiguous directories (0.1%)
>              # of inodes with ind/dind/tind blocks: 0/0/0
>              Extent depth histogram: 15626455/47037/15
>   1281308341 blocks used (88.69%, out of 1444685472)
>            0 bad blocks
>          101 large files
> 
>     15311457 regular files
>       361754 directories
>            0 character device files
>            0 block device files
>            0 fifos
>            0 links
>         2607 symbolic links (2310 fast symbolic links)
>           10 sockets
> ------------
>     15675828 files
> Memory used: 50976k/1912k (20541k/30436k), time: 1304.00/334.06/ 8.00
> I/O read: 4891MB, write: 0MB, rate: 3.75MB/s
> --------------------------------------------------------------------------------
> 
> Does it look OK enough to launch the mount?
> 
sorry - I've been away for the weekend - daughter's wedding :-)

But yes, that looks great. No errors on fsck either, I think :-)

I think your array looks fine. Just look at the output from smartctl for
your old drives and make sure that it doesn't look like another drive is
going to fail soon. I'm not quite sure what to look for, mostly bad
blocks and relocates, I think, but if you compare it with your new drive
and stuff looks dodgy, you can always ask for help.

Cheers,
Wol


^ permalink raw reply

* "SCT Error Recovery Control" disabled by suspend.
From: Doug Herr @ 2016-10-31 16:16 UTC (permalink / raw)
  To: linux-raid

Mostly just a note for others like myself that are using software RAID in 
a more "desktop" setting. Short version is "Oops, I forgot that the 
drives needed to have volatile settings reset on wake up...

Some months back I learned about setting smartctl -l scterc /dev/sd[ab] 
in rc.local. This is a home desktop system with Fedora 24 using two 
single terabyte drives with raid 1.

The other day I checked in on this group to see if there is anything new 
I should be aware of and found a thread that had me double checking my 
work and I found:

>sudo smartctl -l scterc /dev/sda
[sudo] password for doug: 
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.7.9-200.fc24.x86_64] (local 
build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, 
www.smartmontools.org

SCT Error Recovery Control:
           Read: Disabled
          Write: Disabled

>sudo smartctl -l scterc /dev/sdb
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.7.9-200.fc24.x86_64] (local 
build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, 
www.smartmontools.org

SCT Error Recovery Control:
           Read: Disabled
          Write: Disabled

I verified that the rc.local entries worked correctly and I rebooted to 
make sure that the settings were going in on reboot.  I am using:

>grep smartctl /etc/rc.d/rc.local 
/sbin/smartctl -l scterc,70,70 /dev/sda
/sbin/smartctl -l scterc,70,70 /dev/sdb

When I discovered them being "Disabled" I had a six day uptime so I let 
it sit with a note to recheck periodically and this morning I found that 
it is disabled again. After getting a Raspberry Pi I no longer run my 
desktop like a server and am currently using "systemctl suspend" every 
day. I just confirmed that the scterc setting gets lost during supspend. 
I also realize that this should have been obvious since the drive is 
fully powered off by suspend and only the RAM is powered (at least that 
is my understanding).

I already added something to be run at wake up to "tickle" the screen 
saver so I will see about adding this.

I would be happy to learn about any best practices regarding this.

-- 
Doug Herr 

^ permalink raw reply

* Re: recovering failed raid5
From: Wols Lists @ 2016-10-31 16:26 UTC (permalink / raw)
  To: Alexander Shenkin, linux-raid, Andreas Klauer, rm, robin
In-Reply-To: <af2bf92d-c944-2269-1925-50baa44755a2@shenkin.org>

On 31/10/16 10:44, Alexander Shenkin wrote:
> Thanks to everyone for their input.  I need to get a new, non-horrible
> 3TB drive to ddrescue to.  The question: can I get a different 3TB drive
> (e.g. Toshiba P300 3TB,
> https://www.amazon.co.uk/Toshiba-P300-7200RPM-SATA-Drive/dp/B0151KM6F0),
> or are sizes of 3TB slightly different enough for that to cause me
> headaches when adding it back into the array?  If the latter is the
> case, then perhaps I need to aim for a 4TB drive replacement...

Can't speak for that drive, but I recently got myself a laptop 2TB
Toshiba drive. Was pleasantly surprised to discover it supported SCT/ERC
(and they make a 3TB 2.5" drive too, just don't try sticking it in a
laptop as it won't fit :-)

Cheers,
Wol

^ permalink raw reply

* Re: recovering failed raid5
From: Wols Lists @ 2016-10-31 16:28 UTC (permalink / raw)
  To: Alexander Shenkin, linux-raid, Andreas Klauer, rm, robin
In-Reply-To: <20161028133626.GA27462@cthulhu.home.robinhill.me.uk>

On 28/10/16 14:36, Robin Hill wrote:
> Unfortunately, reconstruction of the array depends on this data being
> readable, so the fact the drive isn't toast doesn't necessarily help.
> I'd suggest replicating (using ddrescue) that drive to the new one (when
> it arrives) as a first step. It's possible ddrescue will manage to read
> the data (it'll make several attempts, so can sometimes read data that
> fails initially), otherwise you'll end up with some missing data
> (possibly corrupt files, possibly corrupt filesystem metadata, possibly
> just a bit of extra noise in an audio/video file). Once that's done, you
> can do a proper check on sdc (e.g. a badblocks read/write test), which
> will either lead to sector actually being reallocated, or to clearing
> the pending reallocations. Unless you get a lot more reallocated sectors
> than are currently pending, you can put the drive back into use if you
> like (bearing in mind the reputation of these drives and weighing the
> replacement cost against the value of your data).

Read the linux raid wiki - the page about programming projects at the
bottom.

If ddrescue fails to do a complete, okay copy, then maybe you or someone
near you has the smarts to do that little project. Then you can stick
your newly copied drive back knowing that the raid at least has a chance
of reconstructing your data without error.

Cheers,
Wol

^ permalink raw reply

* Re: recovering failed raid5
From: Wols Lists @ 2016-10-31 16:31 UTC (permalink / raw)
  To: Alexander Shenkin, linux-raid; +Cc: Andreas Klauer, rm, robin
In-Reply-To: <715b259f-1e56-9606-edc4-3e5c4d57744b@shenkin.org>

On 28/10/16 13:22, Alexander Shenkin wrote:
>> But... why? Why disable smart? And if you do, is it a surprise that you
>> only notice disk failures when it's already too late?
> 
> yeah, i asked myself that same question.  there was probably some reason
> I did, but i don't remember what it was.  i'll keep smart enabled from
> now on...

I bet he didn't disable smarts - bear in mind I also have two 3TB
Barracudas ... and they lose their settings at power-off/on.

If he didn't set the boot process to explicitly turn smart on, EVERY
BOOT, then by default it's off.

Cheers,
Wol

^ permalink raw reply

* Re: [PATCH 1/8] imsm: parse bad block log in metadata on startup
From: Jes Sorensen @ 2016-10-31 18:02 UTC (permalink / raw)
  To: Tomasz Majchrzak; +Cc: linux-raid
In-Reply-To: <1477925454-16809-1-git-send-email-tomasz.majchrzak@intel.com>

Tomasz Majchrzak <tomasz.majchrzak@intel.com> writes:
> Always allocate memory for all log entries to avoid a need for memory
> allocation when monitor requests to record a bad block.
>
> Also some extra checks added to make static code analyzer happy.
>
> Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
> ---
>  super-intel.c | 158 +++++++++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 112 insertions(+), 46 deletions(-)

Thanks - just a note, I am at Linux Plumbers in Santa Fe this week, and
on PTO next week, so it'll probably be two weeks before I will have
time to look at this.

If you haven't heard from my by three weeks from now, please feel free
to turn on all caps and yell at me.

Cheers,
Jes

> diff --git a/super-intel.c b/super-intel.c
> index c146bbd..5d6d534 100644
> --- a/super-intel.c
> +++ b/super-intel.c
> @@ -217,22 +217,24 @@ struct imsm_super {
>  } __attribute__ ((packed));
>  
>  #define BBM_LOG_MAX_ENTRIES 254
> +#define BBM_LOG_MAX_LBA_ENTRY_VAL 256		/* Represents 256 LBAs */
> +#define BBM_LOG_SIGNATURE 0xABADB10C
> +
> +struct bbm_log_block_addr {
> +	__u16 w1;
> +	__u32 dw1;
> +} __attribute__ ((__packed__));
>  
>  struct bbm_log_entry {
> -	__u64 defective_block_start;
> -#define UNREADABLE 0xFFFFFFFF
> -	__u32 spare_block_offset;
> -	__u16 remapped_marked_count;
> -	__u16 disk_ordinal;
> +	__u8 marked_count;		/* Number of blocks marked - 1 */
> +	__u8 disk_ordinal;		/* Disk entry within the imsm_super */
> +	struct bbm_log_block_addr defective_block_start;
>  } __attribute__ ((__packed__));
>  
>  struct bbm_log {
>  	__u32 signature; /* 0xABADB10C */
>  	__u32 entry_count;
> -	__u32 reserved_spare_block_count; /* 0 */
> -	__u32 reserved; /* 0xFFFF */
> -	__u64 first_spare_lba;
> -	struct bbm_log_entry mapped_block_entries[BBM_LOG_MAX_ENTRIES];
> +	struct bbm_log_entry marked_block_entries[BBM_LOG_MAX_ENTRIES];
>  } __attribute__ ((__packed__));
>  
>  #ifndef MDASSEMBLE
> @@ -785,6 +787,92 @@ static struct imsm_dev *get_imsm_dev(struct intel_super *super, __u8 index)
>  	return NULL;
>  }
>  
> +#if BYTE_ORDER == LITTLE_ENDIAN
> +static inline unsigned long long __le48_to_cpu(const struct bbm_log_block_addr
> +					       *addr)
> +{
> +	return ((((__u64)addr->dw1) << 16) | addr->w1);
> +}
> +
> +static inline struct bbm_log_block_addr __cpu_to_le48(unsigned long long sec)
> +{
> +	struct bbm_log_block_addr addr;
> +
> +	addr.w1 =  (__u16)(sec & 0xFFFF);
> +	addr.dw1 = (__u32)((sec >> 16) & 0xFFFFFFFF);
> +	return addr;
> +}
> +#elif BYTE_ORDER == BIG_ENDIAN
> +static inline unsigned long long __le48_to_cpu(const struct bbm_log_block_addr
> +					       *addr)
> +{
> +	return ((((__u64)__le32_to_cpu(addr->dw1)) << 16) |
> +		__le16_to_cpu(addr->w1));
> +}
> +
> +static inline struct bbm_log_block_addr __cpu_to_le48(unsigned long long sec)
> +{
> +	struct bbm_log_block_addr addr;
> +
> +	addr.w1 =  __cpu_to_le16((__u16)(sec & 0xFFFF));
> +	addr.dw1 = __cpu_to_le32((__u32)(sec >> 16) & 0xFFFFFFFF);
> +	return addr;
> +}
> +#else
> +#  error "unknown endianness."
> +#endif
> +
> +#ifndef MDASSEMBLE
> +/* get size of the bbm log */
> +static __u32 get_imsm_bbm_log_size(struct bbm_log *log)
> +{
> +	if (!log || log->entry_count == 0)
> +		return 0;
> +
> +	return sizeof(log->signature) +
> +		sizeof(log->entry_count) +
> +		log->entry_count * sizeof(struct bbm_log_entry);
> +}
> +#endif /* MDASSEMBLE */
> +
> +/* allocate and load BBM log from metadata */
> +static int load_bbm_log(struct intel_super *super)
> +{
> +	struct imsm_super *mpb = super->anchor;
> +	__u32 bbm_log_size =  __le32_to_cpu(mpb->bbm_log_size);
> +
> +	super->bbm_log = malloc(sizeof(struct bbm_log));
> +	if (!super->bbm_log)
> +		return 1;
> +
> +	if (bbm_log_size) {
> +		struct bbm_log *log = (void *)mpb +
> +			__le32_to_cpu(mpb->mpb_size) - bbm_log_size;
> +		__u32 entry_count;
> +
> +		if (bbm_log_size < sizeof(log->signature) +
> +		    sizeof(log->entry_count))
> +			return 2;
> +
> +		entry_count = __le32_to_cpu(log->entry_count);
> +		if ((__le32_to_cpu(log->signature) != BBM_LOG_SIGNATURE) ||
> +		    (entry_count > BBM_LOG_MAX_ENTRIES))
> +			return 3;
> +
> +		if (bbm_log_size !=
> +		    sizeof(log->signature) + sizeof(log->entry_count) +
> +		    entry_count * sizeof(struct bbm_log_entry))
> +			return 4;
> +
> +		memcpy(super->bbm_log, log, bbm_log_size);
> +	} else {
> +		super->bbm_log->signature = __cpu_to_le32(BBM_LOG_SIGNATURE);
> +		super->bbm_log->entry_count = 0;
> +	}
> +
> +	return 0;
> +}
> +
>  /*
>   * for second_map:
>   *  == MAP_0 get first map
> @@ -1433,7 +1521,7 @@ static void examine_super_imsm(struct supertype *st, char *homehost)
>  	printf("          Disks : %d\n", mpb->num_disks);
>  	printf("   RAID Devices : %d\n", mpb->num_raid_devs);
>  	print_imsm_disk(__get_imsm_disk(mpb, super->disks->index), super->disks->index, reserved);
> -	if (super->bbm_log) {
> +	if (get_imsm_bbm_log_size(super->bbm_log)) {
>  		struct bbm_log *log = super->bbm_log;
>  
>  		printf("\n");
> @@ -1441,9 +1529,6 @@ static void examine_super_imsm(struct supertype *st, char *homehost)
>  		printf("       Log Size : %d\n", __le32_to_cpu(mpb->bbm_log_size));
>  		printf("      Signature : %x\n", __le32_to_cpu(log->signature));
>  		printf("    Entry Count : %d\n", __le32_to_cpu(log->entry_count));
> -		printf("   Spare Blocks : %d\n",  __le32_to_cpu(log->reserved_spare_block_count));
> -		printf("    First Spare : %llx\n",
> -		       (unsigned long long) __le64_to_cpu(log->first_spare_lba));
>  	}
>  	for (i = 0; i < mpb->num_raid_devs; i++) {
>  		struct mdinfo info;
> @@ -3628,19 +3713,6 @@ static int parse_raid_devices(struct intel_super *super)
>  	return 0;
>  }
>  
> -/* retrieve a pointer to the bbm log which starts after all raid devices */
> -struct bbm_log *__get_imsm_bbm_log(struct imsm_super *mpb)
> -{
> -	void *ptr = NULL;
> -
> -	if (__le32_to_cpu(mpb->bbm_log_size)) {
> -		ptr = mpb;
> -		ptr += mpb->mpb_size - __le32_to_cpu(mpb->bbm_log_size);
> -	}
> -
> -	return ptr;
> -}
> -
>  /*******************************************************************************
>   * Function:	check_mpb_migr_compatibility
>   * Description:	Function checks for unsupported migration features:
> @@ -3790,12 +3862,6 @@ static int load_imsm_mpb(int fd, struct intel_super *super, char *devname)
>  		return 3;
>  	}
>  
> -	/* FIXME the BBM log is disk specific so we cannot use this global
> -	 * buffer for all disks.  Ok for now since we only look at the global
> -	 * bbm_log_size parameter to gate assembly
> -	 */
> -	super->bbm_log = __get_imsm_bbm_log(super->anchor);
> -
>  	return 0;
>  }
>  
> @@ -3839,6 +3905,9 @@ load_and_parse_mpb(int fd, struct intel_super *super, char *devname, int keep_fd
>  	if (err)
>  		return err;
>  	err = parse_raid_devices(super);
> +	if (err)
> +		return err;
> +	err = load_bbm_log(super);
>  	clear_hi(super);
>  	return err;
>  }
> @@ -3903,6 +3972,8 @@ static void __free_imsm(struct intel_super *super, int free_disks)
>  		free(elem);
>  		elem = next;
>  	}
> +	if (super->bbm_log)
> +		free(super->bbm_log);
>  	super->hba = NULL;
>  }
>  
> @@ -4508,7 +4579,7 @@ static int get_super_block(struct intel_super **super_list, char *devnm, char *d
>  		*super_list = s;
>  	} else {
>  		if (s)
> -			free(s);
> +			free_imsm(s);
>  		if (dfd >= 0)
>  			close(dfd);
>  	}
> @@ -4570,6 +4641,8 @@ static int load_super_imsm(struct supertype *st, int fd, char *devname)
>  	free_super_imsm(st);
>  
>  	super = alloc_super();
> +	if (!super)
> +		return 1;
>  	/* Load hba and capabilities if they exist.
>  	 * But do not preclude loading metadata in case capabilities or hba are
>  	 * non-compliant and ignore_hw_compat is set.
> @@ -4912,7 +4985,7 @@ static int init_super_imsm(struct supertype *st, mdu_array_info_t *info,
>  
>  	super = alloc_super();
>  	if (super && posix_memalign(&super->buf, 512, mpb_size) != 0) {
> -		free(super);
> +		free_imsm(super);
>  		super = NULL;
>  	}
>  	if (!super) {
> @@ -4922,7 +4995,7 @@ static int init_super_imsm(struct supertype *st, mdu_array_info_t *info,
>  	if (posix_memalign(&super->migr_rec_buf, 512, MIGR_REC_BUF_SIZE) != 0) {
>  		pr_err("could not allocate migr_rec buffer\n");
>  		free(super->buf);
> -		free(super);
> +		free_imsm(super);
>  		return 0;
>  	}
>  	memset(super->buf, 0, mpb_size);
> @@ -5489,11 +5562,6 @@ static int store_super_imsm(struct supertype *st, int fd)
>  #endif
>  }
>  
> -static int imsm_bbm_log_size(struct imsm_super *mpb)
> -{
> -	return __le32_to_cpu(mpb->bbm_log_size);
> -}
> -
>  #ifndef MDASSEMBLE
>  static int validate_geometry_imsm_container(struct supertype *st, int level,
>  					    int layout, int raiddisks, int chunk,
> @@ -5529,6 +5597,10 @@ static int validate_geometry_imsm_container(struct supertype *st, int level,
>  	 * note that there is no fd for the disks in array.
>  	 */
>  	super = alloc_super();
> +	if (!super) {
> +		close(fd);
> +		return 0;
> +	}
>  	rv = find_intel_hba_capability(fd, super, verbose > 0 ? dev : NULL);
>  	if (rv != 0) {
>  #if DEBUG
> @@ -6760,12 +6832,6 @@ static struct mdinfo *container_content_imsm(struct supertype *st, char *subarra
>  		pr_err("Unsupported attributes in IMSM metadata.Arrays activation is blocked.\n");
>  	}
>  
> -	/* check for bad blocks */
> -	if (imsm_bbm_log_size(super->anchor)) {
> -		pr_err("BBM log found in IMSM metadata.Arrays activation is blocked.\n");
> -		sb_errors = 1;
> -	}
> -
>  	/* count spare devices, not used in maps
>  	 */
>  	for (d = super->disks; d; d = d->next)

^ permalink raw reply

* Re: Buffer I/O error on dev md5, logical block 7073536, async page read
From: Wols Lists @ 2016-10-31 19:24 UTC (permalink / raw)
  To: Marc MERLIN, Andreas Klauer; +Cc: linux-raid
In-Reply-To: <20161030164342.GC28648@merlins.org>

On 30/10/16 16:43, Marc MERLIN wrote:
> And here isn't one good drive between the 2, the bad blocks are identical on
> both drives and must have happened at the same time due to those cable
> induced IO errors I mentionned.
> Too bad that mdadm doesn't seem to account for the fact that it could be
> wrong when marking blocks as bad and does not seem to give a way to recover
> from this easily....
> I'll do more reading, thanks.

Reading the list, I've picked up that somehow badblocks seem to get
propagated from one drive to another. So if one drive gets a badblock,
that seems to get marked as bad on other drives too :-(

Oh - and as for badblocks being obsolete, isn't there a load of work
being done on it at the moment? For hardware raid I believe, which
presumably does not handle badblocks the way Phil thinks all modern
drives do? (Not surprising - hardware raid is regularly slated for being
buggy and not a good idea, this is probably more of the same...)

Cheers,
Wol

^ permalink raw reply

* Re: [ LR] Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds.
From: Wols Lists @ 2016-10-31 19:29 UTC (permalink / raw)
  To: TomK, linux-raid
In-Reply-To: <73e35e17-80aa-c7e6-535c-3665d9789e16@mdevsys.com>

On 30/10/16 18:56, TomK wrote:
> 
> We did not do a thorough R/W test to see how the error and bad disk
> affected the data stored on the array but did notice pauses and
> slowdowns on the CIFS share presented from it with pauses and generally
> difficulty in reading data, however no data errors that we could see.
> Since then we replaced the 2TB Seagate with a new 2TB WD and everything
> is fine even if the array is degraded.  But as soon as we put in this
> bad disk, it degraded to it's previous behaviour.  Yet the array didn't
> catch it as a failed disk until the disk was nearly completely
> inaccessible.

What is this 2TB Seagate? A Barracuda? There's your problem, quite
possibly. Sounds like you've got your timeouts correctly matched, so
this drive is responding, but taking ages to do so. And that's why it
doesn't get kicked, but it knackers system response times - the kernel
is correctly configured to wait for the geriatric to respond.

Cheers,
Wol

^ permalink raw reply

* Re: Panicked and deleted superblock
From: Peter Hoffmann @ 2016-10-31 22:36 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <20161030211103.GA7196@metamorpher.de>

I'm a bit confused using the overlay function.

(A) If I follow the manual on the wiki [1] the new raid device md0
seemingly just contains random junks of byte, nothing like the header of
the original decrypted device.
(B) But if I copy say 400M of each original drive at least the hexdump
looks like the head of an ext4 file system and is what I expected from
looking at the original decrypted device

Steps for (A):

    for ((i=0; $i < 3; i++ )); do
      dev=/dev/mapper/HDD_$i

      dd if=/dev/zero of=overlay-$i bs=4M count=100 # alternative 1
      # truncate -s1850G overlay-$i # alternativ 2

      loop=$(losetup -f --show -- overlay-$i)
      echo 0 $(blockdev --getsize $dev) snapshot $dev $loop P 8 | \
        dmsetup create HDD_overlay_0_$i
    done
    mdadm --create --assume-clean --level=5 --raid-devices=4 /dev/md0
/dev/mapper/HDD_overlay_0_[012] missing

Steps for (B):

    for ((i=0; $i < 3; i++ )); do
      dd if=/dev/mapper/HDD_$i of=copy-$i bs=4M count=100
      loops="$loops $(losetup -f --show -- copy-$i"
    done
    mdadm --create --assume-clean --level=5 --raid-devices=4 /dev/md1
$loops missing
	
Both ways should look exactly the same at least for the first 1200M,
shouldn't they?

Greetings
P. Hoffmann


1)
https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID#Making_the_harddisks_read-only_using_an_overlay_file

Am 30.10.2016 um 22:11 schrieb Andreas Klauer:
> On Sun, Oct 30, 2016 at 09:45:27PM +0100, Peter Hoffmann wrote:
>> there shouldn't anything be lost as growing consumes more
>> than it writes, stripe wise speaking
> 
> That's what I meant by 'overlap' - it's the wrong word I guess.
> 
>> /dev/sda2 --luks--> /dev/mapper/HDD_0 \
>> /dev/sdb2 --luks--> /dev/mapper/HDD_1 --raid--> /dev/md127 -ext4-> /raid
>> /dev/sdc2 --luks--> /dev/mapper/HDD_2 /
> 
> You're hoping it be faster since three threads instead of one?
> Adds the overhead of encrypting parity. Not sure if worth it.
> This idea belongs to another era (before AES-NI).
> 
> But it's good, that way, you have "unencrypted" data on your RAID and can 
> make deductions from that raw data as to chunk size and such things. 
> 
>> * anything else?
> 
> This is where I don't know how to provide specific help.
> Since you did not provide specific data I can work with.
> Your data offset sounds strange to me but with overlay, 
> it's faster to just go ahead and try.
> 
> You'll have to figure out the details by yourself, pretty much.
> 
> Once you have the correct offset you might be able to deduct the other 
> offset. Create 4 loop devices size of your disks (sparse files in tmpfs, 
> truncate -s thefile, losetup), create a 3 disk raid, grow to 4 disks, 
> check with mdadm --examine if & how the data offset changed.
> 
>> So I'm looking for a sequence of bytes that is duplicated on both
>> overlays. This way I find the border between both parts.
> 
> Yes, there should be an identical region (let's hope not zeroes)
> and you should roughly determine the end of that region and that's 
> your entry point for a linear device mapping.
> 
> Regards
> Andreas Klauer
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [PATCH 00/60] block: support multipage bvec
From: Ming Lei @ 2016-10-31 22:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Linux Kernel Mailing List, linux-block,
	Linux FS Devel, Kirill A . Shutemov, Al Viro, Andrew Morton,
	Bart Van Assche, open list:GFS2 FILE SYSTEM, Coly Li,
	Dan Williams, open list:DEVICE-MAPPER (LVM),
	open list:DRBD DRIVER, Eric Wheeler, Guoqing Jiang,
	Hannes Reinecke, Hannes Reinecke, Jiri Kosina, Joe Perches,
	Johannes Berg, Johannes Thumshirn, Kei
In-Reply-To: <20161031152519.GA25791@infradead.org>

On Mon, Oct 31, 2016 at 11:25 PM, Christoph Hellwig <hch@infradead.org> wrote:
> Hi Ming,
>
> can you send a first patch just doing the obvious cleanups like
> converting to bio_add_page and replacing direct poking into the
> bio with the proper accessors?  That should help reducing the

OK, that is just the 1st part of the patchset.

> actual series to a sane size, and it should also help to cut
> down the Cc list.
>



Thanks,
Ming Lei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 09/60] dm: dm.c: replace 'bio->bi_vcnt == 1' with !bio_multiple_segments
From: Ming Lei @ 2016-10-31 22:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Linux Kernel Mailing List, linux-block,
	Linux FS Devel, Kirill A . Shutemov, Alasdair Kergon,
	Mike Snitzer, maintainer:DEVICE-MAPPER (LVM), Shaohua Li,
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT
In-Reply-To: <20161031152901.GD30919@infradead.org>

On Mon, Oct 31, 2016 at 11:29 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Sat, Oct 29, 2016 at 04:08:08PM +0800, Ming Lei wrote:
>> Avoid to access .bi_vcnt directly, because it may be not what
>> the driver expected any more after supporting multipage bvec.
>>
>> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
>
> It would be really nice to have a comment in the code why it's
> even checking for multiple segments.
>

OK, will add comment about using !bio_multiple_segments(rq->bio)
to replace 'rq->bio->bi_vcnt == 1'.


Thanks,
Ming Lei

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox