* Re: Fail to assemble raid4 with replaced disk
From: Mikael Abrahamsson @ 2016-10-25 17:35 UTC (permalink / raw)
To: Santiago DIEZ; +Cc: Linux Raid LIST
In-Reply-To: <CAJh8RqUjr7L_Of0fbW_mAXshmmgcPdXvhtsvhFXKF+XxjOTrFw@mail.gmail.com>
On Tue, 25 Oct 2016, Santiago DIEZ wrote:
> # mdadm --assemble /dev/md10 --verbose /dev/sda10 /dev/sdb10 /dev/sdc10
> mdadm: looking for devices for /dev/md10
> mdadm: /dev/sda10 is identified as a member of /dev/md10, slot 0.
> mdadm: /dev/sdb10 is identified as a member of /dev/md10, slot 1.
> mdadm: /dev/sdc10 is identified as a member of /dev/md10, slot 2.
> mdadm: added /dev/sda10 to /dev/md10 as 0 (possibly out of date)
> mdadm: added /dev/sdc10 to /dev/md10 as 2 (possibly out of date)
> mdadm: no uptodate device for slot 3 of /dev/md10
> mdadm: added /dev/sdb10 to /dev/md10 as 1
> mdadm: /dev/md10 assembled from 1 drive - not enough to start the array.
This means sda10 and sdc10 most likely have a lower event count than
sdb10.
> I examined the status again with:
> # cat /proc/mdstat
> md10 : inactive sdb10[1](S) sdc10[2](S) sda10[0](S)
> 5778741888 blocks
>
> Now I'm SCARED!
> What does the (S) mean?
> How do I reassemble my array and add the new sdd10 partition?
Check with mdadm -E /dev/sd[abc]10, check the event count, if it differs
just a little (5-10 perhaps), then you can use --assemble --force to start
it even though the event count is not exactly the same on each drive.
The event count is increased every time a drive is written to, when there
is an unclean shutdown mdadm won't auto-assemble drives without operator
intervention to understand the situation and act accordingly.
--
Mikael Abrahamsson email: swmike@swm.pp.se
^ permalink raw reply
* Re: Fail to assemble raid4 with replaced disk
From: Wols Lists @ 2016-10-25 17:50 UTC (permalink / raw)
To: Santiago DIEZ, Linux Raid LIST
In-Reply-To: <CAJh8RqUjr7L_Of0fbW_mAXshmmgcPdXvhtsvhFXKF+XxjOTrFw@mail.gmail.com>
On 25/10/16 18:08, Santiago DIEZ wrote:
> Hi Raiders,
This looks like a fairly simple recovery job - but you will probably
lose a little data - fsck will moan about a few new files being corrupted.
Firstly, DON'T DO ANYTHING WITH THE RAID.
Secondly, go to the linux raid wiki
https://raid.wiki.kernel.org/index.php/Linux_Raid and read section 4
"When things go wrogn". You've messed up replacing the failed drive, and
are now at "My raid won't assemble/run". But as I say, it doesn't look
particularly serious.
>
> I had a raid5 array md10 with sd[abcd]10.
> Eventually, sdd10 failed.
>
> I did NOT do any mdadm --fail NOR mdadm --remove command.
> What I did is comment out the line "ARRAY /dev/md10 ..." in
> /etc/mdadm/mdadm.conf.
mdadm.conf is somewhat of a relic from a bygone age, I believe. It used
to be necessary, in the new world of raid superblocks it is mostly
ignored and redundant.
>
> Then I powered off the server, replaced the disk sdd with a new one
> and booted the system.
>
> I examined the status with:
> # cat /proc/mdstat
> md10 : inactive sdb10[1]
> 1926247296 blocks
>
> I stopped the array with:
> # mdadm --stop /dev/md10
>
> I tried to assemble the array with the 3 original disks like this
> # mdadm --assemble /dev/md10 --verbose /dev/sda10 /dev/sdb10 /dev/sdc10
> mdadm: looking for devices for /dev/md10
> mdadm: /dev/sda10 is identified as a member of /dev/md10, slot 0.
> mdadm: /dev/sdb10 is identified as a member of /dev/md10, slot 1.
> mdadm: /dev/sdc10 is identified as a member of /dev/md10, slot 2.
> mdadm: added /dev/sda10 to /dev/md10 as 0 (possibly out of date)
> mdadm: added /dev/sdc10 to /dev/md10 as 2 (possibly out of date)
> mdadm: no uptodate device for slot 3 of /dev/md10
> mdadm: added /dev/sdb10 to /dev/md10 as 1
> mdadm: /dev/md10 assembled from 1 drive - not enough to start the array.
Okay. It's got three drives. When you've done what "Asking for help"
says, you should have event counts for all those three drives -
sd[abc]10. Hopefully they're all pretty much the same. If they are, a
simple "--assemble --force" should get your array up and running again.
The complaint about slot 3 is because you haven't removed the old sdd10,
and the new sdd10 isn't part of the array, it has no superblock.
>
> I examined the status again with:
> # cat /proc/mdstat
> md10 : inactive sdb10[1](S) sdc10[2](S) sda10[0](S)
> 5778741888 blocks
>
> Now I'm SCARED!
> What does the (S) mean?
> How do I reassemble my array and add the new sdd10 partition?
>
> Thanks for your help
>
Okay. That leaves your recovery path neatly mapped out. Get the event
count of the three remaining drives and post them here. Wait for an
expert to muck in and say it all looks good. Then
Assemble the array with --force
Remove the old sdd10
Add the new sdd10
Run a fsck.
And your array should all be back fine. One thing - the wiki bangs on
about the timeout problem. Is that your problem? Because if it is you
will have grief trying to get the array back unless you fix that as your
very first step.
Cheers,
Wol
^ permalink raw reply
* Re: Problems with a RAID5 array
From: Andreas Klauer @ 2016-10-25 17:54 UTC (permalink / raw)
To: Nicolas Nicolaou; +Cc: linux-raid
In-Reply-To: <5C95474C-8A2A-4472-9629-2DD6B2143D28@GMAIL.COM>
On Tue, Oct 25, 2016 at 10:45:12AM +0300, Nicolas Nicolaou wrote:
> Below you can see the mdadm —examine information for the 4 drives.
The output doesn't seem to be formatted correctly, hard to read.
Also you posted only 2 of 4 disks, for two different RAIDs,
with different metadata (1.0 vs. 0.90), creation times...
Basically this is too confusing to say anything about it.
Did you pick the correct devices?
> /dev/sda3:
> Magic : a92b4efc
> Version : 1.0
> Array UUID : 11d32674:4247f385:74ee352b:5e4c22c7
> Creation Time : Wed Jan 9 02:29:02 2013
> Raid Level : raid5
> Raid Devices : 4
> /dev/sdb3:
> Magic : a92b4efc
> Version : 00.90.00
> UUID : 3fe6d5d4:5b9d61f2:4f7ddb81:e4ae2138
> Creation Time : Thu Jun 7 19:50:56 2012
> Raid Level : raid5
Regards
Andreas Klauer
^ permalink raw reply
* Re: dm block manager: use do/while(0) for empty macros
From: Mike Snitzer @ 2016-10-25 18:20 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Joe Thornber, linux-kernel, linux-raid, dm-devel, Mikulas Patocka,
Shaohua Li, Alasdair Kergon
In-Reply-To: <20161025155434.21727-1-arnd@arndb.de>
On Tue, Oct 25 2016 at 11:54am -0400,
Arnd Bergmann <arnd@arndb.de> wrote:
> make W=1 reports a new warning for the dm-block-manager:
>
> drivers/md/persistent-data/dm-block-manager.c: In function ‘dm_bm_unlock’:
> drivers/md/persistent-data/dm-block-manager.c:598:3: error: suggest braces around empty body in an ‘else’ statement [-Werror=empty-body]
>
> This is completely harmless, but generally speaking it's a good idea to
> address this warning as it can often detect nasty bugs, and replacing
> empty macros with "do { } while (0)" is generally considered good style
> to make code more robust anyway.
>
> Fixes: f94bdb2e26b6 ("dm block manager: make block locking optional")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Thanks Arnd, I'll fold this in. I suspected this was the right thing to
do. Strange that my build didn't say anything through, I used:
make C=1 CF=-D__CHECK_ENDIAN__ M=drivers/md
I'll be sure to use make W=1 in the future.
> ---
> drivers/md/persistent-data/dm-block-manager.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/md/persistent-data/dm-block-manager.c b/drivers/md/persistent-data/dm-block-manager.c
> index b619c383d88d..a6dde7cab458 100644
> --- a/drivers/md/persistent-data/dm-block-manager.c
> +++ b/drivers/md/persistent-data/dm-block-manager.c
> @@ -306,13 +306,13 @@ static void report_recursive_bug(dm_block_t b, int r)
>
> #else /* !CONFIG_DM_DEBUG_BLOCK_MANAGER_LOCKING */
>
> -#define bl_init(x)
> +#define bl_init(x) do { } while (0)
> #define bl_down_read(x) 0
> #define bl_down_read_nonblock(x) 0
> -#define bl_up_read(x)
> +#define bl_up_read(x) do { } while (0)
> #define bl_down_write(x) 0
> -#define bl_up_write(x)
> -#define report_recursive_bug(x, y)
> +#define bl_up_write(x) do { } while (0)
> +#define report_recursive_bug(x, y) do { } while (0)
>
> #endif /* CONFIG_DM_DEBUG_BLOCK_MANAGER_LOCKING */
>
> --
> 2.9.0
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
^ permalink raw reply
* Re: [PATCH mdadm] raid6check.c: fix "misleading-indentation" error
From: NeilBrown @ 2016-10-25 19:50 UTC (permalink / raw)
To: Jes Sorensen, renyl; +Cc: LKP, linux-raid
In-Reply-To: <wrfjmvhsskou.fsf@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 2517 bytes --]
On Tue, Oct 25 2016, Jes Sorensen wrote:
> renyl <yilongx.ren@intel.com> writes:
>> From: Yilong Ren <yilongx.ren@intel.com>
>>
>> To fix the following error info:
>>
>> root@vm-lkp-nex04-8G-7 /tmp/mdadm# make test
>> cc -Wall -Werror -Wstrict-prototypes -Wextra -Wno-unused-parameter -ggdb -DSendmail=\""/usr/sbin/sendmail -t"\" -DCONFFILE=\"/etc/mdadm.conf\" -DCONFFILE2=\"/etc/mdadm/mdadm.conf\" -DMAP_DIR=\"/run/mdadm\" -DMAP_FILE=\"map\" -DMDMON_DIR=\"/run/mdadm\" -DFAILED_SLOTS_DIR=\"/run/mdadm/failed-slots\" -DNO_COROSYNC -DNO_DLM -DVERSION=\"3.4-43-g1dcee1c\" -DVERS_DATE="\"06th April 2016\"" -DUSE_PTHREADS -DBINDIR=\"/sbin\" -c -o raid6check.o raid6check.c
>> raid6check.c: In function 'manual_repair':
>> raid6check.c:267:4: error: this 'else' clause does not guard... [-Werror=misleading-indentation]
>> else
>> ^~~~
>> raid6check.c:269:5: note: ...this statement, but the latter is misleadingly indented as if it is guarded by the 'else'
>> printf("Repairing D(%d) and P\n", failed_data);
>> ^~~~~~
>> cc1: all warnings being treated as errors
>> <builtin>: recipe for target 'raid6check.o' failed
>> make: *** [raid6check.o] Error 1
>> root@vm-lkp-nex04-8G-7 /tmp/mdadm#
>>
>>
>> Cc: NeilBrown <neilb@suse.com>
>> Cc: linux-raid <linux-raid@vger.kernel.org>
>> Cc: LKP <lkp@eclists.intel.com>
>> Signed-off-by: Yilong Ren <yilongx.ren@intel.com>
>> ---
>> raid6check.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/raid6check.c b/raid6check.c
>> index ad7ffe7..acfc9a3 100644
>> --- a/raid6check.c
>> +++ b/raid6check.c
>> @@ -264,9 +264,10 @@ int manual_repair(int chunk_size, int syndrome_disks,
>> int failed_data;
>> if (failed_slot1 == -1)
>> failed_data = failed_slot2;
>> - else
>> + else {
>> failed_data = failed_slot1;
>> printf("Repairing D(%d) and P\n", failed_data);
>> + }
>> raid6_datap_recov(syndrome_disks+2, chunk_size,
>> failed_data, (uint8_t**)blocks, 1);
>> } else {
>
> Hi,
>
> I suspect this patch is wrong and the code is meant to print in either
> case.
>
> Neil?
Looking at the whole
if (failed_slot1 == -2 || failed_slot2 == -2) {
statement it looks like every path through should print a message, and
this section would be more similar to the previous section if, as you
say, the code prints in either case.
So I agree: original code is correct, indentation is wrong.
NeilBrown
>
> Cheers,
> Jes
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]
^ permalink raw reply
* Re: [PATCH v5 7/8] md/r5cache: r5c recovery
From: NeilBrown @ 2016-10-26 1:18 UTC (permalink / raw)
To: Song Liu
Cc: linux-raid@vger.kernel.org, Shaohua Li, Kernel Team,
dan.j.williams@intel.com, hch@infradead.org,
liuzhengyuang521@gmail.com, liuzhengyuan@kylinos.cn
In-Reply-To: <3F2550E2-584B-495F-BD65-F25BC1C18BFC@fb.com>
[-- Attachment #1: Type: text/plain, Size: 1202 bytes --]
On Sat, Oct 22 2016, Song Liu wrote:
>>> + /* stripes only have parity are already flushed to RAID */
>>> + if (data_count == 0)
>>> + goto out;
>>
>> Can you explain why that is? When were they flushed to the RAID, and
>> how was the parity determined?
>
> It happens like this: say two stripes on journal: 100 and 200. The data (D)
> and parity (P) pages are store in journal as:
>
> ---> D100 D200 P100 P200 ----> newer data
>
> Before we flush D100, journal_start points as D100. Then we flush D100,
> and new journal_start points as D200. Now the system fails, so next
> recovery starts from D200. Recovery code will find stripe 100 only has
> parity. This means, stripe 100 is already flushed to raid. so we can ignore it.
OK, I see. Thanks.
So the data for the stripe had previously been flushed before the crash
that is currently being recovered from. It might help to make the time
of the flush more explicit in the comment:
/* stripes that only have parity must have been flushed
* before the crash that we are now recovering from, so
* there is nothing more to recovery.
*/
Something like that.
Thanks,
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]
^ permalink raw reply
* [PATCH v2] raid5: revert commit 11367799f3d1
From: Tomasz Majchrzak @ 2016-10-26 7:20 UTC (permalink / raw)
To: linux-raid; +Cc: shli, Tomasz Majchrzak
Revert commit 11367799f3d1 ("md: Prevent IO hold during accessing to faulty
raid5 array") as it doesn't comply with commit c3cce6cda162 ("md/raid5:
ensure device failure recorded before write request returns."). That change
is not required anymore as the problem is resolved by commit 16f889499a52
("md: report 'write_pending' state when array in sync") - read request is
stuck as array state is not reported correctly via sysfs attribute.
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
---
drivers/md/raid5.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
v2:
The previous patch was corrupted and didn't compile
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f94472d..323d3c7 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4645,9 +4645,7 @@ finish:
}
if (!bio_list_empty(&s.return_bi)) {
- if (test_bit(MD_CHANGE_PENDING, &conf->mddev->flags) &&
- (s.failed <= conf->max_degraded ||
- conf->mddev->external == 0)) {
+ if (test_bit(MD_CHANGE_PENDING, &conf->mddev->flags)) {
spin_lock_irq(&conf->device_lock);
bio_list_merge(&conf->return_bi, &s.return_bi);
spin_unlock_irq(&conf->device_lock);
--
1.8.3.1
^ permalink raw reply related
* [PATCH v2 mdadm] raid6check.c: fix "misleading-indentation" error
From: Yilong Ren @ 2016-10-26 8:10 UTC (permalink / raw)
To: Jes Sorensen; +Cc: LKP, Yilong Ren, NeilBrown, linux-raid
To fix the following error info:
root@vm-lkp-nex04-8G-7 /tmp/mdadm# make test
cc -Wall -Werror -Wstrict-prototypes -Wextra -Wno-unused-parameter -ggdb -DSendmail=\""/usr/sbin/sendmail -t"\" -DCONFFILE=\"/etc/mdadm.conf\" -DCONFFILE2=\"/etc/mdadm/mdadm.conf\" -DMAP_DIR=\"/run/mdadm\" -DMAP_FILE=\"map\" -DMDMON_DIR=\"/run/mdadm\" -DFAILED_SLOTS_DIR=\"/run/mdadm/failed-slots\" -DNO_COROSYNC -DNO_DLM -DVERSION=\"3.4-43-g1dcee1c\" -DVERS_DATE="\"06th April 2016\"" -DUSE_PTHREADS -DBINDIR=\"/sbin\" -c -o raid6check.o raid6check.c
raid6check.c: In function 'manual_repair':
raid6check.c:267:4: error: this 'else' clause does not guard... [-Werror=misleading-indentation]
else
^~~~
raid6check.c:269:5: note: ...this statement, but the latter is misleadingly indented as if it is guarded by the 'else'
printf("Repairing D(%d) and P\n", failed_data);
^~~~~~
cc1: all warnings being treated as errors
<builtin>: recipe for target 'raid6check.o' failed
make: *** [raid6check.o] Error 1
root@vm-lkp-nex04-8G-7 /tmp/mdadm#
Cc: NeilBrown <neilb@suse.com>
Cc: linux-raid <linux-raid@vger.kernel.org>
Cc: LKP <lkp@eclists.intel.com>
Reviewed-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Yilong Ren <yilongx.ren@intel.com>
---
raid6check.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/raid6check.c b/raid6check.c
index ad7ffe7..551f835 100644
--- a/raid6check.c
+++ b/raid6check.c
@@ -266,7 +266,8 @@ int manual_repair(int chunk_size, int syndrome_disks,
failed_data = failed_slot2;
else
failed_data = failed_slot1;
- printf("Repairing D(%d) and P\n", failed_data);
+
+ printf("Repairing D(%d) and P\n", failed_data);
raid6_datap_recov(syndrome_disks+2, chunk_size,
failed_data, (uint8_t**)blocks, 1);
} else {
--
2.1.4
^ permalink raw reply related
* Re: [PATCH v2 mdadm] raid6check.c: fix "misleading-indentation" error
From: Jes Sorensen @ 2016-10-26 16:00 UTC (permalink / raw)
To: Yilong Ren; +Cc: LKP, NeilBrown, linux-raid
In-Reply-To: <1477469438-62688-1-git-send-email-yilongx.ren@intel.com>
Yilong Ren <yilongx.ren@intel.com> writes:
> To fix the following error info:
>
> root@vm-lkp-nex04-8G-7 /tmp/mdadm# make test
> cc -Wall -Werror -Wstrict-prototypes -Wextra -Wno-unused-parameter -ggdb -DSendmail=\""/usr/sbin/sendmail -t"\" -DCONFFILE=\"/etc/mdadm.conf\" -DCONFFILE2=\"/etc/mdadm/mdadm.conf\" -DMAP_DIR=\"/run/mdadm\" -DMAP_FILE=\"map\" -DMDMON_DIR=\"/run/mdadm\" -DFAILED_SLOTS_DIR=\"/run/mdadm/failed-slots\" -DNO_COROSYNC -DNO_DLM -DVERSION=\"3.4-43-g1dcee1c\" -DVERS_DATE="\"06th April 2016\"" -DUSE_PTHREADS -DBINDIR=\"/sbin\" -c -o raid6check.o raid6check.c
> raid6check.c: In function 'manual_repair':
> raid6check.c:267:4: error: this 'else' clause does not guard... [-Werror=misleading-indentation]
> else
> ^~~~
> raid6check.c:269:5: note: ...this statement, but the latter is misleadingly indented as if it is guarded by the 'else'
> printf("Repairing D(%d) and P\n", failed_data);
> ^~~~~~
> cc1: all warnings being treated as errors
> <builtin>: recipe for target 'raid6check.o' failed
> make: *** [raid6check.o] Error 1
> root@vm-lkp-nex04-8G-7 /tmp/mdadm#
>
>
> Cc: NeilBrown <neilb@suse.com>
> Cc: linux-raid <linux-raid@vger.kernel.org>
> Cc: LKP <lkp@eclists.intel.com>
> Reviewed-by: Jes Sorensen <Jes.Sorensen@redhat.com>
> Signed-off-by: Yilong Ren <yilongx.ren@intel.com>
> ---
> raid6check.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
Applied!
However, in the future I would much prefer a sane commit message rather
than this auto generated noise which doesn't do a good job describing
the change.
Jes
^ permalink raw reply
* Re: [PATCH] imsm: load migration record from right disk
From: Jes Sorensen @ 2016-10-26 16:01 UTC (permalink / raw)
To: Tomasz Majchrzak; +Cc: linux-raid
In-Reply-To: <1477303217-7968-1-git-send-email-tomasz.majchrzak@intel.com>
Tomasz Majchrzak <tomasz.majchrzak@intel.com> writes:
> Migration record is only stored on disks in first and second metadata
> slot. The function to load the record incorrectly passes disk slot as
> disk index. If rebuilt has taken place for a container, disk slot
> doesn't match disk index so it causes migration record to be read from a
> disk it has not been written to. As a result reshape operation fails.
>
> Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
> ---
> super-intel.c | 12 +++---------
> 1 file changed, 3 insertions(+), 9 deletions(-)
Applied!
Thanks,
Jes
^ permalink raw reply
* Re: [PATCH v2] IMSM: Add warning message when x8-type device is used
From: Jes Sorensen @ 2016-10-26 16:02 UTC (permalink / raw)
To: Pawel Baldysiak; +Cc: linux-raid
In-Reply-To: <1477297192-5646-1-git-send-email-pawel.baldysiak@intel.com>
Pawel Baldysiak <pawel.baldysiak@intel.com> writes:
> This patch adds the warning message when x8-type device
> is used with IMSM metadata. x8 device is a special
> NVMe drive - two of them on a single PCIe card.
> This card could be a single point of failure for
> RAID levels different than RAID0. x8 devices have
> serial number ending with "-A/-B" or "-1/-2".
>
> Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
> Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
> super-intel.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 44 insertions(+)
Applied!
Thanks,
Jes
^ permalink raw reply
* Re: [PATCH] IMSM: Enable spanning between VMD domains
From: Jes Sorensen @ 2016-10-26 16:03 UTC (permalink / raw)
To: Pawel Baldysiak; +Cc: linux-raid
In-Reply-To: <1477042671-12934-3-git-send-email-pawel.baldysiak@intel.com>
Pawel Baldysiak <pawel.baldysiak@intel.com> writes:
> Each VMD domain adds additional PCI domain. This patch
> enables RAID creation with NVMe drives from different
> VMD domains.
>
> Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
> ---
> super-intel.c | 19 +++----------------
> 1 file changed, 3 insertions(+), 16 deletions(-)
Applied!
Thanks,
Jes
^ permalink raw reply
* Re: [PATCH] Lib.c: Fix geting devname for devices with long path
From: Jes Sorensen @ 2016-10-26 16:03 UTC (permalink / raw)
To: Pawel Baldysiak; +Cc: linux-raid
In-Reply-To: <1477042671-12934-4-git-send-email-pawel.baldysiak@intel.com>
Pawel Baldysiak <pawel.baldysiak@intel.com> writes:
> In scenario where VMD is enabled, and "x8" type of NVMe drive is
> plugged into PCIe switch - the path will be longer than 200 chars
> (additional VMD domain + 2 level of PCIe switches).
> This patch makes the buffer big enough to handle this kind of
> configurations.
>
> Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
> ---
> lib.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
Applied!
Thanks,
Jes
^ permalink raw reply
* Re: [PATCH 1/4] mdadm: bad block support for external metadata - initialization
From: Jes Sorensen @ 2016-10-26 18:00 UTC (permalink / raw)
To: Tomasz Majchrzak; +Cc: keld, linux-raid
In-Reply-To: <20161020142611.GA23120@proton.igk.intel.com>
Tomasz Majchrzak <tomasz.majchrzak@intel.com> writes:
> I cannot see how badblocks program is related to this patch. It is a generic
> code for bad blocks support in IMSM metadata. It introduces 64-bit value for
> sector address, the same size as in kernel. All it does is syncing kernel bad
> block list with raid metadata.
>
> Tomek
I was waiting for this response, but you cut me off the CC list so
missed it.
In this case I'll go ahead and apply these patches to mdadm.
Cheers,
Jes
>
> On Thu, Oct 20, 2016 at 04:02:20PM +0200, keld@keldix.com wrote:
>> Is this safe for 2TB+ arrays?
>> I think the badblocks program does not handle 2TB+ partitions?
>>
>> best regards
>> Keld
>>
>> Best regards
>> Keld
>>
>> On Thu, Oct 20, 2016 at 03:03:46PM +0200, Tomasz Majchrzak wrote:
>> > If metadata handler provides support for bad blocks, tell md by writing
>> > 'external_bbl' to rdev state file (both on create and assemble),
>> > followed by a list of known bad blocks written via sysfs 'bad_blocks'
>> > file.
>> >
>> > Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
>> > Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>> > ---
>> > mdadm.h | 13 +++++++++++++
>> > sysfs.c | 29 ++++++++++++++++++++++++++++-
>> > 2 files changed, 41 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/mdadm.h b/mdadm.h
>> > index 0516c82..5156ea4 100755
>> > --- a/mdadm.h
>> > +++ b/mdadm.h
>> > @@ -237,6 +237,17 @@ struct dlm_lksb {
>> >
>> > extern const char Name[];
>> >
>> > +struct md_bb_entry {
>> > + unsigned long long sector;
>> > + int length;
>> > +};
>> > +
>> > +struct md_bb {
>> > + int supported;
>> > + int count;
>> > + struct md_bb_entry *entries;
>> > +};
>> > +
>> > /* general information that might be extracted from a superblock */
>> > struct mdinfo {
>> > mdu_array_info_t array;
>> > @@ -311,6 +322,8 @@ struct mdinfo {
>> >
>> > /* info read from sysfs */
>> > char sysfs_array_state[20];
>> > +
>> > + struct md_bb bb;
>> > };
>> >
>> > struct createinfo {
>> > diff --git a/sysfs.c b/sysfs.c
>> > index d28e21a..c7a8e66 100644
>> > --- a/sysfs.c
>> > +++ b/sysfs.c
>> > @@ -50,8 +50,12 @@ void sysfs_free(struct mdinfo *sra)
>> > while (sra->devs) {
>> > struct mdinfo *d = sra->devs;
>> > sra->devs = d->next;
>> > + if (d->bb.entries)
>> > + free(d->bb.entries);
>> > free(d);
>> > }
>> > + if (sra->bb.entries)
>> > + free(sra->bb.entries);
>> > free(sra);
>> > sra = sra2;
>> > }
>> > @@ -259,7 +263,7 @@ struct mdinfo *sysfs_read(int fd, char *devnm, unsigned long options)
>> > dbase = base + strlen(base);
>> > *dbase++ = '/';
>> >
>> > - dev = xmalloc(sizeof(*dev));
>> > + dev = xcalloc(1, sizeof(*dev));
>> >
>> > /* Always get slot, major, minor */
>> > strcpy(dbase, "slot");
>> > @@ -687,6 +691,7 @@ int sysfs_add_disk(struct mdinfo *sra, struct mdinfo *sd, int resume)
>> > char nm[PATH_MAX];
>> > char *dname;
>> > int rv;
>> > + int i;
>> >
>> > sprintf(dv, "%d:%d", sd->disk.major, sd->disk.minor);
>> > rv = sysfs_set_str(sra, NULL, "new_dev", dv);
>> > @@ -718,6 +723,28 @@ int sysfs_add_disk(struct mdinfo *sra, struct mdinfo *sd, int resume)
>> > if (resume)
>> > sysfs_set_num(sra, sd, "recovery_start", sd->recovery_start);
>> > }
>> > + if (sd->bb.supported) {
>> > + if (sysfs_set_str(sra, sd, "state", "external_bbl")) {
>> > + /*
>> > + * backward compatibility - if kernel doesn't support
>> > + * bad blocks for external metadata, let it continue
>> > + * as long as there are none known so far
>> > + */
>> > + if (sd->bb.count) {
>> > + pr_err("The kernel has no support for bad blocks in external metadata\n");
>> > + return -1;
>> > + }
>> > + }
>> > +
>> > + for (i = 0; i < sd->bb.count; i++) {
>> > + char s[30];
>> > + const struct md_bb_entry *entry = &sd->bb.entries[i];
>> > +
>> > + snprintf(s, sizeof(s) - 1, "%llu %d\n", entry->sector,
>> > + entry->length);
>> > + rv |= sysfs_set_str(sra, sd, "bad_blocks", s);
>> > + }
>> > + }
>> > return rv;
>> > }
>> >
>> > --
>> > 1.8.3.1
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH] md/raid5: write an empty meta-block when creating logsuper-block
From: Shaohua Li @ 2016-10-26 18:35 UTC (permalink / raw)
To: Zhengyuan Liu; +Cc: Song Liu, linux-raid, liuzhengyuang521
In-Reply-To: <tencent_048678816CFA01ED3101C078@qq.com>
On Tue, Oct 25, 2016 at 08:43:50PM +0800, Zhengyuan Liu wrote:
> After discussion with my colleague, I think there is still a problem that
> may happen very unlikely.The superblock should point to the last meta
> block we have written after log reclaim or point to the emtpy meta block
> after log recovery, just consider we write some meta block behind the
> superblock position and suppose crash happens. If the first meta block we
> have written neighboring the superblock position is invalid, ctx.seq would
> also equal to last_cp_seq+1 after we did a recovery . So the safest way is
> we always write an empty meta block at ctx.pos no matter how much
> ctx.req is more than last_cp_seq after we did a recovery.
> How do you think, Shaohua? If it is necessary, I'd revert this patch and
> resend one.
I didn't get the point. Could you please elaborate it again?
Thanks,
Shaohua
>
> ------------------ Original ------------------
> From: "Shaohua Li"<shli@kernel.org>;
> Date: Tue, Oct 25, 2016 05:23 AM
> To: "Zhengyuan Liu"<liuzhengyuan@kylinos.cn>;
> Cc: "shli"<shli@fb.com>; "Song Liu"<songliubraving@fb.com>; "linux-raid"<linux-raid@vger.kernel.org>; "liuzhengyuang521"<liuzhengyuang521@gmail.com>;
> Subject: Re: [PATCH] md/raid5: write an empty meta-block when creating logsuper-block
>
> On Mon, Oct 24, 2016 at 04:15:59PM +0800, Zhengyuan Liu wrote:
> > If superblock points to an invalid meta block, r5l_load_log will set
> > create_super with true and create an new superblock, this runtime path
> > would always happen if we do no writing I/O to this array since it was
> > created. Writing an empty meta block could avoid this unnecessary
> > action at the first time we created log superblock.
> >
> > Another reason is for the corretness of log recovery. Currently we have
> > bellow code to guarantee log revocery to be correct.
> >
> > if (ctx.seq > log->last_cp_seq + 1) {
> > int ret;
> >
> > ret = r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq + 10);
> > if (ret)
> > return ret;
> > log->seq = ctx.seq + 11;
> > log->log_start = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
> > r5l_write_super(log, ctx.pos);
> > } else {
> > log->log_start = ctx.pos;
> > log->seq = ctx.seq;
> > }
> >
> > If we just created a array with a journal device, log->log_start and
> > log->last_checkpoint should all be 0, then we write three meta block
> > which are valid except mid one and supposed crash happened. The ctx.seq
> > would equal to log->last_cp_seq + 1 and log->log_start would be set to
> > position of mid invalid meta block after we did a recovery, this will
> > lead to problems which could be avoided with this patch.
>
> This would be very unlikely, but better to fix. Applied, thanks!
^ permalink raw reply
* Re: data loss+inode recovery using RAID6 write journal
From: Shaohua Li @ 2016-10-26 18:43 UTC (permalink / raw)
To: Nick Black; +Cc: linux-raid
In-Reply-To: <20161024235505.rb4fucq24ybbn5aq@schwarzgerat.orthanc>
On Mon, Oct 24, 2016 at 07:55:05PM -0400, Nick Black wrote:
> Hey there, everyone! I've been using and admiring mdadm for over a decade;
> thanks for all the awesome work.
>
> I recently put together a new build, and wanted to try out the
> --write-journal capability of recent Linux md. My write journal is a
> Samsung SSD 840 PRO SSD, atop a RAID6 of 8 4TB spinning disks. All 9 SATA3
> devices are plugged into the onboard SATA3 ports of my ASUS X-99 Deluxe II
> motherboard. Summary description:
>
> md126 : active raid6 sde1[4] sdg1[6] sdd1[3] sdc1[2] sdf1[5] sdi1[8] sdh1[7] sdb1[1] sda1[0](J)
> 23441316864 blocks super 1.2 level 6, 512k chunk, algorithm 2 [8/8] [UUUUUUUU]
> bitmap: 0/30 pages [0KB], 65536KB chunk
>
> All filesystems are ext4. ~14TB of ~22TB are in use on the filesystem built
> directly atop md126:
>
> /dev/md126 22T 14T 7.4T 65% /media/trap
>
> Kernel version is 4.8.3 (the array was built under 4.7.5), and mdadm reports
> v3.4. Distro is debian unstable, running a custom (but fairly orthodox)
> kernel.
>
> I moved a ~20GB tarball from my home directory (located on another device, a
> NVMe md RAID1) to /media/trap/backups. The mv completed successfully. A
> short time after that, I hard rebooted the machine due to X lockup (I'm
> experimenting with compiz). By "short time", I mean "possibly within the
> time window before 20GB could be written out to the backing store, but I'm
> unsure about that". Upon restart, the machine engaged in minutes of disk
> activity, spat out some fsck inode recovery messages (I'm trying to find
> these in my logs), and finally mounted the filesystem. The moved file is
> nowhere to be found.
>
> It's no big loss to me -- I can recreate that data -- but I thought I'd
> report this. As said, I'm looking for logs or other hard details, but not
> seeing them in journalctl output. I can probably reproduce the problem if
> someone needs me to, though otherwise I will likely disable the write
> journal for now (I've not yet done so). Please let me know how I might help
> you track this problem down, if a problem does indeed exist. Thanks!
Thanks for the testing. We can't improve the quality of the new feature if
nobody tests it. Yep, the write journal isn't mature yet, but I can't imagine
the data loss. With write journal, data is written to the ssd first, then to
raid disks and IO is finished at that time. So if IO is finished, the data
should be in raid disks. The only possible way to data loss is the recovery.
But it's also possible filesystem/writeback hasn't flushed data to disk yet.
I'm wondering if you can reproduce it with/without journal, so we can narrow
down it a bit.
Thanks,
Shaohua
^ permalink raw reply
* Re: data loss+inode recovery using RAID6 write journal
From: Nick Black @ 2016-10-26 18:51 UTC (permalink / raw)
To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <20161026184359.u7pwbxzz7i5mcbfm@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 1379 bytes --]
Shaohua Li left as an exercise for the reader:
> Thanks for the testing. We can't improve the quality of the new feature if
> nobody tests it. Yep, the write journal isn't mature yet, but I can't imagine
> the data loss. With write journal, data is written to the ssd first, then to
> raid disks and IO is finished at that time. So if IO is finished, the data
> should be in raid disks. The only possible way to data loss is the recovery.
> But it's also possible filesystem/writeback hasn't flushed data to disk yet.
> I'm wondering if you can reproduce it with/without journal, so we can narrow
> down it a bit.
I doubt it can be replaced without the journal -- like I said, I've been
using mdadm RAID[56] for over a decade, and never seen such a problem.
I'll attempt to reproduce with the journal enabled. Assuming I can, I can
try to reproduce without, but I doubt it'll be fruitful. Are there any
debugging options / flags I should enable prior reproducing in order to get
a more complete report? Some state I should dump from my array and
filesystems? Feel free to be technical.
I owe a lot to Linux MD RAID, and am happy to put some effort into running
this down.
I'll report whether I can at least reproduce ASAP.
--nick
--
nick black -=- http://www.nick-black.com
to make an apple pie from scratch, you need first invent a universe.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 163 bytes --]
^ permalink raw reply
* Re: [PATCH] md: wake up personality thread after array state update
From: Shaohua Li @ 2016-10-26 19:14 UTC (permalink / raw)
To: Tomasz Majchrzak; +Cc: linux-raid
In-Reply-To: <1477408028-3694-1-git-send-email-tomasz.majchrzak@intel.com>
On Tue, Oct 25, 2016 at 05:07:08PM +0200, Tomasz Majchrzak wrote:
> When raid1/raid10 array fails to write to one of the drives, the request
> is added to bio_end_io_list and finished by personality thread. The
> thread doesn't handle it as long as MD_CHANGE_PENDING flag is set. In
> case of external metadata this flag is cleared, however the thread is
> not woken up. It causes request to be blocked for few seconds (until
> another action on the array wakes up the thread) or to get stuck
> indefinitely.
>
> Wake up personality thread once MD_CHANGE_PENDING has been cleared.
> Moving 'restart_array' call after the flag is cleared it not a solution
> because in read-write mode the call doesn't wake up the thread.
The patch looks good. However can you elaborate how userspace handles the case?
I'd like to understand what the user interface should be to support external
metadata array.
Thanks,
Shaohua
^ permalink raw reply
* Re: [PATCH 1/4] mdadm: bad block support for external metadata - initialization
From: Shaohua Li @ 2016-10-26 19:23 UTC (permalink / raw)
To: Jes Sorensen; +Cc: Tomasz Majchrzak, keld, linux-raid
In-Reply-To: <wrfj4m3zkn1s.fsf@redhat.com>
On Wed, Oct 26, 2016 at 02:00:47PM -0400, Jes Sorensen wrote:
> Tomasz Majchrzak <tomasz.majchrzak@intel.com> writes:
> > I cannot see how badblocks program is related to this patch. It is a generic
> > code for bad blocks support in IMSM metadata. It introduces 64-bit value for
> > sector address, the same size as in kernel. All it does is syncing kernel bad
> > block list with raid metadata.
> >
> > Tomek
>
> I was waiting for this response, but you cut me off the CC list so
> missed it.
>
> In this case I'll go ahead and apply these patches to mdadm.
Thomasz,
So your original kernel patch to support bad block for external metadata writes
'-blocked' to state. We agreed it's not required later and the applied kernel
patches don't support that interface. Don't you need change of the mdadm
patches?
Thanks,
Shaohua
^ permalink raw reply
* Re: [PATCH 1/4] mdadm: bad block support for external metadata - initialization
From: Jes Sorensen @ 2016-10-26 19:57 UTC (permalink / raw)
To: Shaohua Li; +Cc: Tomasz Majchrzak, keld, linux-raid
In-Reply-To: <20161026192316.wntpx6jcyay756j7@kernel.org>
Shaohua Li <shli@kernel.org> writes:
> On Wed, Oct 26, 2016 at 02:00:47PM -0400, Jes Sorensen wrote:
>> Tomasz Majchrzak <tomasz.majchrzak@intel.com> writes:
>> > I cannot see how badblocks program is related to this patch. It is a generic
>> > code for bad blocks support in IMSM metadata. It introduces 64-bit value for
>> > sector address, the same size as in kernel. All it does is syncing
>> > kernel bad
>> > block list with raid metadata.
>> >
>> > Tomek
>>
>> I was waiting for this response, but you cut me off the CC list so
>> missed it.
>>
>> In this case I'll go ahead and apply these patches to mdadm.
>
> Thomasz,
>
> So your original kernel patch to support bad block for external metadata writes
> '-blocked' to state. We agreed it's not required later and the applied kernel
> patches don't support that interface. Don't you need change of the mdadm
> patches?
Well I'll wait until this is resolved then :)
Cheers,
Jes
^ permalink raw reply
* Re: [PATCH] md: wake up personality thread after array state update
From: Tomasz Majchrzak @ 2016-10-27 8:52 UTC (permalink / raw)
To: Shaohua Li; +Cc: linux-raid, Jes.Sorensen
In-Reply-To: <20161026191455.wja223ageiwfnqiu@kernel.org>
On Wed, Oct 26, 2016 at 12:14:55PM -0700, Shaohua Li wrote:
> On Tue, Oct 25, 2016 at 05:07:08PM +0200, Tomasz Majchrzak wrote:
> > When raid1/raid10 array fails to write to one of the drives, the request
> > is added to bio_end_io_list and finished by personality thread. The
> > thread doesn't handle it as long as MD_CHANGE_PENDING flag is set. In
> > case of external metadata this flag is cleared, however the thread is
> > not woken up. It causes request to be blocked for few seconds (until
> > another action on the array wakes up the thread) or to get stuck
> > indefinitely.
> >
> > Wake up personality thread once MD_CHANGE_PENDING has been cleared.
> > Moving 'restart_array' call after the flag is cleared it not a solution
> > because in read-write mode the call doesn't wake up the thread.
>
> The patch looks good. However can you elaborate how userspace handles the case?
> I'd like to understand what the user interface should be to support external
> metadata array.
1. Kernel encounters new bad block that needs to be acknowledged.
sysfs array state == "write-pending" (as MD_CHANGE_PENDING set)
sysfs rdev state == "blocked" (as unacked_exists + external_bbl set)
2. mdmon wakes up as there is an update to sysfs array state and unacknowledged
bad blocks list.
3. mdmon checks the state of each disk. If any is 'blocked' and there is a
support for bad blocks in metadata, it reads unacknowledged bad block list and
records new bad blocks in metadata. If successful, it acknowledges bad blocks by
writing to sysfs bad block file. If all bad blocks have been acknowledged, it
schedules disk unblock.
As soon as kernel marks all bad blocks as acknowledged, it will clear
unacked_exists flag.
4. mdmon checks 'faulty' flag for each disk. If it is set, the disk is removed
from array and unblock is scheduled.
5. mdmon requests to unblock the array by writing '-blocked' to sysfs disk
state.
Requests awaiting for bad block confirmation are woken up in kernel.
6. mdmon writes 'active' to sysfs array state.
MD_CHANGE_PENDING flag is cleared by this step but personality thread is not
woken up. The patch resolves this problem.
I hope it answers your question.
Tomek
^ permalink raw reply
* [PATCH 1/4 v2] mdadm: bad block support for external metadata - initialization
From: Tomasz Majchrzak @ 2016-10-27 8:53 UTC (permalink / raw)
To: linux-raid; +Cc: Jes.Sorensen, Tomasz Majchrzak
If metadata handler provides support for bad blocks, tell md by writing
'external_bbl' to rdev state file (both on create and assemble),
followed by a list of known bad blocks written via sysfs 'bad_blocks'
file.
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
mdadm.h | 13 +++++++++++++
sysfs.c | 29 ++++++++++++++++++++++++++++-
2 files changed, 41 insertions(+), 1 deletion(-)
diff --git a/mdadm.h b/mdadm.h
index 0516c82..5156ea4 100755
--- a/mdadm.h
+++ b/mdadm.h
@@ -237,6 +237,17 @@ struct dlm_lksb {
extern const char Name[];
+struct md_bb_entry {
+ unsigned long long sector;
+ int length;
+};
+
+struct md_bb {
+ int supported;
+ int count;
+ struct md_bb_entry *entries;
+};
+
/* general information that might be extracted from a superblock */
struct mdinfo {
mdu_array_info_t array;
@@ -311,6 +322,8 @@ struct mdinfo {
/* info read from sysfs */
char sysfs_array_state[20];
+
+ struct md_bb bb;
};
struct createinfo {
diff --git a/sysfs.c b/sysfs.c
index d28e21a..c7a8e66 100644
--- a/sysfs.c
+++ b/sysfs.c
@@ -50,8 +50,12 @@ void sysfs_free(struct mdinfo *sra)
while (sra->devs) {
struct mdinfo *d = sra->devs;
sra->devs = d->next;
+ if (d->bb.entries)
+ free(d->bb.entries);
free(d);
}
+ if (sra->bb.entries)
+ free(sra->bb.entries);
free(sra);
sra = sra2;
}
@@ -259,7 +263,7 @@ struct mdinfo *sysfs_read(int fd, char *devnm, unsigned long options)
dbase = base + strlen(base);
*dbase++ = '/';
- dev = xmalloc(sizeof(*dev));
+ dev = xcalloc(1, sizeof(*dev));
/* Always get slot, major, minor */
strcpy(dbase, "slot");
@@ -687,6 +691,7 @@ int sysfs_add_disk(struct mdinfo *sra, struct mdinfo *sd, int resume)
char nm[PATH_MAX];
char *dname;
int rv;
+ int i;
sprintf(dv, "%d:%d", sd->disk.major, sd->disk.minor);
rv = sysfs_set_str(sra, NULL, "new_dev", dv);
@@ -718,6 +723,28 @@ int sysfs_add_disk(struct mdinfo *sra, struct mdinfo *sd, int resume)
if (resume)
sysfs_set_num(sra, sd, "recovery_start", sd->recovery_start);
}
+ if (sd->bb.supported) {
+ if (sysfs_set_str(sra, sd, "state", "external_bbl")) {
+ /*
+ * backward compatibility - if kernel doesn't support
+ * bad blocks for external metadata, let it continue
+ * as long as there are none known so far
+ */
+ if (sd->bb.count) {
+ pr_err("The kernel has no support for bad blocks in external metadata\n");
+ return -1;
+ }
+ }
+
+ for (i = 0; i < sd->bb.count; i++) {
+ char s[30];
+ const struct md_bb_entry *entry = &sd->bb.entries[i];
+
+ snprintf(s, sizeof(s) - 1, "%llu %d\n", entry->sector,
+ entry->length);
+ rv |= sysfs_set_str(sra, sd, "bad_blocks", s);
+ }
+ }
return rv;
}
--
1.8.3.1
^ permalink raw reply related
* [PATCH 2/4 v2] mdmon: bad block support for external metadata - sysfs file open
From: Tomasz Majchrzak @ 2016-10-27 8:53 UTC (permalink / raw)
To: linux-raid; +Cc: Jes.Sorensen, Tomasz Majchrzak
In-Reply-To: <1477558425-13332-1-git-send-email-tomasz.majchrzak@intel.com>
Open 'badblocks' and 'unacknowledged_bad_blocks' sysfs files for each
disk in the array. Add them to the list of files observed by monitor.
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
managemon.c | 17 +++++++++++++++++
mdadm.h | 2 ++
monitor.c | 7 ++++++-
3 files changed, 25 insertions(+), 1 deletion(-)
diff --git a/managemon.c b/managemon.c
index 6d1b3d8..3c1d4cb 100644
--- a/managemon.c
+++ b/managemon.c
@@ -115,6 +115,8 @@ static void close_aa(struct active_array *aa)
for (d = aa->info.devs; d; d = d->next) {
close(d->recovery_fd);
close(d->state_fd);
+ close(d->bb_fd);
+ close(d->ubb_fd);
}
if (aa->action_fd >= 0)
@@ -433,6 +435,21 @@ static int disk_init_and_add(struct mdinfo *disk, struct mdinfo *clone,
close(disk->recovery_fd);
return -1;
}
+ disk->bb_fd = sysfs_open2(aa->info.sys_name, disk->sys_name,
+ "bad_blocks");
+ if (disk->bb_fd < 0) {
+ close(disk->recovery_fd);
+ close(disk->state_fd);
+ return -1;
+ }
+ disk->ubb_fd = sysfs_open2(aa->info.sys_name, disk->sys_name,
+ "unacknowledged_bad_blocks");
+ if (disk->ubb_fd < 0) {
+ close(disk->recovery_fd);
+ close(disk->state_fd);
+ close(disk->bb_fd);
+ return -1;
+ }
disk->prev_state = read_dev_state(disk->state_fd);
disk->curr_state = disk->prev_state;
disk->next = aa->info.devs;
diff --git a/mdadm.h b/mdadm.h
index 5156ea4..1a1c7af 100755
--- a/mdadm.h
+++ b/mdadm.h
@@ -311,6 +311,8 @@ struct mdinfo {
/* Device info for mdmon: */
int recovery_fd;
int state_fd;
+ int bb_fd;
+ int ubb_fd;
#define DS_FAULTY 1
#define DS_INSYNC 2
#define DS_WRITE_MOSTLY 4
diff --git a/monitor.c b/monitor.c
index 4c79ce2..56c6500 100644
--- a/monitor.c
+++ b/monitor.c
@@ -454,6 +454,8 @@ static int read_and_act(struct active_array *a)
dprintf_cont(" %d:removed", mdi->disk.raid_disk);
close(mdi->state_fd);
close(mdi->recovery_fd);
+ close(mdi->bb_fd);
+ close(mdi->ubb_fd);
mdi->state_fd = -1;
} else
ret |= ARRAY_BUSY;
@@ -583,8 +585,11 @@ static int wait_and_act(struct supertype *container, int nowait)
add_fd(&rfds, &maxfd, a->info.state_fd);
add_fd(&rfds, &maxfd, a->action_fd);
add_fd(&rfds, &maxfd, a->sync_completed_fd);
- for (mdi = a->info.devs ; mdi ; mdi = mdi->next)
+ for (mdi = a->info.devs ; mdi ; mdi = mdi->next) {
add_fd(&rfds, &maxfd, mdi->state_fd);
+ add_fd(&rfds, &maxfd, mdi->bb_fd);
+ add_fd(&rfds, &maxfd, mdi->ubb_fd);
+ }
ap = &(*ap)->next;
}
--
1.8.3.1
^ permalink raw reply related
* [PATCH 3/4 v2] mdmon: bad block support for external metadata - store bad blocks
From: Tomasz Majchrzak @ 2016-10-27 8:53 UTC (permalink / raw)
To: linux-raid; +Cc: Jes.Sorensen, Tomasz Majchrzak
In-Reply-To: <1477558425-13332-1-git-send-email-tomasz.majchrzak@intel.com>
If md has changed the state to 'blocked' and metadata handler supports
bad blocks, try process them first. If metadata handler has successfully
stored bad block, acknowledge it to md via 'badblocks' sysfs file. If
metadata handler has failed to store the new bad block (ie. lack of
space), remove bad block support for a disk by writing "-external_bbl"
to state sysfs file. If all bad blocks have been acknowledged, request
to unblock the array.
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Acked-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
mdadm.h | 4 +++
monitor.c | 112 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 116 insertions(+)
diff --git a/mdadm.h b/mdadm.h
index 1a1c7af..05a2e3e 100755
--- a/mdadm.h
+++ b/mdadm.h
@@ -1050,6 +1050,10 @@ extern struct superswitch {
/* validate container after assemble */
int (*validate_container)(struct mdinfo *info);
+ /* records new bad block in metadata */
+ int (*record_bad_block)(struct active_array *a, int n,
+ unsigned long long sector, int length);
+
int swapuuid; /* true if uuid is bigending rather than hostendian */
int external;
const char *name; /* canonical metadata name */
diff --git a/monitor.c b/monitor.c
index 56c6500..9de9c8d 100644
--- a/monitor.c
+++ b/monitor.c
@@ -31,6 +31,10 @@ static char *sync_actions[] = {
"idle", "reshape", "resync", "recover", "check", "repair", NULL
};
+enum bb_action {
+ RECORD_BB = 1,
+};
+
static int write_attr(char *attr, int fd)
{
return write(fd, attr, strlen(attr));
@@ -158,6 +162,104 @@ int read_dev_state(int fd)
return rv;
}
+int process_ubb(struct active_array *a, struct mdinfo *mdi, const unsigned long
+ long sector, const int length, const char *buf,
+ const int buf_len)
+{
+ struct superswitch *ss = a->container->ss;
+
+ /*
+ * record bad block in metadata first, then acknowledge it to the driver
+ * via sysfs file
+ */
+ if ((ss->record_bad_block(a, mdi->disk.raid_disk, sector, length)) &&
+ (write(mdi->bb_fd, buf, buf_len) == buf_len))
+ return 1;
+
+ /*
+ * failed to store or acknowledge bad block, switch of bad block support
+ * to get it out of blocked state
+ */
+ sysfs_set_str(&a->info, mdi, "state", "-external_bbl");
+ return -1;
+}
+
+static int read_bb_file(int fd, struct active_array *a, struct mdinfo *mdi,
+ enum bb_action action, void *arg)
+{
+ char buf[30];
+ int n = 0;
+ int ret = 0;
+ int read_again = 0;
+ int off = 0;
+ int pos = 0;
+ int preserve_pos = (action == RECORD_BB ? 0 : 1);
+
+ if (lseek(fd, 0, SEEK_SET) == (off_t) -1)
+ return -1;
+
+ do {
+ read_again = 0;
+ n = read(fd, buf + pos, sizeof(buf) - 1 - pos);
+ if (n < 0)
+ return -1;
+ n += pos;
+
+ buf[n] = '\0';
+ off = 0;
+
+ while (off < n) {
+ unsigned long long sector;
+ int length;
+ char newline;
+ int consumed;
+ int matched;
+ int rc;
+
+ /* kernel sysfs file format: "sector length\n" */
+ matched = sscanf(buf + off, "%llu %d%c%n", §or,
+ &length, &newline, &consumed);
+ if ((matched != 3) && (off > 0)) {
+ /* truncated entry, read again */
+ if (preserve_pos) {
+ pos = sizeof(buf) - off - 1;
+ memmove(buf, buf + off, pos);
+ } else {
+ if (lseek(fd, 0, SEEK_SET) ==
+ (off_t) -1)
+ return -1;
+ }
+ read_again = 1;
+ break;
+ }
+ if (matched != 3)
+ return -1;
+ if (newline != '\n')
+ return -1;
+ if (length <= 0)
+ return -1;
+
+ if (action == RECORD_BB)
+ rc = process_ubb(a, mdi, sector, length,
+ buf + off, consumed);
+ else
+ rc = -1;
+
+ if (rc < 0)
+ return rc;
+ ret += rc;
+ off += consumed;
+ }
+ } while (read_again);
+
+ return ret;
+}
+
+static int process_dev_ubb(struct active_array *a, struct mdinfo *mdi)
+{
+ return read_bb_file(mdi->ubb_fd, a, mdi, RECORD_BB, NULL);
+}
+
static void signal_manager(void)
{
/* tgkill(getpid(), mon_tid, SIGUSR1); */
@@ -256,6 +358,16 @@ static int read_and_act(struct active_array *a)
&mdi->recovery_start);
mdi->curr_state = read_dev_state(mdi->state_fd);
}
+ /*
+ * If array is blocked and metadata handler is able to handle
+ * BB, check if you can acknowledge them to md driver. If
+ * successful, clear faulty state and unblock the array.
+ */
+ if ((mdi->curr_state & DS_BLOCKED) &&
+ a->container->ss->record_bad_block &&
+ (process_dev_ubb(a, mdi) > 0)) {
+ mdi->next_state |= DS_UNBLOCK;
+ }
}
gettimeofday(&tv, NULL);
--
1.8.3.1
^ permalink raw reply related
* [PATCH 4/4 v4] mdmon: bad block support for external metadata - clear bad blocks
From: Tomasz Majchrzak @ 2016-10-27 8:53 UTC (permalink / raw)
To: linux-raid; +Cc: Jes.Sorensen, Tomasz Majchrzak
In-Reply-To: <1477558425-13332-1-git-send-email-tomasz.majchrzak@intel.com>
If an update of acknowledged bad blocks file is notified, read entire
bad block list from sysfs file and compare it against local list of bad
blocks. If any obsolete entries are found, remove them from metadata.
As mdmon cannot perform any memory allocation, new superswitch method
get_bad_blocks is expected to return a list of bad blocks in metadata
without allocating memory. It's up to metadata handler to allocate all
required memory in advance.
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
mdadm.h | 7 ++++++
monitor.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 86 insertions(+), 2 deletions(-)
diff --git a/mdadm.h b/mdadm.h
index 05a2e3e..7f1a1b8 100755
--- a/mdadm.h
+++ b/mdadm.h
@@ -1054,6 +1054,13 @@ extern struct superswitch {
int (*record_bad_block)(struct active_array *a, int n,
unsigned long long sector, int length);
+ /* clears bad block from metadata */
+ int (*clear_bad_block)(struct active_array *a, int n,
+ unsigned long long sector, int length);
+
+ /* get list of bad blocks from metadata */
+ struct md_bb *(*get_bad_blocks)(struct active_array *a, int n);
+
int swapuuid; /* true if uuid is bigending rather than hostendian */
int external;
const char *name; /* canonical metadata name */
diff --git a/monitor.c b/monitor.c
index 9de9c8d..3d60fef 100644
--- a/monitor.c
+++ b/monitor.c
@@ -33,6 +33,7 @@ static char *sync_actions[] = {
enum bb_action {
RECORD_BB = 1,
+ COMPARE_BB,
};
static int write_attr(char *attr, int fd)
@@ -184,6 +185,49 @@ int process_ubb(struct active_array *a, struct mdinfo *mdi, const unsigned long
return -1;
}
+int compare_bb(struct active_array *a, struct mdinfo *mdi, const unsigned long
+ long sector, const unsigned int length, void *arg)
+{
+ struct superswitch *ss = a->container->ss;
+ struct md_bb *bb = (struct md_bb *) arg;
+ int record = 1;
+ int i;
+
+ for (i = 0; i < bb->count; i++) {
+ unsigned long long start = bb->entries[i].sector;
+ unsigned long long len = bb->entries[i].length;
+
+ /*
+ * bad block in metadata exactly matches bad block in kernel
+ * list, just remove it from a list
+ */
+ if ((start == sector) && (len == length)) {
+ if (i < bb->count - 1)
+ bb->entries[i] = bb->entries[bb->count - 1];
+ bb->count -= 1;
+ record = 0;
+ break;
+ }
+ /*
+ * bad block in metadata spans bad block in kernel list,
+ * clear it and record new bad block
+ */
+ if ((sector >= start) && (sector + length <= start + len)) {
+ ss->clear_bad_block(a, mdi->disk.raid_disk, start, len);
+ break;
+ }
+ }
+
+ /* record all bad blocks not in metadata list */
+ if (record && (ss->record_bad_block(a, mdi->disk.raid_disk, sector,
+ length) <= 0)) {
+ sysfs_set_str(&a->info, mdi, "state", "-external_bbl");
+ return -1;
+ }
+
+ return 1;
+}
+
static int read_bb_file(int fd, struct active_array *a, struct mdinfo *mdi,
enum bb_action action, void *arg)
{
@@ -242,6 +286,8 @@ static int read_bb_file(int fd, struct active_array *a, struct mdinfo *mdi,
if (action == RECORD_BB)
rc = process_ubb(a, mdi, sector, length,
buf + off, consumed);
+ else if (action == COMPARE_BB)
+ rc = compare_bb(a, mdi, sector, length, arg);
else
rc = -1;
@@ -260,6 +306,34 @@ static int process_dev_ubb(struct active_array *a, struct mdinfo *mdi)
return read_bb_file(mdi->ubb_fd, a, mdi, RECORD_BB, NULL);
}
+static int check_for_cleared_bb(struct active_array *a, struct mdinfo *mdi)
+{
+ struct superswitch *ss = a->container->ss;
+ struct md_bb *bb;
+ int i;
+
+ /*
+ * Get a list of bad blocks for an array, then read list of
+ * acknowledged bad blocks from kernel and compare it against metadata
+ * list, clear all bad blocks remaining in metadata list
+ */
+ bb = ss->get_bad_blocks(a, mdi->disk.raid_disk);
+ if (!bb)
+ return -1;
+
+ if (read_bb_file(mdi->bb_fd, a, mdi, COMPARE_BB, bb) < 0)
+ return -1;
+
+ for (i = 0; i < bb->count; i++) {
+ unsigned long long sector = bb->entries[i].sector;
+ int length = bb->entries[i].length;
+
+ ss->clear_bad_block(a, mdi->disk.raid_disk, sector, length);
+ }
+
+ return 0;
+}
+
static void signal_manager(void)
{
/* tgkill(getpid(), mon_tid, SIGUSR1); */
@@ -326,7 +400,7 @@ static void signal_manager(void)
#define ARRAY_DIRTY 1
#define ARRAY_BUSY 2
-static int read_and_act(struct active_array *a)
+static int read_and_act(struct active_array *a, fd_set *fds)
{
unsigned long long sync_completed;
int check_degraded = 0;
@@ -368,6 +442,8 @@ static int read_and_act(struct active_array *a)
(process_dev_ubb(a, mdi) > 0)) {
mdi->next_state |= DS_UNBLOCK;
}
+ if (FD_ISSET(mdi->bb_fd, fds))
+ check_for_cleared_bb(a, mdi);
}
gettimeofday(&tv, NULL);
@@ -754,6 +830,7 @@ static int wait_and_act(struct supertype *container, int nowait)
if (rv == -1) {
if (errno == EINTR) {
rv = 0;
+ FD_ZERO(&rfds);
dprintf("monitor: caught signal\n");
} else
dprintf("monitor: error %d in pselect\n",
@@ -795,7 +872,7 @@ static int wait_and_act(struct supertype *container, int nowait)
signal_manager();
}
if (a->container && !a->to_remove) {
- int ret = read_and_act(a);
+ int ret = read_and_act(a, &rfds);
rv |= 1;
dirty_arrays += !!(ret & ARRAY_DIRTY);
/* when terminating stop manipulating the array after it
--
1.8.3.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox