Bug#837964: 95a05b3 broke mdadm --add on my superblock 1.0 array

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Bug#837964: 95a05b3 broke mdadm --add on my superblock 1.0 array
@ 2016-09-19 16:32 Anthony DeRobertis
  2016-09-20  5:38 ` Guoqing Jiang
  0 siblings, 1 reply; 12+ messages in thread
From: Anthony DeRobertis @ 2016-09-19 16:32 UTC (permalink / raw)
  To: linux-raid; +Cc: 837964

(please cc me, I'm not subscribed.)

mdadm 3.4 can not manage to add a spare to my array, it fails like:

   # mdadm -a /dev/md/pv0 /dev/sdc3
   mdadm: add new device failed for /dev/sdc3 as 8: Invalid argument

and the kernel logs:

   md: sdc3 does not have a valid v1.0 superblock, not importing!
   md: md_import_device returned -22

This worked in 3.3.4. I performed two git bisects and found that:

a) it was broken by 95a05b37e8eb2bc0803b1a0298fce6adc60eff16
b) it is sort-of fixed by 81306e021ebdcc4baef866da82d25c3f0a415d2d
   (which AFAIK isn't yet released)

I say sort of fixed in that it adds it to the array, but spits out some
worrying errors (and I have no idea if it'd actually work, e.g., if it'd
assemble again):

   # ./mdadm -a /dev/md/pv0 /dev/sdc3 
   mdadm: Warning: cluster md only works with superblock 1.2
   mdadm: Failed to write metadata to /dev/sdc3

I'm not using (or at least not on purpose!) cluster md, as these are
internal SATA drives accessible by only one machine.

I wasn't able to reproduce it on a new (small) test array, so it might
be something specific to this array.

I've reported this bug to Debian, and that bug report contains a lot of
system information that I won't repeat here (because it's quite long):

	https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=837964

that contains mdadm -E output, the mdadm.conf, etc.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 95a05b3 broke mdadm --add on my superblock 1.0 array
  2016-09-19 16:32 Bug#837964: 95a05b3 broke mdadm --add on my superblock 1.0 array Anthony DeRobertis
@ 2016-09-20  5:38 ` Guoqing Jiang
  2016-09-20  7:02   ` Anthony DeRobertis
  0 siblings, 1 reply; 12+ messages in thread
From: Guoqing Jiang @ 2016-09-20  5:38 UTC (permalink / raw)
  To: Anthony DeRobertis, linux-raid, 837964



On 09/19/2016 12:32 PM, Anthony DeRobertis wrote:
> (please cc me, I'm not subscribed.)
>
> mdadm 3.4 can not manage to add a spare to my array, it fails like:
>
>     # mdadm -a /dev/md/pv0 /dev/sdc3
>     mdadm: add new device failed for /dev/sdc3 as 8: Invalid argument
>
> and the kernel logs:
>
>     md: sdc3 does not have a valid v1.0 superblock, not importing!
>     md: md_import_device returned -22
>
> This worked in 3.3.4. I performed two git bisects and found that:
>
> a) it was broken by 95a05b37e8eb2bc0803b1a0298fce6adc60eff16
> b) it is sort-of fixed by 81306e021ebdcc4baef866da82d25c3f0a415d2d
>     (which AFAIK isn't yet released)
>
> I say sort of fixed in that it adds it to the array, but spits out some
> worrying errors (and I have no idea if it'd actually work, e.g., if it'd
> assemble again):
>
>     # ./mdadm -a /dev/md/pv0 /dev/sdc3
>     mdadm: Warning: cluster md only works with superblock 1.2
>     mdadm: Failed to write metadata to /dev/sdc3

Thanks for report, could you try the latest tree 
git://git.kernel.org/pub/scm/utils/mdadm/mdadm.git?
I guess 45a87c2f31335a759190dff663a881bc78ca5443 should resolve it , and 
I can add a spare disk
to native raid (internal bitmap) with different metadatas (0.9, 1.0 to 1.2).

Pls let me know the result, I will look into it if the issue still exists.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 95a05b3 broke mdadm --add on my superblock 1.0 array
  2016-09-20  5:38 ` Guoqing Jiang
@ 2016-09-20  7:02   ` Anthony DeRobertis
  2016-09-20  9:36     ` Guoqing Jiang
  0 siblings, 1 reply; 12+ messages in thread
From: Anthony DeRobertis @ 2016-09-20  7:02 UTC (permalink / raw)
  To: Guoqing Jiang, linux-raid, 837964

On 09/20/2016 01:38 AM, Guoqing Jiang wrote:
>
> Thanks for report, could you try the latest tree 
> git://git.kernel.org/pub/scm/utils/mdadm/mdadm.git?
> I guess 45a87c2f31335a759190dff663a881bc78ca5443 should resolve it , 
> and I can add a spare disk
> to native raid (internal bitmap) with different metadatas (0.9, 1.0 to 
> 1.2).

(please keep me cc'd, I'm not subscribed)

$ git rev-parse --short HEAD
676e87a
$ make -j4
...

# ./mdadm -a /dev/md/pv0 /dev/sdc3
mdadm: add new device failed for /dev/sdc3 as 8: Invalid argument

[375036.613907] md: sdc3 does not have a valid v1.0 superblock, not 
importing!
[375036.613926] md: md_import_device returned -22

So current master seems to be back to broken completely. It's 3AM here, 
will check more (and do another bisect, to find when it went from 
weird-errors-but-works to broken) tomorrow.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 95a05b3 broke mdadm --add on my superblock 1.0 array
  2016-09-20  7:02   ` Anthony DeRobertis
@ 2016-09-20  9:36     ` Guoqing Jiang
  2016-09-20 17:12       ` Anthony DeRobertis
                         ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Guoqing Jiang @ 2016-09-20  9:36 UTC (permalink / raw)
  To: Anthony DeRobertis, linux-raid, 837964

On 09/20/2016 03:02 AM, Anthony DeRobertis wrote:
> On 09/20/2016 01:38 AM, Guoqing Jiang wrote:
>>
>> Thanks for report, could you try the latest tree 
>> git://git.kernel.org/pub/scm/utils/mdadm/mdadm.git?
>> I guess 45a87c2f31335a759190dff663a881bc78ca5443 should resolve it , 
>> and I can add a spare disk
>> to native raid (internal bitmap) with different metadatas (0.9, 1.0 
>> to 1.2).
>
> (please keep me cc'd, I'm not subscribed)
>
> $ git rev-parse --short HEAD
> 676e87a
> $ make -j4
> ...
>
> # ./mdadm -a /dev/md/pv0 /dev/sdc3
> mdadm: add new device failed for /dev/sdc3 as 8: Invalid argument
>
> [375036.613907] md: sdc3 does not have a valid v1.0 superblock, not 
> importing!
> [375036.613926] md: md_import_device returned -22

The md-cluster code should only work raid1 with 1.2 metadata, and your 
array (/dev/md/pv0) is raid10 with
1.0 metadata (if I read bug correctly), so it is weird that your array 
can invoke the code for md-cluster.

I assume it only happens with existed array,  a new created one doesn't 
have the problem, right? And I can't
reproduce it from my side.

Which kernel version are you used to created the array in case the 
kernel was updated? Also pls show the
output of "mdadm -X $DISK", and your bitmap is a little weird (but I 
don't try with 10 level before, so maybe
it is correct).

Internal Bitmap : -234 sectors from superblock

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 95a05b3 broke mdadm --add on my superblock 1.0 array
  2016-09-20  9:36     ` Guoqing Jiang
@ 2016-09-20 17:12       ` Anthony DeRobertis
  2016-09-21  6:40         ` Guoqing Jiang
  2016-09-20 17:52       ` Anthony DeRobertis
  2016-09-20 18:31       ` Anthony DeRobertis
  2 siblings, 1 reply; 12+ messages in thread
From: Anthony DeRobertis @ 2016-09-20 17:12 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: linux-raid, 837964

On Tue, Sep 20, 2016 at 05:36:17AM -0400, Guoqing Jiang wrote:

> The md-cluster code should only work raid1 with 1.2 metadata, and your array
> (/dev/md/pv0) is raid10 with
> 1.0 metadata (if I read bug correctly),

Correct, it's 1.0 metadata and raid10.

> I assume it only happens with existed array,  a new created one doesn't have
> the problem, right? And I can't
> reproduce it from my side.

yeah, I made a 1.0 raid10 with 4 LVs, and couldn't trigger the bug. Of
course, that array was also much smaller.

> 
> Which kernel version are you used to created the array in case the kernel
> was updated?

I've had the array for a while (the superblocks with -E show a creation
time of Wed Jun 16 14:25:08 2010). If I had to take a guess, I'd guess
it was created with the Debian squeeze alpha1 installer... So probably
2.6.30 or 2.6.32.

> Also pls show the
> output of "mdadm -X $DISK", and your bitmap is a little weird (but I don't
> try with 10 level before, so maybe
> it is correct).

I'd guess it was created by doing 'mdadm --grow --bitmap internal' at
some point after the array was created, but I'm not sure. Could have
been created with the array.

# ./mdadm -X /dev/sd[abde]3 
        Filename : /dev/sda3
           Magic : 6d746962
         Version : 4
            UUID : c840d0de:0626d783:3f1b28dc:c5ec649a
          Events : 1124478
  Events Cleared : 1124478
           State : OK
       Chunksize : 2 MB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 1953005568 (1862.53 GiB 1999.88 GB)
          Bitmap : 953616 bits (chunks), 54 dirty (0.0%)
        Filename : /dev/sdb3
           Magic : 6d746962
         Version : 4
            UUID : c840d0de:0626d783:3f1b28dc:c5ec649a
          Events : 1124478
  Events Cleared : 1124478
           State : OK
       Chunksize : 2 MB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 1953005568 (1862.53 GiB 1999.88 GB)
          Bitmap : 953616 bits (chunks), 54 dirty (0.0%)
        Filename : /dev/sdd3
           Magic : 6d746962
         Version : 4
            UUID : c840d0de:0626d783:3f1b28dc:c5ec649a
          Events : 1124478
  Events Cleared : 1124478
           State : OK
       Chunksize : 2 MB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 1953005568 (1862.53 GiB 1999.88 GB)
          Bitmap : 953616 bits (chunks), 54 dirty (0.0%)
        Filename : /dev/sde3
           Magic : 6d746962
         Version : 4
            UUID : c840d0de:0626d783:3f1b28dc:c5ec649a
          Events : 1124478
  Events Cleared : 1124478
           State : OK
       Chunksize : 2 MB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 1953005568 (1862.53 GiB 1999.88 GB)
          Bitmap : 953616 bits (chunks), 84 dirty (0.0%)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 95a05b3 broke mdadm --add on my superblock 1.0 array
  2016-09-20 17:12       ` Anthony DeRobertis
@ 2016-09-21  6:40         ` Guoqing Jiang
  0 siblings, 0 replies; 12+ messages in thread
From: Guoqing Jiang @ 2016-09-21  6:40 UTC (permalink / raw)
  To: Anthony DeRobertis, linux-raid, 837964



On 09/20/2016 01:12 PM, Anthony DeRobertis wrote:
>
>> Which kernel version are you used to created the array in case the kernel
>> was updated?
> I've had the array for a while (the superblocks with -E show a creation
> time of Wed Jun 16 14:25:08 2010). If I had to take a guess, I'd guess
> it was created with the Debian squeeze alpha1 installer... So probably
> 2.6.30 or 2.6.32.
>

Hmm, lots of things are changed from 2.6.30, so it is possible that 
latest mdadm
can't work well with array which was created with the old kernel.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 95a05b3 broke mdadm --add on my superblock 1.0 array
  2016-09-20  9:36     ` Guoqing Jiang
  2016-09-20 17:12       ` Anthony DeRobertis
@ 2016-09-20 17:52       ` Anthony DeRobertis
  2016-09-20 18:31       ` Anthony DeRobertis
  2 siblings, 0 replies; 12+ messages in thread
From: Anthony DeRobertis @ 2016-09-20 17:52 UTC (permalink / raw)
  To: Guoqing Jiang, linux-raid, 837964

BTW: the change from apparently working but spewing errors back to not 
working is:

bbc24bb35020793b9e6fa2111b15882f0dbfe36e is the first broken commit
commit bbc24bb35020793b9e6fa2111b15882f0dbfe36e
Author: Guoqing Jiang <gqjiang@suse.com>
Date:   Mon May 9 10:22:58 2016 +0800

     super1: make the check for NodeNumUpdate more accurate
     
     We missed to check the version is BITMAP_MAJOR_CLUSTERED
     or not, otherwise mdadm can't create array with other 1.x
     metadatas (1.0 and 1.1).
     
     Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
     Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>

:100644 100644 972b4700455426d47f52141416d873b6c745fa07 fa933676621f6431398192b1c0b26f3ce53deac3 M      super1.c



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 95a05b3 broke mdadm --add on my superblock 1.0 array
  2016-09-20  9:36     ` Guoqing Jiang
  2016-09-20 17:12       ` Anthony DeRobertis
  2016-09-20 17:52       ` Anthony DeRobertis
@ 2016-09-20 18:31       ` Anthony DeRobertis
  2016-09-21  6:45         ` Guoqing Jiang
  2 siblings, 1 reply; 12+ messages in thread
From: Anthony DeRobertis @ 2016-09-20 18:31 UTC (permalink / raw)
  To: Guoqing Jiang, linux-raid, 837964

[-- Attachment #1: Type: text/plain, Size: 1212 bytes --]

Sorry for the amount of emails I'm sending, but I noticed something 
that's probably important. I'm also appending some gdb log from tracing 
through the function (trying to answer why it's doing cluster mode stuff 
at all).

While tracing through, I noticed that *before* the write-bitmap loop, 
mdadm -E considers the superblock valid. That agrees with what I saw 
from strace, I suppose. To my first glance, it figures out how much to 
write by calling this function:

static unsigned int calc_bitmap_size(bitmap_super_t *bms, unsigned int boundary)
{
	unsigned long long bits, bytes;

	bits = __le64_to_cpu(bms->sync_size) / (__le32_to_cpu(bms->chunksize)>>9);
	bytes = (bits+7) >> 3;
	bytes += sizeof(bitmap_super_t);
	bytes = ROUND_UP(bytes, boundary);

	return bytes;
}

That code looked familiar, and I figured out where—it's also in 
95a05b37e8eb2bc0803b1a0298fce6adc60eff16, the commit that I found 
originally broke it. But that commit is making a change to it: it 
changed the ROUND_UP line from 512 to 4096 (and from the gdb trace, 
boundary==4096).

I tested changing that line to "bytes = ROUND_UP(bytes, 512);", and it 
works. Adds the new disk to the array and produces no warnings or errors.

[-- Attachment #2: gdb.txt --]
[-- Type: text/plain, Size: 13228 bytes --]

Starting program: /var/tmp/mdadm/mdadm/mdadm -a /dev/md/pv0 /dev/sdc3

Breakpoint 1, write_bitmap1 (st=0x6b0780, fd=5, update=NodeNumUpdate) at super1.c:2351
2351		struct mdp_superblock_1 *sb = st->sb;
st = 0x6b0780
fd = 5
update = NodeNumUpdate
$1 = (struct supertype *) 0x6b0780
$2 = {ss = 0x69c060 <super1>, minor_version = 0, max_devs = 1920, container_devnm = '\000' <repeats 31 times>, sb = 0x6c7000, 
  info = 0x6c6450, other = 0x0, devsize = 0, data_offset = 0, ignore_hw_compat = 0, updates = 0x0, update_tail = 0x0, arrays = 0x0, 
  sock = 0, devnm = "md127", '\000' <repeats 26 times>, devcnt = 0, retry_soon = 0, nodes = 0, cluster_name = 0x0, devs = 0x0}
#0  write_bitmap1 (st=0x6b0780, fd=5, update=NodeNumUpdate) at super1.c:2351
        sb = 0x6c8000
        bms = 0x8e6492c800000400
        rv = 0
        buf = 0x15250a2b
        towrite = 1953005968
        n = 0
        len = 0
        afd = {fd = 243328694, blk_sz = 5}
        i = 7106560
        total_bm_space = 2199023255557
        bm_space_per_node = 7110656
#1  0x000000000044530c in write_init_super1 (st=0x6b0780) at super1.c:1851
        sb = 0x6c7000
        refst = 0x6c6490
        rv = 0
        bm_space = 264
        di = 0x6c6450
        dsize = 1953005985
        array_size = 1953005568
        sb_offset = 1953005968
        data_offset = 0
#2  0x00000000004169d0 in Manage_add (fd=3, tfd=4, dv=0x6b0040, tst=0x6b0780, array=0x7fffffffda40, force=0, verbose=0, 
    devname=0x7fffffffe4b7 "/dev/md/pv0", update=0x0, rdev=2083, array_size=1953005568, raid_slot=-1) at Manage.c:971
        dfd = 5
        ldsize = 999939064320
        dev_st = 0x6c6390
        j = 8
        disc = {number = 8, major = 8, minor = 35, raid_disk = -1, state = 0}
#3  0x00000000004183f5 in Manage_subdevs (devname=0x7fffffffe4b7 "/dev/md/pv0", fd=3, devlist=0x6b0040, verbose=0, test=0, update=0x0, 
    force=0) at Manage.c:1617
        rdev = 2083
        rv = 0
        mj = -142377600
        mn = 32767
        array = {major_version = 1, minor_version = 0, patch_version = 3, ctime = 1276712708, level = 10, size = 976502784, nr_disks = 4, 
          raid_disks = 4, md_minor = 127, not_persistent = 0, utime = 1474393877, state = 256, active_disks = 4, working_disks = 4, 
          failed_disks = 0, spare_disks = 0, layout = 513, chunk_size = 524288}
        array_size = 1953005568
        dv = 0x6b0040
        tfd = 4
        tst = 0x6b0780
        subarray = 0x0
        sysfd = -1
        count = 0
        info = {array = {major_version = -9784, minor_version = 32767, patch_version = -136434289, ctime = 32767, level = 2, size = 0, 
            nr_disks = -134254776, raid_disks = 32767, md_minor = 1, not_persistent = 0, utime = 0, state = 0, active_disks = 1, 
            working_disks = 0, failed_disks = -134225560, spare_disks = 32767, layout = -7824, chunk_size = 32767}, disk = {
            number = -10032, major = 32767, minor = -117177849, raid_disk = 0, state = 0}, events = 140737354130624, uuid = {-9968, 32767, 
            0, 1}, 
          name = "\000\331\377\377\377\177\000\000\354\222s\360\000\000\000\000\223\024@\000\000\000\000\000\377\377\377\377\000\000\000\000@", data_offset = 140737346016776, new_data_offset = 140737354099120, component_size = 140737488345760, custom_array_size = 140737351942788, 
          reshape_active = 1, reshape_progress = 140737354129344, recovery_blocked = 0, journal_device_required = 0, 
          journal_clean = -136478512, space_before = 140737351876824, space_after = 140737351876808, {resync_start = 140737349770912, 
            recovery_start = 140737349770912}, bitmap_offset = 140737488345760, safe_mode_delay = 0, new_level = 6905808, delta_disks = 0, 
          new_layout = 4206336, new_chunk = 0, errors = -7872, cache_size = 0, mismatch_cnt = 0, 
          text_version = "\000\000\000\000`\340\377\377\377\177\000\000\326w\336\367\377\177\000\000\001", '\000' <repeats 23 times>, "\b\026\204\367\377\177", container_member = -9504, container_enough = 32767, sys_name = "md127", '\000' <repeats 26 times>, 
          devs = 0xff000000000000, next = 0x0, recovery_fd = -16777216, state_fd = -65536, prev_state = 0, curr_state = 0, next_state = 0, 
          sysfs_array_state = "\000\000\377\377", '\000' <repeats 15 times>}
        devinfo = {array = {major_version = -142323768, minor_version = 32767, patch_version = 0, ctime = 0, level = -2147483646, size = 0, 
            nr_disks = 4706142, raid_disks = 0, md_minor = 4, not_persistent = 0, utime = 4272203, state = 0, active_disks = -10000, 
            working_disks = 32767, failed_disks = -136412540, spare_disks = 32767, layout = -142323768, chunk_size = 32767}, disk = {
            number = -134225984, major = 32767, minor = -9856, raid_disk = 32767, state = -10144}, events = 140737488345192, uuid = {
            -10145, 32767, -136395088, 32767}, 
          name = "p\330\377\377\377\177\000\000\310d\377\367\377\177\000\000\225W\275\367\002\000\000\000`\330\377\377\377\177\000\000t", 
          data_offset = 140737488345183, new_data_offset = 1627, component_size = 140737354099120, custom_array_size = 140737345977728, 
          reshape_active = -142323768, reshape_progress = 140737351919787, recovery_blocked = 1627, journal_device_required = 0, 
          journal_clean = -142323768, space_before = 140737354099120, space_after = 140737488345144, {resync_start = 140737488345140, 
            recovery_start = 140737488345140}, bitmap_offset = 140737351918145, safe_mode_delay = 7, new_level = 4199571, delta_disks = 0, 
          new_layout = 4196120, new_chunk = 0, errors = -10184, cache_size = 4034106092, mismatch_cnt = 63032907, 
          text_version = "\000\000\000\000,\000\000\000\000\000\000\000\020\331\377\377\377\177\000\000\310O\204\367\377\177\000\000\200}\203\367\377\177\000\000\064\330\377\377\377\177\000\000\000\331\377\377\377\177", container_member = -134254856, container_enough = 32767, 
          sys_name = "\004\000\000\000\000\000\000\000ibcm\000\000\000\000o.4\000\377\177\000\000\376\377\377\377\000\000\000", devs = 0x0, 
          next = 0x7fffffffd998, recovery_fd = -134224704, state_fd = 32767, prev_state = -9824, curr_state = 32767, 
          next_state = -134254776, sysfs_array_state = "\377\177\000\000\000\000\000\000\000\000\000\000h\341\377\367\377\177\000"}
        frozen = 1
        busy = 0
        raid_slot = -1
#4  0x0000000000406948 in main (argc=4, argv=0x7fffffffe148) at mdadm.c:1368
        mode = 4
        opt = -1
        option_index = -1
        rv = 0
        i = 0
        array_size = 0
        data_offset = 1
        ident = {devname = 0x7fffffffdff8 "\340C\204", <incomplete sequence \367>, uuid_set = 0, uuid = {32767, 2, 0, -134254776}, 
          name = "\000\177\000\000\001", '\000' <repeats 15 times>, "\001\000\000\000\000\000\000\000h\341\377\367\377", 
          super_minor = 65534, devices = 0x0, level = 65534, raid_disks = 65534, spare_disks = 0, st = 0x0, autof = 0, spare_group = 0x0, 
          bitmap_file = 0x0, bitmap_fd = -1, container = 0x0, member = 0x0, next = 0x7ffff7ffe168, {assembled = -142326816}}
        configfile = 0x0
        devmode = 97
        bitmap_fd = -1
        devlist = 0x6b0010
        devlistend = 0x6b0060
        dv = 0x6b0040
        devs_found = 2
        symlinks = 0x0
        grow_continue = 0
        c = {readonly = 0, runstop = 0, verbose = 0, brief = 0, force = 0, homehost = 0x7fffffffdcd0 "Zia", require_homehost = 1, 
          prefer = 0x0, export = 0, test = 0, subarray = 0x0, update = 0x0, scan = 0, SparcAdjust = 0, autof = 0, delay = 0, 
          freeze_reshape = 0, backup_file = 0x0, invalid_backup = 0, action = 0x0, nodes = 0, homecluster = 0x0}
        s = {raiddisks = 0, sparedisks = 0, journaldisks = 0, level = 65534, layout = 65534, layout_str = 0x0, chunk = 0, 
          bitmap_chunk = 65534, bitmap_file = 0x0, assume_clean = 0, write_behind = 0, size = 0}
        sys_hostname = "Zia\000\377\177\000\000\360\303\373\367\377\177\000\000\000\000\000\000\000\000\000\000\330\331\377\367\377\177\000\000\340\336\377\377\377\177\000\000\217-\336\367\377\177\000\000\002\000\000\000\000\000\000\000\360\303\373\367\377\177\000\000\001", '\000' <repeats 15 times>, "\001\000\000\000\000\000\000\000\330\331\377\367\377\177\000\000\000\000 \271\377\377\377\377\000\000\342\004\275\357\377\377`\\i", '\000' <repeats 13 times>, "\300\344\377\367\377\177\000\000\220\335\377\377\377\177\000\000\000\000\200\271\001\000\000\000\200\335\377\377\377\177\000\000\307\016\340=\000\000\000\000t \336\367\377\177\000\000\377\377\377\377\000\000\000\000D\b\000\000\000\000\000\000\260i\377\367\377\177\000\000"...
        mailaddr = 0x0
        program = 0x0
        increments = 20
        daemonise = 0
        pidfile = 0x0
        oneshot = 0
        spare_sharing = 1
        ss = 0x0
        writemostly = 0
        shortopt = 0x6965a0 <short_bitmap_options> "-ABCDEFGIQhVXYWZ:vqb:c:i:l:p:m:n:x:u:c:d:z:U:N:sarfRSow1tye:"
        dosyslog = 0
        rebuild_map = 0
        remove_path = 0x0
        udev_filename = 0x0
        dump_directory = 0x0
        print_help = 0
        outf = 0x0
        mdfd = 3
2352		bitmap_super_t *bms = (bitmap_super_t*)(((char*)sb)+MAX_SB_SIZE);
2353		int rv = 0;
$3 = {magic = 1836345698, version = 4, uuid = "\310@\320\336\006&׃?\033(\334\305\354d\232", events = 1124486, events_cleared = 1124486, 
  sync_size = 3906011136, state = 0, chunksize = 2097152, daemon_sleep = 5, write_behind = 0, sectors_reserved = 0, nodes = 0, 
  cluster_name = '\000' <repeats 63 times>, pad = '\000' <repeats 119 times>}
$4 = (void *) 0x6c7000
2357		unsigned int i = 0;
2360		switch (update) {
2373			if (st->minor_version != 2 && bms->version == BITMAP_MAJOR_CLUSTERED) {
2378			if (bms->version == BITMAP_MAJOR_CLUSTERED) {
2394				if (st->nodes)
No symbol "BITMAP_MAJOR_CLUSTERED" in current context.
$5 = 4
2396				break;
2419		init_afd(&afd, fd);
2421		locate_bitmap1(st, fd, 0);
$6 = {fd = 5, blk_sz = 512}
2423		if (posix_memalign(&buf, 4096, 4096))
$7 = (struct supertype *) 0x6b0780
$8 = {ss = 0x69c060 <super1>, minor_version = 0, max_devs = 1920, container_devnm = '\000' <repeats 31 times>, sb = 0x6c7000, 
  info = 0x6c6450, other = 0x0, devsize = 0, data_offset = 0, ignore_hw_compat = 0, updates = 0x0, update_tail = 0x0, arrays = 0x0, 
  sock = 0, devnm = "md127", '\000' <repeats 26 times>, devcnt = 0, retry_soon = 0, nodes = 0, cluster_name = 0x0, devs = 0x0}
2430			if (i)
2433				memset(buf, 0xff, 4096);
2434			memcpy(buf, (char *)bms, sizeof(bitmap_super_t));
2436			towrite = calc_bitmap_size(bms, 4096);
2437			while (towrite > 0) {
$9 = 122880
2438				n = towrite;
2439				if (n > 4096)
2440					n = 4096;
2441				n = awrite(&afd, buf, n);
2442				if (n > 0)
2443					towrite -= n;
2446				if (i)
2449					memset(buf, 0xff, 4096);
2437			while (towrite > 0) {
2438				n = towrite;
2439				if (n > 4096)
2440					n = 4096;
2441				n = awrite(&afd, buf, n);
2442				if (n > 0)
2443					towrite -= n;
2446				if (i)
2449					memset(buf, 0xff, 4096);
2437			while (towrite > 0) {
2438				n = towrite;
2439				if (n > 4096)
2440					n = 4096;
2441				n = awrite(&afd, buf, n);
2442				if (n > 0)
2443					towrite -= n;
2446				if (i)
2449					memset(buf, 0xff, 4096);
2437			while (towrite > 0) {
2438				n = towrite;
2439				if (n > 4096)
$10 = 110592
Continue program being debugged, after signal or breakpoint.
Usage: continue [N]
If proceeding from breakpoint, a number N may be used as an argument,
which means to set the ignore count of that breakpoint to N - 1 (so that
the breakpoint won't break until the Nth time it is reached).

If non-stop mode is enabled, continue only the current thread,
otherwise all the threads in the program are continued.  To 
continue all stopped threads in non-stop mode, use the -a option.
Specifying -a and an ignore count simultaneously is an error.
Execute until the program reaches a source line greater than the current
or a specified location (same args as break command) within the current frame.
write_bitmap1 (st=0x6b0780, fd=5, update=NodeNumUpdate) at super1.c:2451
2451			fsync(fd);
Continuing.
[Inferior 1 (process 23866) exited with code 01]
Breakpoint 2 at 0x440d25: file super1.c, line 165.
Starting program: /var/tmp/mdadm/mdadm/mdadm -a /dev/md/pv0 /dev/sdc3

Breakpoint 1, write_bitmap1 (st=0x6b0780, fd=5, update=NodeNumUpdate) at super1.c:2351
2351		struct mdp_superblock_1 *sb = st->sb;
Continuing.

Breakpoint 2, calc_bitmap_size (bms=0x6c8000, boundary=4096) at super1.c:165
165		bits = __le64_to_cpu(bms->sync_size) / (__le32_to_cpu(bms->chunksize)>>9);
bms = 0x6c8000
boundary = 4096
$11 = {magic = 1836345698, version = 4, uuid = "\310@\320\336\006&׃?\033(\334\305\354d\232", events = 1124486, events_cleared = 1124486, 
  sync_size = 3906011136, state = 0, chunksize = 2097152, daemon_sleep = 5, write_behind = 0, sectors_reserved = 0, nodes = 0, 
  cluster_name = '\000' <repeats 63 times>, pad = '\000' <repeats 119 times>}
166		bytes = (bits+7) >> 3;
167		bytes += sizeof(bitmap_super_t);
168		bytes = ROUND_UP(bytes, boundary);
$12 = 119458
170		return bytes;
$13 = 122880
Continuing.
[Inferior 1 (process 25040) exited with code 01]
quit

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 95a05b3 broke mdadm --add on my superblock 1.0 array
  2016-09-20 18:31       ` Anthony DeRobertis
@ 2016-09-21  6:45         ` Guoqing Jiang
  2016-09-22  2:40           ` Guoqing Jiang
  0 siblings, 1 reply; 12+ messages in thread
From: Guoqing Jiang @ 2016-09-21  6:45 UTC (permalink / raw)
  To: Anthony DeRobertis, linux-raid, 837964



On 09/20/2016 02:31 PM, Anthony DeRobertis wrote:
> Sorry for the amount of emails I'm sending, but I noticed something 
> that's probably important. I'm also appending some gdb log from 
> tracing through the function (trying to answer why it's doing cluster 
> mode stuff at all).
>
> While tracing through, I noticed that *before* the write-bitmap loop, 
> mdadm -E considers the superblock valid. That agrees with what I saw 
> from strace, I suppose. To my first glance, it figures out how much to 
> write by calling this function:
>
> static unsigned int calc_bitmap_size(bitmap_super_t *bms, unsigned int 
> boundary)
> {
>     unsigned long long bits, bytes;
>
>     bits = __le64_to_cpu(bms->sync_size) / 
> (__le32_to_cpu(bms->chunksize)>>9);
>     bytes = (bits+7) >> 3;
>     bytes += sizeof(bitmap_super_t);
>     bytes = ROUND_UP(bytes, boundary);
>
>     return bytes;
> }
>
> That code looked familiar, and I figured out where—it's also in 
> 95a05b37e8eb2bc0803b1a0298fce6adc60eff16, the commit that I found 
> originally broke it. But that commit is making a change to it: it 
> changed the ROUND_UP line from 512 to 4096 (and from the gdb trace, 
> boundary==4096).
>
> I tested changing that line to "bytes = ROUND_UP(bytes, 512);", and it 
> works. Adds the new disk to the array and produces no warnings or errors.

I think it is is a coincidence that above change works,  4a3d29e commit made
the change but it didn't change the logic at all. Also seems the problem 
is not
related to md-cluster code as your gdb debug shows it run into below part
because the version is 4.

/* no need to change bms->nodes for other bitmap types */

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 95a05b3 broke mdadm --add on my superblock 1.0 array
  2016-09-21  6:45         ` Guoqing Jiang
@ 2016-09-22  2:40           ` Guoqing Jiang
  2016-09-22 16:53             ` Anthony DeRobertis
  2016-10-06  1:26             ` Bug#837964: " NeilBrown
  0 siblings, 2 replies; 12+ messages in thread
From: Guoqing Jiang @ 2016-09-22  2:40 UTC (permalink / raw)
  To: Anthony DeRobertis, linux-raid, 837964



On 09/21/2016 02:45 AM, Guoqing Jiang wrote:
>
>
> On 09/20/2016 02:31 PM, Anthony DeRobertis wrote:
>> Sorry for the amount of emails I'm sending, but I noticed something 
>> that's probably important. I'm also appending some gdb log from 
>> tracing through the function (trying to answer why it's doing cluster 
>> mode stuff at all).
>>
>> While tracing through, I noticed that *before* the write-bitmap loop, 
>> mdadm -E considers the superblock valid. That agrees with what I saw 
>> from strace, I suppose. To my first glance, it figures out how much 
>> to write by calling this function:
>>
>> static unsigned int calc_bitmap_size(bitmap_super_t *bms, unsigned 
>> int boundary)
>> {
>>     unsigned long long bits, bytes;
>>
>>     bits = __le64_to_cpu(bms->sync_size) / 
>> (__le32_to_cpu(bms->chunksize)>>9);
>>     bytes = (bits+7) >> 3;
>>     bytes += sizeof(bitmap_super_t);
>>     bytes = ROUND_UP(bytes, boundary);
>>
>>     return bytes;
>> }
>>
>> That code looked familiar, and I figured out where—it's also in 
>> 95a05b37e8eb2bc0803b1a0298fce6adc60eff16, the commit that I found 
>> originally broke it. But that commit is making a change to it: it 
>> changed the ROUND_UP line from 512 to 4096 (and from the gdb trace, 
>> boundary==4096).
>>
>> I tested changing that line to "bytes = ROUND_UP(bytes, 512);", and 
>> it works. Adds the new disk to the array and produces no warnings or 
>> errors.
>
> I think it is is a coincidence that above change works,  4a3d29e 
> commit made
> the change but it didn't change the logic at all.

Hmm, seems bitmap is aligned to 512 in previous mdadm, but with commit 
95a05b3
we made it aligned to 4k, so it causes the latest mdadm can't work with 
previous
created array.

Does the below change work? Thanks.

diff --git a/super1.c b/super1.c
index 9f62d23..6a0b075 100644
--- a/super1.c
+++ b/super1.c
@@ -2433,7 +2433,10 @@ static int write_bitmap1(struct supertype *st, 
int fd, enum bitmap_update update
                         memset(buf, 0xff, 4096);
                 memcpy(buf, (char *)bms, sizeof(bitmap_super_t));

-               towrite = calc_bitmap_size(bms, 4096);
+               if (__le32_to_cpu(bms->nodes) == 0)
+                       towrite = calc_bitmap_size(bms, 512);
+               else
+                       towrite = calc_bitmap_size(bms, 4096);
                 while (towrite > 0) {

Regards,
Guoqing

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: 95a05b3 broke mdadm --add on my superblock 1.0 array
  2016-09-22  2:40           ` Guoqing Jiang
@ 2016-09-22 16:53             ` Anthony DeRobertis
  2016-10-06  1:26             ` Bug#837964: " NeilBrown
  1 sibling, 0 replies; 12+ messages in thread
From: Anthony DeRobertis @ 2016-09-22 16:53 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: linux-raid, 837964

On Wed, Sep 21, 2016 at 10:40:22PM -0400, Guoqing Jiang wrote:
> 
> Does the below change work? Thanks.

Yes, this patch (after un-email-mangling it) fixes it.

> 
> diff --git a/super1.c b/super1.c
> index 9f62d23..6a0b075 100644
> --- a/super1.c
> +++ b/super1.c
> @@ -2433,7 +2433,10 @@ static int write_bitmap1(struct supertype *st, int
> fd, enum bitmap_update update
>                         memset(buf, 0xff, 4096);
>                 memcpy(buf, (char *)bms, sizeof(bitmap_super_t));
> 
> -               towrite = calc_bitmap_size(bms, 4096);
> +               if (__le32_to_cpu(bms->nodes) == 0)
> +                       towrite = calc_bitmap_size(bms, 512);
> +               else
> +                       towrite = calc_bitmap_size(bms, 4096);
>                 while (towrite > 0) {
> 
> Regards,
> Guoqing

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bug#837964: 95a05b3 broke mdadm --add on my superblock 1.0 array
  2016-09-22  2:40           ` Guoqing Jiang
  2016-09-22 16:53             ` Anthony DeRobertis
@ 2016-10-06  1:26             ` NeilBrown
  1 sibling, 0 replies; 12+ messages in thread
From: NeilBrown @ 2016-10-06  1:26 UTC (permalink / raw)
  To: Guoqing Jiang, 837964, Anthony DeRobertis, linux-raid

[-- Attachment #1: Type: text/plain, Size: 4054 bytes --]

On Thu, Sep 22 2016, Guoqing Jiang wrote:

> On 09/21/2016 02:45 AM, Guoqing Jiang wrote:
>>
>>
>> On 09/20/2016 02:31 PM, Anthony DeRobertis wrote:
>>> Sorry for the amount of emails I'm sending, but I noticed something 
>>> that's probably important. I'm also appending some gdb log from 
>>> tracing through the function (trying to answer why it's doing cluster 
>>> mode stuff at all).
>>>
>>> While tracing through, I noticed that *before* the write-bitmap loop, 
>>> mdadm -E considers the superblock valid. That agrees with what I saw 
>>> from strace, I suppose. To my first glance, it figures out how much 
>>> to write by calling this function:
>>>
>>> static unsigned int calc_bitmap_size(bitmap_super_t *bms, unsigned 
>>> int boundary)
>>> {
>>>     unsigned long long bits, bytes;
>>>
>>>     bits = __le64_to_cpu(bms->sync_size) / 
>>> (__le32_to_cpu(bms->chunksize)>>9);
>>>     bytes = (bits+7) >> 3;
>>>     bytes += sizeof(bitmap_super_t);
>>>     bytes = ROUND_UP(bytes, boundary);
>>>
>>>     return bytes;
>>> }
>>>
>>> That code looked familiar, and I figured out where—it's also in 
>>> 95a05b37e8eb2bc0803b1a0298fce6adc60eff16, the commit that I found 
>>> originally broke it. But that commit is making a change to it: it 
>>> changed the ROUND_UP line from 512 to 4096 (and from the gdb trace, 
>>> boundary==4096).
>>>
>>> I tested changing that line to "bytes = ROUND_UP(bytes, 512);", and 
>>> it works. Adds the new disk to the array and produces no warnings or 
>>> errors.
>>
>> I think it is is a coincidence that above change works,  4a3d29e 
>> commit made
>> the change but it didn't change the logic at all.
>
> Hmm, seems bitmap is aligned to 512 in previous mdadm, but with commit 
> 95a05b3
> we made it aligned to 4k, so it causes the latest mdadm can't work with 
> previous
> created array.
>
> Does the below change work? Thanks.
>
> diff --git a/super1.c b/super1.c
> index 9f62d23..6a0b075 100644
> --- a/super1.c
> +++ b/super1.c
> @@ -2433,7 +2433,10 @@ static int write_bitmap1(struct supertype *st, 
> int fd, enum bitmap_update update
>                          memset(buf, 0xff, 4096);
>                  memcpy(buf, (char *)bms, sizeof(bitmap_super_t));
>
> -               towrite = calc_bitmap_size(bms, 4096);
> +               if (__le32_to_cpu(bms->nodes) == 0)
> +                       towrite = calc_bitmap_size(bms, 512);
> +               else
> +                       towrite = calc_bitmap_size(bms, 4096);
>                  while (towrite > 0) {

(sorry for the late reply ... travel, jetlag, ....)

I think a better, simpler, fix is:

> -               towrite = calc_bitmap_size(bms, 4096);
> +               towrite = calc_bitmap_size(bms, 512);

The only reason that we are rounding up here is that we are using
O_DIRECT writes and they require 512-byte alignment.

Any bytes beyond the end of the actual bitmap will be ignored, so it
doesn't matter whether they are written or not.

Current mdadm always aligns bitmaps on a 4K boundary, but older version
of mdadm didn't.  If the bitmap was less than 4K before the superblock
(quite possible), writing 4K for bitmap would corrupt the superblock.
This can certainly happen with 1.0 metadata.

However ... the reason that everything is now 4K aligned is that some
drives use a 4K block size.  For those, we really should be doing 4K
writes, not 512-byte writes.

So it would make sense to round up to 4K sometimes, and use 512 at other
times.  However the correct test isn't whether cluster-raid is in use.

The metadata has always been aligned on a 4K boundary.
If data_offset and bblog_offset and bitmap_offset all have 4K alignment,
then rounding up to 4K for the bitmap writes would be correct.
If anything have a smaller alignment, then it isn't necessary and so
should be avoided.

So the best fix would be to test those 3 offsets, and round up to a
multiple of 4096 only if all of them are on a 4K boundary.

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-10-06  1:26 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-09-19 16:32 Bug#837964: 95a05b3 broke mdadm --add on my superblock 1.0 array Anthony DeRobertis
2016-09-20  5:38 ` Guoqing Jiang
2016-09-20  7:02   ` Anthony DeRobertis
2016-09-20  9:36     ` Guoqing Jiang
2016-09-20 17:12       ` Anthony DeRobertis
2016-09-21  6:40         ` Guoqing Jiang
2016-09-20 17:52       ` Anthony DeRobertis
2016-09-20 18:31       ` Anthony DeRobertis
2016-09-21  6:45         ` Guoqing Jiang
2016-09-22  2:40           ` Guoqing Jiang
2016-09-22 16:53             ` Anthony DeRobertis
2016-10-06  1:26             ` Bug#837964: " NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).