[BISECT] Kernel panic, RIP bitmap

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [BISECT] Kernel panic, RIP bitmap_create
       [not found] <CAOOwNtJhFa67EFTs5AdgSHzFseBr9xJGTsaEOyYnaYYNCeUMAQ@mail.gmail.com>
@ 2012-05-03  5:05 ` Karl Newman
  2012-05-03  5:58   ` NeilBrown
  0 siblings, 1 reply; 8+ messages in thread
From: Karl Newman @ 2012-05-03  5:05 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb

Hi,

I'm attempting to use kernel 3.4-rc? but keep running into a kernel panic on
boot, with RIP pointing to bitmap_create. I tried 3.4-rc1, 3.4-rc4 and
3.4-rc5 and they all have the kernel panic, while 3.3.4 boots fine. I have
my root on raid 5 with an internal bitmap, and the kernel panic occurs if I
use the built-in kernel autodetect or during the root array assembly via
mdadm inside a dracut-generated initramfs. I bisected it down to the
following commit:
61a0d80ce4ab5b4fb9ecb38f1fb19654778b71ed

md/bitmap: discard CHUNK_BLOCK_SHIFT macro

Be redefining ->chunkshift as the shift from sectors to chunks rather than
bytes to chunks, we can just use "bitmap->chunkshift" which is shorter than
the macro call, and less indirect.

Signed-off-by: NeilBrown <neilb@suse.de>

My bisect testing including a scary commit where 2 of 3 drives had their
UUIDs zeroed when I booted with it! Fortunately I found the mailing list
archives with the solution and I was able to recover everything and keep
bisecting (although I was tempted to quit and just give the range of
commits...).

I hope this fix can make it into the next 3.4-rc kernel.

Thanks,

Karl Newman

P.S. Sorry for the possible repeat, apparently Gmail's default HTML
format is unacceptable to this list.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BISECT] Kernel panic, RIP bitmap_create
  2012-05-03  5:05 ` [BISECT] Kernel panic, RIP bitmap_create Karl Newman
@ 2012-05-03  5:58   ` NeilBrown
  2012-05-03  6:14     ` Karl Newman
  0 siblings, 1 reply; 8+ messages in thread
From: NeilBrown @ 2012-05-03  5:58 UTC (permalink / raw)
  To: Karl Newman; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1798 bytes --]

On Wed, 2 May 2012 22:05:44 -0700 Karl Newman <siliconfiend@gmail.com> wrote:

> Hi,
> 
> I'm attempting to use kernel 3.4-rc? but keep running into a kernel panic on
> boot, with RIP pointing to bitmap_create. I tried 3.4-rc1, 3.4-rc4 and
> 3.4-rc5 and they all have the kernel panic, while 3.3.4 boots fine. I have
> my root on raid 5 with an internal bitmap, and the kernel panic occurs if I
> use the built-in kernel autodetect or during the root array assembly via
> mdadm inside a dracut-generated initramfs. I bisected it down to the
> following commit:
> 61a0d80ce4ab5b4fb9ecb38f1fb19654778b71ed
> 
> md/bitmap: discard CHUNK_BLOCK_SHIFT macro
> 
> Be redefining ->chunkshift as the shift from sectors to chunks rather than
> bytes to chunks, we can just use "bitmap->chunkshift" which is shorter than
> the macro call, and less indirect.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> 
> My bisect testing including a scary commit where 2 of 3 drives had their
> UUIDs zeroed when I booted with it! Fortunately I found the mailing list
> archives with the solution and I was able to recover everything and keep
> bisecting (although I was tempted to quit and just give the range of
> commits...).
> 
> I hope this fix can make it into the next 3.4-rc kernel.

I do too, but first I would need to know what the fix is, and I cannot see
anything in that commit what would change the behaviour of md at all.

Do you have a copy of the full stack trace provided when Linux crashed?  That
could be useful.
Also what bitmap chunk size are you using? Maybe the output of
  mdadm -X
and
  mdadm -E

of one of the devices in the array would help.

Thanks a lot for the report and going to the trouble of bisecting, it is
really appreciated.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BISECT] Kernel panic, RIP bitmap_create
  2012-05-03  5:58   ` NeilBrown
@ 2012-05-03  6:14     ` Karl Newman
  2012-05-03  6:25       ` NeilBrown
  2012-05-03  6:50       ` NeilBrown
  0 siblings, 2 replies; 8+ messages in thread
From: Karl Newman @ 2012-05-03  6:14 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Wed, May 2, 2012 at 10:58 PM, NeilBrown <neilb@suse.de> wrote:
> On Wed, 2 May 2012 22:05:44 -0700 Karl Newman <siliconfiend@gmail.com> wrote:
>
>> Hi,
>>
>> I'm attempting to use kernel 3.4-rc? but keep running into a kernel panic on
>> boot, with RIP pointing to bitmap_create. I tried 3.4-rc1, 3.4-rc4 and
>> 3.4-rc5 and they all have the kernel panic, while 3.3.4 boots fine. I have
>> my root on raid 5 with an internal bitmap, and the kernel panic occurs if I
>> use the built-in kernel autodetect or during the root array assembly via
>> mdadm inside a dracut-generated initramfs. I bisected it down to the
>> following commit:
>> 61a0d80ce4ab5b4fb9ecb38f1fb19654778b71ed
>>
>> md/bitmap: discard CHUNK_BLOCK_SHIFT macro
>>
>> Be redefining ->chunkshift as the shift from sectors to chunks rather than
>> bytes to chunks, we can just use "bitmap->chunkshift" which is shorter than
>> the macro call, and less indirect.
>>
>> Signed-off-by: NeilBrown <neilb@suse.de>
>>
>> My bisect testing including a scary commit where 2 of 3 drives had their
>> UUIDs zeroed when I booted with it! Fortunately I found the mailing list
>> archives with the solution and I was able to recover everything and keep
>> bisecting (although I was tempted to quit and just give the range of
>> commits...).
>>
>> I hope this fix can make it into the next 3.4-rc kernel.
>
> I do too, but first I would need to know what the fix is, and I cannot see
> anything in that commit what would change the behaviour of md at all.
>
> Do you have a copy of the full stack trace provided when Linux crashed?  That
> could be useful.
> Also what bitmap chunk size are you using? Maybe the output of
>  mdadm -X
> and
>  mdadm -E
>
> of one of the devices in the array would help.
>
> Thanks a lot for the report and going to the trouble of bisecting, it is
> really appreciated.
>
> NeilBrown

I'm not sure how to go about getting the full stack trace. The
motherboard has no serial port, so that's not an option. Unless the
kernel supports USB to serial adapters for that purpose, in which case
I might be able to borrow a couple Keyspans. Or I could sit and try
and transcribe the whole thing...(!) I'm a little nervous about
tripping the all-zeros UUID bug again, although it only happened once
and it doesn't seem to be related to that commit. Anyway, here's some
data from the array:

# mdadm -E /dev/sda3
/dev/sda3:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 60bf4ee8:e6e3e14f:073e21cd:ed2abb54
  Creation Time : Wed May  2 20:22:47 2012
     Raid Level : raid5
  Used Dev Size : 29302400 (27.94 GiB 30.01 GB)
     Array Size : 58604800 (55.89 GiB 60.01 GB)
   Raid Devices : 3
  Total Devices : 3
Preferred Minor : 127

    Update Time : Wed May  2 23:01:45 2012
          State : active
Internal Bitmap : present
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0
       Checksum : 863c968f - correct
         Events : 2

         Layout : left-symmetric
     Chunk Size : 128K

      Number   Major   Minor   RaidDevice State
this     0       8        3        0      active sync   /dev/sda3

   0     0       8        3        0      active sync   /dev/sda3
   1     1       8       19        1      active sync   /dev/sdb3
   2     2       8       35        2      active sync   /dev/sdc3

# mdadm -X /dev/sda3
        Filename : /dev/sda3
           Magic : 6d746962
         Version : 4
            UUID : 60bf4ee8:e6e3e14f:073e21cd:ed2abb54
          Events : 1
  Events Cleared : 1
           State : OK
       Chunksize : 64 MB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 29302400 (27.94 GiB 30.01 GB)
          Bitmap : 448 bits (chunks), 0 dirty (0.0%)
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BISECT] Kernel panic, RIP bitmap_create
  2012-05-03  6:14     ` Karl Newman
@ 2012-05-03  6:25       ` NeilBrown
  2012-05-03  6:50       ` NeilBrown
  1 sibling, 0 replies; 8+ messages in thread
From: NeilBrown @ 2012-05-03  6:25 UTC (permalink / raw)
  To: Karl Newman; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3134 bytes --]

On Wed, 2 May 2012 23:14:09 -0700 Karl Newman <siliconfiend@gmail.com> wrote:

> I'm not sure how to go about getting the full stack trace. The
> motherboard has no serial port, so that's not an option. Unless the
> kernel supports USB to serial adapters for that purpose, in which case
> I might be able to borrow a couple Keyspans. Or I could sit and try
> and transcribe the whole thing...(!) 

A photo with a digital camera is usually easiest.   If you have wired
ethernet you could possible set up net-console.
Add something like

netconsole=@192.168.1.8/eth0,6666@192.168.1.3/00:14:85:fc:3b:de
           ^my address            ^other host IP / ethernet

Then on other-host run

  nc -u -l -p 6666 | tee -a /tmp/log


> I'm a little nervous about
> tripping the all-zeros UUID bug again, although it only happened once
> and it doesn't seem to be related to that commit.

IF you apply


--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8140,7 +8140,8 @@ static int md_notify_reboot(struct notifier_block *this,
 
        for_each_mddev(mddev, tmp) {
                if (mddev_trylock(mddev)) {
-                       __md_stop_writes(mddev);
+                       if (mddev->pers)
+                               __md_stop_writes(mddev);
                        mddev->safemode = 2;
                        mddev_unlock(mddev);
                }


to the kernel before you build it, that bug should not happen again.


>    Anyway, here's some
> data from the array:

Thanks.  Nothing jumps out at me, but I'll ponder it some more.

Thanks,
NeilBrown



> 
> # mdadm -E /dev/sda3
> /dev/sda3:
>           Magic : a92b4efc
>         Version : 0.90.00
>            UUID : 60bf4ee8:e6e3e14f:073e21cd:ed2abb54
>   Creation Time : Wed May  2 20:22:47 2012
>      Raid Level : raid5
>   Used Dev Size : 29302400 (27.94 GiB 30.01 GB)
>      Array Size : 58604800 (55.89 GiB 60.01 GB)
>    Raid Devices : 3
>   Total Devices : 3
> Preferred Minor : 127
> 
>     Update Time : Wed May  2 23:01:45 2012
>           State : active
> Internal Bitmap : present
>  Active Devices : 3
> Working Devices : 3
>  Failed Devices : 0
>   Spare Devices : 0
>        Checksum : 863c968f - correct
>          Events : 2
> 
>          Layout : left-symmetric
>      Chunk Size : 128K
> 
>       Number   Major   Minor   RaidDevice State
> this     0       8        3        0      active sync   /dev/sda3
> 
>    0     0       8        3        0      active sync   /dev/sda3
>    1     1       8       19        1      active sync   /dev/sdb3
>    2     2       8       35        2      active sync   /dev/sdc3
> 
> # mdadm -X /dev/sda3
>         Filename : /dev/sda3
>            Magic : 6d746962
>          Version : 4
>             UUID : 60bf4ee8:e6e3e14f:073e21cd:ed2abb54
>           Events : 1
>   Events Cleared : 1
>            State : OK
>        Chunksize : 64 MB
>           Daemon : 5s flush period
>       Write Mode : Normal
>        Sync Size : 29302400 (27.94 GiB 30.01 GB)
>           Bitmap : 448 bits (chunks), 0 dirty (0.0%)


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BISECT] Kernel panic, RIP bitmap_create
  2012-05-03  6:14     ` Karl Newman
  2012-05-03  6:25       ` NeilBrown
@ 2012-05-03  6:50       ` NeilBrown
  2012-05-04  6:37         ` Karl Newman
  1 sibling, 1 reply; 8+ messages in thread
From: NeilBrown @ 2012-05-03  6:50 UTC (permalink / raw)
  To: Karl Newman; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1245 bytes --]


I've managed to find a bug, but it is fairly minor and I cannot see how
it would cause a crash.

The calculation of bitmap->chunks is wrong and will usually be 1 too small.

Does it make a difference for you?  I tend to doubt it.

Thanks,
NeilBrown


diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 97e73e5..17e2b47 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -1727,8 +1727,7 @@ int bitmap_create(struct mddev *mddev)
 	bitmap->chunkshift = (ffz(~mddev->bitmap_info.chunksize)
 			      - BITMAP_BLOCK_SHIFT);
 
-	/* now that chunksize and chunkshift are set, we can use these macros */
-	chunks = (blocks + bitmap->chunkshift - 1) >>
+	chunks = (blocks + (1 << bitmap->chunkshift) - 1) >>
 			bitmap->chunkshift;
 	pages = (chunks + PAGE_COUNTER_RATIO - 1) / PAGE_COUNTER_RATIO;
 
diff --git a/drivers/md/bitmap.h b/drivers/md/bitmap.h
index 55ca5ae..b44b0aba 100644
--- a/drivers/md/bitmap.h
+++ b/drivers/md/bitmap.h
@@ -101,9 +101,6 @@ typedef __u16 bitmap_counter_t;
 
 #define BITMAP_BLOCK_SHIFT 9
 
-/* how many blocks per chunk? (this is variable) */
-#define CHUNK_BLOCK_RATIO(bitmap) ((bitmap)->mddev->bitmap_info.chunksize >> BITMAP_BLOCK_SHIFT)
-
 #endif
 
 /*

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [BISECT] Kernel panic, RIP bitmap_create
  2012-05-03  6:50       ` NeilBrown
@ 2012-05-04  6:37         ` Karl Newman
  2012-05-04  6:47           ` NeilBrown
  0 siblings, 1 reply; 8+ messages in thread
From: Karl Newman @ 2012-05-04  6:37 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Wed, May 2, 2012 at 11:50 PM, NeilBrown <neilb@suse.de> wrote:
>
> I've managed to find a bug, but it is fairly minor and I cannot see how
> it would cause a crash.
>
> The calculation of bitmap->chunks is wrong and will usually be 1 too small.
>
> Does it make a difference for you?  I tend to doubt it.
>
> Thanks,
> NeilBrown
>
>
> diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
> index 97e73e5..17e2b47 100644
> --- a/drivers/md/bitmap.c
> +++ b/drivers/md/bitmap.c
> @@ -1727,8 +1727,7 @@ int bitmap_create(struct mddev *mddev)
>        bitmap->chunkshift = (ffz(~mddev->bitmap_info.chunksize)
>                              - BITMAP_BLOCK_SHIFT);
>
> -       /* now that chunksize and chunkshift are set, we can use these macros */
> -       chunks = (blocks + bitmap->chunkshift - 1) >>
> +       chunks = (blocks + (1 << bitmap->chunkshift) - 1) >>
>                        bitmap->chunkshift;
>        pages = (chunks + PAGE_COUNTER_RATIO - 1) / PAGE_COUNTER_RATIO;
>
> diff --git a/drivers/md/bitmap.h b/drivers/md/bitmap.h
> index 55ca5ae..b44b0aba 100644
> --- a/drivers/md/bitmap.h
> +++ b/drivers/md/bitmap.h
> @@ -101,9 +101,6 @@ typedef __u16 bitmap_counter_t;
>
>  #define BITMAP_BLOCK_SHIFT 9
>
> -/* how many blocks per chunk? (this is variable) */
> -#define CHUNK_BLOCK_RATIO(bitmap) ((bitmap)->mddev->bitmap_info.chunksize >> BITMAP_BLOCK_SHIFT)
> -
>  #endif
>
>  /*

Somehow gmail marked this email as read, too, so I missed it. Anyway,
that did it! With this patch applied I can successfully boot! I tested
the offending commit by itself first with the all-zeros uuid patch
applied and confirmed the bug was still present, then applied this
patch and the bug was gone. I also applied this patch to 3.4-rc5 and
confirmed that it was still good.

Thank you for your help on this issue, and thank you for your work as
a kernel developer and supporting this crucial component.

Sincerely,

Karl Newman
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BISECT] Kernel panic, RIP bitmap_create
  2012-05-04  6:37         ` Karl Newman
@ 2012-05-04  6:47           ` NeilBrown
  2012-05-04 13:54             ` Karl Newman
  0 siblings, 1 reply; 8+ messages in thread
From: NeilBrown @ 2012-05-04  6:47 UTC (permalink / raw)
  To: Karl Newman; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1057 bytes --]

On Thu, 3 May 2012 23:37:38 -0700 Karl Newman <siliconfiend@gmail.com> wrote:


> Somehow gmail marked this email as read, too, so I missed it. Anyway,
> that did it! With this patch applied I can successfully boot! I tested
> the offending commit by itself first with the all-zeros uuid patch
> applied and confirmed the bug was still present, then applied this
> patch and the bug was gone. I also applied this patch to 3.4-rc5 and
> confirmed that it was still good.

Thanks - and good news.

I'd still like to know how this bug manages to cause a crash (I create an
array that have identical "mdadm -E" and "mdadm -X" output on an x86_64
machine, and couldn't make it crash).

I'll add a Reported-by: and Tested-by: for you and submit to Linus shortly.


> 
> Thank you for your help on this issue, and thank you for your work as
> a kernel developer and supporting this crucial component.

A pleasure - specially when I get to work with helpful and responsive
people :-)

NeilBrown


> 
> Sincerely,
> 
> Karl Newman


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BISECT] Kernel panic, RIP bitmap_create
  2012-05-04  6:47           ` NeilBrown
@ 2012-05-04 13:54             ` Karl Newman
  0 siblings, 0 replies; 8+ messages in thread
From: Karl Newman @ 2012-05-04 13:54 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Thu, May 3, 2012 at 11:47 PM, NeilBrown <neilb@suse.de> wrote:
> On Thu, 3 May 2012 23:37:38 -0700 Karl Newman <siliconfiend@gmail.com> wrote:
>
>
>> Somehow gmail marked this email as read, too, so I missed it. Anyway,
>> that did it! With this patch applied I can successfully boot! I tested
>> the offending commit by itself first with the all-zeros uuid patch
>> applied and confirmed the bug was still present, then applied this
>> patch and the bug was gone. I also applied this patch to 3.4-rc5 and
>> confirmed that it was still good.
>
> Thanks - and good news.
>
> I'd still like to know how this bug manages to cause a crash (I create an
> array that have identical "mdadm -E" and "mdadm -X" output on an x86_64
> machine, and couldn't make it crash).
>
> I'll add a Reported-by: and Tested-by: for you and submit to Linus shortly.
>
>
>>
>> Thank you for your help on this issue, and thank you for your work as
>> a kernel developer and supporting this crucial component.
>
> A pleasure - specially when I get to work with helpful and responsive
> people :-)
>
> NeilBrown
>

Well, if it helps any, here's some history: This array dates back to
early 2006 and was created with the Gentoo mdadm tools available at
that time. I had one hard drive fail about 2 years ago and replaced it
with an identical model. During this recent testing I noticed that one
of the array devices had a metadata of versions of 0.90.02 where the
others were 0.90.00 so possibly that was a side effect of the
replacement. A few weeks ago I had the motherboard or CPU or something
fail on the machine, so I bought replacement hardware and am trying to
bring it up on the old array (which is why I'm using rc kernels--I
need the driver support introduced in 3.4). It was during this rebuild
that I discovered about bitmaps and thought it would be a good idea to
add it to the array, so I did. So, the array has had its metadata
written by at least 3 different versions of mdadm scattered over 6-1/2
years. Thus, it may be impossible (or at least extremely difficult)
for you to exactly re-create my situation unless you can scrounge the
old versions and simulate it. I'm suspecting my condition is an
oddball one, which is probably why nobody else has experienced it (or
at least google didn't find anyone talking about it).

Sincerely,

Karl

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-05-04 13:54 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAOOwNtJhFa67EFTs5AdgSHzFseBr9xJGTsaEOyYnaYYNCeUMAQ@mail.gmail.com>
2012-05-03  5:05 ` [BISECT] Kernel panic, RIP bitmap_create Karl Newman
2012-05-03  5:58   ` NeilBrown
2012-05-03  6:14     ` Karl Newman
2012-05-03  6:25       ` NeilBrown
2012-05-03  6:50       ` NeilBrown
2012-05-04  6:37         ` Karl Newman
2012-05-04  6:47           ` NeilBrown
2012-05-04 13:54             ` Karl Newman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).