Panic doing BLKDISCARD on a raid 5 array on linux 3.17.3

All of lore.kernel.org
 help / color / mirror / Atom feed

* Panic doing BLKDISCARD on a raid 5 array on linux 3.17.3
@ 2014-12-17 12:00 Anthony Wright
  2014-12-18  5:28 ` NeilBrown
  0 siblings, 1 reply; 5+ messages in thread
From: Anthony Wright @ 2014-12-17 12:00 UTC (permalink / raw)
  To: linux-raid

I've hit a panic bug on stock linux 3.17.3 (which includes the recent
commit on BLKDISCARD in md/raid5.c) running in Dom0 under Xen 4.1.0 that
I've isolated to a BLKDISCARD system call within mkfs.ext3 and only
happens on a raid 5 array (it doesn't happen on a raid 1 array).

The system it happens on is remote and I don't have physical access to
it, but the system administrator there is fairly helpful. We're in the
process of commissioning the system which needs to be done tomorrow
(thursday), so I've only got 24 hours in which I can run any tests you
may want. If necessary I can arrange remote access, but it's a little
complex.

We have 3 512GB SSDs on the system, all with a GPT partition table and
the same partition layout. All the partitions have optimal alignment
according to parted. One of the partitions on each SSD is assembled into
a raid 1 array, another partition is assembled into a raid 5 array. Each
array is the used as the only physical volume in a LVM volume group. I
then create a logical volume on each array and format the logical volume
with mkfs.ext3. I ran mkfs.ext3 in verbose mode and also ran strace on
it in a separate session (though it was over a network) so it's possible
I lost the last few packets of data.

/dev/Test/Test - 400MB LV on raid 1
/dev/Master/Test - 400MB LV on raid 5

A) mkfs.ext3 -E nodiscard -v /dev/Test/Test - succeeds
B) mkfs.ext3 -v /dev/Test/Test - succeeds
C) mkfs.ext3 -E nodiscard -v /dev/Master/Test - succeeds
D) mkfs.ext3 -v /dev/Master/Test - panics

mkfs.ext3 output from (B)
-------------------------
mke2fs 1.42.9 (28-Dec-2013)
fs_types for mke2fs.conf resolution: 'ext3', 'small'
Discarding device blocks: done                           Discard
succeeded and will return 0s - skipping inode table wipe
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
Stride=4 blocks, Stripe width=4 blocks
51200 inodes, 204800 blocks
10240 blocks (5.00%) reserved for the super user
First data block=1
Maximum filesystem blocks=67371008
25 block groups
8192 blocks per group, 8192 fragments per group
2048 inodes per group
Superblock backups stored on blocks:
    8193, 24577, 40961, 57345, 73729

Allocating group tables: done                           Writing inode
tables: done                           Creating journal (4096 blocks): done
Writing superblocks and filesystem accounting information: done

strace output from (B) around the BLKDISCARD
--------------------------------------------
gettimeofday({1418806647, 890754}, NULL) = 0
gettimeofday({1418806647, 890814}, NULL) = 0
ioctl(3, BLKDISCARD, {0, 3000000010})   = 0
write(1, "Discarding device blocks: ", 26) = 26
write(1, "  1024/204800", 13)           = 13
write(1, "\10\10\10\10\10\10\10\10\10\10\10\10\10", 13) = 13
ioctl(3, BLKDISCARD, {100000, 3000000010}) = 0
write(1, "             ", 13)           = 13
write(1, "\10\10\10\10\10\10\10\10\10\10\10\10\10", 13) = 13
write(1, "done                            "..., 33) = 33
write(1, "Discard succeeded and will retur"..., 65) = 65

mkfs.ext3 output from (D)
-------------------------
mke2fs 1.42.9 (28-Dec-2013)
fs_types for mke2fs.conf resolution: 'ext3', 'small'
<Panic>

strace output from (D) around the BLKDISCARD
--------------------------------------------
gettimeofday({1418809706, 244197}, NULL) = 0
gettimeofday({1418809706, 244259}, NULL) = 0
ioctl(3, BLKDISCARD, {0, 3000000010}
<Panic>

I have a photograph of the panic output from a previous session which
includes raid5d and blk_finish_plug in the stack trace, unfortunately I
don't have the top part of the panic and vger won't accept the
attachment. I also have a photograph of the console output from the
crash at (D), but in this case it outputs to the console every 180 seconds:

INFO: rcu_sched self-detected stall on CPU { 1}
sending NMI to all CPUs:
xen: vector 0x2 is not implemented

thanks,

Anthony Wright

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Panic doing BLKDISCARD on a raid 5 array on linux 3.17.3
  2014-12-17 12:00 Panic doing BLKDISCARD on a raid 5 array on linux 3.17.3 Anthony Wright
@ 2014-12-18  5:28 ` NeilBrown
  2014-12-18 10:21   ` Anthony Wright
  2014-12-18 10:58   ` Anthony Wright
  0 siblings, 2 replies; 5+ messages in thread
From: NeilBrown @ 2014-12-18  5:28 UTC (permalink / raw)
  To: Anthony Wright; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 4923 bytes --]

On Wed, 17 Dec 2014 12:00:13 +0000 Anthony Wright <anthony@overnetdata.com>
wrote:

> I've hit a panic bug on stock linux 3.17.3 (which includes the recent
> commit on BLKDISCARD in md/raid5.c) running in Dom0 under Xen 4.1.0 that
> I've isolated to a BLKDISCARD system call within mkfs.ext3 and only
> happens on a raid 5 array (it doesn't happen on a raid 1 array).
> 
> The system it happens on is remote and I don't have physical access to
> it, but the system administrator there is fairly helpful. We're in the
> process of commissioning the system which needs to be done tomorrow
> (thursday), so I've only got 24 hours in which I can run any tests you
> may want. If necessary I can arrange remote access, but it's a little
> complex.
> 
> We have 3 512GB SSDs on the system, all with a GPT partition table and
> the same partition layout. All the partitions have optimal alignment
> according to parted. One of the partitions on each SSD is assembled into
> a raid 1 array, another partition is assembled into a raid 5 array. Each
> array is the used as the only physical volume in a LVM volume group. I
> then create a logical volume on each array and format the logical volume
> with mkfs.ext3. I ran mkfs.ext3 in verbose mode and also ran strace on
> it in a separate session (though it was over a network) so it's possible
> I lost the last few packets of data.
> 
> /dev/Test/Test - 400MB LV on raid 1
> /dev/Master/Test - 400MB LV on raid 5
> 
> A) mkfs.ext3 -E nodiscard -v /dev/Test/Test - succeeds
> B) mkfs.ext3 -v /dev/Test/Test - succeeds
> C) mkfs.ext3 -E nodiscard -v /dev/Master/Test - succeeds
> D) mkfs.ext3 -v /dev/Master/Test - panics
> 
> mkfs.ext3 output from (B)
> -------------------------
> mke2fs 1.42.9 (28-Dec-2013)
> fs_types for mke2fs.conf resolution: 'ext3', 'small'
> Discarding device blocks: done                           Discard
> succeeded and will return 0s - skipping inode table wipe
> Filesystem label=
> OS type: Linux
> Block size=1024 (log=0)
> Fragment size=1024 (log=0)
> Stride=4 blocks, Stripe width=4 blocks
> 51200 inodes, 204800 blocks
> 10240 blocks (5.00%) reserved for the super user
> First data block=1
> Maximum filesystem blocks=67371008
> 25 block groups
> 8192 blocks per group, 8192 fragments per group
> 2048 inodes per group
> Superblock backups stored on blocks:
>     8193, 24577, 40961, 57345, 73729
> 
> Allocating group tables: done                           Writing inode
> tables: done                           Creating journal (4096 blocks): done
> Writing superblocks and filesystem accounting information: done
> 
> strace output from (B) around the BLKDISCARD
> --------------------------------------------
> gettimeofday({1418806647, 890754}, NULL) = 0
> gettimeofday({1418806647, 890814}, NULL) = 0
> ioctl(3, BLKDISCARD, {0, 3000000010})   = 0
> write(1, "Discarding device blocks: ", 26) = 26
> write(1, "  1024/204800", 13)           = 13
> write(1, "\10\10\10\10\10\10\10\10\10\10\10\10\10", 13) = 13
> ioctl(3, BLKDISCARD, {100000, 3000000010}) = 0
> write(1, "             ", 13)           = 13
> write(1, "\10\10\10\10\10\10\10\10\10\10\10\10\10", 13) = 13
> write(1, "done                            "..., 33) = 33
> write(1, "Discard succeeded and will retur"..., 65) = 65
> 
> mkfs.ext3 output from (D)
> -------------------------
> mke2fs 1.42.9 (28-Dec-2013)
> fs_types for mke2fs.conf resolution: 'ext3', 'small'
> <Panic>
> 
> strace output from (D) around the BLKDISCARD
> --------------------------------------------
> gettimeofday({1418809706, 244197}, NULL) = 0
> gettimeofday({1418809706, 244259}, NULL) = 0
> ioctl(3, BLKDISCARD, {0, 3000000010}
> <Panic>
> 
> I have a photograph of the panic output from a previous session which
> includes raid5d and blk_finish_plug in the stack trace, unfortunately I
> don't have the top part of the panic and vger won't accept the
> attachment. I also have a photograph of the console output from the
> crash at (D), but in this case it outputs to the console every 180 seconds:
> 
> INFO: rcu_sched self-detected stall on CPU { 1}
> sending NMI to all CPUs:
> xen: vector 0x2 is not implemented
> 
> thanks,
> 
> Anthony Wright

Presumably you have deliberately enabled DISCARD support by setting the
  raid456.devices_handle_discard_safely

modules parameters?  Otherwise the DISCARD should be a no-op.

It is very hard to deduce anything without the full Oops.  Do you have access
to another machine on the same subnet?  If so you could enable netconsole and
capture the full oops from the other machines (all console messages are sent
via UDP at a very low level).


I suspect md/raid5 is sending down a discard request in some way that the
scsi/sata layer or driver doesn't like, but without the full oops, I really
cannot guess what it might be.

NeilBrown

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Panic doing BLKDISCARD on a raid 5 array on linux 3.17.3
  2014-12-18  5:28 ` NeilBrown
@ 2014-12-18 10:21   ` Anthony Wright
  2014-12-18 10:58   ` Anthony Wright
  1 sibling, 0 replies; 5+ messages in thread
From: Anthony Wright @ 2014-12-18 10:21 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On 18/12/2014 05:28, NeilBrown wrote:
> On Wed, 17 Dec 2014 12:00:13 +0000 Anthony Wright <anthony@overnetdata.com>
> wrote:
>
>> I've hit a panic bug on stock linux 3.17.3 (which includes the recent
>> commit on BLKDISCARD in md/raid5.c) running in Dom0 under Xen 4.1.0 that
>> I've isolated to a BLKDISCARD system call within mkfs.ext3 and only
>> happens on a raid 5 array (it doesn't happen on a raid 1 array).
>>
>> The system it happens on is remote and I don't have physical access to
>> it, but the system administrator there is fairly helpful. We're in the
>> process of commissioning the system which needs to be done tomorrow
>> (thursday), so I've only got 24 hours in which I can run any tests you
>> may want. If necessary I can arrange remote access, but it's a little
>> complex.
>>
>> We have 3 512GB SSDs on the system, all with a GPT partition table and
>> the same partition layout. All the partitions have optimal alignment
>> according to parted. One of the partitions on each SSD is assembled into
>> a raid 1 array, another partition is assembled into a raid 5 array. Each
>> array is the used as the only physical volume in a LVM volume group. I
>> then create a logical volume on each array and format the logical volume
>> with mkfs.ext3. I ran mkfs.ext3 in verbose mode and also ran strace on
>> it in a separate session (though it was over a network) so it's possible
>> I lost the last few packets of data.
>>
>> /dev/Test/Test - 400MB LV on raid 1
>> /dev/Master/Test - 400MB LV on raid 5
>>
>> A) mkfs.ext3 -E nodiscard -v /dev/Test/Test - succeeds
>> B) mkfs.ext3 -v /dev/Test/Test - succeeds
>> C) mkfs.ext3 -E nodiscard -v /dev/Master/Test - succeeds
>> D) mkfs.ext3 -v /dev/Master/Test - panics
>>
>> mkfs.ext3 output from (B)
>> -------------------------
>> mke2fs 1.42.9 (28-Dec-2013)
>> fs_types for mke2fs.conf resolution: 'ext3', 'small'
>> Discarding device blocks: done                           Discard
>> succeeded and will return 0s - skipping inode table wipe
>> Filesystem label=
>> OS type: Linux
>> Block size=1024 (log=0)
>> Fragment size=1024 (log=0)
>> Stride=4 blocks, Stripe width=4 blocks
>> 51200 inodes, 204800 blocks
>> 10240 blocks (5.00%) reserved for the super user
>> First data block=1
>> Maximum filesystem blocks=67371008
>> 25 block groups
>> 8192 blocks per group, 8192 fragments per group
>> 2048 inodes per group
>> Superblock backups stored on blocks:
>>     8193, 24577, 40961, 57345, 73729
>>
>> Allocating group tables: done                           Writing inode
>> tables: done                           Creating journal (4096 blocks): done
>> Writing superblocks and filesystem accounting information: done
>>
>> strace output from (B) around the BLKDISCARD
>> --------------------------------------------
>> gettimeofday({1418806647, 890754}, NULL) = 0
>> gettimeofday({1418806647, 890814}, NULL) = 0
>> ioctl(3, BLKDISCARD, {0, 3000000010})   = 0
>> write(1, "Discarding device blocks: ", 26) = 26
>> write(1, "  1024/204800", 13)           = 13
>> write(1, "\10\10\10\10\10\10\10\10\10\10\10\10\10", 13) = 13
>> ioctl(3, BLKDISCARD, {100000, 3000000010}) = 0
>> write(1, "             ", 13)           = 13
>> write(1, "\10\10\10\10\10\10\10\10\10\10\10\10\10", 13) = 13
>> write(1, "done                            "..., 33) = 33
>> write(1, "Discard succeeded and will retur"..., 65) = 65
>>
>> mkfs.ext3 output from (D)
>> -------------------------
>> mke2fs 1.42.9 (28-Dec-2013)
>> fs_types for mke2fs.conf resolution: 'ext3', 'small'
>> <Panic>
>>
>> strace output from (D) around the BLKDISCARD
>> --------------------------------------------
>> gettimeofday({1418809706, 244197}, NULL) = 0
>> gettimeofday({1418809706, 244259}, NULL) = 0
>> ioctl(3, BLKDISCARD, {0, 3000000010}
>> <Panic>
>>
>> I have a photograph of the panic output from a previous session which
>> includes raid5d and blk_finish_plug in the stack trace, unfortunately I
>> don't have the top part of the panic and vger won't accept the
>> attachment. I also have a photograph of the console output from the
>> crash at (D), but in this case it outputs to the console every 180 seconds:
>>
>> INFO: rcu_sched self-detected stall on CPU { 1}
>> sending NMI to all CPUs:
>> xen: vector 0x2 is not implemented
>>
>> thanks,
>>
>> Anthony Wright
> Presumably you have deliberately enabled DISCARD support by setting the
>   raid456.devices_handle_discard_safely
>
> modules parameters?  Otherwise the DISCARD should be a no-op.
I haven't touched the raid456.devices_handle_discard_safely setting, I
only learnt about it when I discovered your patch while I investigated
the crash. I'm presuming it's the default value, but if there's a way to
confirm that please let me know.
> It is very hard to deduce anything without the full Oops.  Do you have access
> to another machine on the same subnet?  If so you could enable netconsole and
> capture the full oops from the other machines (all console messages are sent
> via UDP at a very low level).
I've got netconsole working, but it doesn't always panic and it takes a
while to get the system reset. Below is the output I got from the most
recent crash:

[63207.177400] BUG: unable to handle kernel paging request at
0000001e00008000

Anthony.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Panic doing BLKDISCARD on a raid 5 array on linux 3.17.3
  2014-12-18  5:28 ` NeilBrown
  2014-12-18 10:21   ` Anthony Wright
@ 2014-12-18 10:58   ` Anthony Wright
  2014-12-18 18:05     ` Chris Murphy
  1 sibling, 1 reply; 5+ messages in thread
From: Anthony Wright @ 2014-12-18 10:58 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On 18/12/2014 05:28, NeilBrown wrote:
> I suspect md/raid5 is sending down a discard request in some way that the
> scsi/sata layer or driver doesn't like, but without the full oops, I really
> cannot guess what it might be.
We've tried 4 times to reproduce the panic we got originally, but
unforunately with no luck. Below are the outputs from all four crashes
as captured by netconsole, in case they are any help.

Crash #1
[63207.177400] BUG: unable to handle kernel paging request at
0000001e00008000

Crash #2
[  531.210340] BUG: unable to handle kernel paging request at
0000000100000000
[  531.210514] IP:[  531.210340] BUG: unable to handle kernel 
[<ffffffff8128788e>] __blk_segment_map_sg+0x5e/0x1b0
paging request[  531.210632] PGD 20187f067 PUD 0  at 0000000100000000

[  531.210514] IP: [<ffffffff8128788e>] __blk_segment_map_sg+0x5e/0x1b0
[  531.210632] PGD 20187f067 PUD 0
[  531.210783] Oops: 0000 [#1] SMP
[  531.210932] Modules linked in: eql netconsole configfs raid456[ 
531.210783] Oops: 0000 [#1] SMP
[  531.210932] Modules linked in: eql netconsole configfs raid456
async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq
xt_multiport async_raid6_recov async_memcpy async_pq async_xor xor
async_tx raid6_pq xt_multiport xt_tcpudp iptable_filter ip_tables
x_tables aesni_intel aes_x86_64 glue_helper lrw xt_tcpudp iptable_filter
ip_tables x_tables aesni_intel aes_x86_64 glue_helper lrw gf128mul
ablk_helper cryptd ppdev gf128mul ablk_helper cryptd ppdev

Crash #3
[  268.115094] general protection fault: 0000 [#1] SMP
[  268.115263] Modules linked in:[  268.115094] general protection
fault: 0000 [#1] SMP
[  268.115263] Modules linked in:

Crash #4
[  276.325157] general protection fault: 0000 [#1] SMP




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Panic doing BLKDISCARD on a raid 5 array on linux 3.17.3
  2014-12-18 10:58   ` Anthony Wright
@ 2014-12-18 18:05     ` Chris Murphy
  0 siblings, 0 replies; 5+ messages in thread
From: Chris Murphy @ 2014-12-18 18:05 UTC (permalink / raw)
  Cc: linux-raid

On Thu, Dec 18, 2014 at 3:58 AM, Anthony Wright <anthony@overnetdata.com> wrote:
> On 18/12/2014 05:28, NeilBrown wrote:
>> I suspect md/raid5 is sending down a discard request in some way that the
>> scsi/sata layer or driver doesn't like, but without the full oops, I really
>> cannot guess what it might be.
> We've tried 4 times to reproduce the panic we got originally, but
> unforunately with no luck. Below are the outputs from all four crashes
> as captured by netconsole, in case they are any help.
>
> Crash #1
> [63207.177400] BUG: unable to handle kernel paging request at
> 0000001e00008000
>
> Crash #2
> [  531.210340] BUG: unable to handle kernel paging request at
> 0000000100000000
> [  531.210514] IP:[  531.210340] BUG: unable to handle kernel
> [<ffffffff8128788e>] __blk_segment_map_sg+0x5e/0x1b0
> paging request[  531.210632] PGD 20187f067 PUD 0  at 0000000100000000
>
> [  531.210514] IP: [<ffffffff8128788e>] __blk_segment_map_sg+0x5e/0x1b0
> [  531.210632] PGD 20187f067 PUD 0
> [  531.210783] Oops: 0000 [#1] SMP
> [  531.210932] Modules linked in: eql netconsole configfs raid456[
> 531.210783] Oops: 0000 [#1] SMP
> [  531.210932] Modules linked in: eql netconsole configfs raid456
> async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq
> xt_multiport async_raid6_recov async_memcpy async_pq async_xor xor
> async_tx raid6_pq xt_multiport xt_tcpudp iptable_filter ip_tables
> x_tables aesni_intel aes_x86_64 glue_helper lrw xt_tcpudp iptable_filter
> ip_tables x_tables aesni_intel aes_x86_64 glue_helper lrw gf128mul
> ablk_helper cryptd ppdev gf128mul ablk_helper cryptd ppdev
>
> Crash #3
> [  268.115094] general protection fault: 0000 [#1] SMP
> [  268.115263] Modules linked in:[  268.115094] general protection
> fault: 0000 [#1] SMP
> [  268.115263] Modules linked in:
>
> Crash #4
> [  276.325157] general protection fault: 0000 [#1] SMP
>

Are you trimming these? Or do they literally end with nothing else
reported? In any case you must be trimming what comes before what
you've posted and I don't think that's a good idea, often there's
something wrong happening well before an oops gets reported.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-12-18 18:05 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-17 12:00 Panic doing BLKDISCARD on a raid 5 array on linux 3.17.3 Anthony Wright
2014-12-18  5:28 ` NeilBrown
2014-12-18 10:21   ` Anthony Wright
2014-12-18 10:58   ` Anthony Wright
2014-12-18 18:05     ` Chris Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.