* What the heck happened to my array? (No apparent data loss).
@ 2011-04-03 13:32 Brad Campbell
2011-04-03 15:47 ` Roberto Spadim
0 siblings, 1 reply; 12+ messages in thread
From: Brad Campbell @ 2011-04-03 13:32 UTC (permalink / raw)
To: linux-raid
2.6.38.2
x86_64
10 x 1TB SATA drives in a single RAID-6
Here is the chain of events.
Saturday morning I started a reshape on a 10 element RAID-6. Simply
changing the Chunk size from 512k to 64k. This was going to take about
4.5 days according to the initial estimates.
I then went away for the weekend and came home to a wedged array.
Here is the chain of events that caused it.
This occurred about 1 minute after my scheduled morning SMART long (it
is Sunday after all) began.
Apr 3 03:19:08 srv kernel: [288180.455339] sd 0:0:12:0: [sdd] Unhandled
error code
Apr 3 03:19:08 srv kernel: [288180.455359] sd 0:0:12:0: [sdd] Result:
hostbyte=0x04 driverbyte=0x00
Apr 3 03:19:08 srv kernel: [288180.455377] sd 0:0:12:0: [sdd] CDB:
cdb[0]=0x2a: 2a 00 00 00 00 08 00 00 02 00
Apr 3 03:19:08 srv kernel: [288180.455415] end_request: I/O error, dev
sdd, sector 8
Apr 3 03:19:08 srv kernel: [288180.455449] end_request: I/O error, dev
sdd, sector 8
Apr 3 03:19:08 srv kernel: [288180.455462] md: super_written gets
error=-5, uptodate=0
Apr 3 03:19:08 srv kernel: [288180.455477] md/raid:md0: Disk failure on
sdd, disabling device.
Apr 3 03:19:08 srv kernel: [288180.455480] md/raid:md0: Operation
continuing on 9 devices.
Apr 3 03:19:08 srv kernel: [288180.472914] md: md0: reshape done.
Apr 3 03:19:08 srv kernel: [288180.472983] md: delaying data-check of
md5 until md3 has finished (they share one or more physical units)
Apr 3 03:19:08 srv kernel: [288180.473002] md: delaying data-check of
md4 until md6 has finished (they share one or more physical units)
Apr 3 03:19:08 srv kernel: [288180.473030] md: delaying data-check of
md6 until md5 has finished (they share one or more physical units)
Apr 3 03:19:08 srv kernel: [288180.473047] md: delaying data-check of
md3 until md1 has finished (they share one or more physical units)
Apr 3 03:19:08 srv kernel: [288180.551450] md: reshape of RAID array md0
Apr 3 03:19:08 srv kernel: [288180.551468] md: minimum _guaranteed_
speed: 200000 KB/sec/disk.
Apr 3 03:19:08 srv kernel: [288180.551483] md: using maximum available
idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
Apr 3 03:19:08 srv kernel: [288180.551514] md: using 128k window, over
a total of 976759808 blocks.
Apr 3 03:19:08 srv kernel: [288180.620089] sd 0:0:12:0: [sdd]
Synchronizing SCSI cache
Apr 3 03:19:08 srv mdadm[4803]: RebuildFinished event detected on md
device /dev/md0
Apr 3 03:19:08 srv mdadm[4803]: Fail event detected on md device
/dev/md0, component device /dev/sdd
Apr 3 03:19:08 srv mdadm[4803]: RebuildStarted event detected on md
device /dev/md0
Apr 3 03:19:10 srv kernel: [288182.614918] scsi 0:0:12:0: Direct-Access
ATA MAXTOR STM310003 MX1A PQ: 0 ANSI: 5
Apr 3 03:19:10 srv kernel: [288182.615312] sd 0:0:12:0: Attached scsi
generic sg3 type 0
Apr 3 03:19:10 srv kernel: [288182.618262] sd 0:0:12:0: [sdq]
1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
Apr 3 03:19:10 srv kernel: [288182.736998] sd 0:0:12:0: [sdq] Write
Protect is off
Apr 3 03:19:10 srv kernel: [288182.737019] sd 0:0:12:0: [sdq] Mode
Sense: 73 00 00 08
Apr 3 03:19:10 srv kernel: [288182.740521] sd 0:0:12:0: [sdq] Write
cache: enabled, read cache: enabled, doesn't support DPO or FUA
Apr 3 03:19:10 srv kernel: [288182.848999] sdq: unknown partition table
Apr 3 03:19:10 srv ata_id[28453]: HDIO_GET_IDENTITY failed for '/dev/sdq'
Apr 3 03:19:10 srv kernel: [288182.970091] sd 0:0:12:0: [sdq] Attached
SCSI disk
Apr 3 03:20:01 srv /USR/SBIN/CRON[28624]: (brad) CMD ([ -z
"`/usr/bin/pgrep -u brad collect`" ] && /usr/bin/screen -X -S brad-bot
screen /home/brad/bin/collect-thermostat)
Apr 3 03:20:01 srv /USR/SBIN/CRON[28625]: (root) CMD ([ -z
`/usr/bin/pgrep -u root keepalive` ] && /home/brad/bin/launch-keepalive)
Apr 3 03:20:01 srv /USR/SBIN/CRON[28626]: (brad) CMD ([ -z "`screen
-list | grep brad-bot`" ] && /home/brad/bin/botstart)
Apr 3 03:20:01 srv /USR/SBIN/CRON[28628]: (root) CMD (if [ -x
/usr/bin/mrtg ] && [ -r /etc/mrtg.cfg ]; then mkdir -p /var/log/mrtg ;
env LANG=C /usr/bin/mrtg /etc/mrtg.cfg 2>&1 | tee -a
/var/log/mrtg/mrtg.log ; fi)
Apr 3 03:20:01 srv /USR/SBIN/CRON[28627]: (brad) CMD
(/home/brad/rrd/rrd-create-graphs)
Apr 3 03:20:01 srv /USR/SBIN/CRON[28590]: (CRON) error (grandchild
#28625 failed with exit status 1)
Apr 3 03:20:01 srv /USR/SBIN/CRON[28589]: (CRON) error (grandchild
#28626 failed with exit status 1)
Apr 3 03:20:01 srv /USR/SBIN/CRON[28587]: (CRON) error (grandchild
#28624 failed with exit status 1)
Apr 3 03:22:10 srv kernel: [288363.070094] INFO: task jbd2/md0-8:2647
blocked for more than 120 seconds.
Apr 3 03:22:10 srv kernel: [288363.070114] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 3 03:22:10 srv kernel: [288363.070132] jbd2/md0-8 D
ffff88041aa52948 0 2647 2 0x00000000
Apr 3 03:22:10 srv kernel: [288363.070154] ffff88041aa526f0
0000000000000046 0000000000000000 ffff8804196769b0
Apr 3 03:22:10 srv kernel: [288363.070178] 0000000000011180
ffff88041bdc5fd8 0000000000004000 0000000000011180
Apr 3 03:22:10 srv kernel: [288363.070201] ffff88041bdc4010
ffff88041aa52950 ffff88041bdc5fd8 ffff88041aa52948
Apr 3 03:22:10 srv kernel: [288363.070224] Call Trace:
Apr 3 03:22:10 srv kernel: [288363.070246] [<ffffffff8104e4c6>] ?
queue_work_on+0x16/0x20
Apr 3 03:22:10 srv kernel: [288363.070266] [<ffffffff812e6bfd>] ?
md_write_start+0xad/0x190
Apr 3 03:22:10 srv kernel: [288363.070283] [<ffffffff81052b90>] ?
autoremove_wake_function+0x0/0x30
Apr 3 03:22:10 srv kernel: [288363.070299] [<ffffffff812e16f5>] ?
make_request+0x35/0x600
Apr 3 03:22:10 srv kernel: [288363.070317] [<ffffffff8108463b>] ?
__alloc_pages_nodemask+0x10b/0x810
Apr 3 03:22:10 srv kernel: [288363.070335] [<ffffffff81142042>] ?
T.1015+0x32/0x90
Apr 3 03:22:10 srv kernel: [288363.070350] [<ffffffff812e6a24>] ?
md_make_request+0xd4/0x200
Apr 3 03:22:10 srv kernel: [288363.070366] [<ffffffff81142218>] ?
ext4_map_blocks+0x178/0x210
Apr 3 03:22:10 srv kernel: [288363.070382] [<ffffffff811b6e84>] ?
generic_make_request+0x144/0x2f0
Apr 3 03:22:10 srv kernel: [288363.070397] [<ffffffff8116e89d>] ?
jbd2_journal_file_buffer+0x3d/0x70
Apr 3 03:22:10 srv kernel: [288363.070413] [<ffffffff811b708c>] ?
submit_bio+0x5c/0xd0
Apr 3 03:22:10 srv kernel: [288363.070430] [<ffffffff810e61d5>] ?
submit_bh+0xe5/0x120
Apr 3 03:22:10 srv kernel: [288363.070445] [<ffffffff811709b1>] ?
jbd2_journal_commit_transaction+0x441/0x1180
Apr 3 03:22:10 srv kernel: [288363.070466] [<ffffffff81044893>] ?
lock_timer_base+0x33/0x70
Apr 3 03:22:10 srv kernel: [288363.070480] [<ffffffff81052b90>] ?
autoremove_wake_function+0x0/0x30
Apr 3 03:22:10 srv kernel: [288363.070498] [<ffffffff81174871>] ?
kjournald2+0xb1/0x1e0
Apr 3 03:22:10 srv kernel: [288363.070511] [<ffffffff81052b90>] ?
autoremove_wake_function+0x0/0x30
Apr 3 03:22:10 srv kernel: [288363.070527] [<ffffffff811747c0>] ?
kjournald2+0x0/0x1e0
Apr 3 03:22:10 srv kernel: [288363.070544] [<ffffffff811747c0>] ?
kjournald2+0x0/0x1e0
Apr 3 03:22:10 srv kernel: [288363.070557] [<ffffffff81052716>] ?
kthread+0x96/0xa0
Apr 3 03:22:10 srv kernel: [288363.070573] [<ffffffff81003154>] ?
kernel_thread_helper+0x4/0x10
Apr 3 03:22:10 srv kernel: [288363.070588] [<ffffffff81052680>] ?
kthread+0x0/0xa0
Apr 3 03:22:10 srv kernel: [288363.070602] [<ffffffff81003150>] ?
kernel_thread_helper+0x0/0x10
So apparently sdd suffered an unknown failure (it happens) and the array
kicked it out (as it should). But 120 seconds later all tasks accessing
that array trigger their 120 second hangcheck warning and are all suck
in the D state.
At the time the array was 12.1% of the way through a reshape. I had to
reboot the machine to get it back up and it's now continuing the reshape
on 9 drives.
brad@srv:~$ cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md0 : active raid6 sdc[0] sdh[9] sda[8] sde[7] sdg[5] sdb[4] sdf[3]
sdm[2] sdl[1]
7814078464 blocks super 1.2 level 6, 512k chunk, algorithm 2
[10/9] [UUUUUU_UUU]
[===>.................] reshape = 16.5% (162091008/976759808)
finish=5778.6min speed=2349K/sec
To make matters more confusing the other arrays on the machine were in
the middle of their "Debians first Sunday of every month" "check" scrub.
I have the full syslog and can probably procure any other information
that might be useful. I don't think I've lost any data, the machine
continued reshaping and we're all moving along nicely. I just wanted to
report it and offer assistance in diagnosing it should that be requested.
Regards,
Brad
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What the heck happened to my array? (No apparent data loss).
2011-04-03 13:32 What the heck happened to my array? (No apparent data loss) Brad Campbell
@ 2011-04-03 15:47 ` Roberto Spadim
2011-04-04 5:59 ` Brad Campbell
0 siblings, 1 reply; 12+ messages in thread
From: Roberto Spadim @ 2011-04-03 15:47 UTC (permalink / raw)
To: Brad Campbell; +Cc: linux-raid
what kernel version? more informations about your linux box?
2011/4/3 Brad Campbell <lists2009@fnarfbargle.com>:
> 2.6.38.2
> x86_64
> 10 x 1TB SATA drives in a single RAID-6
>
> Here is the chain of events.
>
> Saturday morning I started a reshape on a 10 element RAID-6. Simply changing
> the Chunk size from 512k to 64k. This was going to take about 4.5 days
> according to the initial estimates.
>
> I then went away for the weekend and came home to a wedged array.
> Here is the chain of events that caused it.
>
> This occurred about 1 minute after my scheduled morning SMART long (it is
> Sunday after all) began.
>
> Apr 3 03:19:08 srv kernel: [288180.455339] sd 0:0:12:0: [sdd] Unhandled
> error code
> Apr 3 03:19:08 srv kernel: [288180.455359] sd 0:0:12:0: [sdd] Result:
> hostbyte=0x04 driverbyte=0x00
> Apr 3 03:19:08 srv kernel: [288180.455377] sd 0:0:12:0: [sdd] CDB:
> cdb[0]=0x2a: 2a 00 00 00 00 08 00 00 02 00
> Apr 3 03:19:08 srv kernel: [288180.455415] end_request: I/O error, dev sdd,
> sector 8
> Apr 3 03:19:08 srv kernel: [288180.455449] end_request: I/O error, dev sdd,
> sector 8
> Apr 3 03:19:08 srv kernel: [288180.455462] md: super_written gets error=-5,
> uptodate=0
> Apr 3 03:19:08 srv kernel: [288180.455477] md/raid:md0: Disk failure on
> sdd, disabling device.
> Apr 3 03:19:08 srv kernel: [288180.455480] md/raid:md0: Operation
> continuing on 9 devices.
> Apr 3 03:19:08 srv kernel: [288180.472914] md: md0: reshape done.
> Apr 3 03:19:08 srv kernel: [288180.472983] md: delaying data-check of md5
> until md3 has finished (they share one or more physical units)
> Apr 3 03:19:08 srv kernel: [288180.473002] md: delaying data-check of md4
> until md6 has finished (they share one or more physical units)
> Apr 3 03:19:08 srv kernel: [288180.473030] md: delaying data-check of md6
> until md5 has finished (they share one or more physical units)
> Apr 3 03:19:08 srv kernel: [288180.473047] md: delaying data-check of md3
> until md1 has finished (they share one or more physical units)
> Apr 3 03:19:08 srv kernel: [288180.551450] md: reshape of RAID array md0
> Apr 3 03:19:08 srv kernel: [288180.551468] md: minimum _guaranteed_ speed:
> 200000 KB/sec/disk.
> Apr 3 03:19:08 srv kernel: [288180.551483] md: using maximum available idle
> IO bandwidth (but not more than 200000 KB/sec) for reshape.
> Apr 3 03:19:08 srv kernel: [288180.551514] md: using 128k window, over a
> total of 976759808 blocks.
> Apr 3 03:19:08 srv kernel: [288180.620089] sd 0:0:12:0: [sdd] Synchronizing
> SCSI cache
> Apr 3 03:19:08 srv mdadm[4803]: RebuildFinished event detected on md device
> /dev/md0
> Apr 3 03:19:08 srv mdadm[4803]: Fail event detected on md device /dev/md0,
> component device /dev/sdd
> Apr 3 03:19:08 srv mdadm[4803]: RebuildStarted event detected on md device
> /dev/md0
> Apr 3 03:19:10 srv kernel: [288182.614918] scsi 0:0:12:0: Direct-Access
> ATA MAXTOR STM310003 MX1A PQ: 0 ANSI: 5
> Apr 3 03:19:10 srv kernel: [288182.615312] sd 0:0:12:0: Attached scsi
> generic sg3 type 0
> Apr 3 03:19:10 srv kernel: [288182.618262] sd 0:0:12:0: [sdq] 1953525168
> 512-byte logical blocks: (1.00 TB/931 GiB)
> Apr 3 03:19:10 srv kernel: [288182.736998] sd 0:0:12:0: [sdq] Write Protect
> is off
> Apr 3 03:19:10 srv kernel: [288182.737019] sd 0:0:12:0: [sdq] Mode Sense:
> 73 00 00 08
> Apr 3 03:19:10 srv kernel: [288182.740521] sd 0:0:12:0: [sdq] Write cache:
> enabled, read cache: enabled, doesn't support DPO or FUA
> Apr 3 03:19:10 srv kernel: [288182.848999] sdq: unknown partition table
> Apr 3 03:19:10 srv ata_id[28453]: HDIO_GET_IDENTITY failed for '/dev/sdq'
> Apr 3 03:19:10 srv kernel: [288182.970091] sd 0:0:12:0: [sdq] Attached SCSI
> disk
> Apr 3 03:20:01 srv /USR/SBIN/CRON[28624]: (brad) CMD ([ -z "`/usr/bin/pgrep
> -u brad collect`" ] && /usr/bin/screen -X -S brad-bot screen
> /home/brad/bin/collect-thermostat)
> Apr 3 03:20:01 srv /USR/SBIN/CRON[28625]: (root) CMD ([ -z `/usr/bin/pgrep
> -u root keepalive` ] && /home/brad/bin/launch-keepalive)
> Apr 3 03:20:01 srv /USR/SBIN/CRON[28626]: (brad) CMD ([ -z "`screen -list |
> grep brad-bot`" ] && /home/brad/bin/botstart)
> Apr 3 03:20:01 srv /USR/SBIN/CRON[28628]: (root) CMD (if [ -x /usr/bin/mrtg
> ] && [ -r /etc/mrtg.cfg ]; then mkdir -p /var/log/mrtg ; env LANG=C
> /usr/bin/mrtg /etc/mrtg.cfg 2>&1 | tee -a /var/log/mrtg/mrtg.log ; fi)
> Apr 3 03:20:01 srv /USR/SBIN/CRON[28627]: (brad) CMD
> (/home/brad/rrd/rrd-create-graphs)
> Apr 3 03:20:01 srv /USR/SBIN/CRON[28590]: (CRON) error (grandchild #28625
> failed with exit status 1)
> Apr 3 03:20:01 srv /USR/SBIN/CRON[28589]: (CRON) error (grandchild #28626
> failed with exit status 1)
> Apr 3 03:20:01 srv /USR/SBIN/CRON[28587]: (CRON) error (grandchild #28624
> failed with exit status 1)
> Apr 3 03:22:10 srv kernel: [288363.070094] INFO: task jbd2/md0-8:2647
> blocked for more than 120 seconds.
> Apr 3 03:22:10 srv kernel: [288363.070114] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Apr 3 03:22:10 srv kernel: [288363.070132] jbd2/md0-8 D
> ffff88041aa52948 0 2647 2 0x00000000
> Apr 3 03:22:10 srv kernel: [288363.070154] ffff88041aa526f0
> 0000000000000046 0000000000000000 ffff8804196769b0
> Apr 3 03:22:10 srv kernel: [288363.070178] 0000000000011180
> ffff88041bdc5fd8 0000000000004000 0000000000011180
> Apr 3 03:22:10 srv kernel: [288363.070201] ffff88041bdc4010
> ffff88041aa52950 ffff88041bdc5fd8 ffff88041aa52948
> Apr 3 03:22:10 srv kernel: [288363.070224] Call Trace:
> Apr 3 03:22:10 srv kernel: [288363.070246] [<ffffffff8104e4c6>] ?
> queue_work_on+0x16/0x20
> Apr 3 03:22:10 srv kernel: [288363.070266] [<ffffffff812e6bfd>] ?
> md_write_start+0xad/0x190
> Apr 3 03:22:10 srv kernel: [288363.070283] [<ffffffff81052b90>] ?
> autoremove_wake_function+0x0/0x30
> Apr 3 03:22:10 srv kernel: [288363.070299] [<ffffffff812e16f5>] ?
> make_request+0x35/0x600
> Apr 3 03:22:10 srv kernel: [288363.070317] [<ffffffff8108463b>] ?
> __alloc_pages_nodemask+0x10b/0x810
> Apr 3 03:22:10 srv kernel: [288363.070335] [<ffffffff81142042>] ?
> T.1015+0x32/0x90
> Apr 3 03:22:10 srv kernel: [288363.070350] [<ffffffff812e6a24>] ?
> md_make_request+0xd4/0x200
> Apr 3 03:22:10 srv kernel: [288363.070366] [<ffffffff81142218>] ?
> ext4_map_blocks+0x178/0x210
> Apr 3 03:22:10 srv kernel: [288363.070382] [<ffffffff811b6e84>] ?
> generic_make_request+0x144/0x2f0
> Apr 3 03:22:10 srv kernel: [288363.070397] [<ffffffff8116e89d>] ?
> jbd2_journal_file_buffer+0x3d/0x70
> Apr 3 03:22:10 srv kernel: [288363.070413] [<ffffffff811b708c>] ?
> submit_bio+0x5c/0xd0
> Apr 3 03:22:10 srv kernel: [288363.070430] [<ffffffff810e61d5>] ?
> submit_bh+0xe5/0x120
> Apr 3 03:22:10 srv kernel: [288363.070445] [<ffffffff811709b1>] ?
> jbd2_journal_commit_transaction+0x441/0x1180
> Apr 3 03:22:10 srv kernel: [288363.070466] [<ffffffff81044893>] ?
> lock_timer_base+0x33/0x70
> Apr 3 03:22:10 srv kernel: [288363.070480] [<ffffffff81052b90>] ?
> autoremove_wake_function+0x0/0x30
> Apr 3 03:22:10 srv kernel: [288363.070498] [<ffffffff81174871>] ?
> kjournald2+0xb1/0x1e0
> Apr 3 03:22:10 srv kernel: [288363.070511] [<ffffffff81052b90>] ?
> autoremove_wake_function+0x0/0x30
> Apr 3 03:22:10 srv kernel: [288363.070527] [<ffffffff811747c0>] ?
> kjournald2+0x0/0x1e0
> Apr 3 03:22:10 srv kernel: [288363.070544] [<ffffffff811747c0>] ?
> kjournald2+0x0/0x1e0
> Apr 3 03:22:10 srv kernel: [288363.070557] [<ffffffff81052716>] ?
> kthread+0x96/0xa0
> Apr 3 03:22:10 srv kernel: [288363.070573] [<ffffffff81003154>] ?
> kernel_thread_helper+0x4/0x10
> Apr 3 03:22:10 srv kernel: [288363.070588] [<ffffffff81052680>] ?
> kthread+0x0/0xa0
> Apr 3 03:22:10 srv kernel: [288363.070602] [<ffffffff81003150>] ?
> kernel_thread_helper+0x0/0x10
>
> So apparently sdd suffered an unknown failure (it happens) and the array
> kicked it out (as it should). But 120 seconds later all tasks accessing that
> array trigger their 120 second hangcheck warning and are all suck in the D
> state.
>
> At the time the array was 12.1% of the way through a reshape. I had to
> reboot the machine to get it back up and it's now continuing the reshape on
> 9 drives.
>
> brad@srv:~$ cat /proc/mdstat
> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
> md0 : active raid6 sdc[0] sdh[9] sda[8] sde[7] sdg[5] sdb[4] sdf[3] sdm[2]
> sdl[1]
> 7814078464 blocks super 1.2 level 6, 512k chunk, algorithm 2 [10/9]
> [UUUUUU_UUU]
> [===>.................] reshape = 16.5% (162091008/976759808)
> finish=5778.6min speed=2349K/sec
>
>
>
> To make matters more confusing the other arrays on the machine were in the
> middle of their "Debians first Sunday of every month" "check" scrub.
>
> I have the full syslog and can probably procure any other information that
> might be useful. I don't think I've lost any data, the machine continued
> reshaping and we're all moving along nicely. I just wanted to report it and
> offer assistance in diagnosing it should that be requested.
>
> Regards,
> Brad
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What the heck happened to my array? (No apparent data loss).
2011-04-03 15:47 ` Roberto Spadim
@ 2011-04-04 5:59 ` Brad Campbell
2011-04-04 16:49 ` Roberto Spadim
0 siblings, 1 reply; 12+ messages in thread
From: Brad Campbell @ 2011-04-04 5:59 UTC (permalink / raw)
To: Roberto Spadim; +Cc: linux-raid
On 03/04/11 23:47, Roberto Spadim wrote:
> what kernel version? more informations about your linux box?
The kernel version and architecture were the first 2 lines of the E-mail
you top posted over.
What would you like to know about the box? It's a 6 core Phenom-II with
16G of ram. 2 LSI SAS 9240 controllers configured with 10 x 1TB SATA
Drives in a RAID-6(md0) & 3 x 750GB SATA drives in a RAID-5(md2).
The boot drives are a pair of 1TB SATA drives in multiple RAID-1's using
the on-board AMD chipset controller and there is a 64GB SSD on a
separate PCI-E Marvell 7042m Controller.
The array in question is :
root@srv:~# mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Sat Jan 8 11:25:17 2011
Raid Level : raid6
Array Size : 7814078464 (7452.09 GiB 8001.62 GB)
Used Dev Size : 976759808 (931.51 GiB 1000.20 GB)
Raid Devices : 10
Total Devices : 9
Persistence : Superblock is persistent
Update Time : Mon Apr 4 13:53:59 2011
State : clean, degraded, recovering
Active Devices : 9
Working Devices : 9
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Reshape Status : 29% complete
New Chunksize : 64K
Name : srv:server (local to host srv)
UUID : d00a11d7:fe0435af:07c8d4d6:e3b8e34e
Events : 429198
Number Major Minor RaidDevice State
0 8 32 0 active sync /dev/sdc
1 8 176 1 active sync /dev/sdl
2 8 192 2 active sync /dev/sdm
3 8 80 3 active sync /dev/sdf
4 8 16 4 active sync /dev/sdb
5 8 96 5 active sync /dev/sdg
6 0 0 6 removed
7 8 64 7 active sync /dev/sde
8 8 0 8 active sync /dev/sda
9 8 112 9 active sync /dev/sdh
root@srv:~#
Subsequent investigation has shown sdd has a pending reallocation and I
can only assume the unidentified IO error was as a result of tripping up
on that. It still does not explain why all IO to the array froze after
the drive was kicked.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What the heck happened to my array? (No apparent data loss).
2011-04-04 5:59 ` Brad Campbell
@ 2011-04-04 16:49 ` Roberto Spadim
2011-04-05 0:47 ` What the heck happened to my array? Brad Campbell
0 siblings, 1 reply; 12+ messages in thread
From: Roberto Spadim @ 2011-04-04 16:49 UTC (permalink / raw)
To: Brad Campbell; +Cc: linux-raid
i don´t know but this happened with me on a hp server, with linux
2,6,37 i changed kernel to a older release and the problem ended,
check with neil and others md guys what´s the real problem
maybe realtime module and others changes inside kernel are the
problem, maybe not...
just a quick solution idea: try a older kernel
2011/4/4 Brad Campbell <lists2009@fnarfbargle.com>:
> On 03/04/11 23:47, Roberto Spadim wrote:
>>
>> what kernel version? more informations about your linux box?
>
> The kernel version and architecture were the first 2 lines of the E-mail you
> top posted over.
>
> What would you like to know about the box? It's a 6 core Phenom-II with 16G
> of ram. 2 LSI SAS 9240 controllers configured with 10 x 1TB SATA Drives in a
> RAID-6(md0) & 3 x 750GB SATA drives in a RAID-5(md2).
>
> The boot drives are a pair of 1TB SATA drives in multiple RAID-1's using the
> on-board AMD chipset controller and there is a 64GB SSD on a separate PCI-E
> Marvell 7042m Controller.
>
> The array in question is :
>
> root@srv:~# mdadm --detail /dev/md0
> /dev/md0:
> Version : 1.2
> Creation Time : Sat Jan 8 11:25:17 2011
> Raid Level : raid6
> Array Size : 7814078464 (7452.09 GiB 8001.62 GB)
> Used Dev Size : 976759808 (931.51 GiB 1000.20 GB)
> Raid Devices : 10
> Total Devices : 9
> Persistence : Superblock is persistent
>
> Update Time : Mon Apr 4 13:53:59 2011
> State : clean, degraded, recovering
> Active Devices : 9
> Working Devices : 9
> Failed Devices : 0
> Spare Devices : 0
>
> Layout : left-symmetric
> Chunk Size : 512K
>
> Reshape Status : 29% complete
> New Chunksize : 64K
>
> Name : srv:server (local to host srv)
> UUID : d00a11d7:fe0435af:07c8d4d6:e3b8e34e
> Events : 429198
>
> Number Major Minor RaidDevice State
> 0 8 32 0 active sync /dev/sdc
> 1 8 176 1 active sync /dev/sdl
> 2 8 192 2 active sync /dev/sdm
> 3 8 80 3 active sync /dev/sdf
> 4 8 16 4 active sync /dev/sdb
> 5 8 96 5 active sync /dev/sdg
> 6 0 0 6 removed
> 7 8 64 7 active sync /dev/sde
> 8 8 0 8 active sync /dev/sda
> 9 8 112 9 active sync /dev/sdh
> root@srv:~#
>
> Subsequent investigation has shown sdd has a pending reallocation and I can
> only assume the unidentified IO error was as a result of tripping up on
> that. It still does not explain why all IO to the array froze after the
> drive was kicked.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What the heck happened to my array?
2011-04-04 16:49 ` Roberto Spadim
@ 2011-04-05 0:47 ` Brad Campbell
2011-04-05 6:10 ` NeilBrown
0 siblings, 1 reply; 12+ messages in thread
From: Brad Campbell @ 2011-04-05 0:47 UTC (permalink / raw)
To: linux-raid; +Cc: neilb
On 05/04/11 00:49, Roberto Spadim wrote:
> i don´t know but this happened with me on a hp server, with linux
> 2,6,37 i changed kernel to a older release and the problem ended,
> check with neil and others md guys what´s the real problem
> maybe realtime module and others changes inside kernel are the
> problem, maybe not...
> just a quick solution idea: try a older kernel
>
Quick precis:
- Started reshape 512k to 64k chunk size.
- sdd got bad sector and was kicked.
- Array froze all IO.
- Reboot required to get system back.
- Restarted reshape with 9 drives.
- sdl suffered IO error and was kicked
- Array froze all IO.
- Reboot required to get system back.
- Array will no longer mount with 8/10 drives.
- Mdadm 3.1.5 segfaults when trying to start reshape.
Naively tried to run it under gdb to get a backtrace but was unable
to stop it forking
- Got array started with mdadm 3.2.1
- Attempted to re-add sdd/sdl (now marked as spares)
root@srv:~/mdadm-3.1.5# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md0 : active raid6 sdl[1](S) sdd[6](S) sdc[0] sdh[9] sda[8] sde[7]
sdg[5] sdb[4] sdf[3] sdm[2]
7814078464 blocks super 1.2 level 6, 512k chunk, algorithm 2
[10/8] [U_UUUU_UUU]
resync=DELAYED
md2 : active raid5 sdi[0] sdk[3] sdj[1]
1465146368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3]
[UUU]
md6 : active raid1 sdo6[0] sdn6[1]
821539904 blocks [2/2] [UU]
md5 : active raid1 sdo5[0] sdn5[1]
104864192 blocks [2/2] [UU]
md4 : active raid1 sdo3[0] sdn3[1]
20980800 blocks [2/2] [UU]
md3 : active (auto-read-only) raid1 sdo2[0] sdn2[1]
8393856 blocks [2/2] [UU]
md1 : active raid1 sdo1[0] sdn1[1]
20980736 blocks [2/2] [UU]
unused devices: <none>
[ 303.640776] md: bind<sdl>
[ 303.677461] md: bind<sdm>
[ 303.837358] md: bind<sdf>
[ 303.846291] md: bind<sdb>
[ 303.851476] md: bind<sdg>
[ 303.860725] md: bind<sdd>
[ 303.861055] md: bind<sde>
[ 303.861982] md: bind<sda>
[ 303.862830] md: bind<sdh>
[ 303.863128] md: bind<sdc>
[ 303.863306] md: kicking non-fresh sdd from array!
[ 303.863353] md: unbind<sdd>
[ 303.900207] md: export_rdev(sdd)
[ 303.900260] md: kicking non-fresh sdl from array!
[ 303.900306] md: unbind<sdl>
[ 303.940100] md: export_rdev(sdl)
[ 303.942181] md/raid:md0: reshape will continue
[ 303.942242] md/raid:md0: device sdc operational as raid disk 0
[ 303.942285] md/raid:md0: device sdh operational as raid disk 9
[ 303.942327] md/raid:md0: device sda operational as raid disk 8
[ 303.942368] md/raid:md0: device sde operational as raid disk 7
[ 303.942409] md/raid:md0: device sdg operational as raid disk 5
[ 303.942449] md/raid:md0: device sdb operational as raid disk 4
[ 303.942490] md/raid:md0: device sdf operational as raid disk 3
[ 303.942531] md/raid:md0: device sdm operational as raid disk 2
[ 303.943733] md/raid:md0: allocated 10572kB
[ 303.943866] md/raid:md0: raid level 6 active with 8 out of 10
devices, algorithm 2
[ 303.943912] RAID conf printout:
[ 303.943916] --- level:6 rd:10 wd:8
[ 303.943920] disk 0, o:1, dev:sdc
[ 303.943924] disk 2, o:1, dev:sdm
[ 303.943927] disk 3, o:1, dev:sdf
[ 303.943931] disk 4, o:1, dev:sdb
[ 303.943934] disk 5, o:1, dev:sdg
[ 303.943938] disk 7, o:1, dev:sde
[ 303.943941] disk 8, o:1, dev:sda
[ 303.943945] disk 9, o:1, dev:sdh
[ 303.944061] md0: detected capacity change from 0 to 8001616347136
[ 303.944366] md: md0 switched to read-write mode.
[ 303.944427] md: reshape of RAID array md0
[ 303.944469] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
[ 303.944511] md: using maximum available idle IO bandwidth (but not
more than 200000 KB/sec) for reshape.
[ 303.944573] md: using 128k window, over a total of 976759808 blocks.
[ 304.054875] md0: unknown partition table
[ 304.393245] mdadm[5940]: segfault at 7f2000 ip 00000000004480d2 sp
00007fffa04777b8 error 4 in mdadm[400000+64000]
root@srv:~# mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Sat Jan 8 11:25:17 2011
Raid Level : raid6
Array Size : 7814078464 (7452.09 GiB 8001.62 GB)
Used Dev Size : 976759808 (931.51 GiB 1000.20 GB)
Raid Devices : 10
Total Devices : 10
Persistence : Superblock is persistent
Update Time : Tue Apr 5 07:54:30 2011
State : active, degraded
Active Devices : 8
Working Devices : 10
Failed Devices : 0
Spare Devices : 2
Layout : left-symmetric
Chunk Size : 512K
New Chunksize : 64K
Name : srv:server (local to host srv)
UUID : d00a11d7:fe0435af:07c8d4d6:e3b8e34e
Events : 633835
Number Major Minor RaidDevice State
0 8 32 0 active sync /dev/sdc
1 0 0 1 removed
2 8 192 2 active sync /dev/sdm
3 8 80 3 active sync /dev/sdf
4 8 16 4 active sync /dev/sdb
5 8 96 5 active sync /dev/sdg
6 0 0 6 removed
7 8 64 7 active sync /dev/sde
8 8 0 8 active sync /dev/sda
9 8 112 9 active sync /dev/sdh
1 8 176 - spare /dev/sdl
6 8 48 - spare /dev/sdd
root@srv:~# for i in /dev/sd? ; do mdadm --examine $i ; done
/dev/sda:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x4
Array UUID : d00a11d7:fe0435af:07c8d4d6:e3b8e34e
Name : srv:server (local to host srv)
Creation Time : Sat Jan 8 11:25:17 2011
Raid Level : raid6
Raid Devices : 10
Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB)
Array Size : 15628156928 (7452.09 GiB 8001.62 GB)
Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : active
Device UUID : 9beb9a0f:2a73328c:f0c17909:89da70fd
Reshape pos'n : 3437035520 (3277.81 GiB 3519.52 GB)
New Chunksize : 64K
Update Time : Tue Apr 5 07:54:30 2011
Checksum : c58ed095 - correct
Events : 633835
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 8
Array State : A.AAAA.AAA ('A' == active, '.' == missing)
/dev/sdb:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x4
Array UUID : d00a11d7:fe0435af:07c8d4d6:e3b8e34e
Name : srv:server (local to host srv)
Creation Time : Sat Jan 8 11:25:17 2011
Raid Level : raid6
Raid Devices : 10
Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB)
Array Size : 15628156928 (7452.09 GiB 8001.62 GB)
Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : active
Device UUID : 75d997f8:d9372d90:c068755b:81c8206b
Reshape pos'n : 3437035520 (3277.81 GiB 3519.52 GB)
New Chunksize : 64K
Update Time : Tue Apr 5 07:54:30 2011
Checksum : 72321703 - correct
Events : 633835
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 4
Array State : A.AAAA.AAA ('A' == active, '.' == missing)
/dev/sdc:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x4
Array UUID : d00a11d7:fe0435af:07c8d4d6:e3b8e34e
Name : srv:server (local to host srv)
Creation Time : Sat Jan 8 11:25:17 2011
Raid Level : raid6
Raid Devices : 10
Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB)
Array Size : 15628156928 (7452.09 GiB 8001.62 GB)
Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : active
Device UUID : 5738a232:85f23a16:0c7a9454:d770199c
Reshape pos'n : 3437035520 (3277.81 GiB 3519.52 GB)
New Chunksize : 64K
Update Time : Tue Apr 5 07:54:30 2011
Checksum : 5c61ea2e - correct
Events : 633835
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 0
Array State : A.AAAA.AAA ('A' == active, '.' == missing)
/dev/sdd:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x4
Array UUID : d00a11d7:fe0435af:07c8d4d6:e3b8e34e
Name : srv:server (local to host srv)
Creation Time : Sat Jan 8 11:25:17 2011
Raid Level : raid6
Raid Devices : 10
Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB)
Array Size : 15628156928 (7452.09 GiB 8001.62 GB)
Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : active
Device UUID : 83a2c731:ba2846d0:2ce97d83:de624339
Reshape pos'n : 3437035520 (3277.81 GiB 3519.52 GB)
New Chunksize : 64K
Update Time : Tue Apr 5 07:54:30 2011
Checksum : e1a5ebbc - correct
Events : 633835
Layout : left-symmetric
Chunk Size : 512K
Device Role : spare
Array State : A.AAAA.AAA ('A' == active, '.' == missing)
/dev/sde:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x4
Array UUID : d00a11d7:fe0435af:07c8d4d6:e3b8e34e
Name : srv:server (local to host srv)
Creation Time : Sat Jan 8 11:25:17 2011
Raid Level : raid6
Raid Devices : 10
Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB)
Array Size : 15628156928 (7452.09 GiB 8001.62 GB)
Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : active
Device UUID : f1e3c1d3:ea9dc52e:a4e6b70e:e25a0321
Reshape pos'n : 3437035520 (3277.81 GiB 3519.52 GB)
New Chunksize : 64K
Update Time : Tue Apr 5 07:54:30 2011
Checksum : 551997d7 - correct
Events : 633835
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 7
Array State : A.AAAA.AAA ('A' == active, '.' == missing)
/dev/sdf:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x4
Array UUID : d00a11d7:fe0435af:07c8d4d6:e3b8e34e
Name : srv:server (local to host srv)
Creation Time : Sat Jan 8 11:25:17 2011
Raid Level : raid6
Raid Devices : 10
Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB)
Array Size : 15628156928 (7452.09 GiB 8001.62 GB)
Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : active
Device UUID : c32dff71:0b8c165c:9f589b0f:bcbc82da
Reshape pos'n : 3437035520 (3277.81 GiB 3519.52 GB)
New Chunksize : 64K
Update Time : Tue Apr 5 07:54:30 2011
Checksum : db0aa39b - correct
Events : 633835
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 3
Array State : A.AAAA.AAA ('A' == active, '.' == missing)
/dev/sdg:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x4
Array UUID : d00a11d7:fe0435af:07c8d4d6:e3b8e34e
Name : srv:server (local to host srv)
Creation Time : Sat Jan 8 11:25:17 2011
Raid Level : raid6
Raid Devices : 10
Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB)
Array Size : 15628156928 (7452.09 GiB 8001.62 GB)
Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : active
Device UUID : 194bc75c:97d3f507:4915b73a:51a50172
Reshape pos'n : 3437035520 (3277.81 GiB 3519.52 GB)
New Chunksize : 64K
Update Time : Tue Apr 5 07:54:30 2011
Checksum : 344cadbe - correct
Events : 633835
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 5
Array State : A.AAAA.AAA ('A' == active, '.' == missing)
/dev/sdh:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x4
Array UUID : d00a11d7:fe0435af:07c8d4d6:e3b8e34e
Name : srv:server (local to host srv)
Creation Time : Sat Jan 8 11:25:17 2011
Raid Level : raid6
Raid Devices : 10
Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB)
Array Size : 15628156928 (7452.09 GiB 8001.62 GB)
Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : active
Device UUID : 1326457e:4fc0a6be:0073ccae:398d5c7f
Reshape pos'n : 3437035520 (3277.81 GiB 3519.52 GB)
New Chunksize : 64K
Update Time : Tue Apr 5 07:54:30 2011
Checksum : 8debbb14 - correct
Events : 633835
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 9
Array State : A.AAAA.AAA ('A' == active, '.' == missing)
/dev/sdi:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : e39d73c3:75be3b52:44d195da:b240c146
Name : srv:2 (local to host srv)
Creation Time : Sat Jul 10 21:14:29 2010
Raid Level : raid5
Raid Devices : 3
Avail Dev Size : 1465147120 (698.64 GiB 750.16 GB)
Array Size : 2930292736 (1397.27 GiB 1500.31 GB)
Used Dev Size : 1465146368 (698.64 GiB 750.15 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : b577b308:56f2e4c9:c78175f4:cf10c77f
Update Time : Tue Apr 5 07:46:18 2011
Checksum : 57ee683f - correct
Events : 455775
Layout : left-symmetric
Chunk Size : 64K
Device Role : Active device 0
Array State : AAA ('A' == active, '.' == missing)
/dev/sdj:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : e39d73c3:75be3b52:44d195da:b240c146
Name : srv:2 (local to host srv)
Creation Time : Sat Jul 10 21:14:29 2010
Raid Level : raid5
Raid Devices : 3
Avail Dev Size : 1465147120 (698.64 GiB 750.16 GB)
Array Size : 2930292736 (1397.27 GiB 1500.31 GB)
Used Dev Size : 1465146368 (698.64 GiB 750.15 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : b127f002:a4aa8800:735ef8d7:6018564e
Update Time : Tue Apr 5 07:46:18 2011
Checksum : 3ae0b4c6 - correct
Events : 455775
Layout : left-symmetric
Chunk Size : 64K
Device Role : Active device 1
Array State : AAA ('A' == active, '.' == missing)
/dev/sdk:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : e39d73c3:75be3b52:44d195da:b240c146
Name : srv:2 (local to host srv)
Creation Time : Sat Jul 10 21:14:29 2010
Raid Level : raid5
Raid Devices : 3
Avail Dev Size : 1465147120 (698.64 GiB 750.16 GB)
Array Size : 2930292736 (1397.27 GiB 1500.31 GB)
Used Dev Size : 1465146368 (698.64 GiB 750.15 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 90fddf63:03d5dba4:3fcdc476:9ce3c44c
Update Time : Tue Apr 5 07:46:18 2011
Checksum : dd5eef0e - correct
Events : 455775
Layout : left-symmetric
Chunk Size : 64K
Device Role : Active device 2
Array State : AAA ('A' == active, '.' == missing)
/dev/sdl:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x4
Array UUID : d00a11d7:fe0435af:07c8d4d6:e3b8e34e
Name : srv:server (local to host srv)
Creation Time : Sat Jan 8 11:25:17 2011
Raid Level : raid6
Raid Devices : 10
Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB)
Array Size : 15628156928 (7452.09 GiB 8001.62 GB)
Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : active
Device UUID : 769940af:66733069:37cea27d:7fb28a23
Reshape pos'n : 3437035520 (3277.81 GiB 3519.52 GB)
New Chunksize : 64K
Update Time : Tue Apr 5 07:54:30 2011
Checksum : dc756202 - correct
Events : 633835
Layout : left-symmetric
Chunk Size : 512K
Device Role : spare
Array State : A.AAAA.AAA ('A' == active, '.' == missing)
/dev/sdm:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x4
Array UUID : d00a11d7:fe0435af:07c8d4d6:e3b8e34e
Name : srv:server (local to host srv)
Creation Time : Sat Jan 8 11:25:17 2011
Raid Level : raid6
Raid Devices : 10
Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB)
Array Size : 15628156928 (7452.09 GiB 8001.62 GB)
Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : active
Device UUID : 7e564e2c:7f21125b:c3b1907a:b640178f
Reshape pos'n : 3437035520 (3277.81 GiB 3519.52 GB)
New Chunksize : 64K
Update Time : Tue Apr 5 07:54:30 2011
Checksum : b3df3ee7 - correct
Events : 633835
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 2
Array State : A.AAAA.AAA ('A' == active, '.' == missing)
root@srv:~/mdadm-3.1.5# ./mdadm --version
mdadm - v3.1.5 - 23rd March 2011
root@srv:~/mdadm-3.1.5# uname -a
Linux srv 2.6.38 #19 SMP Wed Mar 23 09:57:05 WST 2011 x86_64 GNU/Linux
Now. The array restarted with mdadm 3.2.1, but of course its now
reshaping 8 out of 10 disks, has no redundancy and is going at 600k/s
which will take over 10 days. Is there anything I can do to give it some
redundancy while it completes or am I better to copy the data off, blow
it away and start again? All the important stuff is backed up anyway, I
just wanted to avoid restoring 8TB from backup if I could.
Regards,
Brad
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What the heck happened to my array?
2011-04-05 0:47 ` What the heck happened to my array? Brad Campbell
@ 2011-04-05 6:10 ` NeilBrown
2011-04-05 9:02 ` Brad Campbell
2011-04-08 1:19 ` Brad Campbell
0 siblings, 2 replies; 12+ messages in thread
From: NeilBrown @ 2011-04-05 6:10 UTC (permalink / raw)
To: Brad Campbell; +Cc: linux-raid
On Tue, 05 Apr 2011 08:47:16 +0800 Brad Campbell <lists2009@fnarfbargle.com>
wrote:
> On 05/04/11 00:49, Roberto Spadim wrote:
> > i don´t know but this happened with me on a hp server, with linux
> > 2,6,37 i changed kernel to a older release and the problem ended,
> > check with neil and others md guys what´s the real problem
> > maybe realtime module and others changes inside kernel are the
> > problem, maybe not...
> > just a quick solution idea: try a older kernel
> >
>
> Quick precis:
> - Started reshape 512k to 64k chunk size.
> - sdd got bad sector and was kicked.
> - Array froze all IO.
That .... shouldn't happen. But I know why it did.
mdadm forks and runs in the back ground monitoring the reshape.
It suspends IO to a region of the array, backs up the data, then lets the
reshape progress over that region, then invalidates the backup and allows IO
to resume, then moves on to the next region (it actually have two regions in
different states at the same time, but you get the idea).
If the device failed the reshape in the kernel aborted and then restarted.
It is meant to do this - restore to a known state, then decide if there is
anything useful to do. It restarts exactly where it left off so all should
be fine.
mdadm periodically checks the value in 'sync_completed' to see how far the
reshape has progressed to know if it can move on.
If it checks while the reshape is temporarily aborted it sees 'none', which
is not a number, so it aborts. That should be fixed.
It aborts with IO to a region still suspended so it is very possible for IO
to freeze if anything is destined for that region.
> - Reboot required to get system back.
> - Restarted reshape with 9 drives.
> - sdl suffered IO error and was kicked
Very sad.
> - Array froze all IO.
Same thing...
> - Reboot required to get system back.
> - Array will no longer mount with 8/10 drives.
> - Mdadm 3.1.5 segfaults when trying to start reshape.
Don't know why it would have done that... I cannot reproduce it easily.
> Naively tried to run it under gdb to get a backtrace but was unable
> to stop it forking
Yes, tricky .... an "strace -o /tmp/file -f mdadm ...." might have been
enough, but to late to worry about that now.
> - Got array started with mdadm 3.2.1
> - Attempted to re-add sdd/sdl (now marked as spares)
Hmm... it isn't meant to do that any more. I thought I fixed it so that it
if a device looked like part of the array it wouldn't add it as a spare...
Obviously that didn't work. I'd better look in to it again.
> [ 304.393245] mdadm[5940]: segfault at 7f2000 ip 00000000004480d2 sp
> 00007fffa04777b8 error 4 in mdadm[400000+64000]
>
If you have the exact mdadm binary that caused this segfault we should be
able to figure out what instruction was at 0004480d2. If you don't feel up
to it, could you please email me the file privately and I'll have a look.
> root@srv:~/mdadm-3.1.5# uname -a
> Linux srv 2.6.38 #19 SMP Wed Mar 23 09:57:05 WST 2011 x86_64 GNU/Linux
>
> Now. The array restarted with mdadm 3.2.1, but of course its now
> reshaping 8 out of 10 disks, has no redundancy and is going at 600k/s
> which will take over 10 days. Is there anything I can do to give it some
> redundancy while it completes or am I better to copy the data off, blow
> it away and start again? All the important stuff is backed up anyway, I
> just wanted to avoid restoring 8TB from backup if I could.
No, you cannot give it extra redundancy.
I would suggest:
copy anything that you need off, just in case - if you can.
Kill the mdadm that is running in the back ground. This will mean that
if the machine crashes your array will be corrupted, but you are thinking
of rebuilding it any, so that isn't the end of the world.
In /sys/block/md0/md
cat suspend_hi > suspend_lo
cat component_size > sync_max
That will allow the reshape to continue without any backup. It will be
much faster (but less safe, as I said).
If the reshape completes without incident, it will start recovering to the
two 'spares' - and then you will have a happy array again.
If something goes wrong, you will need to scrap the array, recreate it, and
copy data back from where-ever you copied it to (or backups).
If anything there doesn't make sense, or doesn't seem to work - please ask.
Thanks for the report. I'll try to get those mdadm issues addressed -
particularly if you can get me the mdadm file which caused the segfault.
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What the heck happened to my array?
2011-04-05 6:10 ` NeilBrown
@ 2011-04-05 9:02 ` Brad Campbell
2011-04-05 11:31 ` NeilBrown
2011-04-08 1:19 ` Brad Campbell
1 sibling, 1 reply; 12+ messages in thread
From: Brad Campbell @ 2011-04-05 9:02 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid
On 05/04/11 14:10, NeilBrown wrote:
>> - Reboot required to get system back.
>> - Restarted reshape with 9 drives.
>> - sdl suffered IO error and was kicked
>
> Very sad.
I'd say pretty damn unlucky actually.
>> - Array froze all IO.
>
> Same thing...
>
>> - Reboot required to get system back.
>> - Array will no longer mount with 8/10 drives.
>> - Mdadm 3.1.5 segfaults when trying to start reshape.
>
> Don't know why it would have done that... I cannot reproduce it easily.
No. I tried numerous incantations. The system version of mdadm is Debian
3.1.4. This segfaulted so I downloaded and compiled 3.1.5 which did the
same thing. I then composed most of this E-mail, made *really* sure my
backups were up to date and tried 3.2.1 which to my astonishment worked.
It's been ticking along _slowly_ ever since.
>> Naively tried to run it under gdb to get a backtrace but was unable
>> to stop it forking
>
> Yes, tricky .... an "strace -o /tmp/file -f mdadm ...." might have been
> enough, but to late to worry about that now.
I wondered about using strace but for some reason got it into my head
that a gdb backtrace would be more useful. Then of course I got it
started with 3.2.1 and have not tried again.
>> - Got array started with mdadm 3.2.1
>> - Attempted to re-add sdd/sdl (now marked as spares)
>
> Hmm... it isn't meant to do that any more. I thought I fixed it so
that it
> if a device looked like part of the array it wouldn't add it as a
spare...
> Obviously that didn't work. I'd better look in to it again.
Now the chain of events that led up to this was along these lines.
- Rebooted machine.
- Tried to --assemble with 3.1.4
- mdadm told me it did not really want to continue with 8/10 devices and
I should use --force if I really wanted it to try.
- I used --force
- I did a mdadm --add /dev/md0 /dev/sdd and the same for sdl
- I checked and they were listed as spares.
So this was all done with Debian's mdadm 3.1.4, *not* 3.1.5
>
> No, you cannot give it extra redundancy.
> I would suggest:
> copy anything that you need off, just in case - if you can.
>
> Kill the mdadm that is running in the back ground. This will mean
that
> if the machine crashes your array will be corrupted, but you are
thinking
> of rebuilding it any, so that isn't the end of the world.
> In /sys/block/md0/md
> cat suspend_hi> suspend_lo
> cat component_size> sync_max
>
> That will allow the reshape to continue without any backup. It
will be
> much faster (but less safe, as I said).
Well, I have nothing to lose, but I've just picked up some extra drives
so I'll make second backups and then give this a whirl.
> If something goes wrong, you will need to scrap the array,
recreate it, and
> copy data back from where-ever you copied it to (or backups).
I did go into this with the niggling feeling that something bad might
happen, so I made sure all my backups were up to date before I started.
No biggie if it does die.
The very odd thing is I did a complete array check, plus SMART long
tests on all drives literally hours before I started the reshape. Goes
to show how ropey these large drives can be in big(iash) arrays.
> If anything there doesn't make sense, or doesn't seem to work -
please ask.
>
> Thanks for the report. I'll try to get those mdadm issues addressed -
> particularly if you can get me the mdadm file which caused the segfault.
>
Well, luckily I preserved the entire build tree then. I was planning on
running nm over the binary and have a two thumbs type of look into it
with gdb, but seeing as you probably have a much better idea what you
are looking for I'll just send you the binary!
Thanks for the help Neil. Much appreciated.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What the heck happened to my array?
2011-04-05 9:02 ` Brad Campbell
@ 2011-04-05 11:31 ` NeilBrown
2011-04-05 11:47 ` Brad Campbell
0 siblings, 1 reply; 12+ messages in thread
From: NeilBrown @ 2011-04-05 11:31 UTC (permalink / raw)
To: Brad Campbell; +Cc: linux-raid
On Tue, 05 Apr 2011 17:02:43 +0800 Brad Campbell <lists2009@fnarfbargle.com>
wrote:
> Well, luckily I preserved the entire build tree then. I was planning on
> running nm over the binary and have a two thumbs type of look into it
> with gdb, but seeing as you probably have a much better idea what you
> are looking for I'll just send you the binary!
Thanks. It took me a little while, but I've found the problem.
The code was failing at
wd0 = sources[z][d];
in qsyndrome in restripe.c.
It is looking up 'd' in 'sources[z]' and having problems.
The error address (from dmesg) is 0x7f2000 so it isn't a
NULL pointer, but rather it is falling off the end of an allocation.
When doing qsyndrome calculations we often need a block full of zeros, so
restripe.c allocates one and stores it in in a global pointer.
You were restriping from 512K to 64K chunk size.
The first thing restripe.c was called on to do was to restore data from the
backup file into the array. This uses the new chunk size - 64K. So the
'zero' buffer was allocated at 64K and cleared.
The next thing it does is read the next section of the array and write it to
the backup. As the array was missing 2 devices it needed to do a qsyndrome
calculation to get the missing data block(s). This was a calculation done on
old-style chunks so it needed a 512K zero block.
However as a zero block had already been allocated it didn't bother to
allocate another one. It just used what it had, which was too small.
So it fell off the end and got the result we saw.
I don't know why this works in 3.2.1 where it didn't work in 3.1.4.
However when it successfully recovers from the backup it should update the
metadata so that it knows it has successfully recovered and doesn't need to
recover any more. So maybe the time it worked, it found there wasn't any
recovery needed and so didn't allocate a 'zero' buffer until it was working
with the old, bigger, chunk size.
Anyway, this is easy to fix which I will do.
It only affects restarting a reshape of a double-degraded RAID6 which reduced
the chunksize.
Thanks,
NeilBrown
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What the heck happened to my array?
2011-04-05 11:31 ` NeilBrown
@ 2011-04-05 11:47 ` Brad Campbell
0 siblings, 0 replies; 12+ messages in thread
From: Brad Campbell @ 2011-04-05 11:47 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid
On 05/04/11 19:31, NeilBrown wrote:
> It only affects restarting a reshape of a double-degraded RAID6 which reduced
> the chunksize.
Talk about the planets falling into alignment! I was obviously holding
my mouth wrong when I pushed the enter key.
Thanks for the explanation. Much appreciated (again!)
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What the heck happened to my array?
2011-04-05 6:10 ` NeilBrown
2011-04-05 9:02 ` Brad Campbell
@ 2011-04-08 1:19 ` Brad Campbell
2011-04-08 9:52 ` NeilBrown
1 sibling, 1 reply; 12+ messages in thread
From: Brad Campbell @ 2011-04-08 1:19 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid
On 05/04/11 14:10, NeilBrown wrote:
> I would suggest:
> copy anything that you need off, just in case - if you can.
>
> Kill the mdadm that is running in the back ground. This will mean that
> if the machine crashes your array will be corrupted, but you are thinking
> of rebuilding it any, so that isn't the end of the world.
> In /sys/block/md0/md
> cat suspend_hi> suspend_lo
> cat component_size> sync_max
>
root@srv:/sys/block/md0/md# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md0 : active raid6 sdc[0] sdd[6](S) sdl[1](S) sdh[9] sda[8] sde[7]
sdg[5] sdb[4] sdf[3] sdm[2]
7814078464 blocks super 1.2 level 6, 512k chunk, algorithm 2
[10/8] [U_UUUU_UUU]
[=================>...] reshape = 88.2% (861696000/976759808)
finish=3713.3min speed=516K/sec
md2 : active raid5 sdi[0] sdk[3] sdj[1]
1465146368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3]
[UUU]
md6 : active raid1 sdp6[0] sdo6[1]
821539904 blocks [2/2] [UU]
md5 : active raid1 sdp5[0] sdo5[1]
104864192 blocks [2/2] [UU]
md4 : active raid1 sdp3[0] sdo3[1]
20980800 blocks [2/2] [UU]
md3 : active raid1 sdp2[0] sdo2[1]
8393856 blocks [2/2] [UU]
md1 : active raid1 sdp1[0] sdo1[1]
20980736 blocks [2/2] [UU]
unused devices: <none>
root@srv:/sys/block/md0/md# cat component_size > sync_max
cat: write error: Device or resource busy
root@srv:/sys/block/md0/md# cat suspend_hi suspend_lo
13788774400
13788774400
root@srv:/sys/block/md0/md# grep . sync_*
sync_action:reshape
sync_completed:1723392000 / 1953519616
sync_force_parallel:0
sync_max:1723392000
sync_min:0
sync_speed:281
sync_speed_max:200000 (system)
sync_speed_min:200000 (local)
So I killed mdadm, then did the cat suspend_hi > suspend_lo.. but as you
can see it won't let me change sync_max. The array above reports
516K/sec, but that was just on its way down to 0 on a time based
average. It was not moving at all.
I then tried stopping the array, restarting it with mdadm 3.1.4 which
immediately segfaulted and left the array in state resync=DELAYED.
I issued the above commands again, which succeeded this time but while
the array looked good, it was not resyncing :
root@srv:/sys/block/md0/md# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md0 : active raid6 sdc[0] sdd[6](S) sdl[1](S) sdh[9] sda[8] sde[7]
sdg[5] sdb[4] sdf[3] sdm[2]
7814078464 blocks super 1.2 level 6, 512k chunk, algorithm 2
[10/8] [U_UUUU_UUU]
[=================>...] reshape = 88.2% (861698048/976759808)
finish=30203712.0min speed=0K/sec
md2 : active raid5 sdi[0] sdk[3] sdj[1]
1465146368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3]
[UUU]
md6 : active raid1 sdp6[0] sdo6[1]
821539904 blocks [2/2] [UU]
md5 : active raid1 sdp5[0] sdo5[1]
104864192 blocks [2/2] [UU]
md4 : active raid1 sdp3[0] sdo3[1]
20980800 blocks [2/2] [UU]
md3 : active raid1 sdp2[0] sdo2[1]
8393856 blocks [2/2] [UU]
md1 : active raid1 sdp1[0] sdo1[1]
20980736 blocks [2/2] [UU]
unused devices: <none>
root@srv:/sys/block/md0/md# grep . sync*
sync_action:reshape
sync_completed:1723396096 / 1953519616
sync_force_parallel:0
sync_max:976759808
sync_min:0
sync_speed:0
sync_speed_max:200000 (system)
sync_speed_min:200000 (local)
I stopped the array and restarted it with mdadm 3.2.1 and it continued
along its merry way.
Not an issue, and I don't much care if it blew something up, but I
thought it worthy of a follow up.
If there is anything you need tested while it's in this state I've got ~
1000 minutes of resync time left and I'm happy to damage it if requested.
Regards,
Brad
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What the heck happened to my array?
2011-04-08 1:19 ` Brad Campbell
@ 2011-04-08 9:52 ` NeilBrown
2011-04-08 15:27 ` Roberto Spadim
0 siblings, 1 reply; 12+ messages in thread
From: NeilBrown @ 2011-04-08 9:52 UTC (permalink / raw)
To: Brad Campbell; +Cc: linux-raid
On Fri, 08 Apr 2011 09:19:01 +0800 Brad Campbell <lists2009@fnarfbargle.com>
wrote:
> On 05/04/11 14:10, NeilBrown wrote:
>
> > I would suggest:
> > copy anything that you need off, just in case - if you can.
> >
> > Kill the mdadm that is running in the back ground. This will mean that
> > if the machine crashes your array will be corrupted, but you are thinking
> > of rebuilding it any, so that isn't the end of the world.
> > In /sys/block/md0/md
> > cat suspend_hi> suspend_lo
> > cat component_size> sync_max
> >
>
> root@srv:/sys/block/md0/md# cat /proc/mdstat
> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
> md0 : active raid6 sdc[0] sdd[6](S) sdl[1](S) sdh[9] sda[8] sde[7]
> sdg[5] sdb[4] sdf[3] sdm[2]
> 7814078464 blocks super 1.2 level 6, 512k chunk, algorithm 2
> [10/8] [U_UUUU_UUU]
> [=================>...] reshape = 88.2% (861696000/976759808)
> finish=3713.3min speed=516K/sec
>
> md2 : active raid5 sdi[0] sdk[3] sdj[1]
> 1465146368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3]
> [UUU]
>
> md6 : active raid1 sdp6[0] sdo6[1]
> 821539904 blocks [2/2] [UU]
>
> md5 : active raid1 sdp5[0] sdo5[1]
> 104864192 blocks [2/2] [UU]
>
> md4 : active raid1 sdp3[0] sdo3[1]
> 20980800 blocks [2/2] [UU]
>
> md3 : active raid1 sdp2[0] sdo2[1]
> 8393856 blocks [2/2] [UU]
>
> md1 : active raid1 sdp1[0] sdo1[1]
> 20980736 blocks [2/2] [UU]
>
> unused devices: <none>
> root@srv:/sys/block/md0/md# cat component_size > sync_max
> cat: write error: Device or resource busy
Sorry, I should have checked the source code.
echo max > sync_max
is what you want.
Or just a much bigger number.
>
> root@srv:/sys/block/md0/md# cat suspend_hi suspend_lo
> 13788774400
> 13788774400
They are the same so that is good - nothing will be suspended.
>
> root@srv:/sys/block/md0/md# grep . sync_*
> sync_action:reshape
> sync_completed:1723392000 / 1953519616
> sync_force_parallel:0
> sync_max:1723392000
> sync_min:0
> sync_speed:281
> sync_speed_max:200000 (system)
> sync_speed_min:200000 (local)
>
> So I killed mdadm, then did the cat suspend_hi > suspend_lo.. but as you
> can see it won't let me change sync_max. The array above reports
> 516K/sec, but that was just on its way down to 0 on a time based
> average. It was not moving at all.
>
> I then tried stopping the array, restarting it with mdadm 3.1.4 which
> immediately segfaulted and left the array in state resync=DELAYED.
>
> I issued the above commands again, which succeeded this time but while
> the array looked good, it was not resyncing :
> root@srv:/sys/block/md0/md# cat /proc/mdstat
> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
> md0 : active raid6 sdc[0] sdd[6](S) sdl[1](S) sdh[9] sda[8] sde[7]
> sdg[5] sdb[4] sdf[3] sdm[2]
> 7814078464 blocks super 1.2 level 6, 512k chunk, algorithm 2
> [10/8] [U_UUUU_UUU]
> [=================>...] reshape = 88.2% (861698048/976759808)
> finish=30203712.0min speed=0K/sec
>
> md2 : active raid5 sdi[0] sdk[3] sdj[1]
> 1465146368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3]
> [UUU]
>
> md6 : active raid1 sdp6[0] sdo6[1]
> 821539904 blocks [2/2] [UU]
>
> md5 : active raid1 sdp5[0] sdo5[1]
> 104864192 blocks [2/2] [UU]
>
> md4 : active raid1 sdp3[0] sdo3[1]
> 20980800 blocks [2/2] [UU]
>
> md3 : active raid1 sdp2[0] sdo2[1]
> 8393856 blocks [2/2] [UU]
>
> md1 : active raid1 sdp1[0] sdo1[1]
> 20980736 blocks [2/2] [UU]
>
> unused devices: <none>
>
> root@srv:/sys/block/md0/md# grep . sync*
> sync_action:reshape
> sync_completed:1723396096 / 1953519616
> sync_force_parallel:0
> sync_max:976759808
> sync_min:0
> sync_speed:0
> sync_speed_max:200000 (system)
> sync_speed_min:200000 (local)
>
> I stopped the array and restarted it with mdadm 3.2.1 and it continued
> along its merry way.
>
> Not an issue, and I don't much care if it blew something up, but I
> thought it worthy of a follow up.
>
> If there is anything you need tested while it's in this state I've got ~
> 1000 minutes of resync time left and I'm happy to damage it if requested.
No thank - I think I know what happened. Main problem is that there is
confusion between 'k' and 'sectors' and there are random other values that
sometimes work (like 'max') and I never remember which is which. sysfs in md
is a bit of a mess.... one day I hope to completely replace it (with back
compatibility of course...)
Thanks for the feedback.
NeilBrown
>
> Regards,
> Brad
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What the heck happened to my array?
2011-04-08 9:52 ` NeilBrown
@ 2011-04-08 15:27 ` Roberto Spadim
0 siblings, 0 replies; 12+ messages in thread
From: Roberto Spadim @ 2011-04-08 15:27 UTC (permalink / raw)
To: NeilBrown; +Cc: Brad Campbell, linux-raid
hi neil, with time i could help changin sysfs information (add unit
information at output, and remove it (unit text) from input)
what's the current kernel version of develop?
2011/4/8 NeilBrown <neilb@suse.de>:
> On Fri, 08 Apr 2011 09:19:01 +0800 Brad Campbell <lists2009@fnarfbargle.com>
> wrote:
>
>> On 05/04/11 14:10, NeilBrown wrote:
>>
>> > I would suggest:
>> > copy anything that you need off, just in case - if you can.
>> >
>> > Kill the mdadm that is running in the back ground. This will mean that
>> > if the machine crashes your array will be corrupted, but you are thinking
>> > of rebuilding it any, so that isn't the end of the world.
>> > In /sys/block/md0/md
>> > cat suspend_hi> suspend_lo
>> > cat component_size> sync_max
>> >
>>
>> root@srv:/sys/block/md0/md# cat /proc/mdstat
>> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
>> md0 : active raid6 sdc[0] sdd[6](S) sdl[1](S) sdh[9] sda[8] sde[7]
>> sdg[5] sdb[4] sdf[3] sdm[2]
>> 7814078464 blocks super 1.2 level 6, 512k chunk, algorithm 2
>> [10/8] [U_UUUU_UUU]
>> [=================>...] reshape = 88.2% (861696000/976759808)
>> finish=3713.3min speed=516K/sec
>>
>> md2 : active raid5 sdi[0] sdk[3] sdj[1]
>> 1465146368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3]
>> [UUU]
>>
>> md6 : active raid1 sdp6[0] sdo6[1]
>> 821539904 blocks [2/2] [UU]
>>
>> md5 : active raid1 sdp5[0] sdo5[1]
>> 104864192 blocks [2/2] [UU]
>>
>> md4 : active raid1 sdp3[0] sdo3[1]
>> 20980800 blocks [2/2] [UU]
>>
>> md3 : active raid1 sdp2[0] sdo2[1]
>> 8393856 blocks [2/2] [UU]
>>
>> md1 : active raid1 sdp1[0] sdo1[1]
>> 20980736 blocks [2/2] [UU]
>>
>> unused devices: <none>
>> root@srv:/sys/block/md0/md# cat component_size > sync_max
>> cat: write error: Device or resource busy
>
> Sorry, I should have checked the source code.
>
>
> echo max > sync_max
>
> is what you want.
> Or just a much bigger number.
>
>>
>> root@srv:/sys/block/md0/md# cat suspend_hi suspend_lo
>> 13788774400
>> 13788774400
>
> They are the same so that is good - nothing will be suspended.
>
>>
>> root@srv:/sys/block/md0/md# grep . sync_*
>> sync_action:reshape
>> sync_completed:1723392000 / 1953519616
>> sync_force_parallel:0
>> sync_max:1723392000
>> sync_min:0
>> sync_speed:281
>> sync_speed_max:200000 (system)
>> sync_speed_min:200000 (local)
>>
>> So I killed mdadm, then did the cat suspend_hi > suspend_lo.. but as you
>> can see it won't let me change sync_max. The array above reports
>> 516K/sec, but that was just on its way down to 0 on a time based
>> average. It was not moving at all.
>>
>> I then tried stopping the array, restarting it with mdadm 3.1.4 which
>> immediately segfaulted and left the array in state resync=DELAYED.
>>
>> I issued the above commands again, which succeeded this time but while
>> the array looked good, it was not resyncing :
>> root@srv:/sys/block/md0/md# cat /proc/mdstat
>> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
>> md0 : active raid6 sdc[0] sdd[6](S) sdl[1](S) sdh[9] sda[8] sde[7]
>> sdg[5] sdb[4] sdf[3] sdm[2]
>> 7814078464 blocks super 1.2 level 6, 512k chunk, algorithm 2
>> [10/8] [U_UUUU_UUU]
>> [=================>...] reshape = 88.2% (861698048/976759808)
>> finish=30203712.0min speed=0K/sec
>>
>> md2 : active raid5 sdi[0] sdk[3] sdj[1]
>> 1465146368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3]
>> [UUU]
>>
>> md6 : active raid1 sdp6[0] sdo6[1]
>> 821539904 blocks [2/2] [UU]
>>
>> md5 : active raid1 sdp5[0] sdo5[1]
>> 104864192 blocks [2/2] [UU]
>>
>> md4 : active raid1 sdp3[0] sdo3[1]
>> 20980800 blocks [2/2] [UU]
>>
>> md3 : active raid1 sdp2[0] sdo2[1]
>> 8393856 blocks [2/2] [UU]
>>
>> md1 : active raid1 sdp1[0] sdo1[1]
>> 20980736 blocks [2/2] [UU]
>>
>> unused devices: <none>
>>
>> root@srv:/sys/block/md0/md# grep . sync*
>> sync_action:reshape
>> sync_completed:1723396096 / 1953519616
>> sync_force_parallel:0
>> sync_max:976759808
>> sync_min:0
>> sync_speed:0
>> sync_speed_max:200000 (system)
>> sync_speed_min:200000 (local)
>>
>> I stopped the array and restarted it with mdadm 3.2.1 and it continued
>> along its merry way.
>>
>> Not an issue, and I don't much care if it blew something up, but I
>> thought it worthy of a follow up.
>>
>> If there is anything you need tested while it's in this state I've got ~
>> 1000 minutes of resync time left and I'm happy to damage it if requested.
>
> No thank - I think I know what happened. Main problem is that there is
> confusion between 'k' and 'sectors' and there are random other values that
> sometimes work (like 'max') and I never remember which is which. sysfs in md
> is a bit of a mess.... one day I hope to completely replace it (with back
> compatibility of course...)
>
> Thanks for the feedback.
>
> NeilBrown
>
>
>>
>> Regards,
>> Brad
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2011-04-08 15:27 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-03 13:32 What the heck happened to my array? (No apparent data loss) Brad Campbell
2011-04-03 15:47 ` Roberto Spadim
2011-04-04 5:59 ` Brad Campbell
2011-04-04 16:49 ` Roberto Spadim
2011-04-05 0:47 ` What the heck happened to my array? Brad Campbell
2011-04-05 6:10 ` NeilBrown
2011-04-05 9:02 ` Brad Campbell
2011-04-05 11:31 ` NeilBrown
2011-04-05 11:47 ` Brad Campbell
2011-04-08 1:19 ` Brad Campbell
2011-04-08 9:52 ` NeilBrown
2011-04-08 15:27 ` Roberto Spadim
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).