* Recovery on new 2TB disk: finish=7248.4min (raid1)
@ 2017-04-26 21:57 Ron Leach
2017-04-27 14:25 ` John Stoffel
2017-04-27 14:58 ` Mateusz Korniak
0 siblings, 2 replies; 25+ messages in thread
From: Ron Leach @ 2017-04-26 21:57 UTC (permalink / raw)
To: linux-raid
List, good evening,
We run a 2TB fileserver in a raid1 configuration. Today one of the 2
disks (/dev/sdb) failed and we've just replaced it and set up exactly
the same partitions as the working, but degraded, raid has on /dev/sda.
Using the commands
# mdadm --manage -a /dev/mdo /dev/sdb1
(and so on for md 1->7)
is resulting in a very-unusually slow recovery. And mdadm is now
recovering the largest partition, 1.8TB, but expects to spend 5 days
over it. I think I must have done something wrong. May I ask a
couple of questions?
1 Is there a safe command to stop the recovery/add process that is
ongoing? I reread man mdadm but did not see a command I could use for
this.
2 After the failure of /dev/sdb, mdstat listed sdb x in each md
device with an '(F)'. We then also 'FAIL'ed each sdb partition in
each md device, and then powered down the machine to replace sdb.
After powering up and booting back into Debian, we created the
partitions on (the new) sdb to mirror those on /dev/sda. We then
issued these commands one after the other:
# mdadm --manage -a /dev/mdo /dev/sdb1
# mdadm --manage -a /dev/md1 /dev/sdb2
# mdadm --manage -a /dev/md2 /dev/sdb3
# mdadm --manage -a /dev/md3 /dev/sdb5
# mdadm --manage -a /dev/md4 /dev/sdb6
# mdadm --manage -a /dev/md5 /dev/sdb7
# mdadm --manage -a /dev/md6 /dev/sdb8
# mdadm --manage -a /dev/md7 /dev/sdb9
Have I missed some vital step, and so causing the recover process to
take a very long time?
mdstat and lsdrv outputs here (UUIDs abbreviated):
# cat /proc/mdstat
Personalities : [raid1]
md7 : active raid1 sdb9[3] sda9[2]
1894416248 blocks super 1.2 [2/1] [U_]
[>....................] recovery = 0.0% (1493504/1894416248)
finish=7248.4min speed=4352K/sec
md6 : active raid1 sdb8[3] sda8[2]
39060408 blocks super 1.2 [2/1] [U_]
resync=DELAYED
md5 : active raid1 sdb7[3] sda7[2]
975860 blocks super 1.2 [2/1] [U_]
resync=DELAYED
md4 : active raid1 sdb6[3] sda6[2]
975860 blocks super 1.2 [2/1] [U_]
resync=DELAYED
md3 : active raid1 sdb5[3] sda5[2]
4880372 blocks super 1.2 [2/1] [U_]
resync=DELAYED
md2 : active raid1 sdb3[3] sda3[2]
9764792 blocks super 1.2 [2/1] [U_]
resync=DELAYED
md1 : active raid1 sdb2[3] sda2[2]
2928628 blocks super 1.2 [2/2] [UU]
md0 : active raid1 sdb1[3] sda1[2]
498676 blocks super 1.2 [2/2] [UU]
unused devices: <none>
I meant to also ask - why are the /dev/sdb partitions shown with a
'(3)'? Previously I think they had a '(1)'.
# ./lsdrv
**Warning** The following utility(ies) failed to execute:
sginfo
pvs
lvs
Some information may be missing.
Controller platform [None]
└platform floppy.0
└fd0 4.00k [2:0] Empty/Unknown
PCI [sata_nv] 00:08.0 IDE interface: nVidia Corporation MCP61 SATA
Controller (rev a2)
├scsi 0:0:0:0 ATA WDC WD20EZRX-00D {WD-WC....R1}
│└sda 1.82t [8:0] Partitioned (dos)
│ ├sda1 487.00m [8:1] MD raid1 (0/2) (w/ sdb1) in_sync 'Server6:0'
{b307....e950}
│ │└md0 486.99m [9:0] MD v1.2 raid1 (2) clean {b307....e950}
│ │ │ ext2 {4ed1....e8b1}
│ │ └Mounted as /dev/md0 @ /boot
│ ├sda2 2.79g [8:2] MD raid1 (0/2) (w/ sdb2) in_sync 'Server6:1'
{77b1....50f2}
│ │└md1 2.79g [9:1] MD v1.2 raid1 (2) clean {77b1....50f2}
│ │ │ jfs {7d08....bae5}
│ │ └Mounted as /dev/disk/by-uuid/7d08....bae5 @ /
│ ├sda3 9.31g [8:3] MD raid1 (0/2) (w/ sdb3) in_sync 'Server6:2'
{afd6....b694}
│ │└md2 9.31g [9:2] MD v1.2 raid1 (2) clean DEGRADED, recover
(0.00k/18.62g) 0.00k/sec {afd6....b694}
│ │ │ jfs {81bb....92f8}
│ │ └Mounted as /dev/md2 @ /usr
│ ├sda4 1.00k [8:4] Partitioned (dos)
│ ├sda5 4.66g [8:5] MD raid1 (0/2) (w/ sdb5) in_sync 'Server6:3'
{d00a....4e99}
│ │└md3 4.65g [9:3] MD v1.2 raid1 (2) active DEGRADED, recover
(0.00k/9.31g) 0.00k/sec {d00a....4e99}
│ │ │ jfs {375b....4fd5}
│ │ └Mounted as /dev/md3 @ /var
│ ├sda6 953.00m [8:6] MD raid1 (0/2) (w/ sdb6) in_sync 'Server6:4'
{25af....d910}
│ │└md4 952.99m [9:4] MD v1.2 raid1 (2) clean DEGRADED, recover
(0.00k/1.86g) 0.00k/sec {25af....d910}
│ │ swap {d92f....2ad7}
│ ├sda7 953.00m [8:7] MD raid1 (0/2) (w/ sdb7) in_sync 'Server6:5'
{0034....971a}
│ │└md5 952.99m [9:5] MD v1.2 raid1 (2) active DEGRADED, recover
(0.00k/1.86g) 0.00k/sec {0034....971a}
│ │ │ jfs {4bf7....0fff}
│ │ └Mounted as /dev/md5 @ /tmp
│ ├sda8 37.25g [8:8] MD raid1 (0/2) (w/ sdb8) in_sync 'Server6:6'
{a5d9....568d}
│ │└md6 37.25g [9:6] MD v1.2 raid1 (2) clean DEGRADED, recover
(0.00k/74.50g) 0.00k/sec {a5d9....568d}
│ │ │ jfs {fdf0....6478}
│ │ └Mounted as /dev/md6 @ /home
│ └sda9 1.76t [8:9] MD raid1 (0/2) (w/ sdb9) in_sync 'Server6:7'
{9bb1....bbb4}
│ └md7 1.76t [9:7] MD v1.2 raid1 (2) clean DEGRADED, recover
(0.00k/3.53t) 3.01m/sec {9bb1....bbb4}
│ │ jfs {60bc....33fc}
│ └Mounted as /dev/md7 @ /srv
└scsi 1:0:0:0 ATA ST2000DL003-9VT1 {5Y....HT}
└sdb 1.82t [8:16] Partitioned (dos)
├sdb1 487.00m [8:17] MD raid1 (1/2) (w/ sda1) in_sync 'Server6:0'
{b307....e950}
│└md0 486.99m [9:0] MD v1.2 raid1 (2) clean {b307....e950}
│ ext2 {4ed1....e8b1}
├sdb2 2.79g [8:18] MD raid1 (1/2) (w/ sda2) in_sync 'Server6:1'
{77b1....50f2}
│└md1 2.79g [9:1] MD v1.2 raid1 (2) clean {77b1....50f2}
│ jfs {7d08....bae5}
├sdb3 9.31g [8:19] MD raid1 (1/2) (w/ sda3) spare 'Server6:2'
{afd6....b694}
│└md2 9.31g [9:2] MD v1.2 raid1 (2) clean DEGRADED, recover
(0.00k/18.62g) 0.00k/sec {afd6....b694}
│ jfs {81bb....92f8}
├sdb4 1.00k [8:20] Partitioned (dos)
├sdb5 4.66g [8:21] MD raid1 (1/2) (w/ sda5) spare 'Server6:3'
{d00a....4e99}
│└md3 4.65g [9:3] MD v1.2 raid1 (2) active DEGRADED, recover
(0.00k/9.31g) 0.00k/sec {d00a....4e99}
│ jfs {375b....4fd5}
├sdb6 953.00m [8:22] MD raid1 (1/2) (w/ sda6) spare 'Server6:4'
{25af....d910}
│└md4 952.99m [9:4] MD v1.2 raid1 (2) clean DEGRADED, recover
(0.00k/1.86g) 0.00k/sec {25af....d910}
│ swap {d92f....2ad7}
├sdb7 953.00m [8:23] MD raid1 (1/2) (w/ sda7) spare 'Server6:5'
{0034....971a}
│└md5 952.99m [9:5] MD v1.2 raid1 (2) active DEGRADED, recover
(0.00k/1.86g) 0.00k/sec {0034....971a}
│ jfs {4bf7....0fff}
├sdb8 37.25g [8:24] MD raid1 (1/2) (w/ sda8) spare 'Server6:6'
{a5d9....568d}
│└md6 37.25g [9:6] MD v1.2 raid1 (2) clean DEGRADED, recover
(0.00k/74.50g) 0.00k/sec {a5d9....568d}
│ jfs {fdf0....6478}
├sdb9 1.76t [8:25] MD raid1 (1/2) (w/ sda9) spare 'Server6:7'
{9bb1....bbb4}
│└md7 1.76t [9:7] MD v1.2 raid1 (2) clean DEGRADED, recover
(0.00k/3.53t) 3.01m/sec {9bb1....bbb4}
│ jfs {60bc....33fc}
└sdb10 1.00m [8:26] Empty/Unknown
PCI [pata_amd] 00:06.0 IDE interface: nVidia Corporation MCP61 IDE
(rev a2)
├scsi 2:0:0:0 AOPEN CD-RW CRW5224
{AOPEN_CD-RW_CRW5224_1.07_20020606_}
│└sr0 1.00g [11:0] Empty/Unknown
└scsi 3:x:x:x [Empty]
Other Block Devices
├loop0 0.00k [7:0] Empty/Unknown
├loop1 0.00k [7:1] Empty/Unknown
├loop2 0.00k [7:2] Empty/Unknown
├loop3 0.00k [7:3] Empty/Unknown
├loop4 0.00k [7:4] Empty/Unknown
├loop5 0.00k [7:5] Empty/Unknown
├loop6 0.00k [7:6] Empty/Unknown
└loop7 0.00k [7:7] Empty/Unknown
OS is still as originally installed some years ago - Debian 6/Squeeze.
The OS has been pretty solid, though we've had to renew disks
previously but without this very slow recovery.
I'd be very grateful for any thoughts.
regards, Ron
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: Recovery on new 2TB disk: finish=7248.4min (raid1) 2017-04-26 21:57 Recovery on new 2TB disk: finish=7248.4min (raid1) Ron Leach @ 2017-04-27 14:25 ` John Stoffel 2017-04-27 14:43 ` Reindl Harald 2017-04-27 14:54 ` Mateusz Korniak 2017-04-27 14:58 ` Mateusz Korniak 1 sibling, 2 replies; 25+ messages in thread From: John Stoffel @ 2017-04-27 14:25 UTC (permalink / raw) To: Ron Leach; +Cc: linux-raid Ron> We run a 2TB fileserver in a raid1 configuration. Today one of Ron> the 2 disks (/dev/sdb) failed and we've just replaced it and set Ron> up exactly the same partitions as the working, but degraded, raid Ron> has on /dev/sda. First off, why are you bothering to do this? You should just mirror the entire disk with MD, then build LVM volumes on top of that which you can then allocate as you see fit, moving your data around, growing, shrinking volumes as you need. Ron> Using the commands Ron> # mdadm --manage -a /dev/mdo /dev/sdb1 Ron> (and so on for md 1->7) Ron> is resulting in a very-unusually slow recovery. And mdadm is now Ron> recovering the largest partition, 1.8TB, but expects to spend 5 Ron> days over it. I think I must have done something wrong. May I Ron> ask a couple of questions? Did you check that values in /sys/devices/virtual/block/md0/md/sync_speed* settings? I suspect you want to up the sync_speed_max to a higher number on your system. Ron> 1 Is there a safe command to stop the recovery/add process that Ron> is ongoing? I reread man mdadm but did not see a command I could Ron> use for this. Why would you want to do this? Ron> 2 After the failure of /dev/sdb, mdstat listed sdb x in each md Ron> device with an '(F)'. We then also 'FAIL'ed each sdb partition in Ron> each md device, and then powered down the machine to replace sdb. Ron> After powering up and booting back into Debian, we created the Ron> partitions on (the new) sdb to mirror those on /dev/sda. We then Ron> issued these commands one after the other: Ron> # mdadm --manage -a /dev/mdo /dev/sdb1 Ron> # mdadm --manage -a /dev/md1 /dev/sdb2 Ron> # mdadm --manage -a /dev/md2 /dev/sdb3 Ron> # mdadm --manage -a /dev/md3 /dev/sdb5 Ron> # mdadm --manage -a /dev/md4 /dev/sdb6 Ron> # mdadm --manage -a /dev/md5 /dev/sdb7 Ron> # mdadm --manage -a /dev/md6 /dev/sdb8 Ron> # mdadm --manage -a /dev/md7 /dev/sdb9 Ugh! You're setting yourself up for a true seek storm here, and way too much pain down the road, IMHO. Just mirror the entire disk and put LVM volumes on top. Ron> Have I missed some vital step, and so causing the recover process to Ron> take a very long time? Ron> mdstat and lsdrv outputs here (UUIDs abbreviated): Ron> # cat /proc/mdstat Ron> Personalities : [raid1] Ron> md7 : active raid1 sdb9[3] sda9[2] Ron> 1894416248 blocks super 1.2 [2/1] [U_] Ron> [>....................] recovery = 0.0% (1493504/1894416248) Ron> finish=7248.4min speed=4352K/sec Ron> md6 : active raid1 sdb8[3] sda8[2] Ron> 39060408 blocks super 1.2 [2/1] [U_] Ron> resync=DELAYED Ron> md5 : active raid1 sdb7[3] sda7[2] Ron> 975860 blocks super 1.2 [2/1] [U_] Ron> resync=DELAYED Ron> md4 : active raid1 sdb6[3] sda6[2] Ron> 975860 blocks super 1.2 [2/1] [U_] Ron> resync=DELAYED Ron> md3 : active raid1 sdb5[3] sda5[2] Ron> 4880372 blocks super 1.2 [2/1] [U_] Ron> resync=DELAYED Ron> md2 : active raid1 sdb3[3] sda3[2] Ron> 9764792 blocks super 1.2 [2/1] [U_] Ron> resync=DELAYED Ron> md1 : active raid1 sdb2[3] sda2[2] Ron> 2928628 blocks super 1.2 [2/2] [UU] Ron> md0 : active raid1 sdb1[3] sda1[2] Ron> 498676 blocks super 1.2 [2/2] [UU] Ron> unused devices: <none> Ron> I meant to also ask - why are the /dev/sdb partitions shown with a Ron> '(3)'? Previously I think they had a '(1)'. Ron> # ./lsdrv Ron> **Warning** The following utility(ies) failed to execute: Ron> sginfo Ron> pvs Ron> lvs Ron> Some information may be missing. Ron> Controller platform [None] Ron> └platform floppy.0 Ron> └fd0 4.00k [2:0] Empty/Unknown Ron> PCI [sata_nv] 00:08.0 IDE interface: nVidia Corporation MCP61 SATA Ron> Controller (rev a2) Ron> ├scsi 0:0:0:0 ATA WDC WD20EZRX-00D {WD-WC....R1} Ron> │└sda 1.82t [8:0] Partitioned (dos) Ron> │ ├sda1 487.00m [8:1] MD raid1 (0/2) (w/ sdb1) in_sync 'Server6:0' Ron> {b307....e950} Ron> │ │└md0 486.99m [9:0] MD v1.2 raid1 (2) clean {b307....e950} Ron> │ │ │ ext2 {4ed1....e8b1} Ron> │ │ └Mounted as /dev/md0 @ /boot Ron> │ ├sda2 2.79g [8:2] MD raid1 (0/2) (w/ sdb2) in_sync 'Server6:1' Ron> {77b1....50f2} Ron> │ │└md1 2.79g [9:1] MD v1.2 raid1 (2) clean {77b1....50f2} Ron> │ │ │ jfs {7d08....bae5} Ron> │ │ └Mounted as /dev/disk/by-uuid/7d08....bae5 @ / Ron> │ ├sda3 9.31g [8:3] MD raid1 (0/2) (w/ sdb3) in_sync 'Server6:2' Ron> {afd6....b694} Ron> │ │└md2 9.31g [9:2] MD v1.2 raid1 (2) clean DEGRADED, recover Ron> (0.00k/18.62g) 0.00k/sec {afd6....b694} Ron> │ │ │ jfs {81bb....92f8} Ron> │ │ └Mounted as /dev/md2 @ /usr Ron> │ ├sda4 1.00k [8:4] Partitioned (dos) Ron> │ ├sda5 4.66g [8:5] MD raid1 (0/2) (w/ sdb5) in_sync 'Server6:3' Ron> {d00a....4e99} Ron> │ │└md3 4.65g [9:3] MD v1.2 raid1 (2) active DEGRADED, recover Ron> (0.00k/9.31g) 0.00k/sec {d00a....4e99} Ron> │ │ │ jfs {375b....4fd5} Ron> │ │ └Mounted as /dev/md3 @ /var Ron> │ ├sda6 953.00m [8:6] MD raid1 (0/2) (w/ sdb6) in_sync 'Server6:4' Ron> {25af....d910} Ron> │ │└md4 952.99m [9:4] MD v1.2 raid1 (2) clean DEGRADED, recover Ron> (0.00k/1.86g) 0.00k/sec {25af....d910} Ron> │ │ swap {d92f....2ad7} Ron> │ ├sda7 953.00m [8:7] MD raid1 (0/2) (w/ sdb7) in_sync 'Server6:5' Ron> {0034....971a} Ron> │ │└md5 952.99m [9:5] MD v1.2 raid1 (2) active DEGRADED, recover Ron> (0.00k/1.86g) 0.00k/sec {0034....971a} Ron> │ │ │ jfs {4bf7....0fff} Ron> │ │ └Mounted as /dev/md5 @ /tmp Ron> │ ├sda8 37.25g [8:8] MD raid1 (0/2) (w/ sdb8) in_sync 'Server6:6' Ron> {a5d9....568d} Ron> │ │└md6 37.25g [9:6] MD v1.2 raid1 (2) clean DEGRADED, recover Ron> (0.00k/74.50g) 0.00k/sec {a5d9....568d} Ron> │ │ │ jfs {fdf0....6478} Ron> │ │ └Mounted as /dev/md6 @ /home Ron> │ └sda9 1.76t [8:9] MD raid1 (0/2) (w/ sdb9) in_sync 'Server6:7' Ron> {9bb1....bbb4} Ron> │ └md7 1.76t [9:7] MD v1.2 raid1 (2) clean DEGRADED, recover Ron> (0.00k/3.53t) 3.01m/sec {9bb1....bbb4} Ron> │ │ jfs {60bc....33fc} Ron> │ └Mounted as /dev/md7 @ /srv Ron> └scsi 1:0:0:0 ATA ST2000DL003-9VT1 {5Y....HT} Ron> └sdb 1.82t [8:16] Partitioned (dos) Ron> ├sdb1 487.00m [8:17] MD raid1 (1/2) (w/ sda1) in_sync 'Server6:0' Ron> {b307....e950} Ron> │└md0 486.99m [9:0] MD v1.2 raid1 (2) clean {b307....e950} Ron> │ ext2 {4ed1....e8b1} Ron> ├sdb2 2.79g [8:18] MD raid1 (1/2) (w/ sda2) in_sync 'Server6:1' Ron> {77b1....50f2} Ron> │└md1 2.79g [9:1] MD v1.2 raid1 (2) clean {77b1....50f2} Ron> │ jfs {7d08....bae5} Ron> ├sdb3 9.31g [8:19] MD raid1 (1/2) (w/ sda3) spare 'Server6:2' Ron> {afd6....b694} Ron> │└md2 9.31g [9:2] MD v1.2 raid1 (2) clean DEGRADED, recover Ron> (0.00k/18.62g) 0.00k/sec {afd6....b694} Ron> │ jfs {81bb....92f8} Ron> ├sdb4 1.00k [8:20] Partitioned (dos) Ron> ├sdb5 4.66g [8:21] MD raid1 (1/2) (w/ sda5) spare 'Server6:3' Ron> {d00a....4e99} Ron> │└md3 4.65g [9:3] MD v1.2 raid1 (2) active DEGRADED, recover Ron> (0.00k/9.31g) 0.00k/sec {d00a....4e99} Ron> │ jfs {375b....4fd5} Ron> ├sdb6 953.00m [8:22] MD raid1 (1/2) (w/ sda6) spare 'Server6:4' Ron> {25af....d910} Ron> │└md4 952.99m [9:4] MD v1.2 raid1 (2) clean DEGRADED, recover Ron> (0.00k/1.86g) 0.00k/sec {25af....d910} Ron> │ swap {d92f....2ad7} Ron> ├sdb7 953.00m [8:23] MD raid1 (1/2) (w/ sda7) spare 'Server6:5' Ron> {0034....971a} Ron> │└md5 952.99m [9:5] MD v1.2 raid1 (2) active DEGRADED, recover Ron> (0.00k/1.86g) 0.00k/sec {0034....971a} Ron> │ jfs {4bf7....0fff} Ron> ├sdb8 37.25g [8:24] MD raid1 (1/2) (w/ sda8) spare 'Server6:6' Ron> {a5d9....568d} Ron> │└md6 37.25g [9:6] MD v1.2 raid1 (2) clean DEGRADED, recover Ron> (0.00k/74.50g) 0.00k/sec {a5d9....568d} Ron> │ jfs {fdf0....6478} Ron> ├sdb9 1.76t [8:25] MD raid1 (1/2) (w/ sda9) spare 'Server6:7' Ron> {9bb1....bbb4} Ron> │└md7 1.76t [9:7] MD v1.2 raid1 (2) clean DEGRADED, recover Ron> (0.00k/3.53t) 3.01m/sec {9bb1....bbb4} Ron> │ jfs {60bc....33fc} Ron> └sdb10 1.00m [8:26] Empty/Unknown Ron> PCI [pata_amd] 00:06.0 IDE interface: nVidia Corporation MCP61 IDE Ron> (rev a2) Ron> ├scsi 2:0:0:0 AOPEN CD-RW CRW5224 Ron> {AOPEN_CD-RW_CRW5224_1.07_20020606_} Ron> │└sr0 1.00g [11:0] Empty/Unknown Ron> └scsi 3:x:x:x [Empty] Ron> Other Block Devices Ron> ├loop0 0.00k [7:0] Empty/Unknown Ron> ├loop1 0.00k [7:1] Empty/Unknown Ron> ├loop2 0.00k [7:2] Empty/Unknown Ron> ├loop3 0.00k [7:3] Empty/Unknown Ron> ├loop4 0.00k [7:4] Empty/Unknown Ron> ├loop5 0.00k [7:5] Empty/Unknown Ron> ├loop6 0.00k [7:6] Empty/Unknown Ron> └loop7 0.00k [7:7] Empty/Unknown Ron> OS is still as originally installed some years ago - Debian 6/Squeeze. Ron> The OS has been pretty solid, though we've had to renew disks Ron> previously but without this very slow recovery. Ron> I'd be very grateful for any thoughts. Ron> regards, Ron Ron> -- Ron> To unsubscribe from this list: send the line "unsubscribe linux-raid" in Ron> the body of a message to majordomo@vger.kernel.org Ron> More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Recovery on new 2TB disk: finish=7248.4min (raid1) 2017-04-27 14:25 ` John Stoffel @ 2017-04-27 14:43 ` Reindl Harald 2017-04-28 7:05 ` Ron Leach 2017-04-27 14:54 ` Mateusz Korniak 1 sibling, 1 reply; 25+ messages in thread From: Reindl Harald @ 2017-04-27 14:43 UTC (permalink / raw) To: John Stoffel, Ron Leach; +Cc: linux-raid Am 27.04.2017 um 16:25 schrieb John Stoffel: > > Ron> We run a 2TB fileserver in a raid1 configuration. Today one of > Ron> the 2 disks (/dev/sdb) failed and we've just replaced it and set > Ron> up exactly the same partitions as the working, but degraded, raid > Ron> has on /dev/sda. > > First off, why are you bothering to do this? You should just mirror > the entire disk with MD because he has partitions in use and more than on mdraid as you can see in the /proc/mdstat output? because he just want to replace a disk? becaus ehe likjely has also the operating system on one of that many RAIDs sharing the same disks? > then build LVM volumes on top of that he is replacing a disk in a already existing RAID and has more than on RAID volume > you can then allocate as you see fit, moving your data around, > growing, shrinking volumes as you need he is replacing a disk in a already existing RAID and has more than on RAID volume [root@rh:~]$ df Filesystem Type Size Used Avail Use% Mounted on /dev/md1 ext4 29G 6.8G 22G 24% / /dev/md0 ext4 485M 34M 448M 7% /boot /dev/md2 ext4 3.6T 678G 2.9T 19% /mnt/data [root@rh:~]$ cat /proc/mdstat Personalities : [raid10] [raid1] md0 : active raid1 sda1[0] sdc1[1] sdb1[3] sdd1[2] 511988 blocks super 1.0 [4/4] [UUUU] md1 : active raid10 sda2[0] sdc2[1] sdd2[2] sdb2[3] 30716928 blocks super 1.1 512K chunks 2 near-copies [4/4] [UUUU] md2 : active raid10 sda3[0] sdc3[1] sdd3[2] sdb3[3] 3875222528 blocks super 1.1 512K chunks 2 near-copies [4/4] [UUUU] [========>............] check = 44.4% (1721204032/3875222528) finish=470.9min speed=76232K/sec ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Recovery on new 2TB disk: finish=7248.4min (raid1) 2017-04-27 14:43 ` Reindl Harald @ 2017-04-28 7:05 ` Ron Leach 0 siblings, 0 replies; 25+ messages in thread From: Ron Leach @ 2017-04-28 7:05 UTC (permalink / raw) To: linux-raid On 27/04/2017 15:43, Reindl Harald wrote: > [root@rh:~]$ cat /proc/mdstat > Personalities : [raid10] [raid1] > md0 : active raid1 sda1[0] sdc1[1] sdb1[3] sdd1[2] > 511988 blocks super 1.0 [4/4] [UUUU] > > md1 : active raid10 sda2[0] sdc2[1] sdd2[2] sdb2[3] > 30716928 blocks super 1.1 512K chunks 2 near-copies [4/4] [UUUU] > > md2 : active raid10 sda3[0] sdc3[1] sdd3[2] sdb3[3] > 3875222528 blocks super 1.1 512K chunks 2 near-copies [4/4] [UUUU] > [========>............] check = 44.4% (1721204032/3875222528) > finish=470.9min speed=76232K/sec > Those were the sort of times that I used to see on this machine. I've fixed it now, though. There were some clues in syslog - gdm3 was alerting 2 or 3 times each second, continually. This was because I'd taken this server offline and across to a workbench to change the disk. I'd restarted the machine, partitioned it, etc, and issued those --add commands, without a screen or keyboard, just over ssh. I hadn't realised that gdm3 would panic, causing a couple of acpid messages as well each time. Someone on the Debian list pointed out that gdm3 was a service and could be stopped for this circumstance. Doing that seemed to release mdadm to recovering at its normal rate; all the mds are fully replicated, now. Thanks to folks for contributing their thoughts, some interesting insights came up as well which will be useful in the future. This is quite an old server (still actively used), created before I realised the drawbacks of having so many partitions for each part of the filesystem; I don't do this on more-recent systems, which look more like the setup you show here. Anyway, all seems ok now, and thanks again, regards, Ron ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Recovery on new 2TB disk: finish=7248.4min (raid1) 2017-04-27 14:25 ` John Stoffel 2017-04-27 14:43 ` Reindl Harald @ 2017-04-27 14:54 ` Mateusz Korniak 2017-04-27 19:03 ` John Stoffel 1 sibling, 1 reply; 25+ messages in thread From: Mateusz Korniak @ 2017-04-27 14:54 UTC (permalink / raw) To: John Stoffel; +Cc: Ron Leach, linux-raid On Thursday 27 of April 2017 10:25:35 John Stoffel wrote: > Ron> issued these commands one after the other: > > Ron> # mdadm --manage -a /dev/mdo /dev/sdb1 > Ron> # mdadm --manage -a /dev/md1 /dev/sdb2 > Ron> # mdadm --manage -a /dev/md2 /dev/sdb3 > Ron> # mdadm --manage -a /dev/md3 /dev/sdb5 > Ron> # mdadm --manage -a /dev/md4 /dev/sdb6 > Ron> # mdadm --manage -a /dev/md5 /dev/sdb7 > Ron> # mdadm --manage -a /dev/md6 /dev/sdb8 > Ron> # mdadm --manage -a /dev/md7 /dev/sdb9 > > Ugh! You're setting yourself up for a true seek storm here, and way > too much pain down the road, IMHO. Just mirror the entire disk and > put LVM volumes on top. Why having several md devices leads to "seek storm"? They will be synced one by one, just like one big md device. Regards, -- Mateusz Korniak "(...) mam brata - poważny, domator, liczykrupa, hipokryta, pobożniś, krótko mówiąc - podpora społeczeństwa." Nikos Kazantzakis - "Grek Zorba" ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Recovery on new 2TB disk: finish=7248.4min (raid1) 2017-04-27 14:54 ` Mateusz Korniak @ 2017-04-27 19:03 ` John Stoffel 2017-04-27 19:42 ` Reindl Harald 2017-04-30 12:04 ` Nix 0 siblings, 2 replies; 25+ messages in thread From: John Stoffel @ 2017-04-27 19:03 UTC (permalink / raw) To: Mateusz Korniak; +Cc: John Stoffel, Ron Leach, linux-raid >>>>> "Mateusz" == Mateusz Korniak <mateusz-lists@ant.gliwice.pl> writes: Mateusz> On Thursday 27 of April 2017 10:25:35 John Stoffel wrote: Ron> issued these commands one after the other: >> Ron> # mdadm --manage -a /dev/mdo /dev/sdb1 Ron> # mdadm --manage -a /dev/md1 /dev/sdb2 Ron> # mdadm --manage -a /dev/md2 /dev/sdb3 Ron> # mdadm --manage -a /dev/md3 /dev/sdb5 Ron> # mdadm --manage -a /dev/md4 /dev/sdb6 Ron> # mdadm --manage -a /dev/md5 /dev/sdb7 Ron> # mdadm --manage -a /dev/md6 /dev/sdb8 Ron> # mdadm --manage -a /dev/md7 /dev/sdb9 >> >> Ugh! You're setting yourself up for a true seek storm here, and way >> too much pain down the road, IMHO. Just mirror the entire disk and >> put LVM volumes on top. Mateusz> Why having several md devices leads to "seek storm"? It might if MD isn't being smart enough, which might happen. Mateusz> They will be synced one by one, just like one big md device. No, big MD devices are sync'd in parallel assuming MD thinks they're on seperate devices. Now in this case I admit I might have jumped the gun, but I'm mostly commenting on the use of multiple MD RAID setups on a single pair of disks. It's inefficient. It's a pain to manage. You lose flexibility to resize. Just create a single MD device across the entire disk (or possibly two if you want to boot off one mirrored pair) and then use LVM on top to carve out storage. More flexible. John ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Recovery on new 2TB disk: finish=7248.4min (raid1) 2017-04-27 19:03 ` John Stoffel @ 2017-04-27 19:42 ` Reindl Harald 2017-04-28 7:30 ` Mateusz Korniak 2017-04-30 12:04 ` Nix 1 sibling, 1 reply; 25+ messages in thread From: Reindl Harald @ 2017-04-27 19:42 UTC (permalink / raw) To: John Stoffel, Mateusz Korniak; +Cc: Ron Leach, linux-raid Am 27.04.2017 um 21:03 schrieb John Stoffel: > Mateusz> They will be synced one by one, just like one big md device. > > No, big MD devices are sync'd in parallel assuming MD thinks they're > on seperate devices if they are on sepearte drives it's no problem and no "seek storm" > Now in this case I admit I might have jumped the > gun, but I'm mostly commenting on the use of multiple MD RAID setups > on a single pair of disks. > > It's inefficient. It's a pain to manage. You lose flexibility to > resize. which don't matter if you have LVM on top > Just create a single MD device across the entire disk (or possibly two > if you want to boot off one mirrored pair) and then use LVM on top to > carve out storage. More flexible but you can't boot from a RAID5/RAID6/RAID10 so you have a single point of failure of a single boot disk or need at least two additional disks for a redundant boot device frankly on a proper designed storage machine you have no need for flexibility and resize because for it's entire lifetime you have enough storage at all and in case of LVM it don't matter how many md-devices are underlying the LVM ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Recovery on new 2TB disk: finish=7248.4min (raid1) 2017-04-27 19:42 ` Reindl Harald @ 2017-04-28 7:30 ` Mateusz Korniak 0 siblings, 0 replies; 25+ messages in thread From: Mateusz Korniak @ 2017-04-28 7:30 UTC (permalink / raw) To: Reindl Harald; +Cc: John Stoffel, Ron Leach, linux-raid On Thursday 27 of April 2017 21:42:27 Reindl Harald wrote: > > Just create a single MD device across the entire disk (...) > > but you can't boot from a RAID5/RAID6/RAID10 Since some time you can boot from LVM LV having PV(s) on RAID10 with grub2. -- Mateusz Korniak "(...) mam brata - poważny, domator, liczykrupa, hipokryta, pobożniś, krótko mówiąc - podpora społeczeństwa." Nikos Kazantzakis - "Grek Zorba" ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Recovery on new 2TB disk: finish=7248.4min (raid1) 2017-04-27 19:03 ` John Stoffel 2017-04-27 19:42 ` Reindl Harald @ 2017-04-30 12:04 ` Nix 2017-04-30 13:21 ` Roman Mamedov 1 sibling, 1 reply; 25+ messages in thread From: Nix @ 2017-04-30 12:04 UTC (permalink / raw) To: John Stoffel; +Cc: Mateusz Korniak, Ron Leach, linux-raid On 27 Apr 2017, John Stoffel spake thusly: > No, big MD devices are sync'd in parallel assuming MD thinks they're > on seperate devices. Now in this case I admit I might have jumped the > gun, but I'm mostly commenting on the use of multiple MD RAID setups > on a single pair of disks. > > It's inefficient. It's a pain to manage. You lose flexibility to > resize. Aside: the storage server I've just set up has a different rationale for having multiple mds. There's one in the 'fast part' of the rotating rust, and one in the 'slow part' (for big archival stuff that is rarely written to); the slow one has an LVM PV directly atop it, but the fast one has a bcache and then an LVM PV built atop that. The fast disk also has an md journal on SSD. Both are joined into one LVM VG. (The filesystem journals on the fast part are also on the SSD.) So I have a chunk of 'slow space' for things like ISOs and video files that are rarely written to (so a RAID journal is needless) and never want to be SSD-cached, and another (bigger) chunk of space for everything else, SSD-cached for speed and RAID-journalled for powerfail integrity. You can't do that with one big md array, since you can't have one array which is partially journalled and partially not. (You *can*, with the aid of dm, have one array which is partially bcached and partially not, but frankly messing about with direct dm linears seemed pointlessly painful. It's annoying enough to set up an md->bcache->LVM setup at boot: adding dmsetup to that as well seemed like pain beyond the call of duty.) (... actually it's more complex than that: there is *also* a RAID-0 containing an ext4 sans filesystem journal at the start of the disk for transient stuff like build trees that are easily regenerated, rarely needed more than once, and where journalling the writes or caching the reads on SSD is a total waste of SSD lifespan. If *that* gets corrupted, the boot machinery simply re-mkfses it.) -- NULL && (void) ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Recovery on new 2TB disk: finish=7248.4min (raid1) 2017-04-30 12:04 ` Nix @ 2017-04-30 13:21 ` Roman Mamedov 2017-04-30 16:10 ` Nix 0 siblings, 1 reply; 25+ messages in thread From: Roman Mamedov @ 2017-04-30 13:21 UTC (permalink / raw) To: Nix; +Cc: John Stoffel, Mateusz Korniak, Ron Leach, linux-raid On Sun, 30 Apr 2017 13:04:36 +0100 Nix <nix@esperi.org.uk> wrote: > Aside: the storage server I've just set up has a different rationale for > having multiple mds. There's one in the 'fast part' of the rotating > rust, and one in the 'slow part' (for big archival stuff that is rarely > written to); the slow one has an LVM PV directly atop it, but the fast > one has a bcache and then an LVM PV built atop that. The fast disk also > has an md journal on SSD. Both are joined into one LVM VG. (The > filesystem journals on the fast part are also on the SSD.) It's not like the difference between the so called "fast" and "slow" parts is 100- or even 10-fold. Just SSD-cache the entire thing (I prefer lvmcache not bcache) and go. > So I have a chunk of 'slow space' for things like ISOs and video files > that are rarely written to (so a RAID journal is needless) and never > want to be SSD-cached, and another (bigger) chunk of space for > everything else, SSD-cached for speed and RAID-journalled for powerfail > integrity. > > (... actually it's more complex than that: there is *also* a RAID-0 > containing an ext4 sans filesystem journal at the start of the disk for > transient stuff like build trees that are easily regenerated, rarely > needed more than once, and where journalling the writes or caching the > reads on SSD is a total waste of SSD lifespan. If *that* gets corrupted, > the boot machinery simply re-mkfses it.) You have too much time on your hands if you have nothing better to do than to babysit all that b/s. -- With respect, Roman ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Recovery on new 2TB disk: finish=7248.4min (raid1) 2017-04-30 13:21 ` Roman Mamedov @ 2017-04-30 16:10 ` Nix 2017-04-30 16:47 ` Roman Mamedov 2017-04-30 17:16 ` Wols Lists 0 siblings, 2 replies; 25+ messages in thread From: Nix @ 2017-04-30 16:10 UTC (permalink / raw) To: Roman Mamedov; +Cc: John Stoffel, Mateusz Korniak, Ron Leach, linux-raid On 30 Apr 2017, Roman Mamedov spake thusly: > On Sun, 30 Apr 2017 13:04:36 +0100 > Nix <nix@esperi.org.uk> wrote: > >> Aside: the storage server I've just set up has a different rationale for >> having multiple mds. There's one in the 'fast part' of the rotating >> rust, and one in the 'slow part' (for big archival stuff that is rarely >> written to); the slow one has an LVM PV directly atop it, but the fast >> one has a bcache and then an LVM PV built atop that. The fast disk also >> has an md journal on SSD. Both are joined into one LVM VG. (The >> filesystem journals on the fast part are also on the SSD.) > > It's not like the difference between the so called "fast" and "slow" parts is > 100- or even 10-fold. Just SSD-cache the entire thing (I prefer lvmcache not > bcache) and go. I'd do that if SSDs had infinite lifespan. They really don't. :) lvmcache doesn't cache everything, only frequently-referenced things, so the problem is not so extreme there -- but the fact that it has to be set up anew for *each LV* is a complete killer for me, since I have encrypted filesystems and things that *have* to be on separate LVs and I really do not want to try to figure out the right balance between distinct caches, thanks (oh and also you have to get the metadata size right, and if you get it wrong and it runs out of space all hell breaks loose, AIUI). bcaching the whole block device avoids all this pointless complexity. bcache just works. >> So I have a chunk of 'slow space' for things like ISOs and video files >> that are rarely written to (so a RAID journal is needless) and never >> want to be SSD-cached, and another (bigger) chunk of space for >> everything else, SSD-cached for speed and RAID-journalled for powerfail >> integrity. >> >> (... actually it's more complex than that: there is *also* a RAID-0 >> containing an ext4 sans filesystem journal at the start of the disk for >> transient stuff like build trees that are easily regenerated, rarely >> needed more than once, and where journalling the writes or caching the >> reads on SSD is a total waste of SSD lifespan. If *that* gets corrupted, >> the boot machinery simply re-mkfses it.) > > You have too much time on your hands if you have nothing better to do than > to babysit all that b/s. This is a one-off with tooling to manage it: from my perspective, I just kick off the autobuilders etc and they'll automatically use transient space for objdirs. (And obviously this is all scripted so it is no harder than making or removing directories would be: typing 'mktransient foo' to automatically create a dir in transient space and set up a bind mount to it -- persisted across boots -- in the directory' foo' is literally a few letters more than typing 'mkdir foo'.) Frankly the annoyance factor of having to replace the SSD years in advance because every test build does several gigabytes of objdir writes that I'm not going to care about in fifteen minutes would be far higher than the annoyance factor of having to, uh, write three scripts about fifteen lines long to manage the transient space. -- NULL && (void) ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Recovery on new 2TB disk: finish=7248.4min (raid1) 2017-04-30 16:10 ` Nix @ 2017-04-30 16:47 ` Roman Mamedov 2017-05-01 21:13 ` Nix 2017-04-30 17:16 ` Wols Lists 1 sibling, 1 reply; 25+ messages in thread From: Roman Mamedov @ 2017-04-30 16:47 UTC (permalink / raw) To: Nix; +Cc: John Stoffel, Mateusz Korniak, Ron Leach, linux-raid On Sun, 30 Apr 2017 17:10:22 +0100 Nix <nix@esperi.org.uk> wrote: > > It's not like the difference between the so called "fast" and "slow" parts is > > 100- or even 10-fold. Just SSD-cache the entire thing (I prefer lvmcache not > > bcache) and go. > > I'd do that if SSDs had infinite lifespan. They really don't. :) > lvmcache doesn't cache everything, only frequently-referenced things, so > the problem is not so extreme there -- but Yes I was concerned the lvmcache will over-use the SSD by mistakenly caching streaming linear writes and the like -- and it absolutely doesn't. (it can during the initial fill-up of the cache, but not afterwards). Get an MLC-based SSD if that gives more peace of mind, but tests show even the less durable TLC-based ones have lifespan measuring in hundreds of TB. http://techreport.com/review/27909/the-ssd-endurance-experiment-theyre-all-dead One SSD that I have currently has 19 TB written to it over its entire 4.5 year lifespan. Over the past few months of being used as lvmcache for a 14 TB bulk data array and a separate /home FS, new writes average at about 16 GB/day. Given a VERY conservative 120 TBW endurance estimate, this SSD should last me all the way into year 2034 at least. > the fact that it has to be set up anew for *each LV* is a complete killer > for me, since I have encrypted filesystems and things that *have* to be on > separate LVs and I really do not want to try to figure out the right balance > between distinct caches, thanks (oh and also you have to get the metadata > size right, and if you get it wrong and it runs out of space all hell breaks > loose, AIUI). bcaching the whole block device avoids all this pointless > complexity. bcache just works. Oh yes I wish they had a VG-level lvmcache. Still, it feels more mature than bcache, the latter barely has any userspace management and monitoring tools (having to fiddle with "echo > /sys/..." and "cat /sys/..." is not the state of something you'd call a finished product). And the killer for me was that there is no way to stop using bcache on a partition, once it's a "bcache backing device" there is no way to migrate back to a raw partition, you're stuck with it. > This is a one-off with tooling to manage it: from my perspective, I just > kick off the autobuilders etc and they'll automatically use transient > space for objdirs. (And obviously this is all scripted so it is no > harder than making or removing directories would be: typing 'mktransient > foo' to automatically create a dir in transient space and set up a bind > mount to it -- persisted across boots -- in the directory' foo' is > literally a few letters more than typing 'mkdir foo'.) Sorry for being rather blunt initially, still IMO the amount if micromanagement required (and complexity introduced) is staggering compared to the benefits reaped -- and it all appears to stem from underestimating the modern SSDs. I'd suggest just get one and try "killing" it with your casual daily usage, you'll find (via TBW numbers you will see in SMART compared even to vendor spec'd ones, not to mention what tech sites' field tests show) that you just can't, not until deep into a dozen of years later into the future. -- With respect, Roman ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Recovery on new 2TB disk: finish=7248.4min (raid1) 2017-04-30 16:47 ` Roman Mamedov @ 2017-05-01 21:13 ` Nix 2017-05-01 21:44 ` Anthony Youngman 2017-05-01 21:46 ` Roman Mamedov 0 siblings, 2 replies; 25+ messages in thread From: Nix @ 2017-05-01 21:13 UTC (permalink / raw) To: Roman Mamedov; +Cc: John Stoffel, Mateusz Korniak, Ron Leach, linux-raid [linux-raid list: sorry, this is getting quite off-topic, though I'm finding the argument(?) quite fascinating. I can take it off-list if you like.] On 30 Apr 2017, Roman Mamedov told this: > On Sun, 30 Apr 2017 17:10:22 +0100 > Nix <nix@esperi.org.uk> wrote: > >> > It's not like the difference between the so called "fast" and "slow" parts is >> > 100- or even 10-fold. Just SSD-cache the entire thing (I prefer lvmcache not >> > bcache) and go. >> >> I'd do that if SSDs had infinite lifespan. They really don't. :) >> lvmcache doesn't cache everything, only frequently-referenced things, so >> the problem is not so extreme there -- but > > Yes I was concerned the lvmcache will over-use the SSD by mistakenly caching > streaming linear writes and the like -- and it absolutely doesn't. (it can > during the initial fill-up of the cache, but not afterwards). Yeah, it's hopeless to try to minimize SSD writes during initial cache population. Of course you'll write to the SSD a lot then. That's the point. > Get an MLC-based SSD if that gives more peace of mind, but tests show even the > less durable TLC-based ones have lifespan measuring in hundreds of TB. > http://techreport.com/review/27909/the-ssd-endurance-experiment-theyre-all-dead That was a fascinating and frankly quite reassuring article, thank you! :) > One SSD that I have currently has 19 TB written to it over its entire 4.5 year > lifespan. Over the past few months of being used as lvmcache for a 14 TB > bulk data array and a separate /home FS, new writes average at about 16 GB/day. That's a lot less than I expect, alas. Busy machine, lots of busy source trees and large transient writes -- and without some careful management the SSD capacity would not be larger than the expected working set forever. That's what the fast/slow division is for. > Given a VERY conservative 120 TBW endurance estimate, this SSD should last me > all the way into year 2034 at least. The lifetime estimate on mine says three years to failure, at present usage rates (datacenter-quality SSDs are neat, they give you software that tells you things like this). I'll probably replace it with one rated for higher write loads next time. They're still beyond my price point right now, but in three years they should be much cheaper!) >> the fact that it has to be set up anew for *each LV* is a complete killer >> for me, since I have encrypted filesystems and things that *have* to be on >> separate LVs and I really do not want to try to figure out the right balance >> between distinct caches, thanks (oh and also you have to get the metadata >> size right, and if you get it wrong and it runs out of space all hell breaks >> loose, AIUI). bcaching the whole block device avoids all this pointless >> complexity. bcache just works. > > Oh yes I wish they had a VG-level lvmcache. Still, it feels more mature than > bcache, the latter barely has any userspace management and monitoring tools I was worried about that, but y'know you hardly need them. You set it up and it just works. (Plus, you can do things like temporarily turn the cache *off* during e.g. initial population, have it ignore low-priority I/O, streaming reads etc, none of which lvmcache could do last time I looked. And nearly all the /sys knobs are persistently written to the bcache superblock so you only need to tweak them once.) I far prefer that to LVM's horribly complicated tools, which I frankly barely understand by this point. The manpages randomly intermingle ordinary LV, snapshotting, RAID, caching, clustering, and options only useful for other use cases in an almighty tangle, relying on examples at the bottom of the manpage to try to indicate which options are useful where. Frankly they should be totally reorganized to be much more like mdadm's -- divided into nice neat sections or at least with some sort of by-LV-type options chart. As for monitoring, the stats in /sys knock LVM's completely out of the park, with continuously-updated stats on multiple time horizons. To me, LVM feels both overdesigned and seriously undercooked for this use case, definitely not ready for serious use as a cache. > (having to fiddle with "echo > /sys/..." and "cat /sys/..." is not the state > of something you'd call a finished product). You mean, like md? :) I like /sys. It's easy to explore and you can use your standard fs tools on it. The only downside is the inability to comment anything :( but that's what documentation is for. (Oh, also, if you need ordering or binary data, /sys is the wrong tool. But for configuration interfaces that is rarely true.) > And the killer for me was that > there is no way to stop using bcache on a partition, once it's a "bcache > backing device" there is no way to migrate back to a raw partition, you're > stuck with it. That doesn't really matter, since you can turn the cache off completely and persistently with echo none > /sys/block/bcache$num/bcache/cache_mode and as soon as you do, the cache device is no longer required for the bcache to work (though if you had it in use for writeback caching, you'll have some fscking to do), and it imposes no overhead that I can discern. (The inability to use names with bcache devices *is* annoying: LVM and indeed md beats it there.) >> This is a one-off with tooling to manage it: from my perspective, I just >> kick off the autobuilders etc and they'll automatically use transient >> space for objdirs. (And obviously this is all scripted so it is no >> harder than making or removing directories would be: typing 'mktransient >> foo' to automatically create a dir in transient space and set up a bind >> mount to it -- persisted across boots -- in the directory' foo' is >> literally a few letters more than typing 'mkdir foo'.) > > Sorry for being rather blunt initially, still IMO the amount if micromanagement > required (and complexity introduced) is staggering compared to the benefits I was worried about that, but it's almost entirely scripted, so "none to speak of". The only admin overhead I see in my daily usage is a single "sync-vms" command every time I yum update my more write-insane test virtual machines. (I don't like writing 100GiB to the SSD ten times a day, so I run those VMs via CoW onto the RAID-0 transient fs, and write them back to their real filesystems on the cached/journalled array after big yum updates or when I do something else I want long-term preservation for. That happens every few weeks, at most.) Everything else is automated: my autobuilders make transient bind-mounts onto the RAID-0 as needed, video transcoding drops stuff in there automatically, and backups run with ionice -c3 so they don't flood the cache either. I probably don't run mktransient by hand more than once a month. I'd be more worried about the complexity required to just figure out the space needed for half a dozen sets of lvmcache metadata and cache filesystems. (How do you know how much cache you'll need for each fs in advance, anyway? That seems like a much harder question to answer than "will I want to cache this at all".) > reaped -- and it all appears to stem from underestimating the modern SSDs. > I'd suggest just get one and try "killing" it with your casual daily usage, When did I say I was a casual daily user? Build-and-test cycles with tens to hundreds of gigs of writes daily are routine, and video transcoding runs with half-terabyte to a terabyte of writes happen quite often. I care about the content of those writes for about ten minutes (one write, one read) and then couldn't care less about them: they're entirely transient. Dumping them to an SSD cache, or indeed to the md journal, is just pointless. I'm dropping some of them onto tmpfs, but some are just too large for that. I didn't say this was a setup useful for everyone! My workload happens to have a lot of large briefly-useful writes in it, and a lot of archival data that I don't want to waste space caching. It's the *other* stuff, that doesn't fit into those categories, that I want to cache and RAID-journal (and, for that matter, run routine backups of, so my own backup policies told me what data fell into which category.) As for modern SSDs... I think my Intel S3510 is a modern SSD, if not a write-workload-focused one (my supplier lied to me and claimed it was write-focused, and the spec sheet that said otherwise did not become apparent until after I bought it, curses). I'll switch to a write-focused 1.2TiB S3710, or the then-modern equivalent, when the S3510 burns out. *That* little monster is rated for 14 petabytes of writes before failure... but it also costs over a thousand pounds right now, and I already have a perfectly good SSD, so why not use it until it dies? I'd agree that when using something like the S3710 I'm going to stop caring about writes, because if you try to write that much to rotating rust it's going to wear out too. But the 480GiB S3510, depending on which spec sheets I read, is either rated for 290TiB or 876TiB of writes before failure, and given the Intel SSD "suicide-pill" self-bricking wearout failure mode described in the article you cited above, I think being a bit cautious is worthwhile. 290TiB is only the equivalent of thirteen complete end-to-end writes to the sum of all my RAID arrays... so no, I'm not treating it like it has infinite write endurance. Its own specs say it doesn't. (This is also why only the fast array is md-journalled.) (However, I do note that the 335 tested in the endurance test above is only rated for 20GiB of daily writes for three years, which comes to only 22TiB total writes, but in the tests it bricked itself after *720TiB*. So it's quite possible my S3510 will last vastly longer than its own diagnostic tools estimate, well into the petabytes. I do hope it does! I'm just not *trusting* that it does. A bit of fiddling and scripting at setup time is quite acceptable for that peace of mind. It wouldn't be worth it if this was a lot of work on an ongoing basis, but it's nearly none.) > you'll find (via TBW numbers you will see in SMART compared even to vendor > spec'd ones, not to mention what tech sites' field tests show) that you just > can't, not until deep into a dozen of years later into the future. I'm very impressed by modern SSD write figures, and suspect that in a few years they will be comparable to rotating rust's. They're just not, yet. Not quite, and my workload falls squarely into the 'not quite' gap. Given how easy it was for me to script my way around this problem, I didn't mind much. With a hardware RAID array, it would have been much more difficult! md's unmatched flexibility shines yet again. -- NULL && (void) ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Recovery on new 2TB disk: finish=7248.4min (raid1) 2017-05-01 21:13 ` Nix @ 2017-05-01 21:44 ` Anthony Youngman 2017-05-01 21:46 ` Roman Mamedov 1 sibling, 0 replies; 25+ messages in thread From: Anthony Youngman @ 2017-05-01 21:44 UTC (permalink / raw) To: Nix, Roman Mamedov; +Cc: John Stoffel, Mateusz Korniak, Ron Leach, linux-raid On 01/05/17 22:13, Nix wrote: > [linux-raid list: sorry, this is getting quite off-topic, though I'm > finding the argument(?) quite fascinating. I can take it off-list if > you like.] 1) I'm finding it fascinating, too ... :-) 2) We get a reasonable amount of other stuff which I think is off-topic for raid but is cc'd to the list because it's relevant to linux 3) At some point it would be nice to have this sort of stuff on the wiki, so people have up-to-date performance figures. Snag is, someone's got to run a load of benchmarks, document it, and do a lot of work for little personal benefit ... Cheers, Wol ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Recovery on new 2TB disk: finish=7248.4min (raid1) 2017-05-01 21:13 ` Nix 2017-05-01 21:44 ` Anthony Youngman @ 2017-05-01 21:46 ` Roman Mamedov 2017-05-01 21:53 ` Anthony Youngman 2017-05-01 23:26 ` Nix 1 sibling, 2 replies; 25+ messages in thread From: Roman Mamedov @ 2017-05-01 21:46 UTC (permalink / raw) To: Nix; +Cc: John Stoffel, Mateusz Korniak, Ron Leach, linux-raid On Mon, 01 May 2017 22:13:59 +0100 Nix <nix@esperi.org.uk> wrote: > > (having to fiddle with "echo > /sys/..." and "cat /sys/..." is not the state > > of something you'd call a finished product). > > You mean, like md? :) You must be kidding. On the contrary I was looking to present md as the example of how it should be done, with its all-encompassing and extremely capable 'mdadm' tool -- and a complete lack of a similar tool for bcache. > I'd be more worried about the complexity required to just figure out the > space needed for half a dozen sets of lvmcache metadata and cache > filesystems. Metadata can be implicitly created and auto-managed in recent lvm versions http://fibrevillage.com/storage/460-how-to-create-lvm-cache-logical-volume ("Automatic pool metadata LV"). If not, the rule of thumb suggested everywhere is 1/1000 of the cache volume size; I doubled that just in case, and looks like I didn't have to, as my metadata partitions are only about 9.5% full each. As for half a dozen sets, I'd reconsider the need for those, as well as the entire fast/slow HDD tracks separation, just SSD-cache everything and let it figure out to not cache streaming writes on your video transcodes, or even bulk writes during your compile tests (while still caching the filesystem metadata). However the world of pain begins if you want to have multiple guest VMs each with its disk as a separate LV. One solution (that doesn't sound too clean but perhaps could work), is stacked LVM, i.e. a PV of a different volume group made on top of a cached LV. -- With respect, Roman ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Recovery on new 2TB disk: finish=7248.4min (raid1) 2017-05-01 21:46 ` Roman Mamedov @ 2017-05-01 21:53 ` Anthony Youngman 2017-05-01 22:03 ` Roman Mamedov 2017-05-01 23:26 ` Nix 1 sibling, 1 reply; 25+ messages in thread From: Anthony Youngman @ 2017-05-01 21:53 UTC (permalink / raw) To: Roman Mamedov, Nix; +Cc: John Stoffel, Mateusz Korniak, Ron Leach, linux-raid On 01/05/17 22:46, Roman Mamedov wrote: > On Mon, 01 May 2017 22:13:59 +0100 > Nix <nix@esperi.org.uk> wrote: > >>> (having to fiddle with "echo > /sys/..." and "cat /sys/..." is not the state >>> of something you'd call a finished product). >> You mean, like md? :) > You must be kidding. On the contrary I was looking to present md as the > example of how it should be done, with its all-encompassing and extremely > capable 'mdadm' tool -- and a complete lack of a similar tool for bcache. > That's what I understood you to mean, but you are aware that SOME raid management still has to be done with echo > /sys/... ? So mdadm isn't perfect, not by a long chalk, yet :-) Cheers, Wol ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Recovery on new 2TB disk: finish=7248.4min (raid1) 2017-05-01 21:53 ` Anthony Youngman @ 2017-05-01 22:03 ` Roman Mamedov 2017-05-02 6:10 ` Wols Lists 0 siblings, 1 reply; 25+ messages in thread From: Roman Mamedov @ 2017-05-01 22:03 UTC (permalink / raw) To: Anthony Youngman Cc: Nix, John Stoffel, Mateusz Korniak, Ron Leach, linux-raid On Mon, 1 May 2017 22:53:14 +0100 Anthony Youngman <antlists@youngman.org.uk> wrote: > That's what I understood you to mean, but you are aware that SOME raid > management still has to be done with echo > /sys/... ? > > So mdadm isn't perfect, not by a long chalk, yet :-) Well, why not post some examples of what you find yourself doing often via /sys, that's not available in mdadm (maybe as a new thread). One that I remember is the "want_replacement" mechanism, which was initially only available via "echo > /sys/..." but quickly got added to mdadm as "--replace". People (and various outdated wikis) also tend to suggest using "echo check..." or "echo repair...", but those are available in mdadm as well, via "--action=". Lastly, I change the "stripe_cache_size" via /sys/, but that's fine-tuning, which feels OK to do via sysfs parameters, whereas needing to use sysfs for the most basic operations of managing the storage system does not. -- With respect, Roman ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Recovery on new 2TB disk: finish=7248.4min (raid1) 2017-05-01 22:03 ` Roman Mamedov @ 2017-05-02 6:10 ` Wols Lists 2017-05-02 10:02 ` Nix 0 siblings, 1 reply; 25+ messages in thread From: Wols Lists @ 2017-05-02 6:10 UTC (permalink / raw) To: Roman Mamedov; +Cc: Nix, John Stoffel, Mateusz Korniak, Ron Leach, linux-raid On 01/05/17 23:03, Roman Mamedov wrote: >> > That's what I understood you to mean, but you are aware that SOME raid >> > management still has to be done with echo > /sys/... ? >> > >> > So mdadm isn't perfect, not by a long chalk, yet :-) > Well, why not post some examples of what you find yourself doing often > via /sys, that's not available in mdadm (maybe as a new thread). I *should* do, rather than I *do* do, but your everyday general maintenance tasks, like scrubbing? I'm not aware of *any* day-to-day tasks that you do using mdadm - oh and if you don't scrub it's not an uncommon cause of arrays crashing... Cheers, Wol ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Recovery on new 2TB disk: finish=7248.4min (raid1) 2017-05-02 6:10 ` Wols Lists @ 2017-05-02 10:02 ` Nix 0 siblings, 0 replies; 25+ messages in thread From: Nix @ 2017-05-02 10:02 UTC (permalink / raw) To: Wols Lists Cc: Roman Mamedov, John Stoffel, Mateusz Korniak, Ron Leach, linux-raid On 2 May 2017, Wols Lists outgrape: > On 01/05/17 23:03, Roman Mamedov wrote: >>> > That's what I understood you to mean, but you are aware that SOME raid >>> > management still has to be done with echo > /sys/... ? >>> > >>> > So mdadm isn't perfect, not by a long chalk, yet :-) >> Well, why not post some examples of what you find yourself doing often >> via /sys, that's not available in mdadm (maybe as a new thread). > > I *should* do, rather than I *do* do, but your everyday general > maintenance tasks, like scrubbing? You can scrub with mdadm now. :) IIRC (I haven't started using it yet) the syntax is something like mdadm --misc --action=check /dev/md/my-array or mdadm --misc --action=repair /dev/md/my-array (though frankly it has never been clear to me which is preferable for a regular scrub. Probably check on a RAID-6, repair on a RAID-5 where such failures are much more potentially catastrophic...) ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Recovery on new 2TB disk: finish=7248.4min (raid1) 2017-05-01 21:46 ` Roman Mamedov 2017-05-01 21:53 ` Anthony Youngman @ 2017-05-01 23:26 ` Nix 1 sibling, 0 replies; 25+ messages in thread From: Nix @ 2017-05-01 23:26 UTC (permalink / raw) To: Roman Mamedov; +Cc: John Stoffel, Mateusz Korniak, Ron Leach, linux-raid On 1 May 2017, Roman Mamedov outgrape: > On Mon, 01 May 2017 22:13:59 +0100 > Nix <nix@esperi.org.uk> wrote: >> I'd be more worried about the complexity required to just figure out the >> space needed for half a dozen sets of lvmcache metadata and cache >> filesystems. > > Metadata can be implicitly created and auto-managed in recent lvm versions > http://fibrevillage.com/storage/460-how-to-create-lvm-cache-logical-volume > ("Automatic pool metadata LV"). If not, the rule of thumb suggested everywhere Useful. > is 1/1000 of the cache volume size; I doubled that just in case, and looks > like I didn't have to, as my metadata partitions are only about 9.5% full each. Still seems like something that shouldn't need to be in a separate LV at all (though .1% is not something to be worried about: it just seems inelegant to me). > As for half a dozen sets, I'd reconsider the need for those, as well as the > entire fast/slow HDD tracks separation, just SSD-cache everything and let it > figure out to not cache streaming writes on your video transcodes, or even bulk > writes during your compile tests (while still caching the filesystem metadata). > > However the world of pain begins if you want to have multiple guest VMs each > with its disk as a separate LV. One solution (that doesn't sound too clean but > perhaps could work), is stacked LVM, i.e. a PV of a different volume group made > on top of a cached LV. Yeah. Or, as I have, multiple LVs for encrypted filesystems with distinct passphrases and mount lifetimes, all of which I want to cache. (And I want the cache to cache the *encrypted* data, obviously, but both LVM and bcache will do that.) I think LVM needs something like bcache's cache sets -- i.e. the ability to have one cache for multiple LVs (this would be even more flexible than doing it at the VG level, since you retain the current ability to cache only some LVs without having to figure out how much cache each needs). -- NULL && (void) ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Recovery on new 2TB disk: finish=7248.4min (raid1) 2017-04-30 16:10 ` Nix 2017-04-30 16:47 ` Roman Mamedov @ 2017-04-30 17:16 ` Wols Lists 2017-05-01 20:12 ` Nix 1 sibling, 1 reply; 25+ messages in thread From: Wols Lists @ 2017-04-30 17:16 UTC (permalink / raw) To: Nix, Roman Mamedov; +Cc: John Stoffel, Mateusz Korniak, Ron Leach, linux-raid On 30/04/17 17:10, Nix wrote: > This is a one-off with tooling to manage it: from my perspective, I just > kick off the autobuilders etc and they'll automatically use transient > space for objdirs. (And obviously this is all scripted so it is no > harder than making or removing directories would be: typing 'mktransient > foo' to automatically create a dir in transient space and set up a bind > mount to it -- persisted across boots -- in the directory' foo' is > literally a few letters more than typing 'mkdir foo'.) This sounds like me with tmpfs. Okay, mine don't persist across reboots, but if it's in your build scripts, can't they create a tmpfs and do the builds in that? My system maxes out at 16GB ram, so twice ram per disk as swap gives me 32GB swap/disk by 2 disks gives me 64GB of swap for all my transient stuff. Running gentoo, I need that space to build gcc, LO etc :-) Cheers, Wol ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Recovery on new 2TB disk: finish=7248.4min (raid1) 2017-04-30 17:16 ` Wols Lists @ 2017-05-01 20:12 ` Nix 0 siblings, 0 replies; 25+ messages in thread From: Nix @ 2017-05-01 20:12 UTC (permalink / raw) To: Wols Lists Cc: Roman Mamedov, John Stoffel, Mateusz Korniak, Ron Leach, linux-raid On 30 Apr 2017, Wols Lists verbalised: > On 30/04/17 17:10, Nix wrote: >> This is a one-off with tooling to manage it: from my perspective, I just >> kick off the autobuilders etc and they'll automatically use transient >> space for objdirs. (And obviously this is all scripted so it is no >> harder than making or removing directories would be: typing 'mktransient >> foo' to automatically create a dir in transient space and set up a bind >> mount to it -- persisted across boots -- in the directory' foo' is >> literally a few letters more than typing 'mkdir foo'.) > > This sounds like me with tmpfs. Okay, mine don't persist across reboots, > but if it's in your build scripts, can't they create a tmpfs and do the > builds in that? Even though I have 128GiB RAM, I don't want a tmpfs for everything (though I use it for a lot, oh yes). Some of the transient stuff is things like QEMU CoW disk images that I don't want to lose on every reboot (though rebuilding them is not too annoying, it is not totally ignorable either); other stuff is things like vapoursynth intermediates which can easily exceed 500GiB, which are only ever read once but nonetheless just won't fit. -- NULL && (void) ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Recovery on new 2TB disk: finish=7248.4min (raid1) 2017-04-26 21:57 Recovery on new 2TB disk: finish=7248.4min (raid1) Ron Leach 2017-04-27 14:25 ` John Stoffel @ 2017-04-27 14:58 ` Mateusz Korniak 2017-04-27 19:01 ` Ron Leach 1 sibling, 1 reply; 25+ messages in thread From: Mateusz Korniak @ 2017-04-27 14:58 UTC (permalink / raw) To: Ron Leach; +Cc: linux-raid On Wednesday 26 of April 2017 22:57:33 Ron Leach wrote: > is resulting in a very-unusually slow recovery. What is output of iostat -x 30 2 -m during recovery ? Regards, -- Mateusz Korniak "(...) mam brata - poważny, domator, liczykrupa, hipokryta, pobożniś, krótko mówiąc - podpora społeczeństwa." Nikos Kazantzakis - "Grek Zorba" ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Recovery on new 2TB disk: finish=7248.4min (raid1) 2017-04-27 14:58 ` Mateusz Korniak @ 2017-04-27 19:01 ` Ron Leach 2017-04-28 7:06 ` Mateusz Korniak 0 siblings, 1 reply; 25+ messages in thread From: Ron Leach @ 2017-04-27 19:01 UTC (permalink / raw) To: linux-raid On 27/04/2017 15:58, Mateusz Korniak wrote: > > What is output of > iostat -x 30 2 -m > during recovery ? > iostat command is not recognised - it perhaps is not installed; man iostat was not recognised either. What would the command have done? I may be able to post equivalent information - Smartctl is installed and enabled on sdb, the new disk. Recovery is still going on - I have not wanted to disrupt it in case I mess it up. (Though all the volatile user data is multiply backed up, none of the OS or configuration data is so I don't want to risk compromising the array contents.) regards, Ron ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Recovery on new 2TB disk: finish=7248.4min (raid1) 2017-04-27 19:01 ` Ron Leach @ 2017-04-28 7:06 ` Mateusz Korniak 0 siblings, 0 replies; 25+ messages in thread From: Mateusz Korniak @ 2017-04-28 7:06 UTC (permalink / raw) To: Ron Leach; +Cc: linux-raid On Thursday 27 of April 2017 20:01:44 Ron Leach wrote: > > iostat -x 30 2 -m > > What would the command have done? It would hint performance and load of block devices in system. -- Mateusz Korniak "(...) mam brata - poważny, domator, liczykrupa, hipokryta, pobożniś, krótko mówiąc - podpora społeczeństwa." Nikos Kazantzakis - "Grek Zorba" ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2017-05-02 10:02 UTC | newest] Thread overview: 25+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-04-26 21:57 Recovery on new 2TB disk: finish=7248.4min (raid1) Ron Leach 2017-04-27 14:25 ` John Stoffel 2017-04-27 14:43 ` Reindl Harald 2017-04-28 7:05 ` Ron Leach 2017-04-27 14:54 ` Mateusz Korniak 2017-04-27 19:03 ` John Stoffel 2017-04-27 19:42 ` Reindl Harald 2017-04-28 7:30 ` Mateusz Korniak 2017-04-30 12:04 ` Nix 2017-04-30 13:21 ` Roman Mamedov 2017-04-30 16:10 ` Nix 2017-04-30 16:47 ` Roman Mamedov 2017-05-01 21:13 ` Nix 2017-05-01 21:44 ` Anthony Youngman 2017-05-01 21:46 ` Roman Mamedov 2017-05-01 21:53 ` Anthony Youngman 2017-05-01 22:03 ` Roman Mamedov 2017-05-02 6:10 ` Wols Lists 2017-05-02 10:02 ` Nix 2017-05-01 23:26 ` Nix 2017-04-30 17:16 ` Wols Lists 2017-05-01 20:12 ` Nix 2017-04-27 14:58 ` Mateusz Korniak 2017-04-27 19:01 ` Ron Leach 2017-04-28 7:06 ` Mateusz Korniak
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).