Replacing a drive from a RAID 1 array

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* Replacing a drive from a RAID 1 array
@ 2015-06-16 16:43 Arnaud Kapp
  2015-06-16 16:58 ` Hugo Mills
  0 siblings, 1 reply; 5+ messages in thread
From: Arnaud Kapp @ 2015-06-16 16:43 UTC (permalink / raw)
  To: linux-btrfs

Hello,

Consider the following situation: I have a RAID 1 array with 4 drives.
I want to replace one the drive by a new one, with greater capacity.

However, let's say I only have 4 HDD slots so I cannot plug the new
drive, add it to the array then remove the other one.
If there a *safe* way to change drives in this situation? I'd bet that
booting with 3drives, adding the new one then removing the old, non
connected one would work. However, is there something that could go
wrong in this situation?

Thanks,

-- 
Kapp Arnaud - Xaqq

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Replacing a drive from a RAID 1 array
  2015-06-16 16:43 Replacing a drive from a RAID 1 array Arnaud Kapp
@ 2015-06-16 16:58 ` Hugo Mills
  2015-06-16 21:04   ` Chris Murphy
                     ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Hugo Mills @ 2015-06-16 16:58 UTC (permalink / raw)
  To: Arnaud Kapp; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1923 bytes --]

On Tue, Jun 16, 2015 at 06:43:23PM +0200, Arnaud Kapp wrote:
> Hello,
> 
> Consider the following situation: I have a RAID 1 array with 4 drives.
> I want to replace one the drive by a new one, with greater capacity.
> 
> However, let's say I only have 4 HDD slots so I cannot plug the new
> drive, add it to the array then remove the other one.
> If there a *safe* way to change drives in this situation? I'd bet that
> booting with 3drives, adding the new one then removing the old, non
> connected one would work. However, is there something that could go
> wrong in this situation?

   The main thing that could go wrong with that is a disk failure. If
you have the SATA ports available, I'd consider operating the machine
with the case open and one of the drives bare and resting on something
stable and insulating for the time it takes to do a "btrfs replace"
operation.

   If that's not an option, then a good-quality external USB case with
a short cable directly attached to one of the USB ports on the
motherboard would be a reasonable solution (with the proviso that some
USB connections are just plain unstable and throw errors, which can
cause problems with the filesystem code, typically requiring a reboot,
and a restart of the process).

   You might also consider using either NBD or iSCSI to present one of
the disks (I'd probably use the outgoing one) over the network from
another machine with more slots in it, but that's going to end up with
horrible performance during the migration.

   In my big disk array at home, I have two 4-slot enclosures, and I
leave one of them empty specifically for this reason. It's a less
attractive proposition with only 4 slots in total, though.

   Hugo.

-- 
Hugo Mills             | "What are we going to do tonight?"
hugo@... carfax.org.uk | "The same thing we do every night, Pinky. Try to
http://carfax.org.uk/  | take over the world!"
PGP: E2AB1DE4          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Replacing a drive from a RAID 1 array
  2015-06-16 16:58 ` Hugo Mills
@ 2015-06-16 21:04   ` Chris Murphy
  2015-06-17  0:51   ` Duncan
  2015-06-17 11:30   ` Austin S Hemmelgarn
  2 siblings, 0 replies; 5+ messages in thread
From: Chris Murphy @ 2015-06-16 21:04 UTC (permalink / raw)
  To: linux-btrfs

Another option is to shrink the file system from 4 devices to 3 by
deleting one. This assumes there's sufficient room for the data to
exist on 3 devices. At least this way the volume is never degraded.
You could do a balance after adding the new 4th device, but that's
optional. Since it's empty, all new writes will have an a copy always
going to the new 4th device and then the b copy to one of the other
three.

Chris Murphy

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Replacing a drive from a RAID 1 array
  2015-06-16 16:58 ` Hugo Mills
  2015-06-16 21:04   ` Chris Murphy
@ 2015-06-17  0:51   ` Duncan
  2015-06-17 11:30   ` Austin S Hemmelgarn
  2 siblings, 0 replies; 5+ messages in thread
From: Duncan @ 2015-06-17  0:51 UTC (permalink / raw)
  To: linux-btrfs

Hugo Mills posted on Tue, 16 Jun 2015 16:58:32 +0000 as excerpted:

> On Tue, Jun 16, 2015 at 06:43:23PM +0200, Arnaud Kapp wrote:
>> 
>> Consider the following situation: I have a RAID 1 array with 4 drives.
>> I want to replace one the drive by a new one, with greater capacity.
>> 
>> However, let's say I only have 4 HDD slots so I cannot plug the new
>> drive, add it to the array then remove the other one.
>> If there a *safe* way to change drives in this situation? I'd bet that
>> booting with 3drives, adding the new one then removing the old, non
>> connected one would work. However, is there something that could go
>> wrong in this situation?
> 
> The main thing that could go wrong with that is a disk failure.

Agreed with Hugo (and Chris), but there's a couple additional factors to 
consider that they didn't mention.

1) Btrfs raid1, unlike for example mdraid raid1, is two copies, 
regardless of the number of devices.  More devices results in more 
storage capacity, not more copies and thus more redundancy.

So physical removal of a device from a btrfs raid1 means you only have 
one copy left of anything that was on that device, since there's only two 
copies and you just removed the device containing one of them.

Which of course is why the device failure Hugo mentioned is so critical, 
because that would mean loss of the other copy for anything where the 
second copy was on the newly failed device. =:^(

2) Btrfs' data integrity feature adds another aspect to btrfs raid1 that 
normal raid1 doesn't deal with.  The great thing about btrfs raid1 is 
that both copies of the data (and metadata) are checksummed, and in 
normal operation, should one copy fail its checksum validation, btrfs can 
check the second copy and assuming it's fine, use it, while rewriting the 
checksum-failure copy with the good one.

Thus, removing one of those two copies has the additional aspect that if 
the remaining one is now found to be bad, there's no fallback, and that 
file (for data) is simply unavailable.  For bad metadata the problem is 
of course worse, as that bad metadata very likely covered multiple files 
and possibly directories, and you will likely lose access to them all.

The overall effect, then, is to take the device-failure possibility from 
the whole device level to the individual file level.  While failure of 
the whole device may be considered unlikely, on today's multi-terabyte 
devices, there's statistically actually a reasonable chance of at least 
one failure on each device.  If your devices become that statistic, and 
whatever one remaining copy ends up bad while the device with the other 
copy is unavailable...

The bottom line is that with a device and its copy removed, there's a 
reasonable statistical chance you'll lose access to at least one file 
because the remaining copy is found to fail checksum and be bad.

Which of course makes it even MORE important to if at all possible 
arrange a way to keep the to-be-removed device online, via a temporary 
hookup if necessary, while running the replace that will ultimately move 
its contents to another device.  Playing with the odds is acceptable when 
a device failed and there's no other way (tho as always, the sysadmin's 
rule that if you didn't have a backup, by definition and by (lack of) 
action, you didn't care about that data, despite any claims to the 
contrary, most definitely applies), but if you have a choice, don't play 
the odds, play it smart.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Replacing a drive from a RAID 1 array
  2015-06-16 16:58 ` Hugo Mills
  2015-06-16 21:04   ` Chris Murphy
  2015-06-17  0:51   ` Duncan
@ 2015-06-17 11:30   ` Austin S Hemmelgarn
  2 siblings, 0 replies; 5+ messages in thread
From: Austin S Hemmelgarn @ 2015-06-17 11:30 UTC (permalink / raw)
  To: Hugo Mills, Arnaud Kapp, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2576 bytes --]

On 2015-06-16 12:58, Hugo Mills wrote:
> On Tue, Jun 16, 2015 at 06:43:23PM +0200, Arnaud Kapp wrote:
>> Hello,
>>
>> Consider the following situation: I have a RAID 1 array with 4 drives.
>> I want to replace one the drive by a new one, with greater capacity.
>>
>> However, let's say I only have 4 HDD slots so I cannot plug the new
>> drive, add it to the array then remove the other one.
>> If there a *safe* way to change drives in this situation? I'd bet that
>> booting with 3drives, adding the new one then removing the old, non
>> connected one would work. However, is there something that could go
>> wrong in this situation?
>
>     The main thing that could go wrong with that is a disk failure. If
> you have the SATA ports available, I'd consider operating the machine
> with the case open and one of the drives bare and resting on something
> stable and insulating for the time it takes to do a "btrfs replace"
> operation.
This would be my first suggestion also; although, if you only have 4 
SATA ports, you might want to invest in a SATA add in card (if you go 
this way, look for one with an ASmedia chipset, those are the best I've 
seen as far as reliability for add on controllers).
>
>     If that's not an option, then a good-quality external USB case with
> a short cable directly attached to one of the USB ports on the
> motherboard would be a reasonable solution (with the proviso that some
> USB connections are just plain unstable and throw errors, which can
> cause problems with the filesystem code, typically requiring a reboot,
> and a restart of the process).
If you decide to go with this option and are using an Intel system, 
avoid using USB3.0 ports, as a number of Intel's chipsets have known 
bugs with their USB3 hardware that will likely cause serious issues.
If your system has an eSATA port however, try to use that instead of 
USB, it will almost certainly be faster and more reliable.
>
>     You might also consider using either NBD or iSCSI to present one of
> the disks (I'd probably use the outgoing one) over the network from
> another machine with more slots in it, but that's going to end up with
> horrible performance during the migration.
The other possibility WRT this is ATAoE, which generally gets better 
performance than NBD or iSCSI but has the caveat that both systems have 
to be on the same network link (ie, no gateways between them).  If you 
do decide to use ATAoE< look into a program called 'vblade' (most 
distro's have it in a package with the same name).


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2967 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-06-17 11:30 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-06-16 16:43 Replacing a drive from a RAID 1 array Arnaud Kapp
2015-06-16 16:58 ` Hugo Mills
2015-06-16 21:04   ` Chris Murphy
2015-06-17  0:51   ` Duncan
2015-06-17 11:30   ` Austin S Hemmelgarn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox