From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f178.google.com ([209.85.223.178]:51798 "EHLO mail-io0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753277AbdJLRTl (ORCPT ); Thu, 12 Oct 2017 13:19:41 -0400 Received: by mail-io0-f178.google.com with SMTP id b186so6261373iof.8 for ; Thu, 12 Oct 2017 10:19:41 -0700 (PDT) Subject: Re: USB upgrade fun To: Chris Murphy , Kai Hendry Cc: Btrfs BTRFS References: <1507456690.745489.1131538472.25270157@webmail.messagingengine.com> From: "Austin S. Hemmelgarn" Message-ID: <0a3e71f7-1a8e-9a4f-7470-4c5cc2fb11a0@gmail.com> Date: Thu, 12 Oct 2017 13:19:39 -0400 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: On 2017-10-12 12:57, Chris Murphy wrote: > On Sun, Oct 8, 2017 at 10:58 AM, Kai Hendry wrote: >> Hi there, >> >> My /mnt/raid1 suddenly became full somewhat expectedly, so I bought 2 >> new USB 4TB hard drives (one WD, one Seagate) to upgrade to. >> >> After adding sde and sdd I started to see errors in dmesg [2]. >> https://s.natalian.org/2017-10-07/raid1-newdisks.txt >> [2] https://s.natalian.org/2017-10-07/btrfs-errors.txt > > I'm not sure what the call traces mean exactly but they seem > non-fatal. The entire dmesg might be useful to see if there are device > or bus related errors. > > I have a similar modeled NUC and I can tell you for sure it does not > provide enough USB bus power for 2.5" laptop drives. They must be > externally powered, or you need a really good USB hub with an even > better power supply that can handle e.g. 4 drives at the same time to > bus power them. I had lots of problems before I fixed this, but Btrfs > managed to recover gracefully once I solved the power issue. Same here on a pair of 3 year old NUC's. Based on the traces and the other information, I'd be willing to bet this is probably the root cause of the issues.>> >> I assumed it had to perhaps with the USB bus on my NUC5CPYB being maxed >> out, and to expedite the sync, I tried to remove one of the older 2TB >> sdc1. However the load went crazy and my system went completely >> unstable. I shutdown the machine and after an hour I hard powered it >> down since it seemed to hang (it's headless). > > I've notice recent kernels hanging under trivial scrub and balance > with hard drives. It does complete, but they are really laggy and > sometimes unresponsive to anything else unless the operation is > cancelled. I haven't had time to do regression testing. My assertion > about this is in the archives, about versions I think it started with. >> >> Sidenote: I've since learnt that removing a drive actually deletes the >> contents of the drive? I don't want that. I was hoping to put that drive >> into cold storage. How do I remove a drive without losing data from a >> RAID1 configuration? > > I'm pretty sure, but not certain of the following: device > delete/remove is replicating chunk by chunk cow style. The entire > operation is not atomic. The chunk operations themselves are atomic. I > expect that metadata is updated as each chunk is properly replicated > so I don't think what you want is possible. This is correct. Deleting a device first marks that device as zero size so nothing tries to allocate data there, and then runs a balance operation to force chunks onto other devices (I'm not sure if it only moves chunks that are on the device being removed though). This results in two particularly important differences from most other RAID systems: 1. The device being removed is functionally wiped (it will appear to be empty), but not physically wiped (most of the data is still there, you just can't get to it through BTRFS). 2. The process as a whole is not atomic, but as a result of how it works, it is generally possible to restart it if it got stopped part way through (and you won't usually lose much progress). That said, even if it was technically possible to remove the drive without messing things up, it would be of limited utility. You couldn't later reconnect it and expect things to just work (you would have generation mismatches, which would hopefully cause the old disk to effectively be updated to match the new one, _IF_ the old disk even registered properly as part of the filesystem), and it would be non-trivial to get data off of it safely too (you would have to connect it to a different system, and hope that BTRFS doesn't choke on half a filesystem). > > Again, pretty sure about this too, but not certain: device replace is > an atomic operation, the whole thing succeeds or fails, and at the end > merely the Btrfs signature is wiped from the deleted device(s). So you > could restore that signature and the device would be valid again; > HOWEVER it's going to have the same volume UUID as the new devices. > Even though the device UUIDs are unique, and should prevent confusion, > maybe confusion is possible. Also correct. This is part of why it's preferred to use the replace command instead of deleting and then adding a device to replace it (the other reason being that it's significantly more efficient, especially if the filesystem isn't full). > > A better way, which currently doesn't exist, is to make the raid1 a > seed device, and then add two new devices and remove the seed. That > way you get the replication you want, the instant the sprout is > mounted rw, it can be used in production (all changes go to the > sprout), while the chunks from the seed are replicated. The reason > this isn't viable right now is the tools aren't mature enough to > handle multiple devices yet. Otherwise with a single device seed to a > single sprout, this works and would be the way to do what you want. Indeed, although it's worth noting that even with a single seed and single sprout, things aren't as well tested as most of the rest of BTRFS. > > A better way that does exist is to setup an overlay for the two > original devices. Mount the overlay devices, add the new devices, > delete the overlays. So the overlay devices get the writes that cause > those devices to be invalidated. The original devices aren't really > touched. There's a way to do this with dmsetup like how live boot > media work, and there's another way I haven't ever used before that's > described here: > > https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID#Making_the_harddisks_read-only_using_an_overlay_file Using block-level overlays with BTRFS is probably a bad idea for the same reasons that block-level copies are a bad idea, even with the dmsetup methods (also, most live boot media does it at the filesystem level, not the block level, it's safer and more efficient that way). Your safest bet is probably seed devices, though that of course is not very well documented. > >> After a reboot it failed, namely because "nofail" wasn't in my fstab and >> systemd is pedantic by default. After managing to get it booting into my >> system without /mnt/raid1 I faced these "open ctree failed" issues. >> After running btrfs check on all the drives and getting nowhere, I >> decided to unplug the new drives and I discovered that when I take out >> the new 4TB WD drive, I could mount it with -o degraded. >> >> dmesg errors with the WD include "My Passport" Wrong diagnostic page; >> asked for 1 got 8 "Failed to get diagnostic page 0xffffffea" which >> raised my suspicions. The model number btw is WDBYFT0040BYI-WESN >> >> Anyway, I'm back up and running with 2x2TB (one of them didn't finish >> removing, I don't know which) & 1x4TB. > > > Be aware that you are likely in a very precarious position now. > Anytime raid1 volumes are mounted rw,degraded, one or more of the > devices will end up with new empty single chunks (there is a patch to > prevent this, I'm not sure if it's in 4.13). The consequence of these > new empty single chunks is that they will prevent any subsequent > degraded rw mount. You get a one time degraded,rw. Any subsquent > attempt will require ro,degraded to get it to mount. If you end up > snared in this, there are patches in the archives to inhibit the > kernels protection to allow mounting of such volumes. Super annoying. > You'll have to build a custom kernel. > > My opinion is you should update backups before you do anything else, > just in cas > Next, you have to figure out a way to get all devices to be used in > this volume healthy. Tricky as you technically have a 4 device raid 1 > with a missing device. I propose first to check if you have single > chunks with either 'btrfs fi us' or 'btrfs fi df' and if so, get rid > of them with a filtered balance 'btrfs balance start > -mconvert=raid1,soft -dconvert=raid1,soft' and then in theory you > should be able to do 'btrfs delete missing' to end up with a valid > three device btrfs raid 1, which you can use until you get your USB > power supply issues sorted. I absolutely concur with Chris here, get your backups updated, and then worry about repairing the filesystem. Or, alternatively, get your backups updated and then nuke the filesystem and rebuild it from scratch (this may be more work, but it's guaranteed to work).