From: Dmitry Katsubo <dmitry.katsubo@gmail.com>
To: linux-btrfs@vger.kernel.org
Subject: Re: Recover btrfs volume which can only be mounded in read-only mode
Date: Sun, 18 Oct 2015 11:44:08 +0200 [thread overview]
Message-ID: <562369E8.60709@gmail.com> (raw)
In-Reply-To: <pan$e78b7$efe06fb0$f477bf4e$85f224c0@cox.net>
On 16/10/2015 10:18, Duncan wrote:
> Dmitry Katsubo posted on Thu, 15 Oct 2015 16:10:13 +0200 as excerpted:
>
>> On 15 October 2015 at 02:48, Duncan <1i5t5.duncan@cox.net> wrote:
>>
>>> [snipped]
>>
>> Thanks for this information. As far as I can see, btrfs-tools v4.1.2 in
>> now in experimental Debian repo (but you anyway suggest at least 4.2.2,
>> which is just 10 days ago released in master git). Kernel image 3.18 is
>> still not there, perhaps because Debian jessie was frozen before is was
>> released (2014-12-07).
>
> For userspace, as long as it's supporting the features you need at
> runtime (where it generally simply has to know how to make the call to
> the kernel, to do the actual work), and you're not running into anything
> really hairy that you're trying to offline-recover, which is where the
> latest userspace code becomes critical...
>
> Running a userspace series behind, or even more (as long as it's not
> /too/ far), isn't all /that/ critical a problem.
>
> It generally becomes a problem in one of three ways: 1) You have a bad
> filesystem and want the best chance at fixing it, in which case you
> really want the latest code, including the absolute latest fixups for the
> most recently discovered possible problems. 2) You want/need a new
> feature that's simply not supported in your old userspace. 3) The
> userspace gets so old that the output from its diagnostics commands no
> longer easily compares with that of current tools, giving people on-list
> difficulties when trying to compare the output in your posts to the
> output they get.
>
> As a very general rule, at least try to keep the userspace version
> comparable to the kernel version you are running. Since the userspace
> version numbering syncs to kernelspace version numbering, and userspace
> of a particular version is normally released shortly after the similarly
> numbered kernel series is released, with a couple minor updates before
> the next kernel-series-synced release, keeping userspace to at least the
> kernel space version, means you're at least running the userspace release
> that was made with that kernel series release in mind.
>
> Then, as long as you don't get too far behind on kernel version, you
> should remain at least /somewhat/ current on userspace as well, since
> you'll be upgrading to near the same userspace (at least), when you
> upgrade the kernel.
>
> Using that loose guideline, since you're aiming for the 3.18 stable
> kernel, you should be running at least a 3.18 btrfs-progs as well.
>
> In that context, btrfs-progs 4.1.2 should be fine, as long as you're not
> trying to fix any problems that a newer version fixed. And, my
> recommendation of the latest 4.2.2 was in the "fixing problems" context,
> in which case, yes, getting your hands on 4.2.2, even if it means
> building from sources to do so, could be critical, depending of course on
> the problem you're trying to fix. But otherwise, 4.1.2, or even back to
> the last 3.18.whatever release since that's the kernel version you're
> targeting, should be fine.
>
> Just be sure that whenever you do upgrade to later, you avoid the known-
> bad-mkfs.btrfs in 4.2.0 and/or 4.2.1 -- be sure if you're doing the btrfs-
> progs-4.2 series, that you get 4.2.2 or later.
>
> As for finding a current 3.18 series kernel released for Debian, I'm not
> a Debian user so my my knowledge of the ecosystem around it is limited,
> but I've been very much under the impression that there are various
> optional repos available that you can choose to include and update from
> as well, and I'm quite sure based on previous discussions with others
> that there's a well recognized and fairly commonly enabled repo that
> includes debian kernel updates thru current release, or close to it.
>
> Of course you could also simply run a mainstream Linus kernel and build
> it yourself, and it's not too horribly hard to do either, as there's all
> sorts of places with instructions for doing so out there, and back when I
> switched from MS to freedomware Linux in late 2001, I learned the skill,
> at at least the reasonably basic level of mostly taking a working config
> from my distro's kernel and using it as a basis for my mainstream kernel
> config as well, within about two months of switching.
>
> Tho of course just because you can doesn't mean you want to, and for
> many, finding their distro's experimental/current kernel repos and simply
> installing the packages from it, will be far simpler.
>
> But regardless of the method used, finding or building and keeping
> current with your own copy of at least the lastest couple of LTS
> releases, shouldn't be /horribly/ difficult. While I've not used them as
> actual package resources in years, I do still know a couple rpm-based
> package resources from my time back on Mandrake (and do still check them
> in contexts like this for others, or to quickly see what files a package
> I don't have installed on gentoo might include, etc), and would point you
> at them if Debian was an rpm-based distro, but of course it's not, so
> they won't do any good. But I'd guess a google might. =:^)
Thanks, Duncan. The information you give is of the greatest value for
me. Finally I have decided not to play with the fate and copy the data
off, re-create btrfs and copy it back. That is anyway a good exercise.
>> If I may ask:
>>
>> Provided that btrfs allowed to mount a volume in read-only mode – does
>> it mean that add data blocks are present (e.g. it has assured that add
>> files / directories can be read)
>
> I'm not /absolutely/ sure I understand your question, here. But assuming
> it's what I believe it is... here's an answer in typical Duncan fashion,
> answering the question... and rather more! =:^)
>
> In this particular scenario, yes, everything should still be accessible,
> as at least one copy of every raid1 chunk should exist on a still
> detected and included device. This is because of the balance after the
> loss of the first device, making sure there was two copies of each chunk
> on remaining devices, before loss of the second device. But because
> btrfs device delete missing didn't work, you couldn't remove that first
> device, even tho you now had two copies of each chunk on existing
> devices. So when another device dropped, you had two missing devices,
> but because of the balance between, you still had at least one copy of
> all chunks.
>
> The reason it's not letting you mount read-write is that btrfs sees now
> two devices missing on a raid1, the one that you actually replaced but
> couldn't device delete, and the new missing one that it didn't detect
> this time. To btrfs' rather simple way of thinking about it, that means
> anything with one of the only two raid1 copies on each of the two missing
> devices is now entirely gone, and to avoid making changes that would
> complicate things and prevent return of at least one of those missing
> devices, it won't let you mount writable, even in degraded mode. It
> doesn't understand that there's actually still at least one copy of
> everything available, as it simply sees the two missing devices and gives
> up without actually checking.
>
> And in the situation where btrfs' fears were correct, where chunks
> existed with each of the two copies on one of the now missing devices,
> no, not everything /would/ be accessible, and btrfs forcing read-only
> mounting is its way of not letting you make the problem even worse,
> forcing you to copy the data you can actually get to off to somewhere
> else, while you can still get to it in read-only mode, at least. Also,
> of course, forcing the filesystem read-only when there's two devices
> missing, at least in theory preserves a state where a device might be
> able to return, allowing repair of the filesystem, while allowing
> writable could prevent a returning device allowing the healing of the
> filesystem.
>
> So in this particular scenario, yes, all your data should be there,
> intact. However, a forced read-only mount normally indicates a serious
> issue, and in other scenarios, it could well indicate that some of the
> data is now indeed *NOT* accessible.
>
> Which is where AJ's patch comes in. That teaches btrfs to actually check
> each chunk. Once it sees that there's actually at least one copy of each
> chunk available, it'll allow mounting degraded, writable, again, so you
> can fix the problem.
>
> (Tho the more direct scenario that the patch addresses is a bit
> different, loss of one device of a two-device raid1, in which case
> mounting degraded writable will force new chunks to be written in single
> mode, because there's not a second device to write to so writing raid1 is
> no longer possible. So far, so good. But then on an unmount and attempt
> to mount again, btrfs sees single mode chunks on a two-device btrfs, and
> knows that single mode normally won't allow a missing device, so forces
> read-only, thus blocking adding a new device and rebalancing all the
> single chunks back to raid1. But in actuality, the only single mode
> chunks there are the ones written when the second device wasn't
> available, so they HAD to be written to the available device, and it's
> not POSSIBLE for any to be on the missing device. Again, the patch
> teaches btrfs to actually look at what's there and see that it can
> actually deal with it, thus allowing writable mounting, instead of
> jumping to conclusions and giving up, as soon as it sees a situation
> that /could/, in a different situation, mean entirely missing chunks with
> no available copies on remaining devices.)
>
> Again, these patches are in newer kernel versions, so there (assuming no
> further bugs) they "just work". On older kernels, however, you either
> have to cherry-pick the patches yourself, or manually avoid or work
> around the problem they fix. This is why we typically stress new
> versions so much -- they really /do/ fix active bugs and make problems
> /much/ easier to deal with. =:^)
Thanks for explanation. You understood the question correctly, basically
I wondered if btrfs checks that all data can be read before allowing
read-only mount. In my case I was luck and I just copied the date from
mounted volume to another place and then copied it back.
>> Do you have any ideas why "btrfs balance" has pulled all data to two
>> drives (and not balanced between three)?
>
> Hugo did much better answering that, than I would have initially done, as
> most of my btrfs are raid1 here, but they're all exactly two-device, with
> the two devices exactly the same size, so I'm not used to thinking in
> terms of different sizes and didn't actually notice the situation, thus
> leaving me clueless, until Hugo pointed it out.
>
> But he's right. Here's my much more detailed way of saying the same
> thing, now that he reminded me of why that would be the deciding factor
> here.
>
> Given that (1) your devices are different sizes, that (2) btrfs raid1
> means exactly two copies, not one per device, and that (3), the btrfs
> chunk-allocator allocates chunks from the device with the most free space
> left, subject to the restriction that both copies of a raid1 chunk can't
> be allocated to the same device...
>
> A rebalance of raid1 chunks would indeed start filling the two biggest
> devices first, until the space available on the smallest of the two
> biggest devices (thus the second largest) was equal to the space
> available on the third largest device, at which point it would continue
> allocating from the largest for one copy (until it too reached equivalent
> space available), while alternating between the others for the second
> copy.
>
> Given that the amount of data you had fit a copy each on the two largest
> devices, before the space available on either one dwindled to that
> available on the third largest device, only the two largest devices
> actually had chunk allocations, leaving the third device, still with less
> space total than the other two each had remaining available, entirely
> empty.
I think the mentioned strategy (fill in the device with most free space)
is not most effective. If the data is spread equally, the read
performance would be higher (reading from 3 disks instead of 2). In my
case this is even crucial, because the smallest drive is SSD (and it is
not loaded at all).
Maybe I don't see the benefit from the strategy which is currently
implemented (besides that it is robust and well-tested)?
>> Does btrfs has the following optimization for mirrored data: if drive is
>> non-rotational, then prefer reads from it? Or it simply schedules the
>> read to the drive that performs faster (irrelative to rotational
>> status)?
>
> Such optimizations have in general not yet been done to btrfs -- not even
> scheduling to the faster drive. In fact, the lack of such optimizations
> is arguably the biggest "objective" proof that btrfs devs themselves
> don't yet consider btrfs truly stable.
>
> As any good dev knows there's a real danger to "premature optimization",
> with that danger appearing in one or both of two forms: (a) We've now
> severely limited the alternative code paths we can take, because
> implementing things differently will force throwing away all that
> optimization work we did as it won't work with what would otherwise be
> the better alternative, and (b) We're now throwing away all that
> optimization work we did, making it a waste, because the previous
> implementation didn't work, and the new one does, but doesn't work with
> the current optimization code, so that work must now be redone as well.
>
> Thus, good devs tend to leave moderate to complex optimization code until
> they know the implementation is stable and won't be changing out from
> under the optimization. To do differently is "premature optimization",
> and devs tend to be well aware of the problem, often because of the
> number of times they did it themselves earlier in their career.
>
> It follows that looking at whether devs (assuming you consider them good
> enough to be aware of the dangers of premature optimization, which if
> they're doing the code that runs your filesystem, you better HOPE they're
> at least that good, or you and your data are in serious trouble!) have
> actually /done/ that sort of optimization, ends up being a pretty good
> indicator of whether they consider the code actually stable enough to
> avoid the dangers of premature optimization, or not.
>
> In this case, definitely not, since these sorts of optimizations in
> general remain to be done.
>
> Meanwhile, the present btrfs raid1 read-scheduler is both pretty simple
> to code up and pretty simple to arrange tests for that run either one
> side or the other, but not both, or that are well balanced to both.
> However, it's pretty poor in terms of ensuring optimized real-world
> deployment read-scheduling.
>
> What it does is simply this. Remember, btrfs raid1 is specifically two
> copies. It chooses which copy of the two will be read very simply, based
> on the PID making the request. Odd PIDs get assigned one copy, even PIDs
> the other. As I said, simple to code, great for ensuring testing of one
> copy or the other or both, but not really optimized at all for real-world
> usage.
>
> If your workload happens to be a bunch of all odd or all even PIDs, well,
> enjoy your testing-grade read-scheduler, bottlenecking everything reading
> one copy, while the other sits entirely idle.
>
> (Of course on fast SSDs with their zero seek-time, which is what I'm
> using for my own btrfs, that's not the issue it'd be on spinning rust.
> I'm still using my former reiserfs standard for spinning rust, which I
> use for backup and media files. But normal operations are on btrfs on
> ssd, and despite btrfs lack of optimization, on ssd, it's fast /enough/
> for my usage, and I particularly like the data integrity features of
> btrfs raid1 mode, so...)
I think PID-based solution is not the best one. Why not simply take a
random device? Then at least all drives in the volume are equally loaded
(in average).
>From what you said I believe that certain servers will not benefit from
btrfs, e.g. dedicated server that runs only one "fat" Java process, or
one "huge" MySQL database.
In general I think that btrfs should not check for rotational flag, as
even SATA-III is two times faster than SATA-II. So ideal scheduler
should assign read requests to the drive that simply copes with reads
faster :) If SSD drive can read 10 blocks while normal HDD reads only
one during the same time - let it do it.
Maybe my case is a corner one, as I am mixing "fast" and "slow" drives
in one volume, more over, faster drive is the smallest. If I would have
the drives of the same performance - the strategy I suggest would not
matter.
>> No, it was particular my decision to use btrfs on various reasons.
>> First of all, I am using raid1 on all data. Second, I benefit from
>> transparent compression. Third I need CRC consistency: some of the
>> drives (like /dev/sdd in my case) seem to fail, also once I have a buggy
>> DIMM so btrfs helps me not to loose the data "silently". Anyway,
>> it much better then md-raid.
>
> The fact that despite it being available, mdraid couldn't be configured
> to runtime-verify integrity using either parity or redundancy, nor
> checksums (which weren't available) was a very strong disappointment for
> me.
>
> To me, the fact that btrfs /does/ do runtime checksumming on write and
> data integrity checking on read, and in raid1/10 mode, will actually
> fallback to the second copy if the first one fails checksum verification,
> is one of its best features, and why I use btrfs raid1 (or on a couple
> single-device btrfs, mixed-bg mode dup). =:^)
>
> That's also why my personally most hotly anticipated features is N-way-
> mirroring, with 3-way being my ideal balance, since that will give me a
> fallback to the fallback, if both the first read copy and the first
> fallback copy fail verification. Four-way would be too much, but I just
> don't quite rest as easy as I otherwise could, because I know that if
> both the primary-read copy and the fallback happen to be bad, same
> logical place at the same time, there's no third copy to fall back on!
> It seems as much of a shame not to have that on btrfs with its data
> integrity, as it did to have mdraid with N-way-mirroring but no runtime
> data integrity. But at least btrfs does have N-way-mirroring on the
> roadmap, actually for after raid56, which is now done, so N-way-mirroring
> should be coming up rather soon (even if on btrfs, "soon" is relative),
> while AFAIK, mdraid has no plans to implement runtime data integrity
> checking.
>
>> And dynamic assignment is not a problem since udev was introduced (so
>> one can add extra persistent symlinks):
>>
>> https://wiki.debian.org/Persistent_disk_names
>
> FWIW, I actually use labels as my own form of "human-readable" UUID,
> here. I came up with the scheme back when I was on reiserfs, with 15-
> character label limits, so that's what mine are. Using this scheme, I
> encode the purpose of the filesystem (root/home/media/whatever), the size
> and brand of the media, the sequence number of the media (since I often
> have more than one of the same brand and size), the machine the media is
> targeted at, the date I did the formatting, and the sequence-number of
> the partition (root-working, root-backup1, root-backup2, etc).
>
> hm0238gcnx+35l0
>
> home, on a 238 gig corsair neutron, #x (the filesystem is multidevice,
> across #0 and #1), targeted at + (the workstation), originally
> partitioned in (201)3, on May (5) 21 (l), working copy (0)
>
> I use GPT partitioning, which takes partition labels (aka names) as
> well. The two partitions hosting that filesystem are on identically
> partitioned corsair neutrons, 256 GB = 238 GiB. The gpt labels on those
> two partitions are identical to the above, except one will have a 0
> replacing the x, while the other has a 1, as they are my first and second
> media of that size and brand.
>
> hm0238gcn0+35l0
> hm0238gcn1+35l0
>
> The primary backup of home, on a different pair of partitions on the same
> physical devices, is labeled identically, except the partition number is
> one:
>
> hm0238gcnx+35l1
>
> ... and its partitions:
>
> hm0238gcn0+35l1
> hm0238gcn1+35l1
>
> The secondary backup is on a reiserfs, on spinning rust:
>
> hm0465gsg0+47f0
>
> In that case the partition label and filesystem label are the same, since
> the partition and its filesystem correspond 1:1. It's home on the 465
> GiB (aka 500 GB) seagate #0, targeted at the workstation, first formatted
> in (201)4, on July 15, first (0) copy there. (I could make it #3 instead
> of #0, indicating second backup, but didn't, as I know that 0465gsg0+ is
> the media and backups spinning rust device for the workstation.)
>
> Both my internal and USB attached devices have the same labeling scheme,
> media identified by size, brand, media sequence number and what it's
> targetting, partition/filesystem identified by purpose, original
> partition/format date, and partition sequence number.
>
> As I said, it's effectively human-readable GUID, my own scheme for my own
> devices.
>
> And I use LABEL= in fstab as well, running gdisk -l to get a listing of
> partitions with their gpt-labels when I need to associate actual sdN
> mapping to specific partitions (if I don't already have the mapping from
> mount or whatever).
>
> Which makes it nice when btrfs fi show outputs filesystem label as well.
> =:^)
>
> The actual GUID is simply machine-readable but not necessary for the
> human to deal with "noise", to me, as the label (of either the gpt
> partition or the filesystem it hosts) gives me *FAR* more and more useful
> information, while being entirely unique within my ID system.
>
>> If "btrfs device scan" is user-space, then I think doing some output is
>> better then outputting nothing :) (perhaps with "-v" flag). If it is
>> kernel-space, then I agree that logging to dmesg is not very evident
>> (from perspective that user should remember where to look),
>> but I think has a value.
>
> Well, btrfs is a userspace tool, but in this case, btrfs device scan's
> use is purely to make a particular kernel call, which triggers the btrfs
> module to do a device rescan to update its own records, *not* for human
> consumption. -v to force output could work if it had been designed that
> way, but getting that output is precisely what btrfs filesystem show is
> for, printing for both mounted and unmounted filesystems unless told
> otherwise.
>
> Put it this way. If neither your initr* nor some service started before
> whatever mounts local filesystems doesn't do a btrfs device scan, then
> attempting to mount a multi-device btrfs will fail, unless all its
> component devices have been fed in using device= options. Why? Because
> mount takes exactly one device to mount. With traditional filesystems,
> that's enough, since they only consist of a single device. And with
> single-device btrfs, it's enough as well. But with a multi-device btrfs,
> something has to supply the other devices to btrfs, along with the one
> that mount tells it about. It is possible to list all those component
> devices in device= options, but those take /dev/sd* style device nodes,
> and those may change from boot to boot, so that's not very reliable.
> Which is where btrfs device scan comes in. It tells the btrfs module to
> do a general scan and map out internally which devices belong to which
> filesystems, after which a mount supplying just one of them can work,
> since this internal map, the generation or refresh of which is triggered
> by btrfs device scan, supplies the others.
>
> IOW, btrfs device scan needs no output, because all the userspace command
> does is call a kernel function, which triggers the mapping internal to
> the btrfs kernel module, so it can then handle mounts with just one of
> the possibly many devices handed to it from mount.
>
> Outputting that mapping is an entirely different function, with the
> userspace side of that being btrfs filesystem show, which calls a kernel
> function that generates output back to the btrfs userspace app, which
> then further formats it for output back to the user.
I understand that. If btrfs can show the mapping for *unmounted* volume
(e.g. "btrfs fi show /dev/sdb") that would be great. Also I think that
btrfs kernel-space can be smart enough and perform a scan, if mount was
attempted without a prio scan. So one should be able to mount (provided
that all devices are present) without a hassle.
>> Thanks. I have carefully read changelog wiki page and found that:
>>
>> btrfs-progs 4.2.2:
>> scrub: report status 'running' until all devices are finished
>
> Thanks. As I said, I had seen the patch on the list, and /thought/ it
> was now in, but had lost track of specifically when it went in, or
> indeed, /whether/ it had gone in.
>
> Now I know it's in 4.2.2, without having to actually go look it up in the
> git log again, myself.
>
>> Idea concerning balance is listed on wiki page "Project ideas":
>>
>> balance: allow to run it in background (fork) and report status
>> periodically
>
> FWIW, it sort of does that today, except that the btrfs bal start doesn't
> actually return to the command prompt. But again, what it actually does
> is call a kernel function to initiate the balance, and then it's simply
> waiting. On my relatively small btrfs on partitioned ssd, the return is
> often within a minute or two anyway, but on multi-TB spinning rust...
>
> In any case, once the kernel function has triggered the balance, ctrl-C
> should I believe terminate the userspace side and get you back to the
> prompt, without terminating the balance as that continues on in kernel
> space.
>
> But it would still be useful to have balance start actually return
> quickly, instead of having to ctrl-C it.
Thanks for expression your thoughts. I will keep my eye on new features
development.
--
With best regards,
Dmitry
next prev parent reply other threads:[~2015-10-18 9:44 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-10-14 14:28 Recover btrfs volume which can only be mounded in read-only mode Dmitry Katsubo
2015-10-14 14:40 ` Anand Jain
2015-10-14 20:27 ` Dmitry Katsubo
2015-10-15 0:48 ` Duncan
2015-10-15 14:10 ` Dmitry Katsubo
2015-10-15 14:55 ` Hugo Mills
2015-10-16 8:18 ` Duncan
2015-10-18 9:44 ` Dmitry Katsubo [this message]
2015-10-26 7:09 ` Duncan
2015-10-26 9:14 ` Duncan
2015-10-26 9:24 ` Hugo Mills
2015-10-27 5:58 ` Duncan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=562369E8.60709@gmail.com \
--to=dmitry.katsubo@gmail.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).