From: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
To: Martin Steigerwald <martin@lichtvoll.de>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD
Date: Fri, 17 Aug 2018 08:55:17 -0400 [thread overview]
Message-ID: <06e2c72d-8c46-e8a1-b371-6477e2d01d71@gmail.com> (raw)
In-Reply-To: <1598008.Z1BGxZKWFL@merkaba>
On 2018-08-17 08:28, Martin Steigerwald wrote:
> Thanks for your detailed answer.
>
> Austin S. Hemmelgarn - 17.08.18, 13:58:
>> On 2018-08-17 05:08, Martin Steigerwald wrote:
> […]
>>> I have seen a discussion about the limitation in point 2. That
>>> allowing to add a device and make it into RAID 1 again might be
>>> dangerous, cause of system chunk and probably other reasons. I did
>>> not completely read and understand it tough.
>>>
>>> So I still don´t get it, cause:
>>>
>>> Either it is a RAID 1, then, one disk may fail and I still have
>>> *all*
>>> data. Also for the system chunk, which according to btrfs fi df /
>>> btrfs fi sh was indeed RAID 1. If so, then period. Then I don´t see
>>> why it would need to disallow me to make it into an RAID 1 again
>>> after one device has been lost.
>>>
>>> Or it is no RAID 1 and then what is the point to begin with? As I
>>> was
>>> able to copy of all date of the degraded mount, I´d say it was a
>>> RAID 1.
>>>
>>> (I know that BTRFS RAID 1 is not a regular RAID 1 anyway, but just
>>> does two copies regardless of how many drives you use.)
>>
>> So, what's happening here is a bit complicated. The issue is entirely
>> with older kernels that are missing a couple of specific patches, but
>> it appears that not all distributions have their kernels updated to
>> include those patches yet.
>>
>> In short, when you have a volume consisting of _exactly_ two devices
>> using raid1 profiles that is missing one device, and you mount it
>> writable and degraded on such a kernel, newly created chunks will be
>> single-profile chunks instead of raid1 chunks with one half missing.
>> Any write has the potential to trigger allocation of a new chunk, and
>> more importantly any _read_ has the potential to trigger allocation of
>> a new chunk if you don't use the `noatime` mount option (because a
>> read will trigger an atime update, which results in a write).
>>
>> When older kernels then go and try to mount that volume a second time,
>> they see that there are single-profile chunks (which can't tolerate
>> _any_ device failures), and refuse to mount at all (because they
>> can't guarantee that metadata is intact). Newer kernels fix this
>> part by checking per-chunk if a chunk is degraded/complete/missing,
>> which avoids this because all the single chunks are on the remaining
>> device.
>
> How new the kernel needs to be for that to happen?
>
> Do I get this right that it would be the kernel used for recovery, i.e.
> the one on the live distro that needs to be new enough? To one on this
> laptop meanwhile is already 4.18.1.
Yes, the kernel used for recovery is the important one here. I don't
remember for certain when the patches went in, but I'm pretty sure it's
been no eariler than 4.14. FWIW, I'm pretty sure SystemRescueCD has a
new enough kernel, but they still (sadly) lack zstd support.
>
> I used latest GRML stable release 2017.05 which has an 4.9 kernel.
While I don't know exactly when the patches went in, I'm fairly certain
that 4.9 never got them.
>
>> As far as avoiding this in the future:
>
> I hope that with the new Samsung Pro 860 together with the existing
> Crucial m500 I am spared from this for years to come. That Crucial SSD
> according to SMART status about lifetime used has still quite some time
> to go.
Yes, hopefully. And the SMART status on that Crucial is probably right,
they tend to do a very good job in my experience with accurately
measuring life expectancy (that or they're just _really_ good at
predicting failures, I've never had a Crucial SSD that did not indicate
correctly in the SMART status that it would fail in the near future).
>
>> * If you're just pulling data off the device, mark the device
>> read-only in the _block layer_, not the filesystem, before you mount
>> it. If you're using LVM, just mark the LV read-only using LVM
>> commands This will make 100% certain that nothing gets written to
>> the device, and thus makes sure that you won't accidentally cause
>> issues like this.
>
>> * If you're going to convert to a single device,
>> just do it and don't stop it part way through. In particular, make
>> sure that your system will not lose power.
>
>> * Otherwise, don't mount the volume unless you know you're going to
>> repair it.
>
> Thanks for those. Good to keep in mind.
The last one is actually good advice in general, not just for BTRFS. I
can't count how many stories I've heard of people who tried to run half
an array simply to avoid downtime, and ended up making things far worse
than they were as a result.
>
>>> For this laptop it was not all that important but I wonder about
>>> BTRFS RAID 1 in enterprise environment, cause restoring from backup
>>> adds a significantly higher downtime.
>>>
>>> Anyway, creating a new filesystem may have been better here anyway,
>>> cause it replaced an BTRFS that aged over several years with a new
>>> one. Due to the increased capacity and due to me thinking that
>>> Samsung 860 Pro compresses itself, I removed LZO compression. This
>>> would also give larger extents on files that are not fragmented or
>>> only slightly fragmented. I think that Intel SSD 320 did not
>>> compress, but Crucial m500 mSATA SSD does. That has been the
>>> secondary SSD that still had all the data after the outage of the
>>> Intel SSD 320.
>>
>> First off, keep in mind that the SSD firmware doing compression only
>> really helps with wear-leveling. Doing it in the filesystem will help
>> not only with that, but will also give you more space to work with.
>
> While also reducing the ability of the SSD to wear-level. The more data
> I fit on the SSD, the less it can wear-level. And the better I compress
> that data, the less it can wear-level.
No, the better you compress the data, the _less_ data you are physically
putting on the SSD, just like compressing a file makes it take up less
space. This actually makes it easier for the firmware to do
wear-leveling. Wear-leveling is entirely about picking where to put
data, and by reducing the total amount of data you are writing to the
SSD, you're making that decision easier for the firmware, and also
reducing the number of blocks of flash memory needed (which also helps
with SSD life expectancy because it translates to fewer erase cycles).
The compression they do internally operates on the same principal, the
only difference is that you have no control over how it's doing it and
no way to see exactly how efficient it is (but it's pretty well known it
needs to be fast, and fast compression usually does not get good
compression ratios).
>
>> Secondarily, keep in mind that most SSD's use compression algorithms
>> that are fast, but don't generally get particularly amazing
>> compression ratios (think LZ4 or Snappy for examples of this). In
>> comparison, BTRFS provides a couple of options that are slower, but
>> get far better ratios most of the time (zlib, and more recently zstd,
>> which is actually pretty fast).
>
> I considered switching to zstd. But it may not be compatible with grml
> 2017.05 4.9 kernel, of course I could test a grml snapshot with a newer
> kernel. I always like to be able to recover with some live distro :).
> And GRML is the one of my choice.
>
> However… I am not all that convinced that it would benefit me as long as
> I have enough space. That SSD replacement more than doubled capacity
> from about 680 TB to 1480 TB. I have ton of free space in the
> filesystems – usage of /home is only 46% for example – and there are 96
> GiB completely unused in LVM on the Crucial SSD and even more than 183
> GiB completely unused on Samsung SSD. The system is doing weekly
> "fstrim" on all filesystems. I think that this is more than is needed
> for the longevity of the SSDs, but well actually I just don´t need the
> space, so…
>
> Of course, in case I manage to fill up all that space, I consider using
> compression. Until then, I am not all that convinced that I´d benefit
> from it.
>
> Of course it may increase read speeds and in case of nicely compressible
> data also write speeds, I am not sure whether it even matters. Also it
> uses up some CPU cycles on a dual core (+ hyperthreading) Sandybridge
> mobile i5. While I am not sure about it, I bet also having larger
> possible extent sizes may help a bit. As well as no compression may also
> help a bit with fragmentation.
It generally does actually. Less data physically on the device means
lower chances of fragmentation. In your case, it may not improve speed
much though (your i5 _probably_ can't compress data much faster than it
can access your SSD's, which means you likely won't see much performance
benefit other than reducing fragmentation).
>
> Well putting this to a (non-scientific) test:
>
> […]/.local/share/akonadi/db_data/akonadi> du -sh * | sort -rh | head -5
> 3,1G parttable.ibd
>
> […]/.local/share/akonadi/db_data/akonadi> filefrag parttable.ibd
> parttable.ibd: 11583 extents found
>
> Hmmm, already quite many extents after just about one week with the new
> filesystem. On the old filesystem I had somewhat around 40000-50000
> extents on that file.
Filefrag doesn't properly handle compressed files on BTRFS. It treats
each 128KiB compression block as a separate extent, even though they may
be contiguous as part of one BTRFS extent. That one file by itself
should have reported as about 25396 extents on the old volume (assuming
it was entirely compressed), so your numbers seem to match up
realistically.>
>
> Well actually what do I know: I don´t even have an idea whether not
> using compression would be beneficial. Maybe it does not even matter all
> that much.
>
> I bet testing it to the point that I could be sure about it for my
> workload would take considerable amount of time.
>
One last quick thing about compression in general on BTRFS. Unless you
have a lot of files that are likely to be completely incompressible,
you're generally better off using `compress-force` instead of
`compress`. With regular `compress`, BTRFS will try to compress the
first few blocks of a file, and if that fails will mark the file as
incompressible and not try to compress any of it automatically ever
again. With `compress-force`, BTRFS will just unconditionally compress
everything.
next prev parent reply other threads:[~2018-08-17 15:58 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-08-17 9:08 Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD Martin Steigerwald
2018-08-17 11:58 ` Austin S. Hemmelgarn
2018-08-17 12:28 ` Martin Steigerwald
2018-08-17 12:50 ` Roman Mamedov
2018-08-17 13:01 ` Austin S. Hemmelgarn
2018-08-17 21:16 ` Martin Steigerwald
2018-08-17 21:17 ` Martin Steigerwald
2018-08-18 7:12 ` Roman Mamedov
2018-08-18 8:47 ` Martin Steigerwald
2018-08-17 12:55 ` Austin S. Hemmelgarn [this message]
2018-08-17 21:26 ` Martin Steigerwald
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=06e2c72d-8c46-e8a1-b371-6477e2d01d71@gmail.com \
--to=ahferroin7@gmail.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=martin@lichtvoll.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).