* Migrate to bcache: A few questions
@ 2013-12-29 21:11 Kai Krakow
2013-12-30 1:03 ` Chris Murphy
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: Kai Krakow @ 2013-12-29 21:11 UTC (permalink / raw)
To: linux-btrfs
Hello list!
I'm planning to buy a small SSD (around 60GB) and use it for bcache in front
of my 3x 1TB HDD btrfs setup (mraid1+draid0) using write-back caching. Btrfs
is my root device, thus the system must be able to boot from bcache using
init ramdisk. My /boot is a separate filesystem outside of btrfs and will be
outside of bcache. I am using Gentoo as my system.
I have a few questions:
* How stable is it? I've read about some csum errors lately...
* I want to migrate my current storage to bcache without replaying a backup.
Is it possible?
* Did others already use it? What is the perceived performance for desktop
workloads in comparision to not using bcache?
* How well does bcache handle power outages? Btrfs does handle them very
well since many months.
* How well does it play with dracut as initrd? Is it as simple as telling it
the new device nodes or is there something complicate to configure?
* How does bcache handle a failing SSD when it starts to wear out in a few
years?
* Is it worth waiting for hot-relocation support in btrfs to natively use
a SSD as cache?
* Would you recommend going with a bigger/smaller SSD? I'm planning to use
only 75% of it for bcache so wear-leveling can work better, maybe use
another part of it for hibernation (suspend to disk).
Regards,
Kai
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Migrate to bcache: A few questions
2013-12-29 21:11 Migrate to bcache: A few questions Kai Krakow
@ 2013-12-30 1:03 ` Chris Murphy
2013-12-30 1:22 ` Kai Krakow
2013-12-30 6:24 ` Duncan
2013-12-30 16:02 ` Austin S Hemmelgarn
2 siblings, 1 reply; 14+ messages in thread
From: Chris Murphy @ 2013-12-30 1:03 UTC (permalink / raw)
To: Btrfs BTRFS
On Dec 29, 2013, at 2:11 PM, Kai Krakow <hurikhan77+btrfs@gmail.com> wrote:
>
> * How stable is it? I've read about some csum errors lately…
Seems like bcache devs are still looking into the recent btrfs csum issues.
>
> * I want to migrate my current storage to bcache without replaying a backup.
> Is it possible?
>
> * Did others already use it? What is the perceived performance for desktop
> workloads in comparision to not using bcache?
>
> * How well does bcache handle power outages? Btrfs does handle them very
> well since many months.
>
> * How well does it play with dracut as initrd? Is it as simple as telling it
> the new device nodes or is there something complicate to configure?
>
> * How does bcache handle a failing SSD when it starts to wear out in a few
> years?
I think most of these questions are better suited for the bcache list. I think there are still many uncertainties about the behavior of SSDs during power failures when they aren't explicitly designed with power failure protection in mind. At best I'd hope for a rollback involving data loss, but hopefully not a corrupt file system. I'd rather lose the last minute of data supposedly written to the drive, than have to do a fuil restore from backup.
>
> * Is it worth waiting for hot-relocation support in btrfs to natively use
> a SSD as cache?
I haven't read anything about it. Don't see it listed in project ideas.
>
> * Would you recommend going with a bigger/smaller SSD? I'm planning to use
> only 75% of it for bcache so wear-leveling can work better, maybe use
> another part of it for hibernation (suspend to disk).
I think that depends greatly on workload. If you're writing or reading a lot of disparate files, or a lot of small file random writes (mail server), I'd go bigger. By default sequential IO isn't cached. So I think you can get a big boost in responsiveness with a relatively small bcache size.
Chris Murphy
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Migrate to bcache: A few questions
2013-12-30 1:03 ` Chris Murphy
@ 2013-12-30 1:22 ` Kai Krakow
2013-12-30 3:48 ` Chris Murphy
2013-12-30 9:01 ` Marc MERLIN
0 siblings, 2 replies; 14+ messages in thread
From: Kai Krakow @ 2013-12-30 1:22 UTC (permalink / raw)
To: linux-btrfs
Chris Murphy <lists@colorremedies.com> schrieb:
> I think most of these questions are better suited for the bcache list.
Ah yes, you are true. I will repost the non-btrfs related questions to the
bcache list. But actually I am most interested in using bcache together
btrfs, so getting a general picture of its current state in this combination
would be nice - and so these questions may be partially appropriate here.
> I
> think there are still many uncertainties about the behavior of SSDs during
> power failures when they aren't explicitly designed with power failure
> protection in mind. At best I'd hope for a rollback involving data loss,
> but hopefully not a corrupt file system. I'd rather lose the last minute
> of data supposedly written to the drive, than have to do a fuil restore
> from backup.
These thought are actually quite interesting. So you are saying that data
may not be fully written to SSD although the kernel thinks so? This is
probably very dangerous. The bcache module could not ensure coherence
between its backing devices and its own contents - and data loss will occur
and probably destroy important file system structures.
I understand your words as "data may only partially being written". This, of
course, may happen to HDDs as well. But usually a file system works with
transactions so the last incomplete transaction can simply be thrown away. I
hope bcache implements the same architecture. But what does it mean for the
stacked write-back architecture?
As I understand, bcache may use write-through for sequential writes, but
write-back for random writes. In this case, part of the data may have hit
the backing device, other data does only exist in the bcache. If that last
transaction is not closed due to power-loss, and then thrown away, we have
part of the transaction already written to the backing device that the
filesystem does not know of after resume.
I'd appreciate some thoughts about it but this topic is probably also best
moved over to the bcache list.
Thanks,
Kai
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Migrate to bcache: A few questions
2013-12-30 1:22 ` Kai Krakow
@ 2013-12-30 3:48 ` Chris Murphy
2013-12-30 9:01 ` Marc MERLIN
1 sibling, 0 replies; 14+ messages in thread
From: Chris Murphy @ 2013-12-30 3:48 UTC (permalink / raw)
To: Btrfs BTRFS
On Dec 29, 2013, at 6:22 PM, Kai Krakow <hurikhan77+btrfs@gmail.com> wrote:
> So you are saying that data
> may not be fully written to SSD although the kernel thinks so?
Drives shouldn't lie when asked to flush to disk, but they do. Older article about this at lwn is a decent primer on the subject of write barriers.
http://lwn.net/Articles/283161/
> This is
> probably very dangerous. The bcache module could not ensure coherence
> between its backing devices and its own contents - and data loss will occur
> and probably destroy important file system structures.
I don't know the details, there's more detail on lkml.org and bcache lists. My impression is that short of bugs, it should be much safer than you describe. It's not like a linear/concat md or LVM device fail scenario. There's good info in the bcache.h file:
http://lxr.free-electrons.com/source/drivers/md/bcache/bcache.h
If anything, once the kinks are worked out, under heavy random write IO I'd expect bcache to improve the likelihood data isn't lost. Faster speed of SSD means we get a faster commit of the data to stable media. Also bcache assumes the cache is always dirty on startup, no matter whether the shutdown was clean or dirty, so the code is explicitly designed to resolve the state of the cache relative to the backing device. It's actually pretty fascinating work.
It may not be required, but I'd expect we'd want the write cache on the backing device disabled. It should still honor write barriers but it kinda seems unnecessary and riskier to have it enabled (which is the default with consumer drives).
> As I understand, bcache may use write-through for sequential writes, but
> write-back for random writes. In this case, part of the data may have hit
> the backing device, other data does only exist in the bcache. If that last
> transaction is not closed due to power-loss, and then thrown away, we have
> part of the transaction already written to the backing device that the
> filesystem does not know of after resume.
In the write through case we should be no worse off than the bare drive in a power loss. In the write back case the SSD should have committed more data than the HDD could have in the same situation. I don't understand the details of how partially successful writes to the backing media are handled when the system comes back up. Since bcache is also COW, SSD blocks aren't reused until data is committed to the backing device.
Chris Murphy
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Migrate to bcache: A few questions
2013-12-29 21:11 Migrate to bcache: A few questions Kai Krakow
2013-12-30 1:03 ` Chris Murphy
@ 2013-12-30 6:24 ` Duncan
2013-12-31 3:13 ` Kai Krakow
2013-12-30 16:02 ` Austin S Hemmelgarn
2 siblings, 1 reply; 14+ messages in thread
From: Duncan @ 2013-12-30 6:24 UTC (permalink / raw)
To: linux-btrfs
Kai Krakow posted on Sun, 29 Dec 2013 22:11:16 +0100 as excerpted:
> Hello list!
>
> I'm planning to buy a small SSD (around 60GB) and use it for bcache in
> front of my 3x 1TB HDD btrfs setup (mraid1+draid0) using write-back
> caching. Btrfs is my root device, thus the system must be able to boot
> from bcache using init ramdisk. My /boot is a separate filesystem
> outside of btrfs and will be outside of bcache. I am using Gentoo as my
> system.
Gentooer here too. =:^)
> I have a few questions:
>
> * How stable is it? I've read about some csum errors lately...
FWIW, both bcache and btrfs are new and still developing technology.
While I'm using btrfs here, I have tested usable (which for root means
either means directly bootable or that you have tested booting to a
recovery image and restoring from there, I do the former, here) backups,
as STRONGLY recommended for btrfs in its current state, but haven't had
to use them.
And I considered bcache previously and might otherwise be using it, but
at least personally, I'm not willing to try BOTH of them at once, since
neither one is mature yet and if there are problems as there very well
might be, I'd have the additional issue of figuring out which one was the
problem, and I'm personally not prepared to deal with that.
Instead, at this point I'd recommend choosing /either/ bcache /or/ btrfs,
and using bcache with a more mature filesystem like ext4 or (what I used
for years previous and still use for spinning rust) reiserfs.
And as I said, keep your backups as current as you're willing to deal
with losing what's not backed up, and tested usable and (for root) either
bootable or restorable from alternate boot, because while at least btrfs
is /reasonably/ stable for /ordinary/ daily use, there remain corner-
cases and you never know when your case is going to BE a corner-case!
> * I want to migrate my current storage to bcache without replaying a
> backup. Is it possible?
Since I've not actually used bcache, I won't try to answer some of these,
but will answer based on what I've seen on the list where I can... I
don't know on this one.
> * Did others already use it? What is the perceived performance for
> desktop workloads in comparision to not using bcache?
Others are indeed already using it. I've seen some btrfs/bcache problems
reported on this list, but as mentioned above, when both are in use that
means figuring out which is the problem, and at least from the btrfs side
I've not seen a lot of resolution in that regard. From here it /looks/
like that's simply being punted at this time, as there's still more
easily traceable problems without the additional bcache variable to work
on first. But it's quite possible the bcache list is actively tackling
btrfs/bache combination problems, as I'm not subscribed there.
So I can't answer the desktop performance comparison question directly,
but given that I /am/ running btrfs on SSD, I /can/ say I'm quite happy
with that. =:^)
Keep in mind...
We're talking storage cache here. Given the cost of memory and common
system configurations these days, 4-16 gig of memory on a desktop isn't
unusual or cost prohibitive, and a common desktop working set should well
fit.
I suspect my desktop setup, 16 gigs memory backing a 6-core AMD fx6100
(bulldozer-1) @ 3.6 GHz, is probably a bit toward the high side even for
a gentooer, but not inordinately so. Based on my usage...
Typical app memory usage runs 1-2 GiB (that's with KDE 4.12.49.9999 from
the gentoo/kde overlay, but USE=-semantic-desktop, etc). Buffer memory
runs a few MiB but isn't normally significant, so it can fold into that
same 1-2 GiB too.
That leaves a full 14 GiB for cache. But at least with /my/ usage,
normal non-update cache memory usage tends to be below ~6 GiB too, so
total apps/buffer/cache memory usage tends to be below 8 GiB as well.
When I'm doing multi-job builds or working with big media files, I'll
sometimes go above 8 gig usage, and that occasional cache-spill was why I
upgraded to 16 gig. But in practice, 10 gig would take care of that most
of the time, and were it not for the "accident" of powers-of-two meaning
16 gig is the notch above 8 gig, 10 or 12 gig would be plenty. Truth be
told, I so seldom use that last 4 gig that it's almost embarrassing.
* Tho if I ran multi-GiB VMs that'd use up that extra memory real fast!
But while that /is/ becoming more common, I'm not exactly sure I'd
classify 4 gigs plus of VM usage as "desktop" usage just yet.
Workstation, yes, and definitely server, but not really desktop.
All that as background to this...
* Cache works only after first access. If you only access something
occasionally, it may not be worth caching at all.
* Similarly, if access isn't time critical, think of playing a huge video
file where only a few meg in memory at once is plenty, and where storage
access is several times faster than play-speed, cache isn't particularly
useful.
* Bcache is designed not to cache sequential access (that large video
file) in any case, since spinning rust tends to be more than fast enough
for that sort of thing already.
Given the stated 3 x 1TB drive btrfs in raid1 metadata, raid0 data, config
you mention, I'm wondering if big media is a/the big use case for you, in
which case bcache isn't going to be a good solution anyway, since that
tends to be sequential access, which bcache deliberately ignores as it
doesn't fit the model it's targeting.
(I am a bit worried about that raid0 data, tho. Unless you consider that
data of trivial value that's not a good choice, since raid0 generally
means you lose it all if you lose a physical device. And you're running
three devices, which means you just tripled the chance of a device
failure over that of just putting it all on a single 3 TB drive! And
backups... a 3 TB restore on spinning rust will take some time any way
you look at it, so backups may or may not be particularly viable here.
The most common use case for that much data is probably a DVR scenario,
which is video, and you may well consider it of low enough value that if
you lose it, you lose it, and you're willing to take that risk, but for
normally sequential access video/media, bcache isn't a good match anyway.)
* With memory cost what it is, for repeat access where initial access
time isn't /too/ critical, investing in more memory, to a point (for me,
8-12 gig as explained above), and simply letting the kernel manage cache
and memory as it normally does, may make more sense than bcache to an ssd.
* Of course, what bcache *DOES* effectively do, is extend the per-boot
cache time of memory, making the cache persistent. That effectively
extends the time over which "occasional access" still justifies caching
at all.
* That makes bcache well suited to boot-time and initial-access-speed-
critical scenarios, where more memory for a larger in-memory cache won't
do any good, since it's first-access-since-boot, because for in-memory
cache that's a cold-cache scenario, while with bcache's persistent cache,
it's a hot-cache scenario.
But what I'm actually wondering is if your use case better matches a
split data model, where you put root and perhaps stuff like the portage
tree and/or /home on fast SSD, while keeping all that big and generally
sequential access media on slower but much cheaper big spinning rust.
That's effectively what I've done here, tho I'm looking at rather less
than a TB of slow-access media, etc. See below for the details. The
general idea is as I said to stick all the time-critical stuff on SSD
directly (not using something like bcache), while keeping the slower
spinning rust for the big less-time-critical and sequential-access stuff,
and for non-btrfs backups of the stuff on the btrfs-formatted SSD, since
btrfs /is/ after all still in development, and I /do/ intend to be
prepared if /my/ particular case ends up being one of the corner-cases
btrfs still worst-cases on.
> * How well does bcache handle power outages? Btrfs does handle them very
> well since many months.
Since I don't run bcache I can't really speak to this at all, /except/,
the btrfs/bcache combo trouble reports that have come to the list have I
think all been power outage or kernel-crash scenarios... as could be
predicted of course since that's a filesystem's worst-case scenario, at
least that it has to commonly deal with.
But I know I'd definitely not trust that case, ATM. Like I said, I'd not
trust the combination of the two, and this is exactly where/why. Under
normal operation, the two should work together well. But in a power-loss
situation with both technologies being still relatively new and under
development... not *MY* data!
> * How well does it play with dracut as initrd? Is it as simple as
> telling it the new device nodes or is there something complicate to
> configure?
I can't answer this at all for bcache, but I can say I've been relatively
happy with the dracut initramfs solution for dual-device btrfs raid1
root. =:^) (At least back when I first set it up several kernels ago,
the kernel's commandline parser apparently couldn't handle the multiple
equals of something like rootflags=device=/dev/sda5,device=/dev/sdb5. So
the only way to get a multi-device btrfs rootfs to work was to use an
initr* with userspace btrfs device scan before attempting to mount real-
root, and dracut has worked well for that.)
> * How does bcache handle a failing SSD when it starts to wear out in a
> few years?
Given the newness of the bcache technology, assuming your SSD doesn't
fail early and it is indeed a few years, I'd suggest that question is
premature. Bcache will by that time be much older and more mature than
it is now, and how it'd handle, or fail to handle, such an event /now/
likely hasn't a whole lot to do with how much (presumably) better it'll
handle it /then/.
> * Is it worth waiting for hot-relocation support in btrfs to natively
> use a SSD as cache?
I wouldn't wait for it. It's on the wishlist, but according to the wiki
(project ideas, see the dm_cache or bcache like cache, and the hybrid
storage points), nobody has claimed that project yet, which makes it
effectively status "bluesky", which in turn means "nice idea, we might
get to it... someday."
Given the btrfs project history of everything seeming to take rather
longer than the original it turned out wildly optimistic projections, in
the absense of a good filesystem dev personally getting that specific
itch to scratch, that means it's likely a good two years out, and may be
5-10. So no, I'd definitely *NOT* wait on it!
> * Would you recommend going with a bigger/smaller SSD? I'm planning to
> use only 75% of it for bcache so wear-leveling can work better, maybe
> use another part of it for hibernation (suspend to disk).
FWIW, for my split data, some on SSD, some on spinning rust, setup, I had
originally planned perhaps a 64 gig or so SSD, figuring I could put the
boot-time-critical rootfs and a few other initial-access-time-critical
things on it, with a reasonable amount of room to spare for wear-
leveling. Maybe 128 gig or so, with a bit more stuff on it.
But when I actually went looking for hardware (some months ago now, but
rather less than a year), I found the availability and price-point knee
at closer to 256 gig. 128 gig or so was at a similar price-point per-
gig, but tends to sell out pretty fast as it's about half the gigs and
thus about half the price. There were some smaller ones available, but
they tended to be either MUCH slower or MUCH higher priced, I'd guess
left over from a previous generation before prices came down, and they
simply hadn't been re-priced to match current price/capacity price-points.
But much below 128 GiB (there were some 120 GB at about the same per-gig,
which "units" says is just under 112 GiB) and the price per gig tends to
go up, while above 256 GB (not GiB) both the price per gig and full price
tend to go up.
In practice, 60 or 80 GB SSDs just didn't seem to be that much cheaper
than 120-ish gig, and 120-ish gig were a good deal, but were popular
enough that availability was a bit of an issue.
So I actually ended up with 256 GB, which works out to ~ 238 GiB. Yeah I
paid a bit more, but that both gave me a bit more flexibility in terms of
what I put on them, AND meant after I set them up I STILL had about 40%
unallocated, giving them *LOTS* of wear-leveling room.
Of course that means if you do actually do bcache, 60-ish gigs should be
good and I'd guess 128 gig would be overkill, as I guess 40-60 gigs
probably about what my "hot" data is, the stuff bcache would likely catch.
And 60 gig will likely be /some/ cheaper tho not as much as you might
expect, but you'll lose flexibility too, and/or you might actually pay
more for the 60 gig than the 120 gig, or it'll be slower speed-rated.
That was what I found when I actually went out to buy, anyway.
As to layout (all GPT partitions, not legacy MBR):
On the SSD(s, I actually have two setup, mostly in btrfs dual-device data/
metadata raid1 partitions but with some single-device, mixed/dup):
- (boot area)
x 1007 KiB free space (so partitions are 1 MiB aligned)
1 3 MiB BIOS reserved partition
(grub2 puts its core image here, partitions are now 4 MiB aligned)
2 124 MiB EFI reserved partition (for EFI forward compatibility)
(partitions are now 128 MiB aligned)
3 256 MiB /boot (btrfs mixed-block mode, DUP data/metadata)
I have a separate boot partition on each of the SSDs, with grub2
installed to both SSD separately, pointing at its own /boot. with
the SSD I boot selectable in BIOS. That gives me a working /boot
and a primary /boot backup. I run git kernels and normally
update the working /boot with a new kernel once or twice a week,
while only updating the backup /boot with the release kernel, so
every couple months.
4 640 MiB /var/log (btrfs mixed-mode, raid1 data/metadata)
That gives me plenty of log space as long as logrotate doesn't
break, while still keeping a reasonable cap on the log partition
in case I get a runaway log. As any good sysadmin should know,
some from experience (!!), keeping a separate log partition is a
good idea, since that limits the damage if something /does/ go
runaway logging.
(partitions beyond this are now 1 GiB aligned)
5 8 GiB rootfs (btrfs raid1 data/metadata)
My rootfs includes (almost) all "installable" data, everything
installed by packages except for /var/lib, which is a symlink to
/home/var/lib. The reason for that is that I keep rootfs mounted
read-only by default, only mounting it read-write for updates or
configuration changes, and /var/lib needs to be writable. /home
is mounted writable, thus the /var/lib symlink pointing into it.
I learned the hard way to keep everything installed (but for
/var/lib) on the same filesystem, along with the installed-
package database (/var/db/pkg on gentoo), when I had to deal with
a recovery situation with rootfs, /var, and /usr on separate
partitions, recovering each one from a backup made at a different
time! Now I make **VERY** sure everything stays in sync, so
the installed-package database matches what's actually installed.
(Obviously /var/lib is a limited exception in ordered to keep
rootfs read-only by default. If I have to recover from an out of
sync /home and thus /home/var/lib, I can query for what packages
own /var/lib and reinstall them.)
6 20 GiB /home (btrfs raid1 data/metadata)
20 GiB is plenty big enough for /home, since I keep my big media
files on a dedicated media partition on spinning rust.
7 24 GiB build and packages tree (btrfs raid1 data/metadata)
I mount this at /usr/src, since that seemed logical, but it
contains the traditional /usr/src/linux (a git kernel, here),
plus the gentoo tree and layman-based overlays, plus my binpkg
cache, plus the ccache. Additionally it contains the 32-bit
chroot binpkg cache and ccache, see below.
8 8 GiB 32-bit chroot build-image (btrfs raid1 data/metadata)
I have a 32-bit netbook that runs gentoo also. This is its
build image, more or less a copy of its rootfs, but on my main
machine where I build the packages for it. I keep this
rsynced to the netbook for its updates. That way the slower
netbook with its smaller hard drive doesn't have to build
packages or keep a copy of the gentoo tree or the 32-bit
binpkg cache at all.
9-12 Primary backups of partitions 5-8, rootfs, /home, packages, and
netbook build image. These partitions are the same size and
configuration as their working copies above, recreated
periodically to protect against fat-finger mishaps as well as
still-under-development btrfs corner-cases and ~arch plus
live-branch-kde, etc update mishaps.
(My SSDs, Corsair Neutron series, run a LAMD (Link A
Media Devices) controller. These don't have the compression
or dedup features of something like the sandforce controllers,
but the Neutrons at least (as opposed to the Neutron GTX) are
enterprise targeted, with the resulting predictable performance,
capacity and reliability bullet-point features. What you save to
the SSD is saved as-you-sent-it, regardless of compressibility or
whether it's a dup of something else already on the SSD. Thus,
at least with my SSDs, the redundant working and backup copies
are actually two copies on the SSD as well, not one compressed/
dedupped copy. That's a very nice confidence point when the
whole /point/ of sending two copies is to have a backup! So
for anyone reading this that decides to do something similar,
be sure your SSD firmware isn't doing de-duping in the background,
leaving you with only the one copy regardless of what you thought
you might have saved!)
That STILL leaves me 117.5 GiB of the 238.5 GiB entirely free and
unallocated for wear-leveling and/or future flexibility, and I've a
second copy (primary backup copy) of most of the data as well, which
could be omitted from SSD (kept on spinning rust) if necessary.
Before I actually got the drives and was still figuring on 128-ish gigs,
I was figuring 1 gig x-log, maybe 6 gig rootfs, 16 gig home, 20 gig pkg,
and another 6 gig netbookroot, so about 49 gig of data if I went with
60-80 gig SSDs, with the backups as well 97 gig, if I went with 120-ish
gig SSDs.
But as I said, once I actually was out there shopping for 'em I ended up
getting the 256 GB (238.5 GiB) SSDs as a near-best bargain in terms of
rated performance and reliability vs. size vs. price.
Still on spinning rust, meanwhile, all my filesystems remain the many-
years-stable reiserfs. I keep a working and backup media partition
there, as well as second backup partitions for everything on btrfs on the
ssds, just in case.
Additionally, I have an external USB-connected drive that's both
disconnected and off most of the time, to recover from in case something
takes out both the SSDs and internal spinning rust.
I figure if the external gets taken out too, say by fire if my house
burnt down or by theft if someone broke in and stole it, I'd have much
more important things to worry about for awhile, then what might have
happened to my data! And once I did get back on my feet and ready to
think about computing again, much of the data would be sufficiently
outdated as to be near worthless in any case. At that point I might as
well start from scratch but for the knowledge in my head, and whatever
offsite or the like backups I might have had probably wouldn't be worth
the trouble to recover anyway, so that's beyond cost/time/hassle
effective and I don't bother.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Migrate to bcache: A few questions
2013-12-30 1:22 ` Kai Krakow
2013-12-30 3:48 ` Chris Murphy
@ 2013-12-30 9:01 ` Marc MERLIN
2013-12-31 0:31 ` Kai Krakow
1 sibling, 1 reply; 14+ messages in thread
From: Marc MERLIN @ 2013-12-30 9:01 UTC (permalink / raw)
To: Kai Krakow; +Cc: linux-btrfs
On Mon, Dec 30, 2013 at 02:22:55AM +0100, Kai Krakow wrote:
> These thought are actually quite interesting. So you are saying that data
> may not be fully written to SSD although the kernel thinks so? This is
That, and worse.
Incidently, I have just posted on my G+ about this:
https://plus.google.com/106981743284611658289/posts/Us8yjK9SPs6
which is mostly links to
http://lkcl.net/reports/ssd_analysis.html
https://www.usenix.org/conference/fast13/understanding-robustness-ssds-under-power-fault
After you read those, you'll never think twice about SSDs and data loss
anymore :-/
(I kind of found that out myself over time too, but these have much more
data than I got myself empirically on a couple of SSDs)
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Migrate to bcache: A few questions
2013-12-29 21:11 Migrate to bcache: A few questions Kai Krakow
2013-12-30 1:03 ` Chris Murphy
2013-12-30 6:24 ` Duncan
@ 2013-12-30 16:02 ` Austin S Hemmelgarn
2014-01-01 10:06 ` Duncan
2014-01-01 20:12 ` Austin S Hemmelgarn
2 siblings, 2 replies; 14+ messages in thread
From: Austin S Hemmelgarn @ 2013-12-30 16:02 UTC (permalink / raw)
To: Kai Krakow, linux-btrfs
On 12/29/2013 04:11 PM, Kai Krakow wrote:
> Hello list!
>
> I'm planning to buy a small SSD (around 60GB) and use it for bcache in front
> of my 3x 1TB HDD btrfs setup (mraid1+draid0) using write-back caching. Btrfs
> is my root device, thus the system must be able to boot from bcache using
> init ramdisk. My /boot is a separate filesystem outside of btrfs and will be
> outside of bcache. I am using Gentoo as my system.
>
> I have a few questions:
>
> * How stable is it? I've read about some csum errors lately...
>
> * I want to migrate my current storage to bcache without replaying a backup.
> Is it possible?
>
> * Did others already use it? What is the perceived performance for desktop
> workloads in comparision to not using bcache?
>
> * How well does bcache handle power outages? Btrfs does handle them very
> well since many months.
>
> * How well does it play with dracut as initrd? Is it as simple as telling it
> the new device nodes or is there something complicate to configure?
>
> * How does bcache handle a failing SSD when it starts to wear out in a few
> years?
>
> * Is it worth waiting for hot-relocation support in btrfs to natively use
> a SSD as cache?
>
> * Would you recommend going with a bigger/smaller SSD? I'm planning to use
> only 75% of it for bcache so wear-leveling can work better, maybe use
> another part of it for hibernation (suspend to disk).
I've actually tried a simmilar configuration myself a couple of times
(also using Gentoo in-fact), and I can tell you from experience that
unless things have changed greatly since kernel 3.12.1, it really isn't
worth the headaches. Setting it up on an already installed system is a
serious pain because the backing device has to be reformatted with a
bcache super-block. In addition, every kernel that I have tried that
had bcache compiled in or loaded as a module had issues, I would see a
kernel OOPS on average once a day from the bcache code, usually followed
shortly by a panic from some other unrelated subsystem. I didn't get
any actual data corruption, but I wasn't using btrfs at the time for any
of my filesystems.
As an alternative to using bcache, you might try something simmilar to
the following:
64G SSD with /boot, /, and /usr
Other HDD with /var, /usr/portage, /usr/src, and /home
tmpfs or ramdisk for /tmp and /var/tmp
This is essentially what I use now, and I have found that it
significantly improves system performance.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Migrate to bcache: A few questions
2013-12-30 9:01 ` Marc MERLIN
@ 2013-12-31 0:31 ` Kai Krakow
0 siblings, 0 replies; 14+ messages in thread
From: Kai Krakow @ 2013-12-31 0:31 UTC (permalink / raw)
To: linux-btrfs
Marc MERLIN <marc@merlins.org> schrieb:
> On Mon, Dec 30, 2013 at 02:22:55AM +0100, Kai Krakow wrote:
>> These thought are actually quite interesting. So you are saying that data
>> may not be fully written to SSD although the kernel thinks so? This is
>
> That, and worse.
>
> Incidently, I have just posted on my G+ about this:
> https://plus.google.com/106981743284611658289/posts/Us8yjK9SPs6
>
> which is mostly links to
> http://lkcl.net/reports/ssd_analysis.html
> https://www.usenix.org/conference/fast13/understanding-robustness-ssds-under-power-fault
>
> After you read those, you'll never think twice about SSDs and data loss
> anymore :-/
> (I kind of found that out myself over time too, but these have much more
> data than I got myself empirically on a couple of SSDs)
The bad thing here is: Even battery-backed RAID controllers won't help you
here. I start to understand why I still don't trust this new technology
entirely.
Thanks,
Kai
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Migrate to bcache: A few questions
2013-12-30 6:24 ` Duncan
@ 2013-12-31 3:13 ` Kai Krakow
0 siblings, 0 replies; 14+ messages in thread
From: Kai Krakow @ 2013-12-31 3:13 UTC (permalink / raw)
To: linux-btrfs
Duncan <1i5t5.duncan@cox.net> schrieb:
[ spoiler: tldr ;-) ]
>> * How stable is it? I've read about some csum errors lately...
>
> FWIW, both bcache and btrfs are new and still developing technology.
> While I'm using btrfs here, I have tested usable (which for root means
> either means directly bootable or that you have tested booting to a
> recovery image and restoring from there, I do the former, here) backups,
> as STRONGLY recommended for btrfs in its current state, but haven't had
> to use them.
>
> And I considered bcache previously and might otherwise be using it, but
> at least personally, I'm not willing to try BOTH of them at once, since
> neither one is mature yet and if there are problems as there very well
> might be, I'd have the additional issue of figuring out which one was the
> problem, and I'm personally not prepared to deal with that.
I mostly trust btrfs by now. Don't understand me wrong: I still have my
nightly backup job syncing the complete system to an external drive -
nothing defeats a good backup. But btrfs has survived reliably multiple
power-losses, kernel panics/freezes, unreliable USB connections, ... It
looks very stable from that view. Yes, it may have bugs that may introduce
errors fatal to the filesystem structure. But generally, under usual
workloads it has proven stable for me. At least for desktop workloads.
> Instead, at this point I'd recommend choosing /either/ bcache /or/ btrfs,
> and using bcache with a more mature filesystem like ext4 or (what I used
> for years previous and still use for spinning rust) reiserfs.
I've used reiserfs for several years a long time ago. But it does absolutely
not scale well for parallel/threaded workloads which is a show stopper for
server workloads. But it always survived even the worst failure scenarios
(like SCSI bus going offline for some RAID members) and the tools
distributed with it were able to recover all data even if the FS was damaged
beyond any usual things you would normally try when it does no longer mount.
I've been with Ext3 before, and it was not only one time that a simple
power-loss during high server-workload destroyed the filesystem beyond
repair with fsck only making it worse.
Since reiserfs did not scale well and ext* FS has annoyed me more than once,
we've decided to go with XFS. While it tends to wipe some data after power-
loss and leaves you with zero-filled files, it has proven extremely reliable
even under those situations mentioned above like dying SCSI bus. Not to the
extent reiserfs did but still very satisfying. The big plus: it scales
extremely well with parallel workloads and can be optimized for the stripe
configuration of the underlying RAID layer. So I made it my default
filesystem for desktop, too. With the above mentioned annoying "feature" of
zero'ing out recently touched files when the system crashed. But well, we
all got proven backups, right? Yep, I also learned that lesson... *sigh
But btrfs, when first announced and while I already was jealously looking at
ZFS, seemed to be the FS of my choice giving me flexible RAID setups,
snapshots... I'm quite happy with it although it feels slow sometimes. I
simply threw more RAM at it - now it is okay.
> And as I said, keep your backups as current as you're willing to deal
> with losing what's not backed up, and tested usable and (for root) either
> bootable or restorable from alternate boot, because while at least btrfs
> is /reasonably/ stable for /ordinary/ daily use, there remain corner-
> cases and you never know when your case is going to BE a corner-case!
I've got a small rescue system I can boot which has btrfs-tools and a recent
kernel to flexible repair, restore, or whatever I want to do with my backup.
My backup itself is not bootable (although it probably could, if I change
some configurations files).
>> * I want to migrate my current storage to bcache without replaying a
>> backup. Is it possible?
>
> Since I've not actually used bcache, I won't try to answer some of these,
> but will answer based on what I've seen on the list where I can... I
> don't know on this one.
I remember someone created some pyhton scripts to make it possible - wrt to
btrfs especially. Can't remember the link. Maybe I'm able to dig it up. But
at least I read it as: There's no improvement on that migration path
directly from bcache. I hoped otherwise...
>> * Did others already use it? What is the perceived performance for
>> desktop workloads in comparision to not using bcache?
>
> Others are indeed already using it. I've seen some btrfs/bcache problems
> reported on this list, but as mentioned above, when both are in use that
> means figuring out which is the problem, and at least from the btrfs side
> I've not seen a lot of resolution in that regard. From here it /looks/
> like that's simply being punted at this time, as there's still more
> easily traceable problems without the additional bcache variable to work
> on first. But it's quite possible the bcache list is actively tackling
> btrfs/bache combination problems, as I'm not subscribed there.
>
> So I can't answer the desktop performance comparison question directly,
> but given that I /am/ running btrfs on SSD, I /can/ say I'm quite happy
> with that. =:^)
Well, I'm most interested in bcache+btrfs so I put my questions here in this
list - although I have to admit that most questions would've better been
properly placed in the bcache list.
Small sidenote: I'm subscribed through the gmane NNTP proxy to all these
lists, using a native NNTP reader. I can really recommend it. Subscribing to
the list for post access is also very easy. You may want to look into it.
;-)
> Keep in mind...
>
> We're talking storage cache here. Given the cost of memory and common
> system configurations these days, 4-16 gig of memory on a desktop isn't
> unusual or cost prohibitive, and a common desktop working set should well
> fit.
I'm having 16 gig of memory. I started with 8, but it was insanely cheap
when I upgraded my mainboard, only €30 for 8 gig - so I threw in another
pair of 4 gig modules. Did never regret...
> I suspect my desktop setup, 16 gigs memory backing a 6-core AMD fx6100
> (bulldozer-1) @ 3.6 GHz, is probably a bit toward the high side even for
> a gentooer, but not inordinately so. Based on my usage...
Mine is 16 gigs, Core i5 Quad @ 3.3 (with this turbo boost thingy) - so,
well, no. I think both, your and my setup, are decent but not extraordinary.
;-)
> Typical app memory usage runs 1-2 GiB (that's with KDE 4.12.49.9999 from
> the gentoo/kde overlay, but USE=-semantic-desktop, etc). Buffer memory
> runs a few MiB but isn't normally significant, so it can fold into that
> same 1-2 GiB too.
Similar observation here, though I'm using semantic-desktop: Memory usage
rarely goes above 3-4 GB in KDE during usual workloads, so the rest is
mostly dedicated to cache. Still, btrfs feels very sluggish from time to
time. Thus my idea throwing a SSD with bcache into the equation. This
sluggishness came quite suddenly with one of the kernel updates though I
don't remember which, probably between 3.7 and 3.8... I've mitigated it
mostly by ramping up the IO queue depth... A lot... From default 128 to
8192. My amount of RAM allows it - so what... ;-)
> When I'm doing multi-job builds or working with big media files, I'll
> sometimes go above 8 gig usage, and that occasional cache-spill was why I
> upgraded to 16 gig. But in practice, 10 gig would take care of that most
> of the time, and were it not for the "accident" of powers-of-two meaning
> 16 gig is the notch above 8 gig, 10 or 12 gig would be plenty. Truth be
> told, I so seldom use that last 4 gig that it's almost embarrassing.
Same observation here: 8 gigs are usually enough for almost any workload.
Just sometimes an extra amount of 2-3 gigs was needed. But well, it was
cheap. Why spend 20 bucks for 4 gigs if I could get 8 for 30 bucks. :-) And
while I never measured it and neither looked at how todays system organize
memory, I still believe in memory interleaving and thus always buy in pairs.
> * Tho if I ran multi-GiB VMs that'd use up that extra memory real fast!
> But while that /is/ becoming more common, I'm not exactly sure I'd
> classify 4 gigs plus of VM usage as "desktop" usage just yet.
> Workstation, yes, and definitely server, but not really desktop.
I want a snappy system without bothering how to distribute it over different
storage techniques which both have their distinct limitations. So
bcache+btrfs is a solution to have the best of _all_ worlds. Like having
your own cake and eat it, too. And while having 16 gigs of RAM, using
preload, btrfs distributed across 3 devices, gave me a pretty snappy system,
it has suffered a lot due to the above mentioned "kernel incident." It never
came back to its old snappiness. So I feel the urge to move forward.
[...]
> Given the stated 3 x 1TB drive btrfs in raid1 metadata, raid0 data, config
> you mention, I'm wondering if big media is a/the big use case for you, in
> which case bcache isn't going to be a good solution anyway, since that
> tends to be sequential access, which bcache deliberately ignores as it
> doesn't fit the model it's targeting.
Well, use cases are as follows:
* The system is also connected to my TV by HDMI
* used for HTPC functions (just playback) with XBMC
* used for Steam (occasinally playing games on the big screen)
* used for development (and I always keep loads of tabs open in the
browser then)
* this involves git, restarting dev servers, compiling
* VMs for these Windows-only things (but this is rare)
* the usual Gentoo compiling, you know it...
So, bcache could probably help those situations where I want snappiness. And
in the long term I'm planning to add another HDD and go with btrfs RAID10
instead.
> (I am a bit worried about that raid0 data, tho. Unless you consider that
> data of trivial value that's not a good choice, since raid0 generally
> means you lose it all if you lose a physical device. And you're running
> three devices, which means you just tripled the chance of a device
> failure over that of just putting it all on a single 3 TB drive! And
> backups... a 3 TB restore on spinning rust will take some time any way
> you look at it, so backups may or may not be particularly viable here.
I have a working backup with backlog. I got the 1TB drives incredibly cheap,
so it was the option of choice. And I feel big drives with high data density
are not as reliable as not so big and technically proven drives
(manufactured when technology had moved forward to bigger platters).
> The most common use case for that much data is probably a DVR scenario,
> which is video, and you may well consider it of low enough value that if
> you lose it, you lose it, and you're willing to take that risk, but for
> normally sequential access video/media, bcache isn't a good match anyway.)
I'm in the process of sorting out all my CD and DVDs with archived data on
it. Such media is unreliable - more than my current setup. I've been with
LVM and XFS before and it's always been headache to swap storage easily.
With btrfs it is very easy, and RAID striping comes for free. With my
previous LVM setup I used a JBOD setup and I wasn't entirely happy with it.
And then, there's still the long-term goal of migrating to RAID-10.
> * With memory cost what it is, for repeat access where initial access
> time isn't /too/ critical, investing in more memory, to a point (for me,
> 8-12 gig as explained above), and simply letting the kernel manage cache
> and memory as it normally does, may make more sense than bcache to an ssd.
This is why I'm already having 16 gigs. But I feel bcache would improve cold
start of applications and the system.
> * Of course, what bcache *DOES* effectively do, is extend the per-boot
> cache time of memory, making the cache persistent. That effectively
> extends the time over which "occasional access" still justifies caching
> at all.
That is the plan. ;-)
> * That makes bcache well suited to boot-time and initial-access-speed-
> critical scenarios, where more memory for a larger in-memory cache won't
> do any good, since it's first-access-since-boot, because for in-memory
> cache that's a cold-cache scenario, while with bcache's persistent cache,
> it's a hot-cache scenario.
Dito.
> But what I'm actually wondering is if your use case better matches a
> split data model, where you put root and perhaps stuff like the portage
> tree and/or /home on fast SSD, while keeping all that big and generally
> sequential access media on slower but much cheaper big spinning rust.
I hate partitioning. I don't want to micro-optimize my partition setup when
a solution like bcache could provide similar improvements without the
downsides of such partitioning decisions. That's the point.
> That's effectively what I've done here, tho I'm looking at rather less
> than a TB of slow-access media, etc. See below for the details. The
> general idea is as I said to stick all the time-critical stuff on SSD
> directly (not using something like bcache), while keeping the slower
> spinning rust for the big less-time-critical and sequential-access stuff,
> and for non-btrfs backups of the stuff on the btrfs-formatted SSD, since
> btrfs /is/ after all still in development, and I /do/ intend to be
> prepared if /my/ particular case ends up being one of the corner-cases
> btrfs still worst-cases on.
I have no problem with the time a restore from backup takes. I'm not that
dependent on the system. I case of time-critical stuff I would just bind-
mount the home backup to a rescue system or just sync the home directory
from backup to a (slow) spare system I've got somewhere and that usually
just collects dust. That's a tested setup. I worst case there's an Gentoo VM
in my office with almost identical software setup which I could just attach
my disk to and mount my home on, and then even work remote on it. At least
both these spare system setups would work as an emergeny replacement for
important work. All the rest is not that important. If I loose the
entertainment part of my system: Sigh, annoying, but well: Not important.
All those mostly static files are in the backup and I'm not dependent on
them in a time-critical manner. The critical working set can be implanted
into spare systems.
>> * How well does bcache handle power outages? Btrfs does handle them very
>> well since many months.
>
> Since I don't run bcache I can't really speak to this at all, /except/,
> the btrfs/bcache combo trouble reports that have come to the list have I
> think all been power outage or kernel-crash scenarios... as could be
> predicted of course since that's a filesystem's worst-case scenario, at
> least that it has to commonly deal with.
>
> But I know I'd definitely not trust that case, ATM. Like I said, I'd not
> trust the combination of the two, and this is exactly where/why. Under
> normal operation, the two should work together well. But in a power-loss
> situation with both technologies being still relatively new and under
> development... not *MY* data!
The question is: Will it eat my data twice a day or twice a year. I could
live with the latter, I have no time for the former, though. But I'm
interested in helping the community by testing this. The problem isn't
actually with bcache+btrfs destroying my system beyond repair and having to
restore from backup. My problem is with silent data corruption it may
introduce. My backup strategry won't protect me from that altough I have
several weeks of backlog. And putting just unimportant stuff I seldom work
with would just not help the situation: First, I would not really test the
setup, second, I would not really take advantage of the setup. It would be
useless.
>> * How well does it play with dracut as initrd? Is it as simple as
>> telling it the new device nodes or is there something complicate to
>> configure?
>
> I can't answer this at all for bcache, but I can say I've been relatively
> happy with the dracut initramfs solution for dual-device btrfs raid1
> root. =:^) (At least back when I first set it up several kernels ago,
> the kernel's commandline parser apparently couldn't handle the multiple
> equals of something like rootflags=device=/dev/sda5,device=/dev/sdb5. So
> the only way to get a multi-device btrfs rootfs to work was to use an
> initr* with userspace btrfs device scan before attempting to mount real-
> root, and dracut has worked well for that.)
Worked for me by adding rootdelay=2 to the cmdline. And I had to add a
symlink into the dracut ramfs builder because the scripts expect the btrfs
bins somewhere else than they install to. I now use root=UUID=xxxx and it
works like a charme.
>> * How does bcache handle a failing SSD when it starts to wear out in a
>> few years?
>
> Given the newness of the bcache technology, assuming your SSD doesn't
> fail early and it is indeed a few years, I'd suggest that question is
> premature. Bcache will by that time be much older and more mature than
> it is now, and how it'd handle, or fail to handle, such an event /now/
> likely hasn't a whole lot to do with how much (presumably) better it'll
> handle it /then/.
Well, good point. And I've read through some links (thanks to the other
posters here) which show that bcache already has some countermeasures for
this situation. So at least it is designed with such problems in mind. From
that view it looks good to me. My biggest problem is probably I don't really
trust SSDs by now, given my office background when SSDs fail in dumb ways
just due to some workloads applied in Windows systems. Then, you update BIOS
and firmware, and tada: Problems gone. BUT: This just implies that SSD is
far from being mature. And then, I believe some manufactures just did not
figure out how to do wear-leveling really correct. While HDDs usually fail
in a soft way (some sectores no longer working, time for replacement), I
usually read SSDs die from one minute to another unpredictably, so they fail
the hard way in the way of from working flawlessly to everything lost. Or
they start introducing silent data corruption which is much worse (like
ack'ing writes but after reboot it looks like nothing was ever written).
>> * Is it worth waiting for hot-relocation support in btrfs to natively
>> use a SSD as cache?
>
> I wouldn't wait for it. It's on the wishlist, but according to the wiki
> (project ideas, see the dm_cache or bcache like cache, and the hybrid
> storage points), nobody has claimed that project yet, which makes it
> effectively status "bluesky", which in turn means "nice idea, we might
> get to it... someday."
One guy from this list was working on it - I remember it tho not his name.
And he had patches. I liked the idea. It could probably work better than
bcache due to not being filesystem agnostic.
> Given the btrfs project history of everything seeming to take rather
> longer than the original it turned out wildly optimistic projections, in
> the absense of a good filesystem dev personally getting that specific
> itch to scratch, that means it's likely a good two years out, and may be
> 5-10. So no, I'd definitely *NOT* wait on it!
The well known mature filesystems (Ext, XFS, ...) are all probably 20 years
old or more. Btrfs is maybe 5 years old now? It should start becoming
feature complete now, and I think the devs are driven by similar emotions.
Then give it another 5 years to work out all bugs and performance problems.
At least from my dev background I know that time needed to code the feature-
complete codebase is about the same amount of time needed for testing and
optimizing the system. I suppose it will then follow a similar evolution
like the ext family of filesystems, adding new features while maintaining
on-disk format as good as possible or at least enable easy forward-
migration, so users have a choice between proven stability or new features.
At least this is what I hope and wish. ;-)
>> * Would you recommend going with a bigger/smaller SSD? I'm planning to
>> use only 75% of it for bcache so wear-leveling can work better, maybe
>> use another part of it for hibernation (suspend to disk).
>
> FWIW, for my split data, some on SSD, some on spinning rust, setup, I had
> originally planned perhaps a 64 gig or so SSD, figuring I could put the
> boot-time-critical rootfs and a few other initial-access-time-critical
> things on it, with a reasonable amount of room to spare for wear-
> leveling. Maybe 128 gig or so, with a bit more stuff on it.
The calculation behind this is that about 7% of the flash memory is already
reserved for wear-leveling. The chips are powers of two (e.g., 128 gigs)
while the drive announces more human-friendly sizes by cutting about 7% away
(here: 120 gigs). But actually, reserving only 7% for wear-leveling is,
according to multiple sources, not able to provide good performance in many
workloads. The recommendation is to go with 30-50%. So, staying with the
numbers, going with 90 gigs vs fully provisioned 120 gigs (that's 75% of the
announced size), I effectively have a reserve of about 30% for wear-leveling
(90 : 128 ~= 0.7).
> There were some smaller ones available, but
> they tended to be either MUCH slower or MUCH higher priced, I'd guess
> left over from a previous generation before prices came down, and they
> simply hadn't been re-priced to match current price/capacity price-points.
The performance drop is probably explainable by the fact that the drives to
striping internally across the flash chips. You can see this performance
drop even with modern drives if you look at comparision tests. That's not an
effect of old technology only.
I think my system is mostly limited in performance by seeks not by
throughput. So bcache came into my mind as the solution. Even a cheap drive
would still be fast enough to deliver the throughput I usually measure in
the system monitor with my HDDs - but with the bonus of more or less zero
seek time. I don't think I have to optimize for throughput.
This is similar to throwing more RAM into the system usually gives a better
performance boost than throwing more CPU at it: CPU would improve throughput
- but the best throughput does not help if seeking is the limiting factor
(i.e., having to re-read data from disk due to stressing the cache and RAM).
> But much below 128 GiB (there were some 120 GB at about the same per-gig,
> which "units" says is just under 112 GiB) and the price per gig tends to
> go up, while above 256 GB (not GiB) both the price per gig and full price
> tend to go up.
Yes, I figured currently you'd best go with the 128 gigs range, too. But in
a range of difference around 60 gigs it is just not yet important for me. If
it were 120 vs 240 gigs (or even more) it would become interesting. So in
this difference range I'd probably prefer the lower price over the higher
capacity. But I've not finally made up my mind about that.
> Of course that means if you do actually do bcache, 60-ish gigs should be
> good and I'd guess 128 gig would be overkill, as I guess 40-60 gigs
> probably about what my "hot" data is, the stuff bcache would likely catch.
That's the point. In higher capacity ranges it becomes interesting for more
purpose than just bcache. But for the moment, and because of the fact I do
not really trust this technology for storing all sorts of data with
different access-patterns on it, I just want to try it out and see the
effect of it.
> And 60 gig will likely be /some/ cheaper tho not as much as you might
> expect, but you'll lose flexibility too, and/or you might actually pay
> more for the 60 gig than the 120 gig, or it'll be slower speed-rated.
> That was what I found when I actually went out to buy, anyway.
It's like 50% of the capacity for 75% of the price. Not a very good deal.
But throughput is not my prime target I guess (except you'd like to teach me
otherwise wrt the above mentioned), and excess capacity is currently useless
for me. So I'd probably go with the worse deal.
> I have a separate boot partition on each of the SSDs, with grub2
> installed to both SSD separately, pointing at its own /boot. with
> the SSD I boot selectable in BIOS. That gives me a working /boot
> and a primary /boot backup. I run git kernels and normally
> update the working /boot with a new kernel once or twice a week,
> while only updating the backup /boot with the release kernel, so
> every couple months.
Similar here: All hard disk use the same partitioning layout and I can use
the spare space for /boot backups, EFI, or a small rescue system.
> 4 640 MiB /var/log (btrfs mixed-mode, raid1 data/metadata)
>
> That gives me plenty of log space as long as logrotate doesn't
> break, while still keeping a reasonable cap on the log partition
> in case I get a runaway log.
I'm using journald as the only logger and it does its house keeping well.
> As any good sysadmin should know,
> some from experience (!!), keeping a separate log partition is a
> good idea, since that limits the damage if something /does/ go
> runaway logging.
Then take me as a good system admin: That's why the servers I admin are
partitioned with these thoughts in mind. Since these servers run in VMs, I
have no problem with partitioning here (in contrast to my "hate" above)
because I can grow disk images and add new virtual drives without problems.
My partitioning scheme in VMs is thus very simple: One partition per drive.
So in the end: No problem for me with hating partitioning as there actually
is no real partitioning. ;-)
In my opinion, partitioning is a remnant from ancient times which should go
away. Volume pooling like it is supported by ZFS or btrfs is just the way to
go. In VMs I can more or less emulate it by putting thin-provisioned virtual
disk-images in the datastore.
> My rootfs includes (almost) all "installable" data, everything
> installed by packages except for /var/lib, which is a symlink to
> /home/var/lib. The reason for that is that I keep rootfs mounted
> read-only by default, only mounting it read-write for updates or
> configuration changes, and /var/lib needs to be writable. /home
> is mounted writable, thus the /var/lib symlink pointing into it.
I also started to hate such hacks, I used to use them, too - in the past.
Volume pooling is the way to go. Waiting for btrfs to be able to mount
subvolumes read-only...
> I learned the hard way to keep everything installed (but for
> /var/lib) on the same filesystem, along with the installed-
> package database (/var/db/pkg on gentoo), when I had to deal with
> a recovery situation with rootfs, /var, and /usr on separate
> partitions, recovering each one from a backup made at a different
> time! Now I make **VERY** sure everything stays in sync, so
> the installed-package database matches what's actually installed.
Actually, I guess you might have some background to why I hate partitions.
;-) But that's only one reasoning. The other is: You will always run out of
space on one of them and have no way to redistribute space easily as needed.
This is also why I pooled my HDDs together into one btrfs instead of putting
different purpose partitions on them. And the draid-0 came for free. And
since meta-data is much more critical, it is mraid-1. But I never had the
problem yet that the btrfs driver had to repair from a good meta-data block.
;-)
[most personal stuff snipped, feel free to PM]
> (My SSDs, Corsair Neutron series, run a LAMD (Link A
> Media Devices) controller. These don't have the compression
> or dedup features of something like the sandforce controllers,
> but the Neutrons at least (as opposed to the Neutron GTX) are
> enterprise targeted, with the resulting predictable performance,
> capacity and reliability bullet-point features. What you save to
> the SSD is saved as-you-sent-it, regardless of compressibility or
> whether it's a dup of something else already on the SSD. Thus,
> at least with my SSDs, the redundant working and backup copies
> are actually two copies on the SSD as well, not one compressed/
> dedupped copy. That's a very nice confidence point when the
> whole /point/ of sending two copies is to have a backup! So
> for anyone reading this that decides to do something similar,
> be sure your SSD firmware isn't doing de-duping in the background,
> leaving you with only the one copy regardless of what you thought
> you might have saved!)
This is actually an important point when thinking of putting btrfs with raid
features on such drives, or enabling btrfs compression.
> Still on spinning rust, meanwhile, all my filesystems remain the many-
> years-stable reiserfs. I keep a working and backup media partition
> there, as well as second backup partitions for everything on btrfs on the
> ssds, just in case.
As initially mentioned, reiserfs has proven extremely reliable for me too,
even in disastrous circumstances. No other FS has ever kept up with that.
But scaling in multi-process IO situations which are mostly seen on busy
servers is just, expressed mildly: bad. And XFS was the best candidate to
fill that gap. I've lost my trust in ext family FS long before, so that was
no option. Yes, I know: It's mature. It's proven. It's stable. But when it
gets corrupted for whatever reason, my experience is that chances for
recovery are virtually non-existent. Let's see how btrfs works for me during
the next years. It's not there yet and performance is worse in almost any
workload... But it has some compelling features. ;-)
> I figure if the external gets taken out too, say by fire if my house
> burnt down or by theft if someone broke in and stole it, I'd have much
> more important things to worry about for awhile, then what might have
> happened to my data!
Right. But still, my important work set (read: git repos, intellectual goods
and activity, ...) are always mirrored outside. And btrfs snapshots help
against accidential damage.
> And once I did get back on my feet and ready to
> think about computing again, much of the data would be sufficiently
> outdated as to be near worthless in any case. At that point I might as
> well start from scratch but for the knowledge in my head, and whatever
> offsite or the like backups I might have had probably wouldn't be worth
> the trouble to recover anyway, so that's beyond cost/time/hassle
> effective and I don't bother.
I've once lost all my work set - that's when I started to hate FAT. This is
probably 15 years ago but still bugs me, I only had very few and vastly
outdated backups. But lessons learned. So maybe it is not as easy to say "I
don't bother" as you think now. Just my two cents...
Sorry for the noise, list.
Enjoying discussion with you, I suggest getting in touch PM if this is
getting away from btrfs any more...
Thanks,
Kai
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Migrate to bcache: A few questions
2013-12-30 16:02 ` Austin S Hemmelgarn
@ 2014-01-01 10:06 ` Duncan
2014-01-01 20:12 ` Austin S Hemmelgarn
1 sibling, 0 replies; 14+ messages in thread
From: Duncan @ 2014-01-01 10:06 UTC (permalink / raw)
To: linux-btrfs
Austin S Hemmelgarn posted on Mon, 30 Dec 2013 11:02:21 -0500 as
excerpted:
> I've actually tried a simmilar configuration myself a couple of times
> (also using Gentoo in-fact), and I can tell you from experience that
> unless things have changed greatly since kernel 3.12.1, it really isn't
> worth the headaches.
Basically what I posted, but "now with added real experience!" (TM) =:^)
> As an alternative to using bcache, you might try something simmilar to
> the following:
> 64G SSD with /boot, /, and /usr Other HDD with /var, /usr/portage,
> /usr/src, and /home tmpfs or ramdisk for /tmp and /var/tmp
> This is essentially what I use now, and I have found that it
> significantly improves system performance.
Again, very similar to my own recommendation. Nice to see others saying
the same thing. =:^)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Migrate to bcache: A few questions
2013-12-30 16:02 ` Austin S Hemmelgarn
2014-01-01 10:06 ` Duncan
@ 2014-01-01 20:12 ` Austin S Hemmelgarn
2014-01-02 8:49 ` Duncan
1 sibling, 1 reply; 14+ messages in thread
From: Austin S Hemmelgarn @ 2014-01-01 20:12 UTC (permalink / raw)
To: Kai Krakow, linux-btrfs
On 12/30/2013 11:02 AM, Austin S Hemmelgarn wrote:
>
> As an alternative to using bcache, you might try something simmilar to
> the following:
> 64G SSD with /boot, /, and /usr
> Other HDD with /var, /usr/portage, /usr/src, and /home
> tmpfs or ramdisk for /tmp and /var/tmp
> This is essentially what I use now, and I have found that it
> significantly improves system performance.
>
On this specific note, I would actually suggest against putting the
portage tree on btrfs, it makes syncing go ridiculously slow,
and it also seems to slow down emerge as well.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Migrate to bcache: A few questions
2014-01-01 20:12 ` Austin S Hemmelgarn
@ 2014-01-02 8:49 ` Duncan
2014-01-02 12:36 ` Austin S Hemmelgarn
0 siblings, 1 reply; 14+ messages in thread
From: Duncan @ 2014-01-02 8:49 UTC (permalink / raw)
To: linux-btrfs
Austin S Hemmelgarn posted on Wed, 01 Jan 2014 15:12:40 -0500 as
excerpted:
> On 12/30/2013 11:02 AM, Austin S Hemmelgarn wrote:
>>
>> As an alternative to using bcache, you might try something simmilar to
>> the following:
>> 64G SSD with /boot, /, and /usr Other HDD with /var, /usr/portage,
>> /usr/src, and /home tmpfs or ramdisk for /tmp and /var/tmp
>> This is essentially what I use now, and I have found that it
>> significantly improves system performance.
>>
> On this specific note, I would actually suggest against putting the
> portage tree on btrfs, it makes syncing go ridiculously slow,
> and it also seems to slow down emerge as well.
Interesting observation.
I had not see it here (with the gentoo tree and overlays on btrfs), but
that's very likely because all my btrfs are on SSD, as I upgraded to both
at the same time, because my previous default filesystem choice,
reiserfs, isn't well suited to SSD due to excessive writing due to the
journaling.
I do know slow syncs and portage dep-calculations were one of the reasons
I switched to SSD (and thus btrfs), however. That was getting pretty
painful on spinning rust, at least with reiserfs. And I imagine btrfs on
single-device spinning rust would if anything be worse at least for
syncs, due to the default dup metadata, meaning at least three writes
(and three seeks) for each file, once for the data, twice for the
metadata.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Migrate to bcache: A few questions
2014-01-02 8:49 ` Duncan
@ 2014-01-02 12:36 ` Austin S Hemmelgarn
2014-01-03 0:09 ` Duncan
0 siblings, 1 reply; 14+ messages in thread
From: Austin S Hemmelgarn @ 2014-01-02 12:36 UTC (permalink / raw)
To: Duncan, linux-btrfs
On 2014-01-02 03:49, Duncan wrote:
> Austin S Hemmelgarn posted on Wed, 01 Jan 2014 15:12:40 -0500 as
> excerpted:
>
>> On 12/30/2013 11:02 AM, Austin S Hemmelgarn wrote:
>>>
>>> As an alternative to using bcache, you might try something simmilar to
>>> the following:
>>> 64G SSD with /boot, /, and /usr Other HDD with /var, /usr/portage,
>>> /usr/src, and /home tmpfs or ramdisk for /tmp and /var/tmp
>>> This is essentially what I use now, and I have found that it
>>> significantly improves system performance.
>>>
>> On this specific note, I would actually suggest against putting the
>> portage tree on btrfs, it makes syncing go ridiculously slow,
>> and it also seems to slow down emerge as well.
>
> Interesting observation.
>
> I had not see it here (with the gentoo tree and overlays on btrfs), but
> that's very likely because all my btrfs are on SSD, as I upgraded to both
> at the same time, because my previous default filesystem choice,
> reiserfs, isn't well suited to SSD due to excessive writing due to the
> journaling.
>
> I do know slow syncs and portage dep-calculations were one of the reasons
> I switched to SSD (and thus btrfs), however. That was getting pretty
> painful on spinning rust, at least with reiserfs. And I imagine btrfs on
> single-device spinning rust would if anything be worse at least for
> syncs, due to the default dup metadata, meaning at least three writes
> (and three seeks) for each file, once for the data, twice for the
> metadata.
>
I think the triple seek+write is probably the biggest offender in my
case, although COW and autodefrag probably don't help either. I'm kind
of hesitant to put stuff that gets changed daily on a SSD, so I've ended
up putting portage on ext4 with no journaling (which out-performs every
other filesystem I have tested WRT write performance). As for the
dep-calculations, I have 16G of ram, so I just use a script to read the
entire tree into the page cache after each sync.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Migrate to bcache: A few questions
2014-01-02 12:36 ` Austin S Hemmelgarn
@ 2014-01-03 0:09 ` Duncan
0 siblings, 0 replies; 14+ messages in thread
From: Duncan @ 2014-01-03 0:09 UTC (permalink / raw)
To: linux-btrfs
Austin S Hemmelgarn posted on Thu, 02 Jan 2014 07:36:06 -0500 as
excerpted:
> I think the triple seek+write is probably the biggest offender in my
> case, although COW and autodefrag probably don't help either. I'm kind
> of hesitant to put stuff that gets changed daily on a SSD, so I've ended
> up putting portage on ext4 with no journaling (which out-performs every
> other filesystem I have tested WRT write performance).
I went ahead with the gentoo tree and overlays on SSD, because... well,
they need the fast access that SSD provides, and if I can't use the SSD
for its good points, why did I buy it in the first place?
It's also worth noting that only a few files change on a day to day
basis. Most of the tree remains as it is. Similarly with the git pack
files behind the overlays (and live-git sources) -- once they reach a
certain point they stop changing and all the changes go into a new file.
Further, most reports I've seen suggest that daily changes on some
reasonably small part of an SSD aren't a huge problem... given wear-
leveling and an estimated normal lifetime of say three to five years
before they're replaced with new hardware anyway, daily changes simply
shouldn't be an issue. It's worth keeping limited-write-cycles in mind
and minimizing them where possible, but it's not quite the big thing a
lot of people make it out to be.
Additionally, I'm near 100% overprovisioned, giving the SSDs lots of room
to do that wear-leveling, so...
Meanwhile, are you using tail packing on that ext4? The idea of wasting
all that space due to all those small files has always been a downer for
me and others, and is the reason many of us used reiserfs for many
years. I guess ext4 now does have tail packing or some similar solution,
but I do wonder how much that tail packing affects performance.
Of course it'd also be possible to run reiserfs without tail packing, and
even without journaling. But somehow I always thought what was the point
of running reiser, if those were disabled.
Anyway, I'd find it interesting to benchmark what the effect of
tailpacking (or whatever ext4 calls it) on no-journal ext4 for the gentoo
tree actually was. If you happen to know, or happen to be inspired to
run those benchmarks now, I'd be interested...
> As for the
> dep-calculations, I have 16G of ram, so I just use a script to read the
> entire tree into the page cache after each sync.
With 16 gig RAM, won't the sync have pulled everything into page-cache
already? That has always seemed to be the case here. Running an emerge
--deep --upgrade --newuse @world here after a sync shows very little disk
activity and takes far less time than trying the same thing after an
unmount/remount, thus cold-cache.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2014-01-03 0:09 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-29 21:11 Migrate to bcache: A few questions Kai Krakow
2013-12-30 1:03 ` Chris Murphy
2013-12-30 1:22 ` Kai Krakow
2013-12-30 3:48 ` Chris Murphy
2013-12-30 9:01 ` Marc MERLIN
2013-12-31 0:31 ` Kai Krakow
2013-12-30 6:24 ` Duncan
2013-12-31 3:13 ` Kai Krakow
2013-12-30 16:02 ` Austin S Hemmelgarn
2014-01-01 10:06 ` Duncan
2014-01-01 20:12 ` Austin S Hemmelgarn
2014-01-02 8:49 ` Duncan
2014-01-02 12:36 ` Austin S Hemmelgarn
2014-01-03 0:09 ` Duncan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).