linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Migrate to bcache: A few questions
@ 2013-12-29 21:11 Kai Krakow
  2013-12-30  1:03 ` Chris Murphy
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Kai Krakow @ 2013-12-29 21:11 UTC (permalink / raw)
  To: linux-btrfs

Hello list!

I'm planning to buy a small SSD (around 60GB) and use it for bcache in front 
of my 3x 1TB HDD btrfs setup (mraid1+draid0) using write-back caching. Btrfs 
is my root device, thus the system must be able to boot from bcache using 
init ramdisk. My /boot is a separate filesystem outside of btrfs and will be 
outside of bcache. I am using Gentoo as my system.

I have a few questions:

* How stable is it? I've read about some csum errors lately...

* I want to migrate my current storage to bcache without replaying a backup.
  Is it possible?

* Did others already use it? What is the perceived performance for desktop
  workloads in comparision to not using bcache?

* How well does bcache handle power outages? Btrfs does handle them very
  well since many months.

* How well does it play with dracut as initrd? Is it as simple as telling it
  the new device nodes or is there something complicate to configure?

* How does bcache handle a failing SSD when it starts to wear out in a few
  years?

* Is it worth waiting for hot-relocation support in btrfs to natively use
  a SSD as cache?

* Would you recommend going with a bigger/smaller SSD? I'm planning to use
  only 75% of it for bcache so wear-leveling can work better, maybe use
  another part of it for hibernation (suspend to disk).

Regards,
Kai


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Migrate to bcache: A few questions
  2013-12-29 21:11 Migrate to bcache: A few questions Kai Krakow
@ 2013-12-30  1:03 ` Chris Murphy
  2013-12-30  1:22   ` Kai Krakow
  2013-12-30  6:24 ` Duncan
  2013-12-30 16:02 ` Austin S Hemmelgarn
  2 siblings, 1 reply; 14+ messages in thread
From: Chris Murphy @ 2013-12-30  1:03 UTC (permalink / raw)
  To: Btrfs BTRFS


On Dec 29, 2013, at 2:11 PM, Kai Krakow <hurikhan77+btrfs@gmail.com> wrote:

> 
> * How stable is it? I've read about some csum errors lately…

Seems like bcache devs are still looking into the recent btrfs csum issues.

> 
> * I want to migrate my current storage to bcache without replaying a backup.
>  Is it possible?
> 
> * Did others already use it? What is the perceived performance for desktop
>  workloads in comparision to not using bcache?
> 
> * How well does bcache handle power outages? Btrfs does handle them very
>  well since many months.
> 
> * How well does it play with dracut as initrd? Is it as simple as telling it
>  the new device nodes or is there something complicate to configure?
> 
> * How does bcache handle a failing SSD when it starts to wear out in a few
>  years?

I think most of these questions are better suited for the bcache list. I think there are still many uncertainties about the behavior of SSDs during power failures when they aren't explicitly designed with power failure protection in mind. At best I'd hope for a rollback involving data loss, but hopefully not a corrupt file system. I'd rather lose the last minute of data supposedly written to the drive, than have to do a fuil restore from backup.

> 
> * Is it worth waiting for hot-relocation support in btrfs to natively use
>  a SSD as cache?

I haven't read anything about it. Don't see it listed in project ideas.

> 
> * Would you recommend going with a bigger/smaller SSD? I'm planning to use
>  only 75% of it for bcache so wear-leveling can work better, maybe use
>  another part of it for hibernation (suspend to disk).

I think that depends greatly on workload. If you're writing or reading a lot of disparate files, or a lot of small file random writes (mail server), I'd go bigger. By default sequential IO isn't cached. So I think you can get a big boost in responsiveness with a relatively small bcache size.


Chris Murphy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Migrate to bcache: A few questions
  2013-12-30  1:03 ` Chris Murphy
@ 2013-12-30  1:22   ` Kai Krakow
  2013-12-30  3:48     ` Chris Murphy
  2013-12-30  9:01     ` Marc MERLIN
  0 siblings, 2 replies; 14+ messages in thread
From: Kai Krakow @ 2013-12-30  1:22 UTC (permalink / raw)
  To: linux-btrfs

Chris Murphy <lists@colorremedies.com> schrieb:

> I think most of these questions are better suited for the bcache list.

Ah yes, you are true. I will repost the non-btrfs related questions to the 
bcache list. But actually I am most interested in using bcache together 
btrfs, so getting a general picture of its current state in this combination 
would be nice - and so these questions may be partially appropriate here.

> I
> think there are still many uncertainties about the behavior of SSDs during
> power failures when they aren't explicitly designed with power failure
> protection in mind. At best I'd hope for a rollback involving data loss,
> but hopefully not a corrupt file system. I'd rather lose the last minute
> of data supposedly written to the drive, than have to do a fuil restore
> from backup.

These thought are actually quite interesting. So you are saying that data 
may not be fully written to SSD although the kernel thinks so? This is 
probably very dangerous. The bcache module could not ensure coherence 
between its backing devices and its own contents - and data loss will occur 
and probably destroy important file system structures.

I understand your words as "data may only partially being written". This, of 
course, may happen to HDDs as well. But usually a file system works with 
transactions so the last incomplete transaction can simply be thrown away. I 
hope bcache implements the same architecture. But what does it mean for the 
stacked write-back architecture?

As I understand, bcache may use write-through for sequential writes, but 
write-back for random writes. In this case, part of the data may have hit 
the backing device, other data does only exist in the bcache. If that last 
transaction is not closed due to power-loss, and then thrown away, we have 
part of the transaction already written to the backing device that the 
filesystem does not know of after resume.

I'd appreciate some thoughts about it but this topic is probably also best 
moved over to the bcache list.

Thanks,
Kai 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Migrate to bcache: A few questions
  2013-12-30  1:22   ` Kai Krakow
@ 2013-12-30  3:48     ` Chris Murphy
  2013-12-30  9:01     ` Marc MERLIN
  1 sibling, 0 replies; 14+ messages in thread
From: Chris Murphy @ 2013-12-30  3:48 UTC (permalink / raw)
  To: Btrfs BTRFS


On Dec 29, 2013, at 6:22 PM, Kai Krakow <hurikhan77+btrfs@gmail.com> wrote:

> So you are saying that data 
> may not be fully written to SSD although the kernel thinks so?

Drives shouldn't lie when asked to flush to disk, but they do. Older article about this at lwn is a decent primer on the subject of write barriers.

http://lwn.net/Articles/283161/

> This is 
> probably very dangerous. The bcache module could not ensure coherence 
> between its backing devices and its own contents - and data loss will occur 
> and probably destroy important file system structures.

I don't know the details, there's more detail on lkml.org and bcache lists. My impression is that short of bugs, it should be much safer than you describe. It's not like a linear/concat md or LVM device fail scenario. There's good info in the bcache.h file:

http://lxr.free-electrons.com/source/drivers/md/bcache/bcache.h

If anything, once the kinks are worked out, under heavy random write IO I'd expect bcache to improve the likelihood data isn't lost. Faster speed of SSD means we get a faster commit of the data to stable media. Also bcache assumes the cache is always dirty on startup, no matter whether the shutdown was clean or dirty, so the code is explicitly designed to resolve the state of the cache relative to the backing device. It's actually pretty fascinating work.

It may not be required, but I'd expect we'd want the write cache on the backing device disabled. It should still honor write barriers but it kinda seems unnecessary and riskier to have it enabled (which is the default with consumer drives).


> As I understand, bcache may use write-through for sequential writes, but 
> write-back for random writes. In this case, part of the data may have hit 
> the backing device, other data does only exist in the bcache. If that last 
> transaction is not closed due to power-loss, and then thrown away, we have 
> part of the transaction already written to the backing device that the 
> filesystem does not know of after resume.

In the write through case we should be no worse off than the bare drive in a power loss. In the write back case the SSD should have committed more data than the HDD could have in the same situation. I don't understand the details of how partially successful writes to the backing media are handled when the system comes back up. Since bcache is also COW, SSD blocks aren't reused until data is committed to the backing device.


Chris Murphy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Migrate to bcache: A few questions
  2013-12-29 21:11 Migrate to bcache: A few questions Kai Krakow
  2013-12-30  1:03 ` Chris Murphy
@ 2013-12-30  6:24 ` Duncan
  2013-12-31  3:13   ` Kai Krakow
  2013-12-30 16:02 ` Austin S Hemmelgarn
  2 siblings, 1 reply; 14+ messages in thread
From: Duncan @ 2013-12-30  6:24 UTC (permalink / raw)
  To: linux-btrfs

Kai Krakow posted on Sun, 29 Dec 2013 22:11:16 +0100 as excerpted:

> Hello list!
> 
> I'm planning to buy a small SSD (around 60GB) and use it for bcache in
> front of my 3x 1TB HDD btrfs setup (mraid1+draid0) using write-back
> caching. Btrfs is my root device, thus the system must be able to boot
> from bcache using init ramdisk. My /boot is a separate filesystem
> outside of btrfs and will be outside of bcache. I am using Gentoo as my
> system.

Gentooer here too. =:^)

> I have a few questions:
> 
> * How stable is it? I've read about some csum errors lately...

FWIW, both bcache and btrfs are new and still developing technology.  
While I'm using btrfs here, I have tested usable (which for root means 
either means directly bootable or that you have tested booting to a 
recovery image and restoring from there, I do the former, here) backups, 
as STRONGLY recommended for btrfs in its current state, but haven't had 
to use them.

And I considered bcache previously and might otherwise be using it, but 
at least personally, I'm not willing to try BOTH of them at once, since 
neither one is mature yet and if there are problems as there very well 
might be, I'd have the additional issue of figuring out which one was the 
problem, and I'm personally not prepared to deal with that.

Instead, at this point I'd recommend choosing /either/ bcache /or/ btrfs, 
and using bcache with a more mature filesystem like ext4 or (what I used 
for years previous and still use for spinning rust) reiserfs.

And as I said, keep your backups as current as you're willing to deal 
with losing what's not backed up, and tested usable and (for root) either 
bootable or restorable from alternate boot, because while at least btrfs 
is /reasonably/ stable for /ordinary/ daily use, there remain corner-
cases and you never know when your case is going to BE a corner-case!

> * I want to migrate my current storage to bcache without replaying a
> backup.  Is it possible?

Since I've not actually used bcache, I won't try to answer some of these, 
but will answer based on what I've seen on the list where I can...  I 
don't know on this one.

> * Did others already use it? What is the perceived performance for
> desktop workloads in comparision to not using bcache?

Others are indeed already using it.  I've seen some btrfs/bcache problems 
reported on this list, but as mentioned above, when both are in use that 
means figuring out which is the problem, and at least from the btrfs side 
I've not seen a lot of resolution in that regard.  From here it /looks/ 
like that's simply being punted at this time, as there's still more 
easily traceable problems without the additional bcache variable to work 
on first.  But it's quite possible the bcache list is actively tackling 
btrfs/bache combination problems, as I'm not subscribed there.

So I can't answer the desktop performance comparison question directly, 
but given that I /am/ running btrfs on SSD, I /can/ say I'm quite happy 
with that. =:^)

Keep in mind...

We're talking storage cache here.  Given the cost of memory and common 
system configurations these days, 4-16 gig of memory on a desktop isn't 
unusual or cost prohibitive, and a common desktop working set should well 
fit.

I suspect my desktop setup, 16 gigs memory backing a 6-core AMD fx6100 
(bulldozer-1) @ 3.6 GHz, is probably a bit toward the high side even for 
a gentooer, but not inordinately so.  Based on my usage...

Typical app memory usage runs 1-2 GiB (that's with KDE 4.12.49.9999 from 
the gentoo/kde overlay, but USE=-semantic-desktop, etc).  Buffer memory 
runs a few MiB but isn't normally significant, so it can fold into that 
same 1-2 GiB too.

That leaves a full 14 GiB for cache.  But at least with /my/ usage, 
normal non-update cache memory usage tends to be below ~6 GiB too, so 
total apps/buffer/cache memory usage tends to be below 8 GiB as well.

When I'm doing multi-job builds or working with big media files, I'll 
sometimes go above 8 gig usage, and that occasional cache-spill was why I 
upgraded to 16 gig.  But in practice, 10 gig would take care of that most 
of the time, and were it not for the "accident" of powers-of-two meaning 
16 gig is the notch above 8 gig, 10 or 12 gig would be plenty.  Truth be 
told, I so seldom use that last 4 gig that it's almost embarrassing.

* Tho if I ran multi-GiB VMs that'd use up that extra memory real fast!  
But while that /is/ becoming more common, I'm not exactly sure I'd 
classify 4 gigs plus of VM usage as "desktop" usage just yet.  
Workstation, yes, and definitely server, but not really desktop.

All that as background to this...

* Cache works only after first access.  If you only access something 
occasionally, it may not be worth caching at all.

* Similarly, if access isn't time critical, think of playing a huge video 
file where only a few meg in memory at once is plenty, and where storage 
access is several times faster than play-speed, cache isn't particularly 
useful.

* Bcache is designed not to cache sequential access (that large video 
file) in any case, since spinning rust tends to be more than fast enough 
for that sort of thing already.

Given the stated 3 x 1TB drive btrfs in raid1 metadata, raid0 data, config 
you mention, I'm wondering if big media is a/the big use case for you, in 
which case bcache isn't going to be a good solution anyway, since that 
tends to be sequential access, which bcache deliberately ignores as it 
doesn't fit the model it's targeting.

(I am a bit worried about that raid0 data, tho.  Unless you consider that 
data of trivial value that's not a good choice, since raid0 generally 
means you lose it all if you lose a physical device.  And you're running 
three devices, which means you just tripled the chance of a device 
failure over that of just putting it all on a single 3 TB drive!  And 
backups... a 3 TB restore on spinning rust will take some time any way 
you look at it, so backups may or may not be particularly viable here.  
The most common use case for that much data is probably a DVR scenario,  
which is video, and you may well consider it of low enough value that if 
you lose it, you lose it, and you're willing to take that risk, but for 
normally sequential access video/media, bcache isn't a good match anyway.)

* With memory cost what it is, for repeat access where initial access 
time isn't /too/ critical, investing in more memory, to a point (for me, 
8-12 gig as explained above), and simply letting the kernel manage cache 
and memory as it normally does, may make more sense than bcache to an ssd.

* Of course, what bcache *DOES* effectively do, is extend the per-boot 
cache time of memory, making the cache persistent.  That effectively 
extends the time over which "occasional access" still justifies caching 
at all.

* That makes bcache well suited to boot-time and initial-access-speed-
critical scenarios, where more memory for a larger in-memory cache won't 
do any good, since it's first-access-since-boot, because for in-memory 
cache that's a cold-cache scenario, while with bcache's persistent cache, 
it's a hot-cache scenario.


But what I'm actually wondering is if your use case better matches a 
split data model, where you put root and perhaps stuff like the portage 
tree and/or /home on fast SSD, while keeping all that big and generally 
sequential access media on slower but much cheaper big spinning rust.

That's effectively what I've done here, tho I'm looking at rather less 
than a TB of slow-access media, etc.  See below for the details.  The 
general idea is as I said to stick all the time-critical stuff on SSD 
directly (not using something like bcache), while keeping the slower 
spinning rust for the big less-time-critical and sequential-access stuff, 
and for non-btrfs backups of the stuff on the btrfs-formatted SSD, since 
btrfs /is/ after all still in development, and I /do/ intend to be 
prepared if /my/ particular case ends up being one of the corner-cases 
btrfs still worst-cases on.

> * How well does bcache handle power outages? Btrfs does handle them very
>   well since many months.

Since I don't run bcache I can't really speak to this at all, /except/, 
the btrfs/bcache combo trouble reports that have come to the list have I 
think all been power outage or kernel-crash scenarios... as could be 
predicted of course since that's a filesystem's worst-case scenario, at 
least that it has to commonly deal with.

But I know I'd definitely not trust that case, ATM.  Like I said, I'd not 
trust the combination of the two, and this is exactly where/why.  Under 
normal operation, the two should work together well.  But in a power-loss 
situation with both technologies being still relatively new and under 
development... not *MY* data!

> * How well does it play with dracut as initrd? Is it as simple as
> telling it the new device nodes or is there something complicate to
> configure?

I can't answer this at all for bcache, but I can say I've been relatively 
happy with the dracut initramfs solution for dual-device btrfs raid1 
root. =:^)  (At least back when I first set it up several kernels ago, 
the kernel's commandline parser apparently couldn't handle the multiple 
equals of something like rootflags=device=/dev/sda5,device=/dev/sdb5.  So 
the only way to get a multi-device btrfs rootfs to work was to use an 
initr* with userspace btrfs device scan before attempting to mount real-
root, and dracut has worked well for that.)

> * How does bcache handle a failing SSD when it starts to wear out in a
> few years?

Given the newness of the bcache technology, assuming your SSD doesn't 
fail early and it is indeed a few years, I'd suggest that question is 
premature.  Bcache will by that time be much older and more mature than 
it is now, and how it'd handle, or fail to handle, such an event /now/ 
likely hasn't a whole lot to do with how much (presumably) better it'll 
handle it /then/.

> * Is it worth waiting for hot-relocation support in btrfs to natively
> use a SSD as cache?

I wouldn't wait for it.  It's on the wishlist, but according to the wiki 
(project ideas, see the dm_cache or bcache like cache, and the hybrid 
storage points), nobody has claimed that project yet, which makes it 
effectively status "bluesky", which in turn means "nice idea, we might 
get to it... someday."

Given the btrfs project history of everything seeming to take rather 
longer than the original it turned out wildly optimistic projections, in 
the absense of a good filesystem dev personally getting that specific 
itch to scratch, that means it's likely a good two years out, and may be 
5-10.  So no, I'd definitely *NOT* wait on it!

> * Would you recommend going with a bigger/smaller SSD? I'm planning to
> use only 75% of it for bcache so wear-leveling can work better, maybe
> use another part of it for hibernation (suspend to disk).

FWIW, for my split data, some on SSD, some on spinning rust, setup, I had 
originally planned perhaps a 64 gig or so SSD, figuring I could put the 
boot-time-critical rootfs and a few other initial-access-time-critical 
things on it, with a reasonable amount of room to spare for wear-
leveling.  Maybe 128 gig or so, with a bit more stuff on it.

But when I actually went looking for hardware (some months ago now, but 
rather less than a year), I found the availability and price-point knee 
at closer to 256 gig.  128 gig or so was at a similar price-point per-
gig, but tends to sell out pretty fast as it's about half the gigs and 
thus about half the price.  There were some smaller ones available, but 
they tended to be either MUCH slower or MUCH higher priced, I'd guess 
left over from a previous generation before prices came down, and they 
simply hadn't been re-priced to match current price/capacity price-points.

But much below 128 GiB (there were some 120 GB at about the same per-gig, 
which "units" says is just under 112 GiB) and the price per gig tends to 
go up, while above 256 GB (not GiB) both the price per gig and full price 
tend to go up.

In practice, 60 or 80 GB SSDs just didn't seem to be that much cheaper 
than 120-ish gig, and 120-ish gig were a good deal, but were popular 
enough that availability was a bit of an issue.

So I actually ended up with 256 GB, which works out to ~ 238 GiB.  Yeah I 
paid a bit more, but that both gave me a bit more flexibility in terms of 
what I put on them, AND meant after I set them up I STILL had about 40% 
unallocated, giving them *LOTS* of wear-leveling room.

Of course that means if you do actually do bcache, 60-ish gigs should be 
good and I'd guess 128 gig would be overkill, as I guess 40-60 gigs 
probably about what my "hot" data is, the stuff bcache would likely catch.

And 60 gig will likely be /some/ cheaper tho not as much as you might 
expect, but you'll lose flexibility too, and/or you might actually pay 
more for the 60 gig than the 120 gig, or it'll be slower speed-rated.  
That was what I found when I actually went out to buy, anyway.

As to layout (all GPT partitions, not legacy MBR):

On the SSD(s, I actually have two setup, mostly in btrfs dual-device data/
metadata raid1 partitions but with some single-device, mixed/dup):

-	(boot area)

x	1007 KiB free space (so partitions are 1 MiB aligned)

1	3 MiB BIOS reserved partition

	(grub2 puts its core image here, partitions are now 4 MiB aligned)

2	124 MiB EFI reserved partition (for EFI forward compatibility)

	(partitions are now 128 MiB aligned)

3	256 MiB /boot (btrfs mixed-block mode, DUP data/metadata)

	I have a separate boot partition on each of the SSDs, with grub2
	installed to both SSD separately, pointing at its own /boot. with
	the SSD I boot selectable in BIOS.  That gives me a working /boot
	and a primary /boot backup.  I run git kernels and normally
	update the working /boot with a new kernel once or twice a week,
	while only updating the backup /boot with the release kernel, so
	every couple months.

4	640 MiB /var/log (btrfs mixed-mode, raid1 data/metadata)

	That gives me plenty of log space as long as logrotate doesn't
	break, while still keeping a reasonable cap on the log partition
	in case I get a runaway log.  As any good sysadmin should know,
	some from experience (!!), keeping a separate log partition is a
	good idea, since that limits the damage if something /does/ go
	runaway logging.

	(partitions beyond this are now 1 GiB aligned)

5	8 GiB rootfs (btrfs raid1 data/metadata)

	My rootfs includes (almost) all "installable" data, everything
	installed by packages except for /var/lib, which is a symlink to
	/home/var/lib.  The reason for that is that I keep rootfs mounted
	read-only by default, only mounting it read-write for updates or
	configuration changes, and /var/lib needs to be writable.  /home
	is mounted writable, thus the /var/lib symlink pointing into it.

	I learned the hard way to keep everything installed (but for
	/var/lib) on the same filesystem, along with the installed-
	package database (/var/db/pkg on gentoo), when I had to deal with
	a recovery situation with rootfs, /var, and /usr on separate
	partitions, recovering each one from a backup made at a different
	time!  Now I make **VERY** sure everything stays in sync, so
	the installed-package database matches what's actually installed.

	(Obviously /var/lib is a limited exception in ordered to keep
	rootfs read-only by default.  If I have to recover from an out of
	sync /home and thus /home/var/lib, I can query for what packages
	own /var/lib and reinstall them.)

6	20 GiB /home (btrfs raid1 data/metadata)

	20 GiB is plenty big enough for /home, since I keep my big media
	files on a dedicated media partition on spinning rust.

7	24 GiB build and packages tree (btrfs raid1 data/metadata)

	I mount this at /usr/src, since that seemed logical, but it
	contains the traditional /usr/src/linux (a git kernel, here),
	plus the gentoo tree and layman-based overlays, plus my binpkg
	cache, plus the ccache.  Additionally it contains the 32-bit
	chroot binpkg cache and ccache, see below.

8	8 GiB 32-bit chroot build-image (btrfs raid1 data/metadata)

	I have a 32-bit netbook that runs gentoo also.  This is its
	build image, more or less a copy of its rootfs, but on my main
	machine where I build the packages for it.  I keep this
	rsynced to the netbook for its updates.  That way the slower
	netbook with its smaller hard drive doesn't have to build
	packages or keep a copy of the gentoo tree or the 32-bit
	binpkg cache at all.

9-12	Primary backups of partitions 5-8, rootfs, /home, packages, and
	netbook build image.  These partitions are the same size and
	configuration as their working copies above, recreated
	periodically to protect against fat-finger mishaps as well as
	still-under-development btrfs corner-cases and ~arch plus
	live-branch-kde, etc update mishaps.

	(My SSDs, Corsair Neutron series, run a LAMD (Link A
	Media Devices) controller.  These don't have the compression
	or dedup features of something like the sandforce controllers,
	but the Neutrons at least (as opposed to the Neutron GTX) are
	enterprise targeted, with the resulting predictable performance,
	capacity and reliability bullet-point features.  What you save to
	the SSD is saved as-you-sent-it, regardless of compressibility or
	whether it's a dup of something else already on the SSD.  Thus,
	at least with my SSDs, the redundant working and backup copies
	are actually two copies on the SSD as well, not one compressed/
	dedupped copy.  That's a very nice confidence point when the
	whole /point/ of sending two copies is to have a backup!  So
	for anyone reading this that decides to do something similar,
	be sure your SSD firmware isn't doing de-duping in the background,
	leaving you with only the one copy regardless of what you thought
	you might have saved!)
	

That STILL leaves me 117.5 GiB of the 238.5 GiB entirely free and 
unallocated for wear-leveling and/or future flexibility, and I've a 
second copy (primary backup copy) of most of the data as well, which 
could be omitted from SSD (kept on spinning rust) if necessary.

Before I actually got the drives and was still figuring on 128-ish gigs, 
I was figuring 1 gig x-log, maybe 6 gig rootfs, 16 gig home, 20 gig pkg, 
and another 6 gig netbookroot, so about 49 gig of data if I went with 
60-80 gig SSDs, with the backups as well 97 gig, if I went with 120-ish 
gig SSDs.

But as I said, once I actually was out there shopping for 'em I ended up 
getting the 256 GB (238.5 GiB) SSDs as a near-best bargain in terms of 
rated performance and reliability vs. size vs. price.


Still on spinning rust, meanwhile, all my filesystems remain the many-
years-stable reiserfs.  I keep a working and backup media partition 
there, as well as second backup partitions for everything on btrfs on the 
ssds, just in case.

Additionally, I have an external USB-connected drive that's both 
disconnected and off most of the time, to recover from in case something 
takes out both the SSDs and internal spinning rust.

I figure if the external gets taken out too, say by fire if my house 
burnt down or by theft if someone broke in and stole it, I'd have much 
more important things to worry about for awhile, then what might have 
happened to my data!  And once I did get back on my feet and ready to 
think about computing again, much of the data would be sufficiently 
outdated as to be near worthless in any case.  At that point I might as 
well start from scratch but for the knowledge in my head, and whatever 
offsite or the like backups I might have had probably wouldn't be worth 
the trouble to recover anyway, so that's beyond cost/time/hassle 
effective and I don't bother.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Migrate to bcache: A few questions
  2013-12-30  1:22   ` Kai Krakow
  2013-12-30  3:48     ` Chris Murphy
@ 2013-12-30  9:01     ` Marc MERLIN
  2013-12-31  0:31       ` Kai Krakow
  1 sibling, 1 reply; 14+ messages in thread
From: Marc MERLIN @ 2013-12-30  9:01 UTC (permalink / raw)
  To: Kai Krakow; +Cc: linux-btrfs

On Mon, Dec 30, 2013 at 02:22:55AM +0100, Kai Krakow wrote:
> These thought are actually quite interesting. So you are saying that data 
> may not be fully written to SSD although the kernel thinks so? This is 

That, and worse.

Incidently, I have just posted on my G+ about this:
https://plus.google.com/106981743284611658289/posts/Us8yjK9SPs6

which is mostly links to
http://lkcl.net/reports/ssd_analysis.html
https://www.usenix.org/conference/fast13/understanding-robustness-ssds-under-power-fault

After you read those, you'll never think twice about SSDs and data loss
anymore :-/
(I kind of found that out myself over time too, but these have much more
data than I got myself empirically on a couple of SSDs)

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Migrate to bcache: A few questions
  2013-12-29 21:11 Migrate to bcache: A few questions Kai Krakow
  2013-12-30  1:03 ` Chris Murphy
  2013-12-30  6:24 ` Duncan
@ 2013-12-30 16:02 ` Austin S Hemmelgarn
  2014-01-01 10:06   ` Duncan
  2014-01-01 20:12   ` Austin S Hemmelgarn
  2 siblings, 2 replies; 14+ messages in thread
From: Austin S Hemmelgarn @ 2013-12-30 16:02 UTC (permalink / raw)
  To: Kai Krakow, linux-btrfs

On 12/29/2013 04:11 PM, Kai Krakow wrote:
> Hello list!
> 
> I'm planning to buy a small SSD (around 60GB) and use it for bcache in front 
> of my 3x 1TB HDD btrfs setup (mraid1+draid0) using write-back caching. Btrfs 
> is my root device, thus the system must be able to boot from bcache using 
> init ramdisk. My /boot is a separate filesystem outside of btrfs and will be 
> outside of bcache. I am using Gentoo as my system.
> 
> I have a few questions:
> 
> * How stable is it? I've read about some csum errors lately...
> 
> * I want to migrate my current storage to bcache without replaying a backup.
>   Is it possible?
> 
> * Did others already use it? What is the perceived performance for desktop
>   workloads in comparision to not using bcache?
> 
> * How well does bcache handle power outages? Btrfs does handle them very
>   well since many months.
> 
> * How well does it play with dracut as initrd? Is it as simple as telling it
>   the new device nodes or is there something complicate to configure?
> 
> * How does bcache handle a failing SSD when it starts to wear out in a few
>   years?
> 
> * Is it worth waiting for hot-relocation support in btrfs to natively use
>   a SSD as cache?
> 
> * Would you recommend going with a bigger/smaller SSD? I'm planning to use
>   only 75% of it for bcache so wear-leveling can work better, maybe use
>   another part of it for hibernation (suspend to disk).
I've actually tried a simmilar configuration myself a couple of times
(also using Gentoo in-fact), and I can tell you from experience that
unless things have changed greatly since kernel 3.12.1, it really isn't
worth the headaches.  Setting it up on an already installed system is a
serious pain because the backing device has to be reformatted with a
bcache super-block.  In addition, every kernel that I have tried that
had bcache compiled in or loaded as a module had issues, I would see a
kernel OOPS on average once a day from the bcache code, usually followed
shortly by a panic from some other unrelated subsystem.  I didn't get
any actual data corruption, but I wasn't using btrfs at the time for any
of my filesystems.

As an alternative to using bcache, you might try something simmilar to
the following:
    64G SSD with /boot, /, and /usr
    Other HDD with /var, /usr/portage, /usr/src, and /home
    tmpfs or ramdisk for /tmp and /var/tmp
This is essentially what I use now, and I have found that it
significantly improves system performance.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Migrate to bcache: A few questions
  2013-12-30  9:01     ` Marc MERLIN
@ 2013-12-31  0:31       ` Kai Krakow
  0 siblings, 0 replies; 14+ messages in thread
From: Kai Krakow @ 2013-12-31  0:31 UTC (permalink / raw)
  To: linux-btrfs

Marc MERLIN <marc@merlins.org> schrieb:

> On Mon, Dec 30, 2013 at 02:22:55AM +0100, Kai Krakow wrote:
>> These thought are actually quite interesting. So you are saying that data
>> may not be fully written to SSD although the kernel thinks so? This is
> 
> That, and worse.
> 
> Incidently, I have just posted on my G+ about this:
> https://plus.google.com/106981743284611658289/posts/Us8yjK9SPs6
> 
> which is mostly links to
> http://lkcl.net/reports/ssd_analysis.html
> https://www.usenix.org/conference/fast13/understanding-robustness-ssds-under-power-fault
> 
> After you read those, you'll never think twice about SSDs and data loss
> anymore :-/
> (I kind of found that out myself over time too, but these have much more
> data than I got myself empirically on a couple of SSDs)

The bad thing here is: Even battery-backed RAID controllers won't help you 
here. I start to understand why I still don't trust this new technology 
entirely.

Thanks,
Kai


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Migrate to bcache: A few questions
  2013-12-30  6:24 ` Duncan
@ 2013-12-31  3:13   ` Kai Krakow
  0 siblings, 0 replies; 14+ messages in thread
From: Kai Krakow @ 2013-12-31  3:13 UTC (permalink / raw)
  To: linux-btrfs

Duncan <1i5t5.duncan@cox.net> schrieb:

[ spoiler: tldr ;-) ]

>> * How stable is it? I've read about some csum errors lately...
> 
> FWIW, both bcache and btrfs are new and still developing technology.
> While I'm using btrfs here, I have tested usable (which for root means
> either means directly bootable or that you have tested booting to a
> recovery image and restoring from there, I do the former, here) backups,
> as STRONGLY recommended for btrfs in its current state, but haven't had
> to use them.
> 
> And I considered bcache previously and might otherwise be using it, but
> at least personally, I'm not willing to try BOTH of them at once, since
> neither one is mature yet and if there are problems as there very well
> might be, I'd have the additional issue of figuring out which one was the
> problem, and I'm personally not prepared to deal with that.

I mostly trust btrfs by now. Don't understand me wrong: I still have my 
nightly backup job syncing the complete system to an external drive - 
nothing defeats a good backup. But btrfs has survived reliably multiple 
power-losses, kernel panics/freezes, unreliable USB connections, ... It 
looks very stable from that view. Yes, it may have bugs that may introduce 
errors fatal to the filesystem structure. But generally, under usual 
workloads it has proven stable for me. At least for desktop workloads.
 
> Instead, at this point I'd recommend choosing /either/ bcache /or/ btrfs,
> and using bcache with a more mature filesystem like ext4 or (what I used
> for years previous and still use for spinning rust) reiserfs.

I've used reiserfs for several years a long time ago. But it does absolutely 
not scale well for parallel/threaded workloads which is a show stopper for 
server workloads. But it always survived even the worst failure scenarios 
(like SCSI bus going offline for some RAID members) and the tools 
distributed with it were able to recover all data even if the FS was damaged 
beyond any usual things you would normally try when it does no longer mount. 
I've been with Ext3 before, and it was not only one time that a simple 
power-loss during high server-workload destroyed the filesystem beyond 
repair with fsck only making it worse.

Since reiserfs did not scale well and ext* FS has annoyed me more than once, 
we've decided to go with XFS. While it tends to wipe some data after power-
loss and leaves you with zero-filled files, it has proven extremely reliable 
even under those situations mentioned above like dying SCSI bus. Not to the 
extent reiserfs did but still very satisfying. The big plus: it scales 
extremely well with parallel workloads and can be optimized for the stripe 
configuration of the underlying RAID layer. So I made it my default 
filesystem for desktop, too. With the above mentioned annoying "feature" of 
zero'ing out recently touched files when the system crashed. But well, we 
all got proven backups, right? Yep, I also learned that lesson... *sigh

But btrfs, when first announced and while I already was jealously looking at 
ZFS, seemed to be the FS of my choice giving me flexible RAID setups, 
snapshots... I'm quite happy with it although it feels slow sometimes. I 
simply threw more RAM at it - now it is okay.


> And as I said, keep your backups as current as you're willing to deal
> with losing what's not backed up, and tested usable and (for root) either
> bootable or restorable from alternate boot, because while at least btrfs
> is /reasonably/ stable for /ordinary/ daily use, there remain corner-
> cases and you never know when your case is going to BE a corner-case!

I've got a small rescue system I can boot which has btrfs-tools and a recent 
kernel to flexible repair, restore, or whatever I want to do with my backup. 
My backup itself is not bootable (although it probably could, if I change 
some configurations files).

>> * I want to migrate my current storage to bcache without replaying a
>> backup.  Is it possible?
> 
> Since I've not actually used bcache, I won't try to answer some of these,
> but will answer based on what I've seen on the list where I can...  I
> don't know on this one.

I remember someone created some pyhton scripts to make it possible - wrt to 
btrfs especially. Can't remember the link. Maybe I'm able to dig it up. But 
at least I read it as: There's no improvement on that migration path 
directly from bcache. I hoped otherwise...

>> * Did others already use it? What is the perceived performance for
>> desktop workloads in comparision to not using bcache?
> 
> Others are indeed already using it.  I've seen some btrfs/bcache problems
> reported on this list, but as mentioned above, when both are in use that
> means figuring out which is the problem, and at least from the btrfs side
> I've not seen a lot of resolution in that regard.  From here it /looks/
> like that's simply being punted at this time, as there's still more
> easily traceable problems without the additional bcache variable to work
> on first.  But it's quite possible the bcache list is actively tackling
> btrfs/bache combination problems, as I'm not subscribed there.
> 
> So I can't answer the desktop performance comparison question directly,
> but given that I /am/ running btrfs on SSD, I /can/ say I'm quite happy
> with that. =:^)

Well, I'm most interested in bcache+btrfs so I put my questions here in this 
list - although I have to admit that most questions would've better been 
properly placed in the bcache list.

Small sidenote: I'm subscribed through the gmane NNTP proxy to all these 
lists, using a native NNTP reader. I can really recommend it. Subscribing to 
the list for post access is also very easy. You may want to look into it. 
;-)

> Keep in mind...
> 
> We're talking storage cache here.  Given the cost of memory and common
> system configurations these days, 4-16 gig of memory on a desktop isn't
> unusual or cost prohibitive, and a common desktop working set should well
> fit.

I'm having 16 gig of memory. I started with 8, but it was insanely cheap 
when I upgraded my mainboard, only €30 for 8 gig - so I threw in another 
pair of 4 gig modules. Did never regret...

> I suspect my desktop setup, 16 gigs memory backing a 6-core AMD fx6100
> (bulldozer-1) @ 3.6 GHz, is probably a bit toward the high side even for
> a gentooer, but not inordinately so.  Based on my usage...

Mine is 16 gigs, Core i5 Quad @ 3.3 (with this turbo boost thingy) - so, 
well, no. I think both, your and my setup, are decent but not extraordinary. 
;-)

> Typical app memory usage runs 1-2 GiB (that's with KDE 4.12.49.9999 from
> the gentoo/kde overlay, but USE=-semantic-desktop, etc).  Buffer memory
> runs a few MiB but isn't normally significant, so it can fold into that
> same 1-2 GiB too.

Similar observation here, though I'm using semantic-desktop: Memory usage 
rarely goes above 3-4 GB in KDE during usual workloads, so the rest is 
mostly dedicated to cache. Still, btrfs feels very sluggish from time to 
time. Thus my idea throwing a SSD with bcache into the equation. This 
sluggishness came quite suddenly with one of the kernel updates though I 
don't remember which, probably between 3.7 and 3.8... I've mitigated it 
mostly by ramping up the IO queue depth... A lot... From default 128 to 
8192. My amount of RAM allows it - so what... ;-)

> When I'm doing multi-job builds or working with big media files, I'll
> sometimes go above 8 gig usage, and that occasional cache-spill was why I
> upgraded to 16 gig.  But in practice, 10 gig would take care of that most
> of the time, and were it not for the "accident" of powers-of-two meaning
> 16 gig is the notch above 8 gig, 10 or 12 gig would be plenty.  Truth be
> told, I so seldom use that last 4 gig that it's almost embarrassing.

Same observation here: 8 gigs are usually enough for almost any workload. 
Just sometimes an extra amount of 2-3 gigs was needed. But well, it was 
cheap. Why spend 20 bucks for 4 gigs if I could get 8 for 30 bucks. :-) And 
while I never measured it and neither looked at how todays system organize 
memory, I still believe in memory interleaving and thus always buy in pairs.

> * Tho if I ran multi-GiB VMs that'd use up that extra memory real fast!
> But while that /is/ becoming more common, I'm not exactly sure I'd
> classify 4 gigs plus of VM usage as "desktop" usage just yet.
> Workstation, yes, and definitely server, but not really desktop.

I want a snappy system without bothering how to distribute it over different 
storage techniques which both have their distinct limitations. So 
bcache+btrfs is a solution to have the best of _all_ worlds. Like having 
your own cake and eat it, too. And while having 16 gigs of RAM, using 
preload, btrfs distributed across 3 devices, gave me a pretty snappy system, 
it has suffered a lot due to the above mentioned "kernel incident." It never 
came back to its old snappiness. So I feel the urge to move forward.

[...]
> Given the stated 3 x 1TB drive btrfs in raid1 metadata, raid0 data, config
> you mention, I'm wondering if big media is a/the big use case for you, in
> which case bcache isn't going to be a good solution anyway, since that
> tends to be sequential access, which bcache deliberately ignores as it
> doesn't fit the model it's targeting.

Well, use cases are as follows:

  * The system is also connected to my TV by HDMI
  * used for HTPC functions (just playback) with XBMC
  * used for Steam (occasinally playing games on the big screen)
  * used for development (and I always keep loads of tabs open in the
    browser then)
  * this involves git, restarting dev servers, compiling
  * VMs for these Windows-only things (but this is rare)
  * the usual Gentoo compiling, you know it...

So, bcache could probably help those situations where I want snappiness. And 
in the long term I'm planning to add another HDD and go with btrfs RAID10 
instead.

> (I am a bit worried about that raid0 data, tho.  Unless you consider that
> data of trivial value that's not a good choice, since raid0 generally
> means you lose it all if you lose a physical device.  And you're running
> three devices, which means you just tripled the chance of a device
> failure over that of just putting it all on a single 3 TB drive!  And
> backups... a 3 TB restore on spinning rust will take some time any way
> you look at it, so backups may or may not be particularly viable here.

I have a working backup with backlog. I got the 1TB drives incredibly cheap, 
so it was the option of choice. And I feel big drives with high data density 
are not as reliable as not so big and technically proven drives 
(manufactured when technology had moved forward to bigger platters).

> The most common use case for that much data is probably a DVR scenario,
> which is video, and you may well consider it of low enough value that if
> you lose it, you lose it, and you're willing to take that risk, but for
> normally sequential access video/media, bcache isn't a good match anyway.)

I'm in the process of sorting out all my CD and DVDs with archived data on 
it. Such media is unreliable - more than my current setup. I've been with 
LVM and XFS before and it's always been headache to swap storage easily. 
With btrfs it is very easy, and RAID striping comes for free. With my 
previous LVM setup I used a JBOD setup and I wasn't entirely happy with it. 
And then, there's still the long-term goal of migrating to RAID-10.

> * With memory cost what it is, for repeat access where initial access
> time isn't /too/ critical, investing in more memory, to a point (for me,
> 8-12 gig as explained above), and simply letting the kernel manage cache
> and memory as it normally does, may make more sense than bcache to an ssd.

This is why I'm already having 16 gigs. But I feel bcache would improve cold 
start of applications and the system.

> * Of course, what bcache *DOES* effectively do, is extend the per-boot
> cache time of memory, making the cache persistent.  That effectively
> extends the time over which "occasional access" still justifies caching
> at all.

That is the plan. ;-)

> * That makes bcache well suited to boot-time and initial-access-speed-
> critical scenarios, where more memory for a larger in-memory cache won't
> do any good, since it's first-access-since-boot, because for in-memory
> cache that's a cold-cache scenario, while with bcache's persistent cache,
> it's a hot-cache scenario.

Dito.

> But what I'm actually wondering is if your use case better matches a
> split data model, where you put root and perhaps stuff like the portage
> tree and/or /home on fast SSD, while keeping all that big and generally
> sequential access media on slower but much cheaper big spinning rust.

I hate partitioning. I don't want to micro-optimize my partition setup when 
a solution like bcache could provide similar improvements without the 
downsides of such partitioning decisions. That's the point.

> That's effectively what I've done here, tho I'm looking at rather less
> than a TB of slow-access media, etc.  See below for the details.  The
> general idea is as I said to stick all the time-critical stuff on SSD
> directly (not using something like bcache), while keeping the slower
> spinning rust for the big less-time-critical and sequential-access stuff,
> and for non-btrfs backups of the stuff on the btrfs-formatted SSD, since
> btrfs /is/ after all still in development, and I /do/ intend to be
> prepared if /my/ particular case ends up being one of the corner-cases
> btrfs still worst-cases on.

I have no problem with the time a restore from backup takes. I'm not that 
dependent on the system. I case of time-critical stuff I would just bind-
mount the home backup to a rescue system or just sync the home directory 
from backup to a (slow) spare system I've got somewhere and that usually 
just collects dust. That's a tested setup. I worst case there's an Gentoo VM 
in my office with almost identical software setup which I could just attach 
my disk to and mount my home on, and then even work remote on it. At least 
both these spare system setups would work as an emergeny replacement for 
important work. All the rest is not that important. If I loose the 
entertainment part of my system: Sigh, annoying, but well: Not important. 
All those mostly static files are in the backup and I'm not dependent on 
them in a time-critical manner. The critical working set can be implanted 
into spare systems.

>> * How well does bcache handle power outages? Btrfs does handle them very
>>   well since many months.
> 
> Since I don't run bcache I can't really speak to this at all, /except/,
> the btrfs/bcache combo trouble reports that have come to the list have I
> think all been power outage or kernel-crash scenarios... as could be
> predicted of course since that's a filesystem's worst-case scenario, at
> least that it has to commonly deal with.
> 
> But I know I'd definitely not trust that case, ATM.  Like I said, I'd not
> trust the combination of the two, and this is exactly where/why.  Under
> normal operation, the two should work together well.  But in a power-loss
> situation with both technologies being still relatively new and under
> development... not *MY* data!

The question is: Will it eat my data twice a day or twice a year. I could 
live with the latter, I have no time for the former, though. But I'm 
interested in helping the community by testing this. The problem isn't 
actually with bcache+btrfs destroying my system beyond repair and having to 
restore from backup. My problem is with silent data corruption it may 
introduce. My backup strategry won't protect me from that altough I have 
several weeks of backlog. And putting just unimportant stuff I seldom work 
with would just not help the situation: First, I would not really test the 
setup, second, I would not really take advantage of the setup. It would be 
useless.

>> * How well does it play with dracut as initrd? Is it as simple as
>> telling it the new device nodes or is there something complicate to
>> configure?
> 
> I can't answer this at all for bcache, but I can say I've been relatively
> happy with the dracut initramfs solution for dual-device btrfs raid1
> root. =:^)  (At least back when I first set it up several kernels ago,
> the kernel's commandline parser apparently couldn't handle the multiple
> equals of something like rootflags=device=/dev/sda5,device=/dev/sdb5.  So
> the only way to get a multi-device btrfs rootfs to work was to use an
> initr* with userspace btrfs device scan before attempting to mount real-
> root, and dracut has worked well for that.)

Worked for me by adding rootdelay=2 to the cmdline. And I had to add a 
symlink into the dracut ramfs builder because the scripts expect the btrfs 
bins somewhere else than they install to. I now use root=UUID=xxxx and it 
works like a charme.

>> * How does bcache handle a failing SSD when it starts to wear out in a
>> few years?
> 
> Given the newness of the bcache technology, assuming your SSD doesn't
> fail early and it is indeed a few years, I'd suggest that question is
> premature.  Bcache will by that time be much older and more mature than
> it is now, and how it'd handle, or fail to handle, such an event /now/
> likely hasn't a whole lot to do with how much (presumably) better it'll
> handle it /then/.

Well, good point. And I've read through some links (thanks to the other 
posters here) which show that bcache already has some countermeasures for 
this situation. So at least it is designed with such problems in mind. From 
that view it looks good to me. My biggest problem is probably I don't really 
trust SSDs by now, given my office background when SSDs fail in dumb ways 
just due to some workloads applied in Windows systems. Then, you update BIOS 
and firmware, and tada: Problems gone. BUT: This just implies that SSD is 
far from being mature. And then, I believe some manufactures just did not 
figure out how to do wear-leveling really correct. While HDDs usually fail 
in a soft way (some sectores no longer working, time for replacement), I 
usually read SSDs die from one minute to another unpredictably, so they fail 
the hard way in the way of from working flawlessly to everything lost. Or 
they start introducing silent data corruption which is much worse (like 
ack'ing writes but after reboot it looks like nothing was ever written).

>> * Is it worth waiting for hot-relocation support in btrfs to natively
>> use a SSD as cache?
> 
> I wouldn't wait for it.  It's on the wishlist, but according to the wiki
> (project ideas, see the dm_cache or bcache like cache, and the hybrid
> storage points), nobody has claimed that project yet, which makes it
> effectively status "bluesky", which in turn means "nice idea, we might
> get to it... someday."

One guy from this list was working on it - I remember it tho not his name. 
And he had patches. I liked the idea. It could probably work better than 
bcache due to not being filesystem agnostic.

> Given the btrfs project history of everything seeming to take rather
> longer than the original it turned out wildly optimistic projections, in
> the absense of a good filesystem dev personally getting that specific
> itch to scratch, that means it's likely a good two years out, and may be
> 5-10.  So no, I'd definitely *NOT* wait on it!

The well known mature filesystems (Ext, XFS, ...) are all probably 20 years 
old or more. Btrfs is maybe 5 years old now? It should start becoming 
feature complete now, and I think the devs are driven by similar emotions. 
Then give it another 5 years to work out all bugs and performance problems. 
At least from my dev background I know that time needed to code the feature-
complete codebase is about the same amount of time needed for testing and 
optimizing the system. I suppose it will then follow a similar evolution 
like the ext family of filesystems, adding new features while maintaining 
on-disk format as good as possible or at least enable easy forward-
migration, so users have a choice between proven stability or new features. 
At least this is what I hope and wish. ;-)

>> * Would you recommend going with a bigger/smaller SSD? I'm planning to
>> use only 75% of it for bcache so wear-leveling can work better, maybe
>> use another part of it for hibernation (suspend to disk).
> 
> FWIW, for my split data, some on SSD, some on spinning rust, setup, I had
> originally planned perhaps a 64 gig or so SSD, figuring I could put the
> boot-time-critical rootfs and a few other initial-access-time-critical
> things on it, with a reasonable amount of room to spare for wear-
> leveling.  Maybe 128 gig or so, with a bit more stuff on it.

The calculation behind this is that about 7% of the flash memory is already 
reserved for wear-leveling. The chips are powers of two (e.g., 128 gigs) 
while the drive announces more human-friendly sizes by cutting about 7% away 
(here: 120 gigs). But actually, reserving only 7% for wear-leveling is, 
according to multiple sources, not able to provide good performance in many 
workloads. The recommendation is to go with 30-50%. So, staying with the 
numbers, going with 90 gigs vs fully provisioned 120 gigs (that's 75% of the 
announced size), I effectively have a reserve of about 30% for wear-leveling 
(90 : 128 ~= 0.7).

> There were some smaller ones available, but
> they tended to be either MUCH slower or MUCH higher priced, I'd guess
> left over from a previous generation before prices came down, and they
> simply hadn't been re-priced to match current price/capacity price-points.

The performance drop is probably explainable by the fact that the drives to 
striping internally across the flash chips. You can see this performance 
drop even with modern drives if you look at comparision tests. That's not an 
effect of old technology only.

I think my system is mostly limited in performance by seeks not by 
throughput. So bcache came into my mind as the solution. Even a cheap drive 
would still be fast enough to deliver the throughput I usually measure in 
the system monitor with my HDDs - but with the bonus of more or less zero 
seek time. I don't think I have to optimize for throughput.

This is similar to throwing more RAM into the system usually gives a better 
performance boost than throwing more CPU at it: CPU would improve throughput 
- but the best throughput does not help if seeking is the limiting factor 
(i.e., having to re-read data from disk due to stressing the cache and RAM).

> But much below 128 GiB (there were some 120 GB at about the same per-gig,
> which "units" says is just under 112 GiB) and the price per gig tends to
> go up, while above 256 GB (not GiB) both the price per gig and full price
> tend to go up.

Yes, I figured currently you'd best go with the 128 gigs range, too. But in 
a range of difference around 60 gigs it is just not yet important for me. If 
it were 120 vs 240 gigs (or even more) it would become interesting. So in 
this difference range I'd probably prefer the lower price over the higher 
capacity. But I've not finally made up my mind about that.

> Of course that means if you do actually do bcache, 60-ish gigs should be
> good and I'd guess 128 gig would be overkill, as I guess 40-60 gigs
> probably about what my "hot" data is, the stuff bcache would likely catch.

That's the point. In higher capacity ranges it becomes interesting for more 
purpose than just bcache. But for the moment, and because of the fact I do 
not really trust this technology for storing all sorts of data with 
different access-patterns on it, I just want to try it out and see the 
effect of it.

> And 60 gig will likely be /some/ cheaper tho not as much as you might
> expect, but you'll lose flexibility too, and/or you might actually pay
> more for the 60 gig than the 120 gig, or it'll be slower speed-rated.
> That was what I found when I actually went out to buy, anyway.

It's like 50% of the capacity for 75% of the price. Not a very good deal. 
But throughput is not my prime target I guess (except you'd like to teach me 
otherwise wrt the above mentioned), and excess capacity is currently useless 
for me. So I'd probably go with the worse deal.

> I have a separate boot partition on each of the SSDs, with grub2
> installed to both SSD separately, pointing at its own /boot. with
> the SSD I boot selectable in BIOS.  That gives me a working /boot
> and a primary /boot backup.  I run git kernels and normally
> update the working /boot with a new kernel once or twice a week,
> while only updating the backup /boot with the release kernel, so
> every couple months.

Similar here: All hard disk use the same partitioning layout and I can use 
the spare space for /boot backups, EFI, or a small rescue system.

> 4	640 MiB /var/log (btrfs mixed-mode, raid1 data/metadata)
> 
> That gives me plenty of log space as long as logrotate doesn't
> break, while still keeping a reasonable cap on the log partition
> in case I get a runaway log.

I'm using journald as the only logger and it does its house keeping well.

> As any good sysadmin should know,
> some from experience (!!), keeping a separate log partition is a
> good idea, since that limits the damage if something /does/ go
> runaway logging.

Then take me as a good system admin: That's why the servers I admin are 
partitioned with these thoughts in mind. Since these servers run in VMs, I 
have no problem with partitioning here (in contrast to my "hate" above) 
because I can grow disk images and add new virtual drives without problems. 
My partitioning scheme in VMs is thus very simple: One partition per drive. 
So in the end: No problem for me with hating partitioning as there actually 
is no real partitioning. ;-)

In my opinion, partitioning is a remnant from ancient times which should go 
away. Volume pooling like it is supported by ZFS or btrfs is just the way to 
go. In VMs I can more or less emulate it by putting thin-provisioned virtual 
disk-images in the datastore.

> My rootfs includes (almost) all "installable" data, everything
> installed by packages except for /var/lib, which is a symlink to
> /home/var/lib.  The reason for that is that I keep rootfs mounted
> read-only by default, only mounting it read-write for updates or
> configuration changes, and /var/lib needs to be writable.  /home
> is mounted writable, thus the /var/lib symlink pointing into it.

I also started to hate such hacks, I used to use them, too - in the past. 
Volume pooling is the way to go. Waiting for btrfs to be able to mount 
subvolumes read-only...

> I learned the hard way to keep everything installed (but for
> /var/lib) on the same filesystem, along with the installed-
> package database (/var/db/pkg on gentoo), when I had to deal with
> a recovery situation with rootfs, /var, and /usr on separate
> partitions, recovering each one from a backup made at a different
> time!  Now I make **VERY** sure everything stays in sync, so
> the installed-package database matches what's actually installed.

Actually, I guess you might have some background to why I hate partitions. 
;-) But that's only one reasoning. The other is: You will always run out of 
space on one of them and have no way to redistribute space easily as needed. 
This is also why I pooled my HDDs together into one btrfs instead of putting 
different purpose partitions on them. And the draid-0 came for free. And 
since meta-data is much more critical, it is mraid-1. But I never had the 
problem yet that the btrfs driver had to repair from a good meta-data block. 
;-)

[most personal stuff snipped, feel free to PM]

> (My SSDs, Corsair Neutron series, run a LAMD (Link A
> Media Devices) controller.  These don't have the compression
> or dedup features of something like the sandforce controllers,
> but the Neutrons at least (as opposed to the Neutron GTX) are
> enterprise targeted, with the resulting predictable performance,
> capacity and reliability bullet-point features.  What you save to
> the SSD is saved as-you-sent-it, regardless of compressibility or
> whether it's a dup of something else already on the SSD.  Thus,
> at least with my SSDs, the redundant working and backup copies
> are actually two copies on the SSD as well, not one compressed/
> dedupped copy.  That's a very nice confidence point when the
> whole /point/ of sending two copies is to have a backup!  So
> for anyone reading this that decides to do something similar,
> be sure your SSD firmware isn't doing de-duping in the background,
> leaving you with only the one copy regardless of what you thought
> you might have saved!)

This is actually an important point when thinking of putting btrfs with raid 
features on such drives, or enabling btrfs compression.

> Still on spinning rust, meanwhile, all my filesystems remain the many-
> years-stable reiserfs.  I keep a working and backup media partition
> there, as well as second backup partitions for everything on btrfs on the
> ssds, just in case.

As initially mentioned, reiserfs has proven extremely reliable for me too, 
even in disastrous circumstances. No other FS has ever kept up with that. 
But scaling in multi-process IO situations which are mostly seen on busy 
servers is just, expressed mildly: bad. And XFS was the best candidate to 
fill that gap. I've lost my trust in ext family FS long before, so that was 
no option. Yes, I know: It's mature. It's proven. It's stable. But when it 
gets corrupted for whatever reason, my experience is that chances for 
recovery are virtually non-existent. Let's see how btrfs works for me during 
the next years. It's not there yet and performance is worse in almost any 
workload... But it has some compelling features. ;-)

> I figure if the external gets taken out too, say by fire if my house
> burnt down or by theft if someone broke in and stole it, I'd have much
> more important things to worry about for awhile, then what might have
> happened to my data!

Right. But still, my important work set (read: git repos, intellectual goods 
and activity, ...) are always mirrored outside. And btrfs snapshots help 
against accidential damage.

> And once I did get back on my feet and ready to
> think about computing again, much of the data would be sufficiently
> outdated as to be near worthless in any case.  At that point I might as
> well start from scratch but for the knowledge in my head, and whatever
> offsite or the like backups I might have had probably wouldn't be worth
> the trouble to recover anyway, so that's beyond cost/time/hassle
> effective and I don't bother.

I've once lost all my work set - that's when I started to hate FAT. This is 
probably 15 years ago but still bugs me, I only had very few and vastly 
outdated backups. But lessons learned. So maybe it is not as easy to say "I 
don't bother" as you think now. Just my two cents...


Sorry for the noise, list.

Enjoying discussion with you, I suggest getting in touch PM if this is 
getting away from btrfs any more...

Thanks,
Kai


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Migrate to bcache: A few questions
  2013-12-30 16:02 ` Austin S Hemmelgarn
@ 2014-01-01 10:06   ` Duncan
  2014-01-01 20:12   ` Austin S Hemmelgarn
  1 sibling, 0 replies; 14+ messages in thread
From: Duncan @ 2014-01-01 10:06 UTC (permalink / raw)
  To: linux-btrfs

Austin S Hemmelgarn posted on Mon, 30 Dec 2013 11:02:21 -0500 as
excerpted:

> I've actually tried a simmilar configuration myself a couple of times
> (also using Gentoo in-fact), and I can tell you from experience that
> unless things have changed greatly since kernel 3.12.1, it really isn't
> worth the headaches.

Basically what I posted, but "now with added real experience!" (TM) =:^)

> As an alternative to using bcache, you might try something simmilar to
> the following:
>     64G SSD with /boot, /, and /usr Other HDD with /var, /usr/portage,
>     /usr/src, and /home tmpfs or ramdisk for /tmp and /var/tmp
> This is essentially what I use now, and I have found that it
> significantly improves system performance.

Again, very similar to my own recommendation.  Nice to see others saying 
the same thing. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Migrate to bcache: A few questions
  2013-12-30 16:02 ` Austin S Hemmelgarn
  2014-01-01 10:06   ` Duncan
@ 2014-01-01 20:12   ` Austin S Hemmelgarn
  2014-01-02  8:49     ` Duncan
  1 sibling, 1 reply; 14+ messages in thread
From: Austin S Hemmelgarn @ 2014-01-01 20:12 UTC (permalink / raw)
  To: Kai Krakow, linux-btrfs

On 12/30/2013 11:02 AM, Austin S Hemmelgarn wrote:
> 
> As an alternative to using bcache, you might try something simmilar to
> the following:
>     64G SSD with /boot, /, and /usr
>     Other HDD with /var, /usr/portage, /usr/src, and /home
>     tmpfs or ramdisk for /tmp and /var/tmp
> This is essentially what I use now, and I have found that it
> significantly improves system performance.
> 
On this specific note, I would actually suggest against putting the 
portage tree on btrfs, it makes syncing go ridiculously slow, 
and it also seems to slow down emerge as well.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Migrate to bcache: A few questions
  2014-01-01 20:12   ` Austin S Hemmelgarn
@ 2014-01-02  8:49     ` Duncan
  2014-01-02 12:36       ` Austin S Hemmelgarn
  0 siblings, 1 reply; 14+ messages in thread
From: Duncan @ 2014-01-02  8:49 UTC (permalink / raw)
  To: linux-btrfs

Austin S Hemmelgarn posted on Wed, 01 Jan 2014 15:12:40 -0500 as
excerpted:

> On 12/30/2013 11:02 AM, Austin S Hemmelgarn wrote:
>> 
>> As an alternative to using bcache, you might try something simmilar to
>> the following:
>>     64G SSD with /boot, /, and /usr Other HDD with /var, /usr/portage,
>>     /usr/src, and /home tmpfs or ramdisk for /tmp and /var/tmp
>> This is essentially what I use now, and I have found that it
>> significantly improves system performance.
>> 
> On this specific note, I would actually suggest against putting the
> portage tree on btrfs, it makes syncing go ridiculously slow,
> and it also seems to slow down emerge as well.

Interesting observation.

I had not see it here (with the gentoo tree and overlays on btrfs), but 
that's very likely because all my btrfs are on SSD, as I upgraded to both 
at the same time, because my previous default filesystem choice, 
reiserfs, isn't well suited to SSD due to excessive writing due to the 
journaling.

I do know slow syncs and portage dep-calculations were one of the reasons 
I switched to SSD (and thus btrfs), however.  That was getting pretty 
painful on spinning rust, at least with reiserfs.  And I imagine btrfs on 
single-device spinning rust would if anything be worse at least for 
syncs, due to the default dup metadata, meaning at least three writes 
(and three seeks) for each file, once for the data, twice for the 
metadata.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Migrate to bcache: A few questions
  2014-01-02  8:49     ` Duncan
@ 2014-01-02 12:36       ` Austin S Hemmelgarn
  2014-01-03  0:09         ` Duncan
  0 siblings, 1 reply; 14+ messages in thread
From: Austin S Hemmelgarn @ 2014-01-02 12:36 UTC (permalink / raw)
  To: Duncan, linux-btrfs

On 2014-01-02 03:49, Duncan wrote:
> Austin S Hemmelgarn posted on Wed, 01 Jan 2014 15:12:40 -0500 as
> excerpted:
> 
>> On 12/30/2013 11:02 AM, Austin S Hemmelgarn wrote:
>>>
>>> As an alternative to using bcache, you might try something simmilar to
>>> the following:
>>>     64G SSD with /boot, /, and /usr Other HDD with /var, /usr/portage,
>>>     /usr/src, and /home tmpfs or ramdisk for /tmp and /var/tmp
>>> This is essentially what I use now, and I have found that it
>>> significantly improves system performance.
>>>
>> On this specific note, I would actually suggest against putting the
>> portage tree on btrfs, it makes syncing go ridiculously slow,
>> and it also seems to slow down emerge as well.
> 
> Interesting observation.
> 
> I had not see it here (with the gentoo tree and overlays on btrfs), but 
> that's very likely because all my btrfs are on SSD, as I upgraded to both 
> at the same time, because my previous default filesystem choice, 
> reiserfs, isn't well suited to SSD due to excessive writing due to the 
> journaling.
> 
> I do know slow syncs and portage dep-calculations were one of the reasons 
> I switched to SSD (and thus btrfs), however.  That was getting pretty 
> painful on spinning rust, at least with reiserfs.  And I imagine btrfs on 
> single-device spinning rust would if anything be worse at least for 
> syncs, due to the default dup metadata, meaning at least three writes 
> (and three seeks) for each file, once for the data, twice for the 
> metadata.
> 
I think the triple seek+write is probably the biggest offender in my
case, although COW and autodefrag probably don't help either.  I'm kind
of hesitant to put stuff that gets changed daily on a SSD, so I've ended
up putting portage on ext4 with no journaling (which out-performs every
other filesystem I have tested WRT write performance).  As for the
dep-calculations, I have 16G of ram, so I just use a script to read the
entire tree into the page cache after each sync.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Migrate to bcache: A few questions
  2014-01-02 12:36       ` Austin S Hemmelgarn
@ 2014-01-03  0:09         ` Duncan
  0 siblings, 0 replies; 14+ messages in thread
From: Duncan @ 2014-01-03  0:09 UTC (permalink / raw)
  To: linux-btrfs

Austin S Hemmelgarn posted on Thu, 02 Jan 2014 07:36:06 -0500 as
excerpted:

> I think the triple seek+write is probably the biggest offender in my
> case, although COW and autodefrag probably don't help either.  I'm kind
> of hesitant to put stuff that gets changed daily on a SSD, so I've ended
> up putting portage on ext4 with no journaling (which out-performs every
> other filesystem I have tested WRT write performance).

I went ahead with the gentoo tree and overlays on SSD, because... well, 
they need the fast access that SSD provides, and if I can't use the SSD 
for its good points, why did I buy it in the first place?

It's also worth noting that only a few files change on a day to day 
basis.  Most of the tree remains as it is.  Similarly with the git pack 
files behind the overlays (and live-git sources) -- once they reach a 
certain point they stop changing and all the changes go into a new file.

Further, most reports I've seen suggest that daily changes on some 
reasonably small part of an SSD aren't a huge problem... given wear-
leveling and an estimated normal lifetime of say three to five years 
before they're replaced with new hardware anyway, daily changes simply 
shouldn't be an issue.  It's worth keeping limited-write-cycles in mind 
and minimizing them where possible, but it's not quite the big thing a 
lot of people make it out to be.

Additionally, I'm near 100% overprovisioned, giving the SSDs lots of room 
to do that wear-leveling, so...

Meanwhile, are you using tail packing on that ext4?  The idea of wasting 
all that space due to all those small files has always been a downer for 
me and others, and is the reason many of us used reiserfs for many 
years.  I guess ext4 now does have tail packing or some similar solution, 
but I do wonder how much that tail packing affects performance.

Of course it'd also be possible to run reiserfs without tail packing, and 
even without journaling.  But somehow I always thought what was the point 
of running reiser, if those were disabled.

Anyway, I'd find it interesting to benchmark what the effect of 
tailpacking (or whatever ext4 calls it) on no-journal ext4 for the gentoo 
tree actually was.  If you happen to know, or happen to be inspired to 
run those benchmarks now, I'd be interested...

> As for the
> dep-calculations, I have 16G of ram, so I just use a script to read the
> entire tree into the page cache after each sync.

With 16 gig RAM, won't the sync have pulled everything into page-cache 
already?  That has always seemed to be the case here.  Running an emerge 
--deep --upgrade --newuse @world here after a sync shows very little disk 
activity and takes far less time than trying the same thing after an 
unmount/remount, thus cold-cache.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-01-03  0:09 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-29 21:11 Migrate to bcache: A few questions Kai Krakow
2013-12-30  1:03 ` Chris Murphy
2013-12-30  1:22   ` Kai Krakow
2013-12-30  3:48     ` Chris Murphy
2013-12-30  9:01     ` Marc MERLIN
2013-12-31  0:31       ` Kai Krakow
2013-12-30  6:24 ` Duncan
2013-12-31  3:13   ` Kai Krakow
2013-12-30 16:02 ` Austin S Hemmelgarn
2014-01-01 10:06   ` Duncan
2014-01-01 20:12   ` Austin S Hemmelgarn
2014-01-02  8:49     ` Duncan
2014-01-02 12:36       ` Austin S Hemmelgarn
2014-01-03  0:09         ` Duncan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).