btrfs autodefrag?

All of lore.kernel.org
 help / color / mirror / Atom feed

* btrfs autodefrag?
@ 2015-10-17 16:36 Xavier Gnata
  2015-10-18  5:46 ` Duncan
  2015-10-18 14:24 ` Rich Freeman
  0 siblings, 2 replies; 10+ messages in thread
From: Xavier Gnata @ 2015-10-17 16:36 UTC (permalink / raw)
  To: linux-btrfs@vger.kernel.org

Hi,

On a desktop equipped with an ssd with one 100GB virtual image used 
frequently, what do you recommend?
1) nothing special, it is all fine as long as you have a recent kernel 
(which I do)
2) Disabling copy-on-write for just the VM image directory.
3) autodefrag as a mount option.
4) something else.

I don't think this usecase is well documented therefore I asked this 
question.

Xavier

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs autodefrag?
  2015-10-17 16:36 btrfs autodefrag? Xavier Gnata
@ 2015-10-18  5:46 ` Duncan
  2015-10-18 12:44   ` Xavier Gnata
  2015-10-19  6:04   ` Paul Harvey
  2015-10-18 14:24 ` Rich Freeman
  1 sibling, 2 replies; 10+ messages in thread
From: Duncan @ 2015-10-18  5:46 UTC (permalink / raw)
  To: linux-btrfs

Xavier Gnata posted on Sat, 17 Oct 2015 18:36:32 +0200 as excerpted:

> Hi,
> 
> On a desktop equipped with an ssd with one 100GB virtual image used
> frequently, what do you recommend?
> 1) nothing special, it is all fine as long as you have a recent kernel
> (which I do)
> 2) Disabling copy-on-write for just the VM image directory.
> 3) autodefrag as a mount option.
> 4) something else.
> 
> I don't think this usecase is well documented therefore I asked this
> question.

You are correct.  The VM images on ssd use-case /isn't/ particularly well 
documented, I'd guess because people have differing opinions, and, 
indeed, actual observed behavior, and thus recommendations even in the 
ideal case, may well be different depending on the specs and firmware of 
the ssd.  The documentation tends to be aimed at the spinning rust case.

There's one detail of the use-case (besides ssd specs), however, that you 
didn't mention, that could have a big impact on the recommendation.  What 
sort of btrfs snapshotting are you planning to do, and if you're doing 
snapshots, does your use-case really need them to include the VM image 
file?

Snapshots are a big issue for anything that you might set nocow, because 
snapshot functionality assumes and requires cow, and thus conflicts, to 
some extent, with nocow.  A snapshot locks in place the existing extents, 
so they can no longer be modified.  On a normal btrfs cow-based file, 
that's not an issue, since any modifications would be cowed elsewhere 
anyway -- that's how btrfs normally works.  On a nocow file, however, 
there's a problem, because once the snapshot locks in place the existing 
version, the first change to a specific block (normally 4 KiB) *MUST* be 
cowed, despite the nocow attribute, because to rewrite in-place would 
alter the snapshot.  The nocow attribute remains in place, however, and 
further writes to the same block will again be nocow... to the new block 
location established by that first post-snapshot write... until the next 
snapshot comes along and locks that too in-place, of course.  This sort 
of cow-only-once behavior is sometimes called cow1.

If you only do very occasional snapshots, probably manually, this cow1 
behavior isn't /so/ bad, tho the file will still fragment over time as 
more and more bits of it are written and rewritten after the few 
snapshots that are taken.  However, for people doing frequent, generally 
schedule-automated snapshots, the nocow attribute is effectively 
nullified as all those snapshots force cow1s over and over again.

So ssd or spinning rust, there's serious conflicts between nocow and 
snapshotting that really must be taken into consideration if you're 
planning to both snapshot and nocow.

For use-cases that don't require snapshotting of the nocow files, the 
simplest workaround is to put any nocow files on dedicated subvolumes.  
Since snapshots stop at subvolume boundaries, having nocow files on 
dedicated subvolume(s) stops snapshots of the parent from including them, 
thus avoiding the cow1 situation entirely.

If the use-case requires snapshotting of nocow files, the workaround that 
has been reported (mostly on spinning rust, where fragmentation is a far 
worse problem due to non-zero seek-times) to work is first to reduce 
snapshotting to a minimum -- if it was going to be hourly, consider daily 
or every 12 hours, if you can get away with it, if it was going to be 
daily, consider every other day or weekly.  Less snapshotting means less 
cow1s and thus directly affects how quickly fragmentation becomes a 
problem.  Again, dedicated subvolumes can help here, allowing you to 
snapshot the nocow files on a different schedule than you do the up-
hierarchy parent subvolume.  Second, schedule periodic manual defrags of 
the nocow files, so the fragmentation that does occur is at least kept 
manageable.  If the snapshotting is daily, consider weekly or monthly 
defrags.  If it's weekly, consider monthly or quarterly defrags.  Again, 
various people who do need to snapshot their nocow files have reported 
that this really does help, keeping fragmentation to at least some sanely 
managed level.

That's the snapshot vs. nocow problem in general.  With luck, however, 
you can avoid snapshotting the files in question entirely, thus factoring 
this issue out of the equation entirely.

Now to the ssd issue.

On ssds in general, there are two very major differences we need to 
consider vs. spinning rust.  One, fragmentation isn't as much of a 
problem as it is on spinning rust.  It's still worth keeping to a 
minimum, because as the number of fragments increases, so does both btrfs 
and device overhead, but it's not the nearly everything-overriding 
consideration that it is on spinning rust.

Two, ssds have a limited write-cycle factor to consider, where with 
spinning rust the write-cycle limit is effectively infinite... at least 
compared to the much lower limit of ssds.

The weighing of these two overriding ssd factors one against the other, 
along with the simple fact that ssds are new enough technology and 
behavior differs enough between them that people simply haven't had time 
to come to agreement yet on best-practices, is why recommendations here 
differ far more than on spinning rust, where fragmentation really is the 
single most important overriding factor compared to very nearly 
everything else.  The fact of the matter is, on ssds, people strongly 
emphasizing the limited write-cycle count will tend not to worry, perhaps 
at all, about fragmentation, since it's negative effects are so much 
lower on ssds, while those (including me) who emphasize the remaining 
negative effects that fragmentation has, including scaling issues should 
it get to bad, as well as the less easy to create a universal rule for 
(because devices and firmwares do differ in major ways here) effect of 
the larger erase block size and how that interacts with sub-erase-block-
size fragmentation and write-amplification, thus perhaps triggering more 
write cycles due to sub-erase-block-fragmentation than the defrag would 
trigger, still tend to recommend at least taking fragmentation into 
account, and may even consider autodefrag worth enabling, for use-cases 
with small enough internal-rewrite-pattern files, at least.

So let's address autodefrag...

It's worth noting that I have autodefrag enabled here, on my ssds, and 
have from the first mount where I put content on them, so it has been 
enabled for every write on every file.  However, it's not ideal in all 
cases, my use-case simply is one where autodefrag works well, so...

Here's the deal with autodefrag.  First of all, if a file isn't 
constantly being rewritten, or if its rewrite pattern is append-only 
(like most log files, but *not* systemd journal files!), it doesn't tend 
to get particularly fragmented in the first place, especially with a 
filesystem that itself isn't highly fragmented, so free-space blocks tend 
to be large enough that a file doesn't tend to be fragmented as initially 
written.  So fragmentation tends to be worst on internal-rewrite-pattern 
files, where a block here and a block there are rewritten, normally 
triggering cow on a cow-based filesystem such as btrfs.

But, consider that rewriting the entire file to avoid fragmentation, 
which is what autodefrag does, takes time, larger file, more time.   And 
at some point, as filesizes increase, rewrites can be coming in faster 
than the file can be rewritten.  So autodefrag works best on internal-
rewrite-pattern files (as we've already established), but also on smaller 
files.

On spinning rust, autodefrag tends to work best at file sizes under 256 
MiB, a quarter GiB, where they rewrite fast enough that there's generally 
no problems at all.  But on most spinning rust, people will begin to see 
performance issues with autodefrag, at somewhere between half a GiB and 
3/4 GiB (512-768 MiB), and nearly everyone on spinning rust reports 
performance issues at 1 GiB file sizes and larger.

As it happens, this quarter-GiB or so spinning-rust autodefrag limit is 
close to that of common desktop-only database uses such as the sqlite 
files firefox and thunderbird use, so this is the use-case for which 
autodefrag is really recommended and tuned ATM.  That's really useful, 
since it means most desktop-only users can simply enable autodefrag and 
forget about it, as it'll "just work".

People optimizing larger databases and GiB+ VM image files, however, are 
going to need to do rather more detailed optimization, which sucks, but 
in contrast with normal desktop users, they're generally used to doing 
various optimization things, at least to some extent, already, so at 
least the problem is hitting those generally more technically prepared to 
deal with it.

But that's for spinning rust.  On ssds, particularly fast ssds, write 
speeds tend to be high enough that autodefrag can work effectively with 
much larger files.  The rub, however, is that ssd speeds vary enough, and 
there's few enough reports from people actually testing autodefrag with 
larger internal-rewrite-pattern files on ssds, that we don't have nicely 
wrapped up numbers for our ssd autodefrag filesize limitation 
recommendations, as we do for spinning rust.

I'd suggest based on my own experience and the reports we /do/ have, that 
on most ssds, autodefrag, provided people are inclined to enable it in 
the first place (see above discussion of the two major ssd factors here 
and how emphasis on one or the other tends to put people in one of two 
camps regarding even worrying about fragmentation at all on ssds), should 
work well enough on files upto a gig in size, at least.  I wouldn't be 
surprised to see 2 GiB work fine, particularly on fast ssds, tho I'd 
guess people will begin to see performance issues at the 4 GiB to 8 GiB 
size.

You say your image file, while on ssd, is 100 GiB.  Please do your own 
tests and report as it's possible my EWAG (educated but wild-ass-guess) 
is wrong, but I'm predicting that's well above the good performance limit 
for autodefrag, even on SSD.

That said, performance may still be good /enough/ that you can deal with 
it, if if sufficiently simplifies the situation for you regarding /other/ 
files, and your balance of use tilts sufficiently toward those other 
files as opposed to this single very large image file.

Tho at 100 GiB, the repeated rewriting of autodefrag is definitely likely 
to cut into your write-cycle allowance, arguably rather heavily.  So I 
really can't recommend autodefrag, despite how very much I wish it would 
work for your case, since it does dramatically simplify things where it 
works and you can then simply forget about other alternatives and all 
their relative complications.  Maybe someday they'll optimize it to 
handle such large files better, but until then, I really don't think it's 
a good match to your requirements.

So with autodefrag out for that file, and with the previous issues 
discussed, here's some reasonable options to try.

1) The nothing special option.  With a bit of luck, the 0-seek-time of 
ssd will mean that the fragmentation you're likely to see won't 
dramatically affect you, and the "do nothing" option will work acceptably.

The biggest thing I'm worried about here is that fragmentation may well 
get bad enough that it affects btrfs maintenance times, etc, due to 
scaling issues.  Btrfs balance, scrub, and check, could end up taking far 
longer than you might expect on ssd and than they'd take were it not for 
the fragmentation on this single file.

And if you're keeping snapshots around, be aware that simply defragging 
the file isn't likely to solve the btrfs maintenance times issue, because 
while btrfs did have snapshot-aware-defrag for a few kernels, it did not 
scale well *AT* *ALL* and the snapshot awareness was disabled again, 
until the scaling issues could be worked thru (which they're gradually 
doing, but it's an exceedingly complex problem, with many sub-issues that 
must be solved before scaling itself can be considered solved).  So 
defragging a file that's already highly fragmented in various snapshots 
of differing ages, will defrag it in the subvolume/snapshot you run the 
defrag in, but won't affect it in the other snapshots, so isn't likely to 
do much at all for the overall btrfs maintenance scaling issue.  You'd 
have to delete all those snapshots (or not take them in the first place, 
if your use-case doesn't require them) to eliminate the scaling issue, if 
it's due to fragmentation of this file in all those snapshots as well as 
the working copy.

So watch out for the maintenance scaling (maybe run a scrub and/or read-
only check periodically, just to ensure the execution times aren't 
running away on you), but if it works well enough for you, this is by far 
the most uncomplicated option.

2) If your use-case doesn't involve snapshotting the image file, setting 
nocow on the dir before creation of the file, such that the file inherits 
the nocow, should be a reasonably uncomplicated option.

If you do plan on snapshotting the parent but don't actually need to 
snapshot the nocow subdir and its nocow inheriting images, then use the 
dedicated subvolume trick to keep the image file out of your snapshots 
and avoid the cow1 complications.

3) As an idea taking the dedicated subvolume idea even further, consider 
an entirely separate dedicated filesystem for this image file.  That 
gives you much more flexibility, because then you can, for instance, 
still set autodefrag on the main filesystem, if it'd be useful there, 
without worrying about how that huge image file and autodefrag interact.

Additionally, that lets you use something other than btrfs for the image 
file's filesystem, if you want, while still using btrfs for the rest of 
the system.  If you're nocowing the file, you're already killing many of 
the features that btrfs generally brings, and provided the additional 
overhead of managing the separate partition and filesystem isn't too 
much, you might /as/ /well/ simply use something other than btrfs for 
that particular file, thus avoiding the whole image file cowing 
complications scenario in the first place.

I'd strongly consider the separate filesystem option here, as I already 
use multiple separate filesystems in ordered to avoid having my data eggs 
all in the same single filesystem basket (subvolumes don't cut it in 
terms of separation safety, for me).  But some people are far more averse 
to partitioning and similar solutions, for reasons that aren't entirely 
clear to me.  If you'd prefer to avoid the complexity of managing an 
entirely separate filesystem just for your image file, fine, just cross 
this option off your list and don't consider it further.

4) If the "do nothing" option doesn't cut it and your use-case involves 
snapshotting the image file, then things get much more complex.

As mentioned above, the recommendation for this sort of use-case isn't 
going to give you a simple ideal, but others have reported it to work 
acceptably, even surprisingly, well, once it's all setup, and if that's 
the situation on spinning rust, it should be even better on ssd, since 
the "controlled amount of fragmentation" should be even further within 
acceptable levels on ssd with its zero-seek-times, than it is on spinning 
rust.

Again, the recommendation for this use-case is to set nocow on the image-
file's dir so it inherits, and aim for the low end of your acceptable 
snapshotting frequency range for the image file, weekly instead of daily, 
or daily instead of hourly.  If necessary, use the separate subvolume 
trick to separate the image file from the rest of the content you're 
snapshotting, so you can use a higher frequency snapshot schedule on the 
other stuff, while keeping it as low frequency as you can manage on the 
image file.

Then do scheduled periodic targeted defrag of the image file, at a 
frequency some fraction of the snapshot frequency, perhaps monthly or 
quarterly for weekly snapshots, etc.

Keep in mind that defrag will only affect the working copy, not existing 
snapshots, but provided you do it at some reasonable fraction of the 
snapshotting interval, you should reset the fragmentation for further 
snapshots often enough that it doesn't get out of hand for them, either.

Finally, orthogonal to the original fragmentation question, but 
particularly important if you /are/ doing scheduled snapshots...

For scheduled snapshots in particular, it's very important that you setup 
a reasonable snapshot thinning schedule as well, the object of which 
should be to keep the number of snapshots as low as possible, again, for 
scaling reasons.  At this point anyway, btrfs maintenance operations 
simply do /not/ scale well with snapshot numbers in the tens or hundreds 
of thousands range, as people often find themselves with if they aren't 
doing scheduled snapshot thinning as well.

With reasonable thinning, it's quite possible to keep per-subvolume 
snapshots to 250 or so, reasonably under 300, even if starting with 
incredibly high snapshot frequency such as every half-hour or even every 
minute (tho the latter tends to be impractical because while snapshots 
are fast, very nearly instantaneous, removing them is rather more complex 
and definitely not instantaneous!).  With 250 snapshots per subvolume, 
you keep it to 1000 snapshots per filesystem if you're snapshotting four 
subvolumes, 2000 per filesystem if you're doing eight, etc.  Ideally, 
you'll target 1000 or less, possibly by thinning more drastically on some 
subvolume snapshots than others, but 2000 or even 3000 isn't out of hand, 
tho by 2500 to 3000, you'll probably notice increased maintenance times.  
By 10k snapshots, however, things are starting to go south, and above 
that, things go unreasonable pretty fast.

So do try to keep to "a few thousand, at most" snapshots, or expect to 
btrfs balance and other maintenance tasks to take "unreasonable" amounts 
of time, should you need to run them.  And if you can keep to under 1000, 
so much the better; your improved maintenance times will reward you for 
it. =:^)

Also, as you may have already seen, my recommendation for quotas is 
simply leave them off on btrfs.  They're broken and dramatically increase 
the scaling issues.  You either rely on quotas working or you don't.  If 
you don't, leave them off and avoid the issues.  If you do, use a more 
stable and mature filesystem where they're known to work reliably.  
Unless of course you're specifically working with the devs to test, 
report and trace down quota problems and test possible fixes.  In that 
case, please continue, as its your tolerance for the present pain that's 
helping to make the feature actually usable for the rest of us, someday 
hopefully soon. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs autodefrag?
  2015-10-18  5:46 ` Duncan
@ 2015-10-18 12:44   ` Xavier Gnata
  2015-10-19  6:04   ` Paul Harvey
  1 sibling, 0 replies; 10+ messages in thread
From: Xavier Gnata @ 2015-10-18 12:44 UTC (permalink / raw)
  To: Duncan, linux-btrfs



On 18/10/2015 07:46, Duncan wrote:
> Xavier Gnata posted on Sat, 17 Oct 2015 18:36:32 +0200 as excerpted:
>
>> Hi,
>>
>> On a desktop equipped with an ssd with one 100GB virtual image used
>> frequently, what do you recommend?
>> 1) nothing special, it is all fine as long as you have a recent kernel
>> (which I do)
>> 2) Disabling copy-on-write for just the VM image directory.
>> 3) autodefrag as a mount option.
>> 4) something else.
>>
>> I don't think this usecase is well documented therefore I asked this
>> question.
>
> You are correct.  The VM images on ssd use-case /isn't/ particularly well
> documented, I'd guess because people have differing opinions, and,
> indeed, actual observed behavior, and thus recommendations even in the
> ideal case, may well be different depending on the specs and firmware of
> the ssd.  The documentation tends to be aimed at the spinning rust case.
>
> There's one detail of the use-case (besides ssd specs), however, that you
> didn't mention, that could have a big impact on the recommendation.  What
> sort of btrfs snapshotting are you planning to do, and if you're doing
> snapshots, does your use-case really need them to include the VM image
> file?
>
> Snapshots are a big issue for anything that you might set nocow, because
> snapshot functionality assumes and requires cow, and thus conflicts, to
> some extent, with nocow.  A snapshot locks in place the existing extents,
> so they can no longer be modified.  On a normal btrfs cow-based file,
> that's not an issue, since any modifications would be cowed elsewhere
> anyway -- that's how btrfs normally works.  On a nocow file, however,
> there's a problem, because once the snapshot locks in place the existing
> version, the first change to a specific block (normally 4 KiB) *MUST* be
> cowed, despite the nocow attribute, because to rewrite in-place would
> alter the snapshot.  The nocow attribute remains in place, however, and
> further writes to the same block will again be nocow... to the new block
> location established by that first post-snapshot write... until the next
> snapshot comes along and locks that too in-place, of course.  This sort
> of cow-only-once behavior is sometimes called cow1.
>
> If you only do very occasional snapshots, probably manually, this cow1
> behavior isn't /so/ bad, tho the file will still fragment over time as
> more and more bits of it are written and rewritten after the few
> snapshots that are taken.  However, for people doing frequent, generally
> schedule-automated snapshots, the nocow attribute is effectively
> nullified as all those snapshots force cow1s over and over again.
>
> So ssd or spinning rust, there's serious conflicts between nocow and
> snapshotting that really must be taken into consideration if you're
> planning to both snapshot and nocow.
>
> For use-cases that don't require snapshotting of the nocow files, the
> simplest workaround is to put any nocow files on dedicated subvolumes.
> Since snapshots stop at subvolume boundaries, having nocow files on
> dedicated subvolume(s) stops snapshots of the parent from including them,
> thus avoiding the cow1 situation entirely.
>
> If the use-case requires snapshotting of nocow files, the workaround that
> has been reported (mostly on spinning rust, where fragmentation is a far
> worse problem due to non-zero seek-times) to work is first to reduce
> snapshotting to a minimum -- if it was going to be hourly, consider daily
> or every 12 hours, if you can get away with it, if it was going to be
> daily, consider every other day or weekly.  Less snapshotting means less
> cow1s and thus directly affects how quickly fragmentation becomes a
> problem.  Again, dedicated subvolumes can help here, allowing you to
> snapshot the nocow files on a different schedule than you do the up-
> hierarchy parent subvolume.  Second, schedule periodic manual defrags of
> the nocow files, so the fragmentation that does occur is at least kept
> manageable.  If the snapshotting is daily, consider weekly or monthly
> defrags.  If it's weekly, consider monthly or quarterly defrags.  Again,
> various people who do need to snapshot their nocow files have reported
> that this really does help, keeping fragmentation to at least some sanely
> managed level.
>
> That's the snapshot vs. nocow problem in general.  With luck, however,
> you can avoid snapshotting the files in question entirely, thus factoring
> this issue out of the equation entirely.
>
> Now to the ssd issue.
>
> On ssds in general, there are two very major differences we need to
> consider vs. spinning rust.  One, fragmentation isn't as much of a
> problem as it is on spinning rust.  It's still worth keeping to a
> minimum, because as the number of fragments increases, so does both btrfs
> and device overhead, but it's not the nearly everything-overriding
> consideration that it is on spinning rust.
>
> Two, ssds have a limited write-cycle factor to consider, where with
> spinning rust the write-cycle limit is effectively infinite... at least
> compared to the much lower limit of ssds.
>
> The weighing of these two overriding ssd factors one against the other,
> along with the simple fact that ssds are new enough technology and
> behavior differs enough between them that people simply haven't had time
> to come to agreement yet on best-practices, is why recommendations here
> differ far more than on spinning rust, where fragmentation really is the
> single most important overriding factor compared to very nearly
> everything else.  The fact of the matter is, on ssds, people strongly
> emphasizing the limited write-cycle count will tend not to worry, perhaps
> at all, about fragmentation, since it's negative effects are so much
> lower on ssds, while those (including me) who emphasize the remaining
> negative effects that fragmentation has, including scaling issues should
> it get to bad, as well as the less easy to create a universal rule for
> (because devices and firmwares do differ in major ways here) effect of
> the larger erase block size and how that interacts with sub-erase-block-
> size fragmentation and write-amplification, thus perhaps triggering more
> write cycles due to sub-erase-block-fragmentation than the defrag would
> trigger, still tend to recommend at least taking fragmentation into
> account, and may even consider autodefrag worth enabling, for use-cases
> with small enough internal-rewrite-pattern files, at least.
>
> So let's address autodefrag...
>
> It's worth noting that I have autodefrag enabled here, on my ssds, and
> have from the first mount where I put content on them, so it has been
> enabled for every write on every file.  However, it's not ideal in all
> cases, my use-case simply is one where autodefrag works well, so...
>
> Here's the deal with autodefrag.  First of all, if a file isn't
> constantly being rewritten, or if its rewrite pattern is append-only
> (like most log files, but *not* systemd journal files!), it doesn't tend
> to get particularly fragmented in the first place, especially with a
> filesystem that itself isn't highly fragmented, so free-space blocks tend
> to be large enough that a file doesn't tend to be fragmented as initially
> written.  So fragmentation tends to be worst on internal-rewrite-pattern
> files, where a block here and a block there are rewritten, normally
> triggering cow on a cow-based filesystem such as btrfs.
>
> But, consider that rewriting the entire file to avoid fragmentation,
> which is what autodefrag does, takes time, larger file, more time.   And
> at some point, as filesizes increase, rewrites can be coming in faster
> than the file can be rewritten.  So autodefrag works best on internal-
> rewrite-pattern files (as we've already established), but also on smaller
> files.
>
> On spinning rust, autodefrag tends to work best at file sizes under 256
> MiB, a quarter GiB, where they rewrite fast enough that there's generally
> no problems at all.  But on most spinning rust, people will begin to see
> performance issues with autodefrag, at somewhere between half a GiB and
> 3/4 GiB (512-768 MiB), and nearly everyone on spinning rust reports
> performance issues at 1 GiB file sizes and larger.
>
> As it happens, this quarter-GiB or so spinning-rust autodefrag limit is
> close to that of common desktop-only database uses such as the sqlite
> files firefox and thunderbird use, so this is the use-case for which
> autodefrag is really recommended and tuned ATM.  That's really useful,
> since it means most desktop-only users can simply enable autodefrag and
> forget about it, as it'll "just work".
>
> People optimizing larger databases and GiB+ VM image files, however, are
> going to need to do rather more detailed optimization, which sucks, but
> in contrast with normal desktop users, they're generally used to doing
> various optimization things, at least to some extent, already, so at
> least the problem is hitting those generally more technically prepared to
> deal with it.
>
> But that's for spinning rust.  On ssds, particularly fast ssds, write
> speeds tend to be high enough that autodefrag can work effectively with
> much larger files.  The rub, however, is that ssd speeds vary enough, and
> there's few enough reports from people actually testing autodefrag with
> larger internal-rewrite-pattern files on ssds, that we don't have nicely
> wrapped up numbers for our ssd autodefrag filesize limitation
> recommendations, as we do for spinning rust.
>
> I'd suggest based on my own experience and the reports we /do/ have, that
> on most ssds, autodefrag, provided people are inclined to enable it in
> the first place (see above discussion of the two major ssd factors here
> and how emphasis on one or the other tends to put people in one of two
> camps regarding even worrying about fragmentation at all on ssds), should
> work well enough on files upto a gig in size, at least.  I wouldn't be
> surprised to see 2 GiB work fine, particularly on fast ssds, tho I'd
> guess people will begin to see performance issues at the 4 GiB to 8 GiB
> size.
>
> You say your image file, while on ssd, is 100 GiB.  Please do your own
> tests and report as it's possible my EWAG (educated but wild-ass-guess)
> is wrong, but I'm predicting that's well above the good performance limit
> for autodefrag, even on SSD.
>
> That said, performance may still be good /enough/ that you can deal with
> it, if if sufficiently simplifies the situation for you regarding /other/
> files, and your balance of use tilts sufficiently toward those other
> files as opposed to this single very large image file.
>
> Tho at 100 GiB, the repeated rewriting of autodefrag is definitely likely
> to cut into your write-cycle allowance, arguably rather heavily.  So I
> really can't recommend autodefrag, despite how very much I wish it would
> work for your case, since it does dramatically simplify things where it
> works and you can then simply forget about other alternatives and all
> their relative complications.  Maybe someday they'll optimize it to
> handle such large files better, but until then, I really don't think it's
> a good match to your requirements.
>
> So with autodefrag out for that file, and with the previous issues
> discussed, here's some reasonable options to try.
>
> 1) The nothing special option.  With a bit of luck, the 0-seek-time of
> ssd will mean that the fragmentation you're likely to see won't
> dramatically affect you, and the "do nothing" option will work acceptably.
>
> The biggest thing I'm worried about here is that fragmentation may well
> get bad enough that it affects btrfs maintenance times, etc, due to
> scaling issues.  Btrfs balance, scrub, and check, could end up taking far
> longer than you might expect on ssd and than they'd take were it not for
> the fragmentation on this single file.
>
> And if you're keeping snapshots around, be aware that simply defragging
> the file isn't likely to solve the btrfs maintenance times issue, because
> while btrfs did have snapshot-aware-defrag for a few kernels, it did not
> scale well *AT* *ALL* and the snapshot awareness was disabled again,
> until the scaling issues could be worked thru (which they're gradually
> doing, but it's an exceedingly complex problem, with many sub-issues that
> must be solved before scaling itself can be considered solved).  So
> defragging a file that's already highly fragmented in various snapshots
> of differing ages, will defrag it in the subvolume/snapshot you run the
> defrag in, but won't affect it in the other snapshots, so isn't likely to
> do much at all for the overall btrfs maintenance scaling issue.  You'd
> have to delete all those snapshots (or not take them in the first place,
> if your use-case doesn't require them) to eliminate the scaling issue, if
> it's due to fragmentation of this file in all those snapshots as well as
> the working copy.
>
> So watch out for the maintenance scaling (maybe run a scrub and/or read-
> only check periodically, just to ensure the execution times aren't
> running away on you), but if it works well enough for you, this is by far
> the most uncomplicated option.
>
> 2) If your use-case doesn't involve snapshotting the image file, setting
> nocow on the dir before creation of the file, such that the file inherits
> the nocow, should be a reasonably uncomplicated option.
>
> If you do plan on snapshotting the parent but don't actually need to
> snapshot the nocow subdir and its nocow inheriting images, then use the
> dedicated subvolume trick to keep the image file out of your snapshots
> and avoid the cow1 complications.
>
> 3) As an idea taking the dedicated subvolume idea even further, consider
> an entirely separate dedicated filesystem for this image file.  That
> gives you much more flexibility, because then you can, for instance,
> still set autodefrag on the main filesystem, if it'd be useful there,
> without worrying about how that huge image file and autodefrag interact.
>
> Additionally, that lets you use something other than btrfs for the image
> file's filesystem, if you want, while still using btrfs for the rest of
> the system.  If you're nocowing the file, you're already killing many of
> the features that btrfs generally brings, and provided the additional
> overhead of managing the separate partition and filesystem isn't too
> much, you might /as/ /well/ simply use something other than btrfs for
> that particular file, thus avoiding the whole image file cowing
> complications scenario in the first place.
>
> I'd strongly consider the separate filesystem option here, as I already
> use multiple separate filesystems in ordered to avoid having my data eggs
> all in the same single filesystem basket (subvolumes don't cut it in
> terms of separation safety, for me).  But some people are far more averse
> to partitioning and similar solutions, for reasons that aren't entirely
> clear to me.  If you'd prefer to avoid the complexity of managing an
> entirely separate filesystem just for your image file, fine, just cross
> this option off your list and don't consider it further.
>
> 4) If the "do nothing" option doesn't cut it and your use-case involves
> snapshotting the image file, then things get much more complex.
>
> As mentioned above, the recommendation for this sort of use-case isn't
> going to give you a simple ideal, but others have reported it to work
> acceptably, even surprisingly, well, once it's all setup, and if that's
> the situation on spinning rust, it should be even better on ssd, since
> the "controlled amount of fragmentation" should be even further within
> acceptable levels on ssd with its zero-seek-times, than it is on spinning
> rust.
>
> Again, the recommendation for this use-case is to set nocow on the image-
> file's dir so it inherits, and aim for the low end of your acceptable
> snapshotting frequency range for the image file, weekly instead of daily,
> or daily instead of hourly.  If necessary, use the separate subvolume
> trick to separate the image file from the rest of the content you're
> snapshotting, so you can use a higher frequency snapshot schedule on the
> other stuff, while keeping it as low frequency as you can manage on the
> image file.
>
> Then do scheduled periodic targeted defrag of the image file, at a
> frequency some fraction of the snapshot frequency, perhaps monthly or
> quarterly for weekly snapshots, etc.
>
> Keep in mind that defrag will only affect the working copy, not existing
> snapshots, but provided you do it at some reasonable fraction of the
> snapshotting interval, you should reset the fragmentation for further
> snapshots often enough that it doesn't get out of hand for them, either.
>
>
> Finally, orthogonal to the original fragmentation question, but
> particularly important if you /are/ doing scheduled snapshots...
>
> For scheduled snapshots in particular, it's very important that you setup
> a reasonable snapshot thinning schedule as well, the object of which
> should be to keep the number of snapshots as low as possible, again, for
> scaling reasons.  At this point anyway, btrfs maintenance operations
> simply do /not/ scale well with snapshot numbers in the tens or hundreds
> of thousands range, as people often find themselves with if they aren't
> doing scheduled snapshot thinning as well.
>
> With reasonable thinning, it's quite possible to keep per-subvolume
> snapshots to 250 or so, reasonably under 300, even if starting with
> incredibly high snapshot frequency such as every half-hour or even every
> minute (tho the latter tends to be impractical because while snapshots
> are fast, very nearly instantaneous, removing them is rather more complex
> and definitely not instantaneous!).  With 250 snapshots per subvolume,
> you keep it to 1000 snapshots per filesystem if you're snapshotting four
> subvolumes, 2000 per filesystem if you're doing eight, etc.  Ideally,
> you'll target 1000 or less, possibly by thinning more drastically on some
> subvolume snapshots than others, but 2000 or even 3000 isn't out of hand,
> tho by 2500 to 3000, you'll probably notice increased maintenance times.
> By 10k snapshots, however, things are starting to go south, and above
> that, things go unreasonable pretty fast.
>
> So do try to keep to "a few thousand, at most" snapshots, or expect to
> btrfs balance and other maintenance tasks to take "unreasonable" amounts
> of time, should you need to run them.  And if you can keep to under 1000,
> so much the better; your improved maintenance times will reward you for
> it. =:^)
>
> Also, as you may have already seen, my recommendation for quotas is
> simply leave them off on btrfs.  They're broken and dramatically increase
> the scaling issues.  You either rely on quotas working or you don't.  If
> you don't, leave them off and avoid the issues.  If you do, use a more
> stable and mature filesystem where they're known to work reliably.
> Unless of course you're specifically working with the devs to test,
> report and trace down quota problems and test possible fixes.  In that
> case, please continue, as its your tolerance for the present pain that's
> helping to make the feature actually usable for the rest of us, someday
> hopefully soon. =:^)
>

Thanks for the very detailed answer! You text should find its way to the 
BTRSF wiki/doc.

I never have more than a few snapshots of my home dir.
I don't *need* to snapshot the VM image therefore I intended to use 
nocow. However and thanks to your answer, I'm going to try the "do 
nothing special" option. If things are getting to slow then I will 
report and probably switch back to the nocow option (and a good 
old-fashion backup of the VM image every night on old fashion ext4 on 
spinning rust).

Xavier

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs autodefrag?
  2015-10-17 16:36 btrfs autodefrag? Xavier Gnata
  2015-10-18  5:46 ` Duncan
@ 2015-10-18 14:24 ` Rich Freeman
  2015-10-18 14:40   ` Hugo Mills
  1 sibling, 1 reply; 10+ messages in thread
From: Rich Freeman @ 2015-10-18 14:24 UTC (permalink / raw)
  To: Xavier Gnata; +Cc: linux-btrfs@vger.kernel.org

On Sat, Oct 17, 2015 at 12:36 PM, Xavier Gnata <xavier.gnata@gmail.com> wrote:
> 2) Disabling copy-on-write for just the VM image directory.

Unless this has changed, doing this will also disable checksumming.  I
don't see any reason why it has to, but it does.  So, I avoid using
this at all costs.

--
Rich

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs autodefrag?
  2015-10-18 14:24 ` Rich Freeman
@ 2015-10-18 14:40   ` Hugo Mills
  2015-10-19  6:19     ` Erkki Seppala
  0 siblings, 1 reply; 10+ messages in thread
From: Hugo Mills @ 2015-10-18 14:40 UTC (permalink / raw)
  To: Rich Freeman; +Cc: Xavier Gnata, linux-btrfs@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 878 bytes --]

On Sun, Oct 18, 2015 at 10:24:39AM -0400, Rich Freeman wrote:
> On Sat, Oct 17, 2015 at 12:36 PM, Xavier Gnata <xavier.gnata@gmail.com> wrote:
> > 2) Disabling copy-on-write for just the VM image directory.
> 
> Unless this has changed, doing this will also disable checksumming.  I
> don't see any reason why it has to, but it does.  So, I avoid using
> this at all costs.

   It has to be disabled because if you enable it, there's a race
condition: since you're overwriting existing data (rather than CoWing
it), you can't update the checksums atomically. So, in the interests
of consistency, checksums are disabled.

   Hugo.

-- 
Hugo Mills             | Nothing wrong with being written in Perl... Some of
hugo@... carfax.org.uk | my best friends are written in Perl.
http://carfax.org.uk/  |
PGP: E2AB1DE4          |                                                  dark

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs autodefrag?
  2015-10-18  5:46 ` Duncan
  2015-10-18 12:44   ` Xavier Gnata
@ 2015-10-19  6:04   ` Paul Harvey
  1 sibling, 0 replies; 10+ messages in thread
From: Paul Harvey @ 2015-10-19  6:04 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

On 18 October 2015 at 16:46, Duncan <1i5t5.duncan@cox.net> wrote:
> Xavier Gnata posted on Sat, 17 Oct 2015 18:36:32 +0200 as excerpted:
>
>> Hi,
>>
>> On a desktop equipped with an ssd with one 100GB virtual image used
>> frequently, what do you recommend?
>> 1) nothing special, it is all fine as long as you have a recent kernel
>> (which I do)
>> 2) Disabling copy-on-write for just the VM image directory.
>> 3) autodefrag as a mount option.
>> 4) something else.
>>
>> I don't think this usecase is well documented therefore I asked this
>> question.

[snip]

> So ssd or spinning rust, there's serious conflicts between nocow and
> snapshotting that really must be taken into consideration if you're
> planning to both snapshot and nocow.

This is all spot on advice, but I just wanted to chime in to mention:
I've been experimenting with -
- Active working copy of VM image files are hosted on non-btrfs filesystems
- Regular scheduled rsync --inplace onto a btrfs subvol copy of the
file that *is* snapshotted and part of regular send/receive runs.

rsync --inplace does what it says on the tin: it just rewrites those
parts of a file which need to be updated. Thus it only gets written to
once prior to each snapshot run, rather than continuously.

So the theory is that I can retain CoW storage efficiency (hold lots
of snapshots cheaply) but still keep decent performance (by running
the active, in-use working copies outside of my normal snapshotted
btrfs filesystems).

The cost is obviously more filesystems than you'd normally have to
run, more complex disaster recovery, not to mention storage sizing has
to accommodate a working copy on a separate fs to the archived copies.
Plus, this rsync approach has noticeably bigger I/O overhead than
btrfs send/receive, although in my environment nobody is noticing.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs autodefrag?
  2015-10-18 14:40   ` Hugo Mills
@ 2015-10-19  6:19     ` Erkki Seppala
  2015-10-19 11:56       ` Austin S Hemmelgarn
  0 siblings, 1 reply; 10+ messages in thread
From: Erkki Seppala @ 2015-10-19  6:19 UTC (permalink / raw)
  To: linux-btrfs

Hugo Mills <hugo@carfax.org.uk> writes:
>    It has to be disabled because if you enable it, there's a race
> condition: since you're overwriting existing data (rather than CoWing
> it), you can't update the checksums atomically. So, in the interests
> of consistency, checksums are disabled.

I suppose this has been suggested before, but couldn't it store both the
new and the old checksums and be satisfied if either of them match?

The user is probably not happy that a partial write is going to be
difficult to read from the device due to a checksum error, but there is
no promise of recently-overwritten data state with traditional
filesystems either in case of sudden powerdown, assuming there is no
data journaling..

-- 
  _____________________________________________________________________
     / __// /__ ____  __               http://www.modeemi.fi/~flux/\   \
    / /_ / // // /\ \/ /                                            \  /
   /_/  /_/ \___/ /_/\_\@modeemi.fi                                  \/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs autodefrag?
  2015-10-19  6:19     ` Erkki Seppala
@ 2015-10-19 11:56       ` Austin S Hemmelgarn
  2015-10-19 16:13         ` Erkki Seppala
  0 siblings, 1 reply; 10+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-19 11:56 UTC (permalink / raw)
  To: Erkki Seppala, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2320 bytes --]

On 2015-10-19 02:19, Erkki Seppala wrote:
> Hugo Mills <hugo@carfax.org.uk> writes:
>>     It has to be disabled because if you enable it, there's a race
>> condition: since you're overwriting existing data (rather than CoWing
>> it), you can't update the checksums atomically. So, in the interests
>> of consistency, checksums are disabled.
>
> I suppose this has been suggested before, but couldn't it store both the
> new and the old checksums and be satisfied if either of them match?
Actually, I don't think that's been suggested before, read on however 
for an explanation of why we don't do that.
>
> The user is probably not happy that a partial write is going to be
> difficult to read from the device due to a checksum error, but there is
> no promise of recently-overwritten data state with traditional
> filesystems either in case of sudden powerdown, assuming there is no
> data journaling..
And that is exactly the case with how things are now, when something is 
marked NOCOW, it has essentially zero guarantee of data consistency 
after a crash.  As things are now though, there is a guarantee that you 
can still read the file, but using checksums like you suggest would 
result in it being unreadable most of the time, because it's 
statistically unlikely that we wrote the _whole_ block (IOW, we can't 
guarantee without COW that the data was completely written) because:
a. While some disks do atomically write single sectors, most don't, and 
if the power dies during the disk writing a single sector, there is no 
certainty exactly what that sector will read back as.
b. Assuming that item a is not an issue, one block in BTRFS is usually 
multiple sectors on disk, and a majority of disks have volatile write 
caches, thus it is not unlikely that the power will die during the 
process of writing the block.
c. In the event that both items a and b are not an issue (for example, 
you have a storage controller with a non-volatile write cache, have 
write caching turned off on the disks, and it's a smart enough storage 
controller that it only removes writes from the cache after they 
return), then there is still the small but distinct possibility that the 
crash will cause either corruption in the write cache, or some other 
hardware related issue.



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs autodefrag?
  2015-10-19 11:56       ` Austin S Hemmelgarn
@ 2015-10-19 16:13         ` Erkki Seppala
  2015-10-19 19:48           ` Austin S Hemmelgarn
  0 siblings, 1 reply; 10+ messages in thread
From: Erkki Seppala @ 2015-10-19 16:13 UTC (permalink / raw)
  To: linux-btrfs

Austin S Hemmelgarn <ahferroin7@gmail.com> writes:

> And that is exactly the case with how things are now, when something
> is marked NOCOW, it has essentially zero guarantee of data consistency
> after a crash.

Yes. In addition to the zero guarantee of the data validity for the data
being written into, btrfs also doesn't give any guarantees for the rest
of the data, even if it was perfectly quiescent, but was just marked COW
at the time it was written :).

>  As things are now though, there is a guarantee that
> you can still read the file, but using checksums like you suggest
> would result in it being unreadable most of the time, because it's
> statistically unlikely that we wrote the _whole_ block (IOW, we can't
> guarantee without COW that the data was completely written) because:

Well, the amount of data being written at any given time is very small
compared to the whole device. So it's not all the data that is at risk
of having the wrong checksum. Given how small blocks are (4k) I really
doubt that the likelihood of large amounts of data remaining unreadable
would be great.

However, here's a compromise: when detecting an error on a COW file,
instead of refusing to read it, produce a warning to the kernel log. In
addition, when scrubbing it, the last resort after trying other copies
the checksum could simply be repaired, paired with an appropriate log
message. Such a log message would not indicate that the data is wrong,
but that the system administrator might be interested in checking it,
for example against backups, or by perhaps running a scrub within the
virtual machine.

If the scrub would say everything is OK, then certainly everything would
be OK.

> a. While some disks do atomically write single sectors, most don't,
> and if the power dies during the disk writing a single sector, there
> is no certainty exactly what that sector will read back as.

So it seems that the majority vote is to not to provide a feature to the
minority.. :)

> b. Assuming that item a is not an issue, one block in BTRFS is usually
> multiple sectors on disk, and a majority of disks have volatile write
> caches, thus it is not unlikely that the power will die during the
> process of writing the block.

I'm not at all familiar with the on-disk structure of Btrfs, but it
seems that indeed the block size is 16 kilobytes by default, so the risk
of one of the four device-blocks (on modern 4kB-sector HDDs) being
corrupted or only a set of them having being written is real. But,
there's only so much data in-flight at any given time.

I did read that there are two checksums (on Wikipedia,
Btrfs#Checksum_tree..): one per block, and one per a contiguous run of
allocated blocks. The latter checksum seems more likely to be broken,
but I don't see why in that case the per-block checksums (or one of the
two checksums I proposed) couldn't be referred to. This is of course
because I don't understand much of the Btrfs on-disk format, technical
feasibility be damned :).

I understand that the metadata is always COW, so that level of
corruption cannot occur.

> c. In the event that both items a and b are not an issue (for example,
> you have a storage controller with a non-volatile write cache, have
> write caching turned off on the disks, and it's a smart enough storage
> controller that it only removes writes from the cache after they
> return), then there is still the small but distinct possibility that
> the crash will cause either corruption in the write cache, or some
> other hardware related issue.

However, should this not be the case, for example when my computer is
never brought down abruptly, it could still be valuable information to
see that the data has not changed behind my back.

I understand it is the prime motivation behind btrfs scrubbing in any
case; otherwise there could be a faster 'queue a verify after a write'
that would never scrub the same data twice.

-- 
  _____________________________________________________________________
     / __// /__ ____  __               http://www.modeemi.fi/~flux/\   \
    / /_ / // // /\ \/ /                                            \  /
   /_/  /_/ \___/ /_/\_\@modeemi.fi                                  \/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs autodefrag?
  2015-10-19 16:13         ` Erkki Seppala
@ 2015-10-19 19:48           ` Austin S Hemmelgarn
  0 siblings, 0 replies; 10+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-19 19:48 UTC (permalink / raw)
  To: Erkki Seppala, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 8176 bytes --]

On 2015-10-19 12:13, Erkki Seppala wrote:
> Austin S Hemmelgarn <ahferroin7@gmail.com> writes:
>
>> And that is exactly the case with how things are now, when something
>> is marked NOCOW, it has essentially zero guarantee of data consistency
>> after a crash.
>
> Yes. In addition to the zero guarantee of the data validity for the data
> being written into, btrfs also doesn't give any guarantees for the rest
> of the data, even if it was perfectly quiescent, but was just marked COW
> at the time it was written :).
Assuming you do actually mean COW and not NOCOW, in which case there is 
a guarantee that the data will either:
1. Match the original data prior to the write.
2. Match the data that was written.
or, if you are using only single copies of the metadata blocks and the 
system crashes exactly during a write to a metadata block:
3. Everything under that metadata block will become inaccessible, and 
require usage of btrfs-progs to recover.

In the case of NOCOW however, there is absolutely no such guarantee 
(just like ext4 for example can not provide such a guarantee), and any 
of the above could be the case, or any arbitrary portion of the new data 
could have been written.
>>   As things are now though, there is a guarantee that
>> you can still read the file, but using checksums like you suggest
>> would result in it being unreadable most of the time, because it's
>> statistically unlikely that we wrote the _whole_ block (IOW, we can't
>> guarantee without COW that the data was completely written) because:
>
> Well, the amount of data being written at any given time is very small
> compared to the whole device. So it's not all the data that is at risk
> of having the wrong checksum. Given how small blocks are (4k) I really
> doubt that the likelihood of large amounts of data remaining unreadable
> would be great.
That very much depends on how you are using things.for many of the types 
of things which NOCOW should be used for, directio and AIO are also very 
commonly used, and those can write chunks much bigger than BTRFS's block 
size in one go.
>
> However, here's a compromise: when detecting an error on a COW file,
> instead of refusing to read it, produce a warning to the kernel log. In
> addition, when scrubbing it, the last resort after trying other copies
> the checksum could simply be repaired, paired with an appropriate log
> message. Such a log message would not indicate that the data is wrong,
> but that the system administrator might be interested in checking it,
> for example against backups, or by perhaps running a scrub within the
> virtual machine.
In this case I'm assuming you mean NOCOW instead of COW, as the 
corruption can't be detected in a NOCOW file by BTRFS.

In a significant majority of cases, it is actually better to return no 
data than to return known corrupted data (think medical or military 
applications, in those kind of cases it's quite often worse to act on 
incorrect data than it is to not act at all).  Disk images for virtual 
machines are one of the very few rare cases where this is not true, 
simply because they can usually correct the corruption themselves.
>
> If the scrub would say everything is OK, then certainly everything would
> be OK.
That's a _very_ optimistic point of view to take, and doesn't take into 
account software bugs, or potential hardware problems.
>
>> a. While some disks do atomically write single sectors, most don't,
>> and if the power dies during the disk writing a single sector, there
>> is no certainty exactly what that sector will read back as.
>
> So it seems that the majority vote is to not to provide a feature to the
> minority.. :)
For something that provides a false sense of data safety and is 
potentially easy to shoot yourself in the foot with?  Yes we will almost 
certainly not provide it.  If, however, you wish to write a patch to 
provide such a feature (or pay someone to do so for you), there is 
nothing stopping you from doing so, and if it's something that people 
actually want, then it will likely end up included.
>> b. Assuming that item a is not an issue, one block in BTRFS is usually
>> multiple sectors on disk, and a majority of disks have volatile write
>> caches, thus it is not unlikely that the power will die during the
>> process of writing the block.
>
> I'm not at all familiar with the on-disk structure of Btrfs, but it
> seems that indeed the block size is 16 kilobytes by default, so the risk
> of one of the four device-blocks (on modern 4kB-sector HDDs) being
> corrupted or only a set of them having being written is real. But,
> there's only so much data in-flight at any given time.
While the default is usually 16k, there are situations where it may be 
different, for example if the system has a page size greater than 16k 
(some ARM64, PPC, and MIPS systems use 64k pages), or if it's a small 
filesystem (in which case the blocks will be 4k).

It is also worth noting that while most 'modern' HDDs use 4k sectors:
1. They are still vastly outnumbered by older HDDs that use 512 byte 
sectors.
2. A significant percentage of them use 512 byte virtual sectors (that 
is, they expose a 512 byte sector based interface to the OS, but use 4k 
sectors internally, which has potentially dangerous implications if 
their firmware is not well written).
3. SSD's internally use much bigger block sizes (the smallest erase 
block size that I've personally seen in an SSD is 1M, usually it's 2M or 
4M).  The implications of this are pretty scary for cheap SSD's (and OCZ 
SSD's, which are not by any means cheap) that don't include 
super-capacitors to ensure that power-loss in the middle of a write 
won't interrupt the write.
4. I've heard rumors of some exotic ones out there that use 64k sectors 
on disk.
>
> I did read that there are two checksums (on Wikipedia,
> Btrfs#Checksum_tree..): one per block, and one per a contiguous run of
> allocated blocks. The latter checksum seems more likely to be broken,
> but I don't see why in that case the per-block checksums (or one of the
> two checksums I proposed) couldn't be referred to. This is of course
> because I don't understand much of the Btrfs on-disk format, technical
> feasibility be damned :).
>
> I understand that the metadata is always COW, so that level of
> corruption cannot occur.
Oh, it can occur in reality, it's just a _statistical_ impossibility.
>> c. In the event that both items a and b are not an issue (for example,
>> you have a storage controller with a non-volatile write cache, have
>> write caching turned off on the disks, and it's a smart enough storage
>> controller that it only removes writes from the cache after they
>> return), then there is still the small but distinct possibility that
>> the crash will cause either corruption in the write cache, or some
>> other hardware related issue.
>
> However, should this not be the case, for example when my computer is
> never brought down abruptly, it could still be valuable information to
> see that the data has not changed behind my back.
Well yes, but if that is the case, then you shouldn't be worrying about 
anything, as un-mounting the filesystem requires that there be no open 
files on it, and it explicitly flushes all the buffered writes in RAM 
out to disk.

On the other hand, if you're worried about your disk or other hardware 
having issues, then you should be seriously considering verifying that 
it works correctly, and replacing it if it doesn't, and just using BTRFS 
on it is not a safe or even remotely reliable way to detect hardware 
failures.
>
> I understand it is the prime motivation behind btrfs scrubbing in any
> case; otherwise there could be a faster 'queue a verify after a write'
> that would never scrub the same data twice.
Actually, having the ability to tell it to verify a block after writing 
it would potentially be a very useful feature for unreliable hardware, 
assuming you're willing to take the performance penalty for the 
additional read on every write.



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-10-19 19:48 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-10-17 16:36 btrfs autodefrag? Xavier Gnata
2015-10-18  5:46 ` Duncan
2015-10-18 12:44   ` Xavier Gnata
2015-10-19  6:04   ` Paul Harvey
2015-10-18 14:24 ` Rich Freeman
2015-10-18 14:40   ` Hugo Mills
2015-10-19  6:19     ` Erkki Seppala
2015-10-19 11:56       ` Austin S Hemmelgarn
2015-10-19 16:13         ` Erkki Seppala
2015-10-19 19:48           ` Austin S Hemmelgarn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.