All of lore.kernel.org
 help / color / mirror / Atom feed
From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: Strange prformance degradation when COW writes happen at fixed offsets
Date: Sat, 25 Feb 2012 03:34:00 +0000 (UTC)	[thread overview]
Message-ID: <pan.2012.02.25.03.34.00@cox.net> (raw)
In-Reply-To: CAB3q876D=jFQF8D+PypAkZM3Uuxgwmf+tjLB9m9Pb5JZsCHZTw@mail.gmail.com

Nik Markovic posted on Fri, 24 Feb 2012 14:38:57 -0600 as excerpted:

> On Fri, Feb 24, 2012 at 12:38 AM, Duncan <1i5t5.duncan@cox.net> wrote=
:
>> Nik Markovic posted on Thu, 23 Feb 2012 20:31:02 -0600 as excerpted:
>>
>>> I noticed a few errors in the script that I used. I corrected it an=
d
>>> it seems that degradation is occurring even at fully random writes:
>>
>> I don't have an ssd, but is it possible that you're simply seeing
>> erase- block related degradation due to multi-write-block sized
>> erase-blocks?
>>
>> It seems to me that when originally written to the btrfs-on-ssd, the
>> file will likely be written block-sequentially enough that the file =
as
>> a whole takes up relatively few erase-blocks. =C2=A0As you COW-write
>> individual blocks, they'll be written elsewhere, perhaps all the
>> changed blocks to a new erase-block, perhaps each to a different era=
se
>> block.
>=20
> This is a very interesting insight. I wasn't even aware of the
> erase-block issue, so I did some reading up on it...

I take it you looked at TRIM/discard, then, as well?  In theory and for=
=20
some SSD firmware, it works well at helping to alleviate the problem by=
=20
informing the SSD of data areas that it no longer needs to care about=20
(empty space), thus allowing more effective management of those erase-
blocks.

Reality is however not quite so simple, and it doesn't help a lot with=20
some SSDs, plus there's a potential performance issue due when doing th=
e=20
discard on especially earlier devices, since TRIM is an unqueueable=20
command in the earlier standards (I've read it's defined as queueable i=
n=20
the latest standards, however), thus forcing a flush of all activity in=
=20
the queue before the discard, potentially triggering I/O freeze=20
behavior.  Additionally, when run on top of dm-crypt, there's a potenti=
al=20
security issue (examination of the raw undecrypted storage reveals=20
whether there's data there or not, and possibly the filesystem type use=
d=20
based on patterns, a potential deniability issue in that they know the=20
data is there, tho it doesn't affect the strength of the encryption=20
itself).

So since on a lot of firmware it doesn't make a lot of difference anywa=
y,=20
and there's a couple of down sides, the btrfs ssd mount option does NOT=
=20
enable discard as well.  However, there *IS* a discard option that you=20
can experiment with if you like, and it probably WILL help with erase-
block handling on SOME firmware.

See the FAQ, part 3 Features, # 3.4 on TRIM/discard, for a bit more. I'=
ve=20
really covered what it says, above, but there's a link to the encryptio=
n=20
security vs TRIM research, for instance.  And the discard mount-option=20
for whatever reason isn't listed in mount options, or at least I didn't=
=20
see it, only in the FAQ.

(This is one URL, my client is wrapping it and it's a hassle to fix.)

http://btrfs.ipv5.de/index.php?
title=3DFAQ#Does_Btrfs_support_TRIM.2Fdiscard.3F

Bottom line, if it is indeed an erase-block issue, the discard mount=20
option MIGHT help, or it might not, depending on your device firmware. =
=20
It's an experiment-and-see thing.

>> As you increase the successive COW generation count, the file's file=
-
>> system/write blocks will be spread thru more and more erase-blocks,
>> basically fragmentation but of the SSD-critical type, into more and
>> more erase blocks, thus affecting modification and removal time but =
not
>> read time.
>=20
> OK, so time to write would increase due to fragmentation and writing,=
 it
> now makes sense (though I don't see why small writes would affect thi=
s,
> but my concerns are not writes anyway), but why would cp --reflink ti=
me
> increase so much. Yes, new extents would be created, but btrfs doesn'=
t
> write into data blocks, does it? I figured its metadata would be kept=
 in
> one place. I figure the only thing BTRFS would do on cp
> --reflink=3Dalways:
> 1. Take a collection of extents owned by source.
> 2. Make the new copy use the same collection of extents.
> 3. Write the collection of extents to the "directory".
>=20
> Now this process seems to be CPU intensive. When I remove or make a
> reflink copy, one core pikes up to 100%, which tells me that there's =
a
> performance issue there, not an ssd issue. Also, only one CPU thread =
is
> being used for this. I figured that I can improve this by some settin=
g.
> Maybe thread_pool mount option? Are there any updates in later kernel=
s
> that I should possibly pick up?

=46WIW... I am by no means an expert on this.  I /think/ I understand=20
enough of it to somewhat guide trial and error testing to arrive at a=20
reasonable if not best-case config for any setup I might deal with, and=
=20
well enough to hopefully point you in the right direction for your own=20
research and testing, but I'd not going to claim to be able to explain=20
the whys of individual cases, or even necessarily to understand them=20
myself, just understand enough to know of the issue and to trial and=20
error resolve to a hopefully reasonable situation on any hardware I mig=
ht=20
run.

However, I could speculate (enough to guide my own testing were I=20
troubleshooting here) that it's one of several things or more likely a=20
combination of them.  =20

One, I'm not sure if the metadata ends up being COW also, or not, but i=
f=20
it is, then your test case is fragmenting it too, thus explaining the=20
reflink copy issue.  And keep in mind that by default, btrfs uses DUP f=
or=20
metadata, so there's TWO copies of it written, thereby DOUBLING the=20
performance effects of anything affecting metadata!

Two, see the FAQ deduplication question/answer a couple questions below=
=20
the TRIM/discard one mentioned above.  I'm rather fuzzy on the filesyst=
em=20
implications of this myself, but it seems to me that our COW assumption=
s=20
might be wrong because they're assuming deduplication effects that simp=
ly=20
aren't the way btrfs works presently, as it hasn't implemented=20
deduplication.  Admittedly, this is at best a handwavy black-box factor=
,=20
but that's the best I can do with it, presently.  I guess that at least=
=20
gives you another place to do additional research, if it comes to that.=
 =20
(In this regard I do wish the COW subsection of the sysadmin guide page=
=20
on the wiki was written, it's simply punted ATM, since there's a fair=20
chance that a good explanation there would cover the filesystem viewpoi=
nt=20
differences between full deduplication and the COW that btrfs does,=20
perhaps clearing up some misconceptions people including me may have=20
about it, as well.)

Three, as evident in the discussion on the nodatacow and autodefrag=20
options I mentioned before, there's known issues with some use cases=20
involving large files and rewrites of data at random locations within=20
them.  But I'm not sure if these known issues are simply the ones we've=
=20
been discussing, or if there's other factors I'm unaware of in this=20
regard.  Knowing more about just what those known issues are and the=20
specific scenarios under which they occur, could go a long way toward=20
resolving the situation for you.

But I'm only a recent list regular, joining a few weeks ago as part of =
my=20
own research into btrfs (FWIW my use case involves N-way mirroring, wit=
h=20
N=3D3-4; since only no-mirroring and N=3D2 is available today and 3-way=
/n-way=20
is planned to layer on top of raid5/6, which is planned for kernel=20
3.4/3.5, I'm now waiting for that... while continuing to stay current o=
n=20
the list), so whatever research or test cases lead to the remarks on th=
e=20
wiki regarding large files with random data rewrites, predates my=20
involvement likely by quite some time.

=46our, there's additional block alignment issues having to do with the=
=20
alignment of the partition on the physical storage, as it relates to=20
read-, write- and erase-block sizes and alignment.  On SSDs, erase-bloc=
k=20
sizes are the biggest, so the optimum alignment would be to erase-block=
=20
size.  Getting it wrong can result in multiple block writes and/or eras=
es=20
where proper alignment would require only one.  This phenomenon is call=
ed=20
write-amplification (and less commonly, erase-amplification).  However,=
=20
depending on what you used to create the partition on which the=20
filesystem resides (and loopback files do tend toward worst-case), it's=
=20
quite possible you don't have block-alignment level control at all.

=46WIW, that's one use case for the mkbtrfs/mkfs.btrfs --alloc-start/-A=
=20
option, since that allows you to align the allocation within the=20
partition as necessary for alignment, regardless of the partition=20
alignment.

=46WIW2, gptfdisk (a gpt partitioner as opposed to the old mbr style) h=
as=20
reasonable alignment defaults of 1 MiB on disks without an existing=20
partition layout, and attempts 8-sector (4 KiB) alignment even on=20
existing layouts, for disks >=3D300 GB at least.  That's what I've been=
=20
using for the last few years, having converted to gpt-based partitionin=
g=20
for everything, even USB-thumb-drives, if partitioned.  (GPT was design=
ed=20
for EFI, but can be used on BIOS based systems as well, which is what I=
'm=20
doing.  Grub2 understands gpt well and puts to good use any reserved BI=
OS=20
partition it finds, and there's options in the kernel for it that need=20
enabled as well.)

Block alignment is DEFINITELY something you can play with, in terms of=20
testing whether it makes a difference on your drives, SSD or "spinning=20
rust".

There's probably other factors involved of which I'm unaware, as well.

>> IIRC I saw a note about this on the wiki, in regard to the nodatacow
>> mount-option.

>> In addition to nodatacow, see the note on the autodefrag option.

> Unless I am wrong, this would disable COW completely and reflink copy=
=2E
> Reflinks are a crucial component and the sole reason I picked BTRFS f=
or
> the system that I am writing for my company.
> The autodefrag option addresses multiple writes. Writing is not the
> problem, but cp --reflink should be near-instant. That was the reason=
 we
> chose BTRFS over ZFS, which seemed to be the only feasible alternativ=
e.
> ZFS snapshot complicate the design and deduplication copy time is the
> same as (or not much better than) raw copy.

> As I mentioned above, the COW is the crucial component of our system,
> XFS won't do. Our system does not do random writes. In fact it is mai=
nly
> heavy on read operation. The system does occasional "rotation of rust=
"
> on large files in a way that version control system would (large file=
s
> are modified and then used as a new baseline)

Pardon me, I think I might have been too vague with that "rotating rust=
"=20
allusion and lost you.  Either that or you're taking the allusion out=20
even further and potentially lost me! =3D;^0

I meant spinning magnetic media with that "rotating rust" reference, th=
e=20
"rotating rust" bit being a double entendre allusion both to the iron=20
oxide (rust) used as the data storage layer, and to the fact that many=20
view rotating magnetic media as a legacy technology (rusting out)=20
compared to SSDs. =3D:^)  As it happens, I saw that double-meaning word=
-
play used elsewhere recently with the same two allusions attached, and=20
liked it enough to use it myself, when I got the chance.  Only I'm not=20
sure you got the reference, because...

You used it quite differently, referring to file rotation.  So either y=
ou=20
saw my reference and upped the ante, so to speak, leaving me to pick up=
=20
the pieces, or I lost you with the original reference, one of the two.

But I guess we should be on the same page knowing each other's meaning,=
=20
now.  Meanwhile...

[I do see your followup mentioning that it doesn't actually disable /al=
l/=20
COW, and that you tested it, without significant change in the results.=
=2E.]

=46WIW, I wasn't so much SUGGESTING those options, as noting the=20
INFORMATION contained in their description, the random writes to large =
db=20
files and its effect on btrfs bit.  But testing (which you did) is a go=
od=20
idea, just to see what difference it makes, little in your case, so=20
either the nocow option isn't disabling it in your case (specific use o=
f=20
cp --reflink), or the cow isn't the problem at all.


While you're at testing, tho, the question occurred to me of whether=20
simply using btrfs' snapshotting would make a difference.  (I did say I=
=20
don't claim a full understanding, and that trial and error testing woul=
d=20
be my method here, that I really only understand enough to hopefully=20
guide me a bit in what to test...)  Snapshotting by definition uses the=
=20
COW capacities, bit it occurs to me that since it's doing it on a=20
filesystem-wide basis instead of a single-file basis, that might allow=20
more efficiency in metadata handling.

Note that I don't necessarily expect that snapshotting would be a=20
workable final solution for you, but if in testing you discover that th=
e=20
speed stays reasonable with the snapshot method (still only changing th=
e=20
single file between snapshots), while it degrades (as you've found) wit=
h=20
the single cp --reflink method, then that's important data for the test=
=20
case, and given btrfs' state of development, it could well lead to majo=
r=20
optimizations of the single-file cp --reflink case as well, which you=20
presumably COULD use in final deployment.


> Thanks for all your help on this issue. I hope that someone can point
> out some more tweaks or added features/fixes after 3.2 RC5 that I may
> do.

Talking about which... since you mentioned 3.2-rc5, you do seem aware o=
f=20
the fact that btrfs is still very much experimental status, in active=20
development, and the need for staying current on the kernel.

However, unless your testing is for a system with actual deployment=20
scheduled for say a year or more out, I'd question btrfs as a reasonabl=
e=20
solution in any case.  One of the things that a lot of people don't see=
m=20
to realize is just how much active btrfs development is still going on,=
=20
and that it's NOT just corner-case use cases such as the multi-mirror=20
raid1 that I'm waiting on ATM, but that there's still data corruption=20
issues being traced and fixed, etc.

IOW, btrfs isn't something I'd recommend on either a production system =
or=20
even a general user's system, for the time being.  If the intent is to=20
test btrfs, filling it with data that you are not only prepared for it =
to=20
be destroyed, but expect it to happen, so you not only have backups or=20
simply don't value the data enough to be worth backups, you're not=20
counting on the btrfs copy as anything but experimental "garbage" data,=
=20
expected to be lost in testing, as well, then that's FINE.  Such testin=
g,=20
and hopefully bug reporting, and patching where possible, is what btrfs=
=20
is out there for, ATM. =20

But if the intent is to actually put production data on the filesystem,=
=20
or use it as the primary copy of data that you don't want to lose, btrf=
s=20
isn't an appropriate choice at this point, and I'd say probably won't b=
e=20
until say Q4, or even next year, so if your production deployment is=20
scheduled for before that, really, you shouldn't be looking at btrfs fo=
r=20
it, as it's not fit for that purpose ATM and isn't likely to be, for=20
another year or so (and even then, it'll be suitable for only the early=
=20
adopters, the cautious folk will wait another year or more after that,=20
just as many of the cautious folk are only now warming to ext4 as oppos=
ed=20
to ext3).

I just don't want to see you back here as one of those folks asking=20
questions about recovering data on a screwed filesystem, because they h=
ad=20
no backups or the backups weren't kept current, because they were using=
=20
btrfs for real-life use beyond testing purposes, and that's simply not=20
the sort of use btrfs is designed to or can properly deliver at this=20
point!

--=20
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

      parent reply	other threads:[~2012-02-25  3:34 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-24  1:32 Strange prformance degradation when COW writes happen at fixed offsets Nik Markovic
2012-02-24  2:31 ` Nik Markovic
2012-02-24  6:38   ` Duncan
2012-02-24 20:38     ` Nik Markovic
2012-02-24 21:33       ` Nik Markovic
2012-02-27  8:29         ` Christian Brunner
2012-02-25  3:34       ` Duncan [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=pan.2012.02.25.03.34.00@cox.net \
    --to=1i5t5.duncan@cox.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.