* Re: Strange prformance degradation when COW writes happen at fixed offsets
2012-02-24 20:38 ` Nik Markovic
@ 2012-02-24 21:33 ` Nik Markovic
2012-02-27 8:29 ` Christian Brunner
2012-02-25 3:34 ` Duncan
1 sibling, 1 reply; 7+ messages in thread
From: Nik Markovic @ 2012-02-24 21:33 UTC (permalink / raw)
To: Duncan; +Cc: linux-btrfs
To add... I also tried nodatasum (only) and nodatacow otions. I found
somewhere that nodatacow doesn't really mean tthat COW is disabled.
Test data is still the same - CPU spikes and times are the same.
On Fri, Feb 24, 2012 at 2:38 PM, Nik Markovic <nmarkovi.navteq@gmail.co=
m> wrote:
> On Fri, Feb 24, 2012 at 12:38 AM, Duncan <1i5t5.duncan@cox.net> wrote=
:
>> Nik Markovic posted on Thu, 23 Feb 2012 20:31:02 -0600 as excerpted:
>>
>>> I noticed a few errors in the script that I used. I corrected it an=
d it
>>> seems that degradation is occurring even at fully random writes:
>>
>> I don't have an ssd, but is it possible that you're simply seeing er=
ase-
>> block related degradation due to multi-write-block sized erase-block=
s?
>>
>> It seems to me that when originally written to the btrfs-on-ssd, the=
file
>> will likely be written block-sequentially enough that the file as a =
whole
>> takes up relatively few erase-blocks. =A0As you COW-write individual
>> blocks, they'll be written elsewhere, perhaps all the changed blocks=
to a
>> new erase-block, perhaps each to a different erase block.
>
> This is a very interesting insight. I wasn't even aware of the
> erase-block issue, so I did some reading up on it...
>
>>
>> As you increase the successive COW generation count, the file's file=
-
>> system/write blocks will be spread thru more and more erase-blocks,
>> basically fragmentation but of the SSD-critical type, into more and =
more
>> erase blocks, thus affecting modification and removal time but not r=
ead
>> time.
>
> OK, so time to write would increase due to fragmentation and writing,
> it now makes sense (though I don't see why small writes would affect
> this, but my concerns are not writes anyway), but why would cp
> --reflink time increase so much. Yes, new extents would be created,
> but btrfs doesn't write into data blocks, does it? I figured its
> metadata would be kept in one place. I figure the only thing BTRFS
> would do on cp --reflink=3Dalways:
> 1. Take a collection of extents owned by source.
> 2. Make the new copy use the same collection of extents.
> 3. Write the collection of extents to the "directory".
>
> Now this process seems to be CPU intensive. When I remove or make a
> reflink copy, one core pikes up to 100%, which tells me that there's =
a
> performance issue there, not an ssd issue. Also, only one CPU thread
> is being used for this. I figured that I can improve this by some
> setting. Maybe thread_pool mount option? Are there any updates in
> later kernels that I should possibly pick up?
>
>>
>> IIRC I saw a note about this on the wiki, in regard to the nodatacow
>> mount-option. =A0Let's see if I can find it again. =A0Hmm... yes...
>>
>> http://btrfs.ipv5.de/index.php?title=3DGetting_started#Mount_Options
>>
>> In particular this (for nodatacow, read the rest as there's addition=
al
>> implications):
>>
>>>>>>>
>> Performance gain is usually < 5% unless the workload is random write=
s to
>> large database files, where the difference can become very large.
>> <<<<<
>>
>
> Unless I am wrong, this would disable COW completely and reflink copy=
=2E
> Reflinks are a crucial component and the sole
> reason I picked BTRFS for the system that I am writing for my company=
=2E
> The autodefrag option addresses multiple writes. Writing is not the
> problem, but cp --reflink should be near-instant. That was the reason
> we chose BTRFS over ZFS, which seemed to be the only feasible
> alternative. ZFS snapshot complicate the design and deduplication cop=
y
> time is the same as (or not much better than) raw copy.
>
>> In addition to nodatacow, see the note on the autodefrag option.
>>
>> IOW, with the repeated generations of random-writes to cow-copies, y=
ou're
>> apparently triggering a cow-worst-case fragmentation situation. =A0I=
t
>> shouldn't affect read-time much on SSD, but it certainly will affect=
copy
>> and erase time, as the data and metadata (which as you'll recall is =
2X by
>> default on btrfs) gets written to more and more blocks that need upd=
ated
>> at copy/erase time,
>>
>>
>> That /might/ be the problem triggering the freezes you noted that se=
t off
>> the original investigation as well, if the SSD firmware is running o=
ut of
>> erase blocks and having to pause access while it rearranges data to =
allow
>> operations to continue. =A0Since your original issue on "rotating ru=
st"
>> drives was fragmentation, rewriting would seem to be something you d=
o
>> quite a lot of, triggering different but similar-cause issues on SSD=
s as
>> well.
>>
>> FWIW, with that sort of database-style workload, large files constan=
tly
>> random-change rewritten, something like xfs might be more appropriat=
e
>> than btrfs. =A0See the recent xfs presentations (were they at ScaleX=
or
>> LinuxConf.au? both happened about the same time and were covered in =
the
>> same LWN weekly edition) as covered a couple weeks ago on LWN for mo=
re.
>>
>
> As I mentioned above, the COW is the crucial component of our system,
> XFS won't do. Our system does not do random writes. In fact it is
> mainly heavy on read operation. The system does occasional "rotation
> of rust" on large files in a way that version control system would
> (large files are modified and then used as a new baseline)
>
>> --
>> Duncan - List replies preferred. =A0 No HTML msgs.
>> "Every nonfree program has a lord, a master --
>> and if you use the program, he is your master." =A0Richard Stallman
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrf=
s" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>
> Thanks for all your help on this issue. I hope that someone can point
> out some more tweaks or added features/fixes after 3.2 RC5 that I may
> do.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Strange prformance degradation when COW writes happen at fixed offsets
2012-02-24 20:38 ` Nik Markovic
2012-02-24 21:33 ` Nik Markovic
@ 2012-02-25 3:34 ` Duncan
1 sibling, 0 replies; 7+ messages in thread
From: Duncan @ 2012-02-25 3:34 UTC (permalink / raw)
To: linux-btrfs
Nik Markovic posted on Fri, 24 Feb 2012 14:38:57 -0600 as excerpted:
> On Fri, Feb 24, 2012 at 12:38 AM, Duncan <1i5t5.duncan@cox.net> wrote=
:
>> Nik Markovic posted on Thu, 23 Feb 2012 20:31:02 -0600 as excerpted:
>>
>>> I noticed a few errors in the script that I used. I corrected it an=
d
>>> it seems that degradation is occurring even at fully random writes:
>>
>> I don't have an ssd, but is it possible that you're simply seeing
>> erase- block related degradation due to multi-write-block sized
>> erase-blocks?
>>
>> It seems to me that when originally written to the btrfs-on-ssd, the
>> file will likely be written block-sequentially enough that the file =
as
>> a whole takes up relatively few erase-blocks. =C2=A0As you COW-write
>> individual blocks, they'll be written elsewhere, perhaps all the
>> changed blocks to a new erase-block, perhaps each to a different era=
se
>> block.
>=20
> This is a very interesting insight. I wasn't even aware of the
> erase-block issue, so I did some reading up on it...
I take it you looked at TRIM/discard, then, as well? In theory and for=
=20
some SSD firmware, it works well at helping to alleviate the problem by=
=20
informing the SSD of data areas that it no longer needs to care about=20
(empty space), thus allowing more effective management of those erase-
blocks.
Reality is however not quite so simple, and it doesn't help a lot with=20
some SSDs, plus there's a potential performance issue due when doing th=
e=20
discard on especially earlier devices, since TRIM is an unqueueable=20
command in the earlier standards (I've read it's defined as queueable i=
n=20
the latest standards, however), thus forcing a flush of all activity in=
=20
the queue before the discard, potentially triggering I/O freeze=20
behavior. Additionally, when run on top of dm-crypt, there's a potenti=
al=20
security issue (examination of the raw undecrypted storage reveals=20
whether there's data there or not, and possibly the filesystem type use=
d=20
based on patterns, a potential deniability issue in that they know the=20
data is there, tho it doesn't affect the strength of the encryption=20
itself).
So since on a lot of firmware it doesn't make a lot of difference anywa=
y,=20
and there's a couple of down sides, the btrfs ssd mount option does NOT=
=20
enable discard as well. However, there *IS* a discard option that you=20
can experiment with if you like, and it probably WILL help with erase-
block handling on SOME firmware.
See the FAQ, part 3 Features, # 3.4 on TRIM/discard, for a bit more. I'=
ve=20
really covered what it says, above, but there's a link to the encryptio=
n=20
security vs TRIM research, for instance. And the discard mount-option=20
for whatever reason isn't listed in mount options, or at least I didn't=
=20
see it, only in the FAQ.
(This is one URL, my client is wrapping it and it's a hassle to fix.)
http://btrfs.ipv5.de/index.php?
title=3DFAQ#Does_Btrfs_support_TRIM.2Fdiscard.3F
Bottom line, if it is indeed an erase-block issue, the discard mount=20
option MIGHT help, or it might not, depending on your device firmware. =
=20
It's an experiment-and-see thing.
>> As you increase the successive COW generation count, the file's file=
-
>> system/write blocks will be spread thru more and more erase-blocks,
>> basically fragmentation but of the SSD-critical type, into more and
>> more erase blocks, thus affecting modification and removal time but =
not
>> read time.
>=20
> OK, so time to write would increase due to fragmentation and writing,=
it
> now makes sense (though I don't see why small writes would affect thi=
s,
> but my concerns are not writes anyway), but why would cp --reflink ti=
me
> increase so much. Yes, new extents would be created, but btrfs doesn'=
t
> write into data blocks, does it? I figured its metadata would be kept=
in
> one place. I figure the only thing BTRFS would do on cp
> --reflink=3Dalways:
> 1. Take a collection of extents owned by source.
> 2. Make the new copy use the same collection of extents.
> 3. Write the collection of extents to the "directory".
>=20
> Now this process seems to be CPU intensive. When I remove or make a
> reflink copy, one core pikes up to 100%, which tells me that there's =
a
> performance issue there, not an ssd issue. Also, only one CPU thread =
is
> being used for this. I figured that I can improve this by some settin=
g.
> Maybe thread_pool mount option? Are there any updates in later kernel=
s
> that I should possibly pick up?
=46WIW... I am by no means an expert on this. I /think/ I understand=20
enough of it to somewhat guide trial and error testing to arrive at a=20
reasonable if not best-case config for any setup I might deal with, and=
=20
well enough to hopefully point you in the right direction for your own=20
research and testing, but I'd not going to claim to be able to explain=20
the whys of individual cases, or even necessarily to understand them=20
myself, just understand enough to know of the issue and to trial and=20
error resolve to a hopefully reasonable situation on any hardware I mig=
ht=20
run.
However, I could speculate (enough to guide my own testing were I=20
troubleshooting here) that it's one of several things or more likely a=20
combination of them. =20
One, I'm not sure if the metadata ends up being COW also, or not, but i=
f=20
it is, then your test case is fragmenting it too, thus explaining the=20
reflink copy issue. And keep in mind that by default, btrfs uses DUP f=
or=20
metadata, so there's TWO copies of it written, thereby DOUBLING the=20
performance effects of anything affecting metadata!
Two, see the FAQ deduplication question/answer a couple questions below=
=20
the TRIM/discard one mentioned above. I'm rather fuzzy on the filesyst=
em=20
implications of this myself, but it seems to me that our COW assumption=
s=20
might be wrong because they're assuming deduplication effects that simp=
ly=20
aren't the way btrfs works presently, as it hasn't implemented=20
deduplication. Admittedly, this is at best a handwavy black-box factor=
,=20
but that's the best I can do with it, presently. I guess that at least=
=20
gives you another place to do additional research, if it comes to that.=
=20
(In this regard I do wish the COW subsection of the sysadmin guide page=
=20
on the wiki was written, it's simply punted ATM, since there's a fair=20
chance that a good explanation there would cover the filesystem viewpoi=
nt=20
differences between full deduplication and the COW that btrfs does,=20
perhaps clearing up some misconceptions people including me may have=20
about it, as well.)
Three, as evident in the discussion on the nodatacow and autodefrag=20
options I mentioned before, there's known issues with some use cases=20
involving large files and rewrites of data at random locations within=20
them. But I'm not sure if these known issues are simply the ones we've=
=20
been discussing, or if there's other factors I'm unaware of in this=20
regard. Knowing more about just what those known issues are and the=20
specific scenarios under which they occur, could go a long way toward=20
resolving the situation for you.
But I'm only a recent list regular, joining a few weeks ago as part of =
my=20
own research into btrfs (FWIW my use case involves N-way mirroring, wit=
h=20
N=3D3-4; since only no-mirroring and N=3D2 is available today and 3-way=
/n-way=20
is planned to layer on top of raid5/6, which is planned for kernel=20
3.4/3.5, I'm now waiting for that... while continuing to stay current o=
n=20
the list), so whatever research or test cases lead to the remarks on th=
e=20
wiki regarding large files with random data rewrites, predates my=20
involvement likely by quite some time.
=46our, there's additional block alignment issues having to do with the=
=20
alignment of the partition on the physical storage, as it relates to=20
read-, write- and erase-block sizes and alignment. On SSDs, erase-bloc=
k=20
sizes are the biggest, so the optimum alignment would be to erase-block=
=20
size. Getting it wrong can result in multiple block writes and/or eras=
es=20
where proper alignment would require only one. This phenomenon is call=
ed=20
write-amplification (and less commonly, erase-amplification). However,=
=20
depending on what you used to create the partition on which the=20
filesystem resides (and loopback files do tend toward worst-case), it's=
=20
quite possible you don't have block-alignment level control at all.
=46WIW, that's one use case for the mkbtrfs/mkfs.btrfs --alloc-start/-A=
=20
option, since that allows you to align the allocation within the=20
partition as necessary for alignment, regardless of the partition=20
alignment.
=46WIW2, gptfdisk (a gpt partitioner as opposed to the old mbr style) h=
as=20
reasonable alignment defaults of 1 MiB on disks without an existing=20
partition layout, and attempts 8-sector (4 KiB) alignment even on=20
existing layouts, for disks >=3D300 GB at least. That's what I've been=
=20
using for the last few years, having converted to gpt-based partitionin=
g=20
for everything, even USB-thumb-drives, if partitioned. (GPT was design=
ed=20
for EFI, but can be used on BIOS based systems as well, which is what I=
'm=20
doing. Grub2 understands gpt well and puts to good use any reserved BI=
OS=20
partition it finds, and there's options in the kernel for it that need=20
enabled as well.)
Block alignment is DEFINITELY something you can play with, in terms of=20
testing whether it makes a difference on your drives, SSD or "spinning=20
rust".
There's probably other factors involved of which I'm unaware, as well.
>> IIRC I saw a note about this on the wiki, in regard to the nodatacow
>> mount-option.
>> In addition to nodatacow, see the note on the autodefrag option.
> Unless I am wrong, this would disable COW completely and reflink copy=
=2E
> Reflinks are a crucial component and the sole reason I picked BTRFS f=
or
> the system that I am writing for my company.
> The autodefrag option addresses multiple writes. Writing is not the
> problem, but cp --reflink should be near-instant. That was the reason=
we
> chose BTRFS over ZFS, which seemed to be the only feasible alternativ=
e.
> ZFS snapshot complicate the design and deduplication copy time is the
> same as (or not much better than) raw copy.
> As I mentioned above, the COW is the crucial component of our system,
> XFS won't do. Our system does not do random writes. In fact it is mai=
nly
> heavy on read operation. The system does occasional "rotation of rust=
"
> on large files in a way that version control system would (large file=
s
> are modified and then used as a new baseline)
Pardon me, I think I might have been too vague with that "rotating rust=
"=20
allusion and lost you. Either that or you're taking the allusion out=20
even further and potentially lost me! =3D;^0
I meant spinning magnetic media with that "rotating rust" reference, th=
e=20
"rotating rust" bit being a double entendre allusion both to the iron=20
oxide (rust) used as the data storage layer, and to the fact that many=20
view rotating magnetic media as a legacy technology (rusting out)=20
compared to SSDs. =3D:^) As it happens, I saw that double-meaning word=
-
play used elsewhere recently with the same two allusions attached, and=20
liked it enough to use it myself, when I got the chance. Only I'm not=20
sure you got the reference, because...
You used it quite differently, referring to file rotation. So either y=
ou=20
saw my reference and upped the ante, so to speak, leaving me to pick up=
=20
the pieces, or I lost you with the original reference, one of the two.
But I guess we should be on the same page knowing each other's meaning,=
=20
now. Meanwhile...
[I do see your followup mentioning that it doesn't actually disable /al=
l/=20
COW, and that you tested it, without significant change in the results.=
=2E.]
=46WIW, I wasn't so much SUGGESTING those options, as noting the=20
INFORMATION contained in their description, the random writes to large =
db=20
files and its effect on btrfs bit. But testing (which you did) is a go=
od=20
idea, just to see what difference it makes, little in your case, so=20
either the nocow option isn't disabling it in your case (specific use o=
f=20
cp --reflink), or the cow isn't the problem at all.
While you're at testing, tho, the question occurred to me of whether=20
simply using btrfs' snapshotting would make a difference. (I did say I=
=20
don't claim a full understanding, and that trial and error testing woul=
d=20
be my method here, that I really only understand enough to hopefully=20
guide me a bit in what to test...) Snapshotting by definition uses the=
=20
COW capacities, bit it occurs to me that since it's doing it on a=20
filesystem-wide basis instead of a single-file basis, that might allow=20
more efficiency in metadata handling.
Note that I don't necessarily expect that snapshotting would be a=20
workable final solution for you, but if in testing you discover that th=
e=20
speed stays reasonable with the snapshot method (still only changing th=
e=20
single file between snapshots), while it degrades (as you've found) wit=
h=20
the single cp --reflink method, then that's important data for the test=
=20
case, and given btrfs' state of development, it could well lead to majo=
r=20
optimizations of the single-file cp --reflink case as well, which you=20
presumably COULD use in final deployment.
> Thanks for all your help on this issue. I hope that someone can point
> out some more tweaks or added features/fixes after 3.2 RC5 that I may
> do.
Talking about which... since you mentioned 3.2-rc5, you do seem aware o=
f=20
the fact that btrfs is still very much experimental status, in active=20
development, and the need for staying current on the kernel.
However, unless your testing is for a system with actual deployment=20
scheduled for say a year or more out, I'd question btrfs as a reasonabl=
e=20
solution in any case. One of the things that a lot of people don't see=
m=20
to realize is just how much active btrfs development is still going on,=
=20
and that it's NOT just corner-case use cases such as the multi-mirror=20
raid1 that I'm waiting on ATM, but that there's still data corruption=20
issues being traced and fixed, etc.
IOW, btrfs isn't something I'd recommend on either a production system =
or=20
even a general user's system, for the time being. If the intent is to=20
test btrfs, filling it with data that you are not only prepared for it =
to=20
be destroyed, but expect it to happen, so you not only have backups or=20
simply don't value the data enough to be worth backups, you're not=20
counting on the btrfs copy as anything but experimental "garbage" data,=
=20
expected to be lost in testing, as well, then that's FINE. Such testin=
g,=20
and hopefully bug reporting, and patching where possible, is what btrfs=
=20
is out there for, ATM. =20
But if the intent is to actually put production data on the filesystem,=
=20
or use it as the primary copy of data that you don't want to lose, btrf=
s=20
isn't an appropriate choice at this point, and I'd say probably won't b=
e=20
until say Q4, or even next year, so if your production deployment is=20
scheduled for before that, really, you shouldn't be looking at btrfs fo=
r=20
it, as it's not fit for that purpose ATM and isn't likely to be, for=20
another year or so (and even then, it'll be suitable for only the early=
=20
adopters, the cautious folk will wait another year or more after that,=20
just as many of the cautious folk are only now warming to ext4 as oppos=
ed=20
to ext3).
I just don't want to see you back here as one of those folks asking=20
questions about recovering data on a screwed filesystem, because they h=
ad=20
no backups or the backups weren't kept current, because they were using=
=20
btrfs for real-life use beyond testing purposes, and that's simply not=20
the sort of use btrfs is designed to or can properly deliver at this=20
point!
--=20
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread