From mboxrd@z Thu Jan 1 00:00:00 1970 From: Duncan <1i5t5.duncan@cox.net> Subject: Re: Strange prformance degradation when COW writes happen at fixed offsets Date: Sat, 25 Feb 2012 03:34:00 +0000 (UTC) Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 To: linux-btrfs@vger.kernel.org Return-path: List-ID: Nik Markovic posted on Fri, 24 Feb 2012 14:38:57 -0600 as excerpted: > On Fri, Feb 24, 2012 at 12:38 AM, Duncan <1i5t5.duncan@cox.net> wrote= : >> Nik Markovic posted on Thu, 23 Feb 2012 20:31:02 -0600 as excerpted: >> >>> I noticed a few errors in the script that I used. I corrected it an= d >>> it seems that degradation is occurring even at fully random writes: >> >> I don't have an ssd, but is it possible that you're simply seeing >> erase- block related degradation due to multi-write-block sized >> erase-blocks? >> >> It seems to me that when originally written to the btrfs-on-ssd, the >> file will likely be written block-sequentially enough that the file = as >> a whole takes up relatively few erase-blocks. =C2=A0As you COW-write >> individual blocks, they'll be written elsewhere, perhaps all the >> changed blocks to a new erase-block, perhaps each to a different era= se >> block. >=20 > This is a very interesting insight. I wasn't even aware of the > erase-block issue, so I did some reading up on it... I take it you looked at TRIM/discard, then, as well? In theory and for= =20 some SSD firmware, it works well at helping to alleviate the problem by= =20 informing the SSD of data areas that it no longer needs to care about=20 (empty space), thus allowing more effective management of those erase- blocks. Reality is however not quite so simple, and it doesn't help a lot with=20 some SSDs, plus there's a potential performance issue due when doing th= e=20 discard on especially earlier devices, since TRIM is an unqueueable=20 command in the earlier standards (I've read it's defined as queueable i= n=20 the latest standards, however), thus forcing a flush of all activity in= =20 the queue before the discard, potentially triggering I/O freeze=20 behavior. Additionally, when run on top of dm-crypt, there's a potenti= al=20 security issue (examination of the raw undecrypted storage reveals=20 whether there's data there or not, and possibly the filesystem type use= d=20 based on patterns, a potential deniability issue in that they know the=20 data is there, tho it doesn't affect the strength of the encryption=20 itself). So since on a lot of firmware it doesn't make a lot of difference anywa= y,=20 and there's a couple of down sides, the btrfs ssd mount option does NOT= =20 enable discard as well. However, there *IS* a discard option that you=20 can experiment with if you like, and it probably WILL help with erase- block handling on SOME firmware. See the FAQ, part 3 Features, # 3.4 on TRIM/discard, for a bit more. I'= ve=20 really covered what it says, above, but there's a link to the encryptio= n=20 security vs TRIM research, for instance. And the discard mount-option=20 for whatever reason isn't listed in mount options, or at least I didn't= =20 see it, only in the FAQ. (This is one URL, my client is wrapping it and it's a hassle to fix.) http://btrfs.ipv5.de/index.php? title=3DFAQ#Does_Btrfs_support_TRIM.2Fdiscard.3F Bottom line, if it is indeed an erase-block issue, the discard mount=20 option MIGHT help, or it might not, depending on your device firmware. = =20 It's an experiment-and-see thing. >> As you increase the successive COW generation count, the file's file= - >> system/write blocks will be spread thru more and more erase-blocks, >> basically fragmentation but of the SSD-critical type, into more and >> more erase blocks, thus affecting modification and removal time but = not >> read time. >=20 > OK, so time to write would increase due to fragmentation and writing,= it > now makes sense (though I don't see why small writes would affect thi= s, > but my concerns are not writes anyway), but why would cp --reflink ti= me > increase so much. Yes, new extents would be created, but btrfs doesn'= t > write into data blocks, does it? I figured its metadata would be kept= in > one place. I figure the only thing BTRFS would do on cp > --reflink=3Dalways: > 1. Take a collection of extents owned by source. > 2. Make the new copy use the same collection of extents. > 3. Write the collection of extents to the "directory". >=20 > Now this process seems to be CPU intensive. When I remove or make a > reflink copy, one core pikes up to 100%, which tells me that there's = a > performance issue there, not an ssd issue. Also, only one CPU thread = is > being used for this. I figured that I can improve this by some settin= g. > Maybe thread_pool mount option? Are there any updates in later kernel= s > that I should possibly pick up? =46WIW... I am by no means an expert on this. I /think/ I understand=20 enough of it to somewhat guide trial and error testing to arrive at a=20 reasonable if not best-case config for any setup I might deal with, and= =20 well enough to hopefully point you in the right direction for your own=20 research and testing, but I'd not going to claim to be able to explain=20 the whys of individual cases, or even necessarily to understand them=20 myself, just understand enough to know of the issue and to trial and=20 error resolve to a hopefully reasonable situation on any hardware I mig= ht=20 run. However, I could speculate (enough to guide my own testing were I=20 troubleshooting here) that it's one of several things or more likely a=20 combination of them. =20 One, I'm not sure if the metadata ends up being COW also, or not, but i= f=20 it is, then your test case is fragmenting it too, thus explaining the=20 reflink copy issue. And keep in mind that by default, btrfs uses DUP f= or=20 metadata, so there's TWO copies of it written, thereby DOUBLING the=20 performance effects of anything affecting metadata! Two, see the FAQ deduplication question/answer a couple questions below= =20 the TRIM/discard one mentioned above. I'm rather fuzzy on the filesyst= em=20 implications of this myself, but it seems to me that our COW assumption= s=20 might be wrong because they're assuming deduplication effects that simp= ly=20 aren't the way btrfs works presently, as it hasn't implemented=20 deduplication. Admittedly, this is at best a handwavy black-box factor= ,=20 but that's the best I can do with it, presently. I guess that at least= =20 gives you another place to do additional research, if it comes to that.= =20 (In this regard I do wish the COW subsection of the sysadmin guide page= =20 on the wiki was written, it's simply punted ATM, since there's a fair=20 chance that a good explanation there would cover the filesystem viewpoi= nt=20 differences between full deduplication and the COW that btrfs does,=20 perhaps clearing up some misconceptions people including me may have=20 about it, as well.) Three, as evident in the discussion on the nodatacow and autodefrag=20 options I mentioned before, there's known issues with some use cases=20 involving large files and rewrites of data at random locations within=20 them. But I'm not sure if these known issues are simply the ones we've= =20 been discussing, or if there's other factors I'm unaware of in this=20 regard. Knowing more about just what those known issues are and the=20 specific scenarios under which they occur, could go a long way toward=20 resolving the situation for you. But I'm only a recent list regular, joining a few weeks ago as part of = my=20 own research into btrfs (FWIW my use case involves N-way mirroring, wit= h=20 N=3D3-4; since only no-mirroring and N=3D2 is available today and 3-way= /n-way=20 is planned to layer on top of raid5/6, which is planned for kernel=20 3.4/3.5, I'm now waiting for that... while continuing to stay current o= n=20 the list), so whatever research or test cases lead to the remarks on th= e=20 wiki regarding large files with random data rewrites, predates my=20 involvement likely by quite some time. =46our, there's additional block alignment issues having to do with the= =20 alignment of the partition on the physical storage, as it relates to=20 read-, write- and erase-block sizes and alignment. On SSDs, erase-bloc= k=20 sizes are the biggest, so the optimum alignment would be to erase-block= =20 size. Getting it wrong can result in multiple block writes and/or eras= es=20 where proper alignment would require only one. This phenomenon is call= ed=20 write-amplification (and less commonly, erase-amplification). However,= =20 depending on what you used to create the partition on which the=20 filesystem resides (and loopback files do tend toward worst-case), it's= =20 quite possible you don't have block-alignment level control at all. =46WIW, that's one use case for the mkbtrfs/mkfs.btrfs --alloc-start/-A= =20 option, since that allows you to align the allocation within the=20 partition as necessary for alignment, regardless of the partition=20 alignment. =46WIW2, gptfdisk (a gpt partitioner as opposed to the old mbr style) h= as=20 reasonable alignment defaults of 1 MiB on disks without an existing=20 partition layout, and attempts 8-sector (4 KiB) alignment even on=20 existing layouts, for disks >=3D300 GB at least. That's what I've been= =20 using for the last few years, having converted to gpt-based partitionin= g=20 for everything, even USB-thumb-drives, if partitioned. (GPT was design= ed=20 for EFI, but can be used on BIOS based systems as well, which is what I= 'm=20 doing. Grub2 understands gpt well and puts to good use any reserved BI= OS=20 partition it finds, and there's options in the kernel for it that need=20 enabled as well.) Block alignment is DEFINITELY something you can play with, in terms of=20 testing whether it makes a difference on your drives, SSD or "spinning=20 rust". There's probably other factors involved of which I'm unaware, as well. >> IIRC I saw a note about this on the wiki, in regard to the nodatacow >> mount-option. >> In addition to nodatacow, see the note on the autodefrag option. > Unless I am wrong, this would disable COW completely and reflink copy= =2E > Reflinks are a crucial component and the sole reason I picked BTRFS f= or > the system that I am writing for my company. > The autodefrag option addresses multiple writes. Writing is not the > problem, but cp --reflink should be near-instant. That was the reason= we > chose BTRFS over ZFS, which seemed to be the only feasible alternativ= e. > ZFS snapshot complicate the design and deduplication copy time is the > same as (or not much better than) raw copy. > As I mentioned above, the COW is the crucial component of our system, > XFS won't do. Our system does not do random writes. In fact it is mai= nly > heavy on read operation. The system does occasional "rotation of rust= " > on large files in a way that version control system would (large file= s > are modified and then used as a new baseline) Pardon me, I think I might have been too vague with that "rotating rust= "=20 allusion and lost you. Either that or you're taking the allusion out=20 even further and potentially lost me! =3D;^0 I meant spinning magnetic media with that "rotating rust" reference, th= e=20 "rotating rust" bit being a double entendre allusion both to the iron=20 oxide (rust) used as the data storage layer, and to the fact that many=20 view rotating magnetic media as a legacy technology (rusting out)=20 compared to SSDs. =3D:^) As it happens, I saw that double-meaning word= - play used elsewhere recently with the same two allusions attached, and=20 liked it enough to use it myself, when I got the chance. Only I'm not=20 sure you got the reference, because... You used it quite differently, referring to file rotation. So either y= ou=20 saw my reference and upped the ante, so to speak, leaving me to pick up= =20 the pieces, or I lost you with the original reference, one of the two. But I guess we should be on the same page knowing each other's meaning,= =20 now. Meanwhile... [I do see your followup mentioning that it doesn't actually disable /al= l/=20 COW, and that you tested it, without significant change in the results.= =2E.] =46WIW, I wasn't so much SUGGESTING those options, as noting the=20 INFORMATION contained in their description, the random writes to large = db=20 files and its effect on btrfs bit. But testing (which you did) is a go= od=20 idea, just to see what difference it makes, little in your case, so=20 either the nocow option isn't disabling it in your case (specific use o= f=20 cp --reflink), or the cow isn't the problem at all. While you're at testing, tho, the question occurred to me of whether=20 simply using btrfs' snapshotting would make a difference. (I did say I= =20 don't claim a full understanding, and that trial and error testing woul= d=20 be my method here, that I really only understand enough to hopefully=20 guide me a bit in what to test...) Snapshotting by definition uses the= =20 COW capacities, bit it occurs to me that since it's doing it on a=20 filesystem-wide basis instead of a single-file basis, that might allow=20 more efficiency in metadata handling. Note that I don't necessarily expect that snapshotting would be a=20 workable final solution for you, but if in testing you discover that th= e=20 speed stays reasonable with the snapshot method (still only changing th= e=20 single file between snapshots), while it degrades (as you've found) wit= h=20 the single cp --reflink method, then that's important data for the test= =20 case, and given btrfs' state of development, it could well lead to majo= r=20 optimizations of the single-file cp --reflink case as well, which you=20 presumably COULD use in final deployment. > Thanks for all your help on this issue. I hope that someone can point > out some more tweaks or added features/fixes after 3.2 RC5 that I may > do. Talking about which... since you mentioned 3.2-rc5, you do seem aware o= f=20 the fact that btrfs is still very much experimental status, in active=20 development, and the need for staying current on the kernel. However, unless your testing is for a system with actual deployment=20 scheduled for say a year or more out, I'd question btrfs as a reasonabl= e=20 solution in any case. One of the things that a lot of people don't see= m=20 to realize is just how much active btrfs development is still going on,= =20 and that it's NOT just corner-case use cases such as the multi-mirror=20 raid1 that I'm waiting on ATM, but that there's still data corruption=20 issues being traced and fixed, etc. IOW, btrfs isn't something I'd recommend on either a production system = or=20 even a general user's system, for the time being. If the intent is to=20 test btrfs, filling it with data that you are not only prepared for it = to=20 be destroyed, but expect it to happen, so you not only have backups or=20 simply don't value the data enough to be worth backups, you're not=20 counting on the btrfs copy as anything but experimental "garbage" data,= =20 expected to be lost in testing, as well, then that's FINE. Such testin= g,=20 and hopefully bug reporting, and patching where possible, is what btrfs= =20 is out there for, ATM. =20 But if the intent is to actually put production data on the filesystem,= =20 or use it as the primary copy of data that you don't want to lose, btrf= s=20 isn't an appropriate choice at this point, and I'd say probably won't b= e=20 until say Q4, or even next year, so if your production deployment is=20 scheduled for before that, really, you shouldn't be looking at btrfs fo= r=20 it, as it's not fit for that purpose ATM and isn't likely to be, for=20 another year or so (and even then, it'll be suitable for only the early= =20 adopters, the cautious folk will wait another year or more after that,=20 just as many of the cautious folk are only now warming to ext4 as oppos= ed=20 to ext3). I just don't want to see you back here as one of those folks asking=20 questions about recovering data on a screwed filesystem, because they h= ad=20 no backups or the backups weren't kept current, because they were using= =20 btrfs for real-life use beyond testing purposes, and that's simply not=20 the sort of use btrfs is designed to or can properly deliver at this=20 point! --=20 Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" = in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html