* btrfs fi defrag interfering (maybe) with Ceph OSD operation @ 2015-09-27 15:34 Lionel Bouton 2015-09-28 0:18 ` Duncan 2015-09-29 14:49 ` Lionel Bouton 0 siblings, 2 replies; 7+ messages in thread From: Lionel Bouton @ 2015-09-27 15:34 UTC (permalink / raw) To: linux-btrfs Hi, we use BTRFS for Ceph filestores (after much tuning and testing over more than a year). One of the problem we've had to face was the slow decrease in performance caused by fragmentation. Here's a small recap of the history for context. Initially we used internal journals on the few OSDs where we tested BTRFS, which meant constantly overwriting 10GB files (which is obviously bad for CoW). Before using NoCoW and eventually moving the journals to raw SSD partitions, we understood autodefrag was not being effective : the initial performance on a fresh, recently populated OSD was great and slowly degraded over time without access patterns and filesystem sizes changing significantly. My idea was that autodefrag might focus its efforts on files not useful to defragment in the long term. The obvious one was the journal (constant writes but only read again when restarting an OSD) but I couldn't find any description of the algorithms/heuristics used by autodefrag so I decided to disable it and develop our own defragmentation scheduler. It is based on both a slow walk through the filesystem (which acts as a safety net over one week period) and a fatrace pipe (used to detect recent fragmentation). Fragmentation is computed from filefrag detailed outputs and it learns how much it can defragment files with calls to filefrag after defragmentation (we learned compressed files and uncompressed files don't behave the same way in the process so we ended up treating them separately). Simply excluding the journal from defragmentation and using some basic heuristics (don't defragment recently written files but keep them in a pool then queue them and don't defragment files below a given fragmentation "cost" were defragmentation becomes ineffective) gave us usable performance in the long run. Then we successively moved the journal to NoCoW files and SSDs and disabled Ceph's use of BTRFS snapshots which were too costly (removing snapshots generated 120MB of writes to the disks and this was done every 30s on our configuration). In the end we had a very successful experience, migrated everything to BTRFS filestores that were noticeably faster than XFS (according to Ceph metrics), detected silent corruption and compressed data. Everything worked well until this morning. I woke up to a text message signalling VM freezes all over our platform. 2 Ceph OSDs died at the same time on two of our servers (20s appart) which for durability reason freezes writes on the data chunks shared by these two OSDs. The errors we got in the OSD logs seem to point to an IO error (at least IIRC we got a similar crash on an OSD where we had invalid csum errors logged by the kernel) but we couldn't find any kernel error and btrfs scrubs finished on the filesystems without finding any corruption. I've yet to get an answer for the possible contexts and exact IO errors. If people familiar with Ceph read this here's the error on Ceph 0.80.9 (more logs available on demand) : 2015-09-27 06:30:57.373841 7f05d92cf700 -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t, ceph::bufferlist&, bool)' thread 7f05d92cf700 time 2015-09-27 06:30:57.260978 os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio || got != -5) Given that the defragmentation scheduler treats file accesses the same on all replicas to decide when triggering a call to "btrfs fi defrag <file>", I suspect this manual call to defragment could have happened on the 2 OSDs affected for the same file at nearly the same time and caused the near simultaneous crashes. It's not clear to me that "btrfs fi defrag <file>" can't interfere with another process trying to use the file. I assume basic reading and writing is OK but there might be restrictions on unlinking/locking/using other ioctls... Are there any I should be aware of and should look for in Ceph OSDs? This is on a 3.8.19 kernel (with Gentoo patches which don't touch BTRFS sources) with btrfs-progs 4.0.1. We have 5 servers on our storage network : 2 are running a 4.0.5 kernel and 3 are running 3.8.19. The 3.8.19 servers are waiting for an opportunity to reboot on 4.0.5 (or better if we have the time to test a more recent kernel before rebooting : 4.1.8 and 4.2.1 are our candidates for testing right now). Best regards, Lionel Bouton ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: btrfs fi defrag interfering (maybe) with Ceph OSD operation 2015-09-27 15:34 btrfs fi defrag interfering (maybe) with Ceph OSD operation Lionel Bouton @ 2015-09-28 0:18 ` Duncan 2015-09-28 9:55 ` Lionel Bouton 2015-09-29 14:49 ` Lionel Bouton 1 sibling, 1 reply; 7+ messages in thread From: Duncan @ 2015-09-28 0:18 UTC (permalink / raw) To: linux-btrfs Lionel Bouton posted on Sun, 27 Sep 2015 17:34:50 +0200 as excerpted: > Hi, > > we use BTRFS for Ceph filestores (after much tuning and testing over > more than a year). One of the problem we've had to face was the slow > decrease in performance caused by fragmentation. While I'm a regular user/admin (not dev) on the btrfs lists, my ceph knowledge is essentially zero, so this is intended to address the btrfs side ONLY. > Here's a small recap of the history for context. > Initially we used internal journals on the few OSDs where we tested > BTRFS, which meant constantly overwriting 10GB files (which is obviously > bad for CoW). Before using NoCoW and eventually moving the journals to > raw SSD partitions, we understood autodefrag was not being effective : > the initial performance on a fresh, recently populated OSD was great and > slowly degraded over time without access patterns and filesystem sizes > changing significantly. Yes. Autodefrag works most effectively on (relatively) small files, generally for performance reasons, as it detects fragmentation and queues up a a defragmenting rewrite by a separate defragmentation worker thread. As file sizes increase, that defragmenting rewrite will take longer, until at some point, particularly on actively rewritten files, change-writes will be coming in faster than file rewrite speeds... Generally speaking, therefore, it's great for small database files upto a quarter gig or so, think firefox sqlite database files on the desktop, with people starting to see issues somewhere between a quarter gig and a gig on spinning rust, depending on disk speed as well as active rewrite load on the file in question. So constantly rewritten 10-gig journal files... Entirely inappropriate for autodefrag. =:^( There has been discussion and a general plan for some sort of larger-file autodefrag optimization, but btrfs continues to be rather "idea and opportunity rich" and "implementation coder poor", so realistically we're looking at years to implementation. Meanwhile, other measures should be taken for multigig files, as you're already doing. =:^) > I couldn't find any description of the algorithms/heuristics used by > autodefrag [...] This is in general documented on the wiki, tho not with the level of explanation I included above. https://btrfs.wiki.kernel.org > I decided to disable it and develop our own > defragmentation scheduler. It is based on both a slow walk through the > filesystem (which acts as a safety net over one week period) and a > fatrace pipe (used to detect recent fragmentation). Fragmentation is > computed from filefrag detailed outputs and it learns how much it can > defragment files with calls to filefrag after defragmentation (we > learned compressed files and uncompressed files don't behave the same > way in the process so we ended up treating them separately). Note that unless this has very recently changed, filefrag doesn't know how to calculate btrfs-compressed file fragmentation correctly. Btrfs uses (IIRC) 128 KiB compression blocks, which filefrag will see (I'm not actually sure if it's 100% consistent or if it's conditional on something else) as separate extents. Bottom line, there's no easily accessible reliable way to get the fragmentation level of a btrfs-compressed file. =:^( (Presumably btrfs-debug-tree with the -e option to print extents info, with the output fed to some parsing script, could do it, but that's not what I'd call easily accessible, at least at a non-programmer admin level.) Again, there has been some discussion around teaching filefrag about btrfs compression, and it may well eventually happen, but I'm not aware of an e2fsprogs release doing it yet, nor of whether there's even actual patches for it yet, let alone merge status. > Simply excluding the journal from defragmentation and using some basic > heuristics (don't defragment recently written files but keep them in a > pool then queue them and don't defragment files below a given > fragmentation "cost" were defragmentation becomes ineffective) gave us > usable performance in the long run. Then we successively moved the > journal to NoCoW files and SSDs and disabled Ceph's use of BTRFS > snapshots which were too costly (removing snapshots generated 120MB of > writes to the disks and this was done every 30s on our configuration). It can be noted that there's an negative interaction between btrfs snapshots and nocow, sometimes called cow1. The btrfs snapshot feature is predicated on cow, with a snapshot locking in place existing file extents, normally no big deal as ordinary cow files will have rewrites cowed elsewhere in any case. Obviously, then, snapshots must by definition play havoc with nocow. What actually happens is that with existing extents locked in place, the first post-snapshot change to a block must then be cowed into a new extent. The nocow attribute remains on the file, however, and further writes to that block... until the next snapshot anyway... will be written in-place, to the (first-post-snapshot- cowed) current extent. When one list poster referred to that as cow1, I found the term so nicely descriptive that I adopted it for myself, altho for obvious reasons I have to explain it first in many posts. It should now be obvious why 30-second snapshots weren't working well on your nocow files, and why they seemed to become fragmented anyway, the 30- second snapshots were effectively disabling nocow! In general, for nocow files, snapshotting should be disabled (as you ultimately did), or as low frequency as is practically possible. Some list posters have, however, reported a good experience with a combination of lower frequency snapshotting (say daily, or maybe every six hours, but DEFINITELY not more frequent than half-hour), and periodic defrag, on the order of the weekly period you implied in a bit I snipped, to perhaps monthly. > In the end we had a very successful experience, migrated everything to > BTRFS filestores that were noticeably faster than XFS (according to Ceph > metrics), detected silent corruption and compressed data. Everything > worked well [...] =:^) > [...] until this morning. =:^( > I woke up to a text message signalling VM freezes all over our platform. > 2 Ceph OSDs died at the same time on two of our servers (20s appart) > which for durability reason freezes writes on the data chunks shared by > these two OSDs. > The errors we got in the OSD logs seem to point to an IO error (at least > IIRC we got a similar crash on an OSD where we had invalid csum errors > logged by the kernel) but we couldn't find any kernel error and btrfs > scrubs finished on the filesystems without finding any corruption. Snipping some of the ceph stuff since as I said I've essentially zero knowledge there, but... > Given that the defragmentation scheduler treats file accesses the same > on all replicas to decide when triggering a call to "btrfs fi defrag > <file>", I suspect this manual call to defragment could have happened on > the 2 OSDs affected for the same file at nearly the same time and caused > the near simultaneous crashes. ... While what I /do/ know of ceph suggests that it should be protected against this sort of thing, perhaps there's a bug, because... I know for sure that btrfs itself is not intended for distributed access, from more than one system/kernel at a time. Which assuming my ceph illiteracy isn't negatively affecting my reading of the above, seems to be more or less what you're suggesting happened, and I do know that *if* it *did* happen, it could indeed trigger all sorts of havoc! > It's not clear to me that "btrfs fi defrag <file>" can't interfere with > another process trying to use the file. I assume basic reading and > writing is OK but there might be restrictions on unlinking/locking/using > other ioctls... Are there any I should be aware of and should look for > in Ceph OSDs? This is on a 3.8.19 kernel (with Gentoo patches which > don't touch BTRFS sources) with btrfs-progs 4.0.1. We have 5 servers on > our storage network : 2 are running a 4.0.5 kernel and 3 are running > 3.8.19. The 3.8.19 servers are waiting for an opportunity to reboot on > 4.0.5 (or better if we have the time to test a more recent kernel before > rebooting : 4.1.8 and 4.2.1 are our candidates for testing right now). It's worth keeping in mind that the explicit warnings about btrfs being experimental weren't removed until 3.12, and while current status is no longer experimental or entirely unstable, it remains, as I characterize it, as "maturing and stabilizing, not yet entirely stable and mature." So 3.8 is very much still in btrfs-experimental land! And so many bugs have been fixed since then that... well, just get off of it ASAP, which it seems you're already doing. While it's no longer absolutely necessary to stay current to the latest non-long-term-support kernel (unless you're running say raid56 mode, which is still new enough not to be as stable as the rest of btrfs and where running the latest kernel continues to be critical, and while I'm discussing exceptions, btrfs quota code continues to be a problem even with the newest kernels, so I recommend it remain off unless you're specifically working with the devs to debug and test it), list consensus seems to be that where stability is a prime consideration, sticking to long-term-support kernel series, no later than one LTS series behind the latest and upgrading to the latest LTS series some reasonable time after the LTS announcement, after deployment-specific testing as appropriate of course, is recommended best-practice. With kernel 4.1 series now blessed as the latest long-term-stable, and 3.18 the latest before that, the above suggests targeting them, and indeed, list reports for the 3.18 series as it has matured have been very good, with 4.1 still new enough that the stability-cautious are still testing or just deployed, so there's not many reports on it yet. Meanwhile, while latest (or second-latest until latest is site-tested) LTS kernel is recommended for stable deployment, when encountering specific bugs, be prepared to upgrade to latest stable at least for testing, possibly with cherry-picked not-yet-mainlined patches if appropriate for individual bugs. But definitely, anything pre-3.12, get off of, as that really is when the experimental label came off, and you don't want to be running kernel btrfs of that age in production. Again, 3.18 is well tested and rated so targeting it for ASAP deployment is good, with 4.1 targeted for testing and deployment "soon" also recommended. And once again, that's purely from the btrfs side. I know absolutely nothing about ceph stability in any of these kernels, tho obviously for you that's going to be a consideration as well. Tying up a couple loose ends... Regarding nocow... Given that you had apparently missed much of the general list and wiki wisdom above (while at the same time eventually coming to the many of the same conclusions on your own), it's worth mentioning the following additional nocow caveat and recommended procedure, in case you missed it as well: On btrfs, setting nocow on an existing file with existing content, leaves undefined when exactly the nocow attribute will take effect. (FWIW, this is mentioned in the chattr (1) manpage as well.) Recommended procedure is therefore to set the nocow attribute on the directory, such that newly created files (and subdirs) will inherit it. (There's no effect on the directory itself, just this inheritance.) Then, for existing files, copy them into the new location, preferably from a different filesystem in ordered to guarantee that the file is actually newly created and thus gets nocow applied appropriately. (cp behavior currently copies the file in unless the reflink option is set anyway, but there has been discussion of changing that to reflink by default for speed and space usage reasons, and that would play havoc with nocow on file creation, but btrfs doesn't support cross-filesystem reflinks so copying in from a different filesystem should always force creation of a new file, with nocow inherited from its directory as intended.) What about btrfs-progs versions? In general, in normal online operation the btrfs command simply tells the kernel what to do and the kernel takes care of the details, so it's the kernel code that's critical. However, various recovery operations, btrfs check, btrfs restore, btrfs rescue, etc (I'm not actually sure about mkfs.btrfs, whether that's primarily userspace code or calls into the kernel, tho I suspect the former), operate on an unmounted btrfs using primarily userspace code, and it's here where the latest userspace code, updated to deal with the latest known problems, becomes critical. So in general, it's kernel code age and stability that's critical for a deployed and operation filesystem, but userspace code that's critical if you run into problems. For that reason, unless you have backups and intend to simply blow away filesystems with problems and recreate them fresh, restoring from backups, a reasonably current btrfs userspace is critical as well, even if it's not critical in normal operation. And of course you need current userspace as well as kernelspace to best support the newest features, but that's a given. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: btrfs fi defrag interfering (maybe) with Ceph OSD operation 2015-09-28 0:18 ` Duncan @ 2015-09-28 9:55 ` Lionel Bouton 2015-09-28 20:52 ` Duncan 0 siblings, 1 reply; 7+ messages in thread From: Lionel Bouton @ 2015-09-28 9:55 UTC (permalink / raw) To: Duncan, linux-btrfs Hi Duncan, thanks for your answer, here is additional information. Le 28/09/2015 02:18, Duncan a écrit : > [...] >> I decided to disable it and develop our own >> defragmentation scheduler. It is based on both a slow walk through the >> filesystem (which acts as a safety net over one week period) and a >> fatrace pipe (used to detect recent fragmentation). Fragmentation is >> computed from filefrag detailed outputs and it learns how much it can >> defragment files with calls to filefrag after defragmentation (we >> learned compressed files and uncompressed files don't behave the same >> way in the process so we ended up treating them separately). > Note that unless this has very recently changed, filefrag doesn't know > how to calculate btrfs-compressed file fragmentation correctly. Btrfs > uses (IIRC) 128 KiB compression blocks, which filefrag will see (I'm not > actually sure if it's 100% consistent or if it's conditional on something > else) as separate extents. > > Bottom line, there's no easily accessible reliable way to get the > fragmentation level of a btrfs-compressed file. =:^( (Presumably > btrfs-debug-tree with the -e option to print extents info, with the > output fed to some parsing script, could do it, but that's not what I'd > call easily accessible, at least at a non-programmer admin level.) > > Again, there has been some discussion around teaching filefrag about > btrfs compression, and it may well eventually happen, but I'm not aware > of an e2fsprogs release doing it yet, nor of whether there's even actual > patches for it yet, let alone merge status. >From what I understood, filefrag doesn't known the length of each extent on disk but should have its position. This is enough to have a rough estimation of how badly fragmented the file is : it doesn't change the result much when computing what a rotating disk must do (especially how many head movements) to access the whole file. > >> Simply excluding the journal from defragmentation and using some basic >> heuristics (don't defragment recently written files but keep them in a >> pool then queue them and don't defragment files below a given >> fragmentation "cost" were defragmentation becomes ineffective) gave us >> usable performance in the long run. Then we successively moved the >> journal to NoCoW files and SSDs and disabled Ceph's use of BTRFS >> snapshots which were too costly (removing snapshots generated 120MB of >> writes to the disks and this was done every 30s on our configuration). > It can be noted that there's an negative interaction between btrfs > snapshots and nocow, sometimes called cow1. The btrfs snapshot feature > is predicated on cow, with a snapshot locking in place existing file > extents, normally no big deal as ordinary cow files will have rewrites > cowed elsewhere in any case. Obviously, then, snapshots must by > definition play havoc with nocow. What actually happens is that with > existing extents locked in place, the first post-snapshot change to a > block must then be cowed into a new extent. The nocow attribute remains > on the file, however, and further writes to that block... until the next > snapshot anyway... will be written in-place, to the (first-post-snapshot- > cowed) current extent. When one list poster referred to that as cow1, I > found the term so nicely descriptive that I adopted it for myself, altho > for obvious reasons I have to explain it first in many posts. > > It should now be obvious why 30-second snapshots weren't working well on > your nocow files, and why they seemed to become fragmented anyway, the 30- > second snapshots were effectively disabling nocow! > > In general, for nocow files, snapshotting should be disabled (as you > ultimately did), or as low frequency as is practically possible. Some > list posters have, however, reported a good experience with a combination > of lower frequency snapshotting (say daily, or maybe every six hours, but > DEFINITELY not more frequent than half-hour), and periodic defrag, on the > order of the weekly period you implied in a bit I snipped, to perhaps > monthly. In the case of Ceph OSD, this isn't what causes the performance problem: the journal is on the main subvolume and snapshots are done on another subvolume. > [...] >> Given that the defragmentation scheduler treats file accesses the same >> on all replicas to decide when triggering a call to "btrfs fi defrag >> <file>", I suspect this manual call to defragment could have happened on >> the 2 OSDs affected for the same file at nearly the same time and caused >> the near simultaneous crashes. > ... While what I /do/ know of ceph suggests that it should be protected > against this sort of thing, perhaps there's a bug, because... > > I know for sure that btrfs itself is not intended for distributed access, > from more than one system/kernel at a time. Which assuming my ceph > illiteracy isn't negatively affecting my reading of the above, seems to > be more or less what you're suggesting happened, and I do know that *if* > it *did* happen, it could indeed trigger all sorts of havoc! No: Ceph OSDs are normal local processes using a filesystem for storage (and optionally a dedicated journal out of the filesystem) as are the btrfs fi defrag commands run on the same host. What I'm interested in is how the btrfs fi defrag <file> command could interfere with any other process accessing <file> simultaneously. The answer could very well be "it never will" (for example because it doesn't use any operation that can before calling the defrag ioctl which is guaranteed to not interfere with other file operations too). I just need to know if there's a possibility so I can decide if these defragmentations are an operational risk or not in my context and if I found the cause for my slightly frightening morning. >> It's not clear to me that "btrfs fi defrag <file>" can't interfere with >> another process trying to use the file. I assume basic reading and >> writing is OK but there might be restrictions on unlinking/locking/using >> other ioctls... Are there any I should be aware of and should look for >> in Ceph OSDs? This is on a 3.8.19 kernel (with Gentoo patches which >> don't touch BTRFS sources) with btrfs-progs 4.0.1. We have 5 servers on >> our storage network : 2 are running a 4.0.5 kernel and 3 are running >> 3.8.19. The 3.8.19 servers are waiting for an opportunity to reboot on >> 4.0.5 (or better if we have the time to test a more recent kernel before >> rebooting : 4.1.8 and 4.2.1 are our candidates for testing right now). > It's worth keeping in mind that the explicit warnings about btrfs being > experimental weren't removed until 3.12, and while current status is no > longer experimental or entirely unstable, it remains, as I characterize > it, as "maturing and stabilizing, not yet entirely stable and mature." > > So 3.8 is very much still in btrfs-experimental land! And so many bugs > have been fixed since then that... well, just get off of it ASAP, which > it seems you're already doing. Oops, that was a typo : I meant 3.18.9, sorry :-( > [...] > > > Tying up a couple loose ends... > > Regarding nocow... > > Given that you had apparently missed much of the general list and wiki > wisdom above (while at the same time eventually coming to the many of the > same conclusions on your own), In fact I was initially aware of (no)CoW/defragmentation/snapshots performance gotchas (I already used BTRFS for PostgreSQL slaves hosting for example...). But Ceph is filesystem aware: its OSDs detect if they are running on XFS/BTRFS and activate automatically some filesystem features. So even though I was aware of the problems that can happen on a CoW filesystem, I preferred to do actual testing with the default Ceph settings and filesystem mount options before tuning. Best regards, Lionel Bouton ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: btrfs fi defrag interfering (maybe) with Ceph OSD operation 2015-09-28 9:55 ` Lionel Bouton @ 2015-09-28 20:52 ` Duncan 2015-09-28 21:55 ` Lionel Bouton 0 siblings, 1 reply; 7+ messages in thread From: Duncan @ 2015-09-28 20:52 UTC (permalink / raw) To: linux-btrfs Lionel Bouton posted on Mon, 28 Sep 2015 11:55:15 +0200 as excerpted: > From what I understood, filefrag doesn't known the length of each extent > on disk but should have its position. This is enough to have a rough > estimation of how badly fragmented the file is : it doesn't change the > result much when computing what a rotating disk must do (especially how > many head movements) to access the whole file. AFAIK, it's the number of extents reported that's the problem with filefrag and btrfs compression. Multiple 128 KiB compression blocks can be right next to each other, forming one longer extent on-device, but due to the compression, filefrag sees and reports them as one extent per compression block, making the file look like it has perhaps thousands or tens of thousands of extents when in actuality it's only a handful, single or double digits. In that regard, length or position neither one matter, filefrag will simply report a number of extents orders of magnitude higher than what's actually there, on-device. But I'm not a coder so could be entirely wrong; that's simply how I understand it based on what I've seen on-list from the devs themselves. > In the case of Ceph OSD, this isn't what causes the performance problem: > the journal is on the main subvolume and snapshots are done on another > subvolume. Understood... now. I was actually composing a reply saying I didn't get it, when suddenly I did. The snapshots were being taken of different subvolumes entirely, thus excluding the files here in question. Thanks. =:^) >>> This is on a 3.8.19 kernel [...] >> [Btrfs was still experimental] until 3.12 [so] 3.8 >> is very much still in btrfs-experimental land! [...] > > Oops, that was a typo : I meant 3.18.9, sorry :-( That makes a /world/ of difference! LOL! I'm very much relieved! =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: btrfs fi defrag interfering (maybe) with Ceph OSD operation 2015-09-28 20:52 ` Duncan @ 2015-09-28 21:55 ` Lionel Bouton 0 siblings, 0 replies; 7+ messages in thread From: Lionel Bouton @ 2015-09-28 21:55 UTC (permalink / raw) To: Duncan, linux-btrfs Le 28/09/2015 22:52, Duncan a écrit : > Lionel Bouton posted on Mon, 28 Sep 2015 11:55:15 +0200 as excerpted: > >> From what I understood, filefrag doesn't known the length of each extent >> on disk but should have its position. This is enough to have a rough >> estimation of how badly fragmented the file is : it doesn't change the >> result much when computing what a rotating disk must do (especially how >> many head movements) to access the whole file. > AFAIK, it's the number of extents reported that's the problem with > filefrag and btrfs compression. Multiple 128 KiB compression blocks can > be right next to each other, forming one longer extent on-device, but due > to the compression, filefrag sees and reports them as one extent per > compression block, making the file look like it has perhaps thousands or > tens of thousands of extents when in actuality it's only a handful, > single or double digits. Yes but that's not a problem for our defragmentation scheduler: we compute the time needed to read the file based on a model of the disk where reading consecutive compressed blocks has no seek cost, only the same revolution cost as reading the larger block they form. The cost of fragmentation is defined as the ratio between this time and the time computed with our model if the blocks were purely sequential. > > In that regard, length or position neither one matter, filefrag will > simply report a number of extents orders of magnitude higher than what's > actually there, on-device. Yes but filefrag -v reports the length and position and we can then find out based purely on the positions if extents are sequential or random. If people are interested by the details I can discuss them in a separate thread (or a subthread with a different title). One thing in particular surprised me and could be an interesting separate discussion: according to the extents positions reported by filefrag -v, defragmentation can leave extents in several sequences at different positions on the disk leading to an average fragmentation cost for compressed files of 2.7x to 3x compared to the ideal case (note that this is an approximation: we consider files compressed if more than half of their extents are compressed by checking for "encoded" in the extent flags). This is completely different for uncompressed files: here defragmentation is completely effective and we get a single extent most of the time. So there's at least 3 possibilities : an error in positions reported by filefrag (and the file is really defragmented), a good reason to leave these files fragmented or an opportunity for optimization. But let's remember our real problem: I'm still not sure if calling btrfs fi defrag <file> can interfere with any concurrent operation on <file> leading to an I/O error. As this has the potential to bring our platform down in our current setup the answer I really hope this will catch the attention of someone familiar with the technical details of btrfs fi defrag. Best regards, Lionel Bouton ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: btrfs fi defrag interfering (maybe) with Ceph OSD operation 2015-09-27 15:34 btrfs fi defrag interfering (maybe) with Ceph OSD operation Lionel Bouton 2015-09-28 0:18 ` Duncan @ 2015-09-29 14:49 ` Lionel Bouton 2015-09-29 17:14 ` Lionel Bouton 1 sibling, 1 reply; 7+ messages in thread From: Lionel Bouton @ 2015-09-29 14:49 UTC (permalink / raw) To: linux-btrfs Le 27/09/2015 17:34, Lionel Bouton a écrit : > [...] > It's not clear to me that "btrfs fi defrag <file>" can't interfere with > another process trying to use the file. I assume basic reading and > writing is OK but there might be restrictions on unlinking/locking/using > other ioctls... Are there any I should be aware of and should look for > in Ceph OSDs? This is on a 3.8.19 kernel (with Gentoo patches which > don't touch BTRFS sources) with btrfs-progs 4.0.1. We have 5 servers on > our storage network : 2 are running a 4.0.5 kernel and 3 are running > 3.8.19. The 3.8.19 servers are waiting for an opportunity to reboot on > 4.0.5 (or better if we have the time to test a more recent kernel before > rebooting : 4.1.8 and 4.2.1 are our candidates for testing right now). Apparently this isn't the problem : we just had another similar Ceph OSD crash without any concurrent defragmentation going on. Best regards, Lionel Bouton ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: btrfs fi defrag interfering (maybe) with Ceph OSD operation 2015-09-29 14:49 ` Lionel Bouton @ 2015-09-29 17:14 ` Lionel Bouton 0 siblings, 0 replies; 7+ messages in thread From: Lionel Bouton @ 2015-09-29 17:14 UTC (permalink / raw) To: linux-btrfs Le 29/09/2015 16:49, Lionel Bouton a écrit : > Le 27/09/2015 17:34, Lionel Bouton a écrit : >> [...] >> It's not clear to me that "btrfs fi defrag <file>" can't interfere with >> another process trying to use the file. I assume basic reading and >> writing is OK but there might be restrictions on unlinking/locking/using >> other ioctls... Are there any I should be aware of and should look for >> in Ceph OSDs? This is on a 3.8.19 kernel (with Gentoo patches which >> don't touch BTRFS sources) with btrfs-progs 4.0.1. We have 5 servers on >> our storage network : 2 are running a 4.0.5 kernel and 3 are running >> 3.8.19. The 3.8.19 servers are waiting for an opportunity to reboot on >> 4.0.5 (or better if we have the time to test a more recent kernel before >> rebooting : 4.1.8 and 4.2.1 are our candidates for testing right now). > Apparently this isn't the problem : we just had another similar Ceph OSD > crash without any concurrent defragmentation going on. However the Ceph developpers confirmed that BTRFS returned an EIO while reading data from disk. Is there a known bug in kernel 3.18.9 (sorry for the initial typo) that could lead to that? I couldn't find any on the wiki. The last crash was on a filesystem mounted with these options: rw,noatime,nodiratime,compress=lzo,space_cache,recovery,autodefrag Some of the extents have been recompressed to zlib (though at the time of the crash there was no such activity as I disabled it 2 days before to simplify diagnostics). Best regards, Lionel Bouton ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2015-09-29 17:14 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-09-27 15:34 btrfs fi defrag interfering (maybe) with Ceph OSD operation Lionel Bouton 2015-09-28 0:18 ` Duncan 2015-09-28 9:55 ` Lionel Bouton 2015-09-28 20:52 ` Duncan 2015-09-28 21:55 ` Lionel Bouton 2015-09-29 14:49 ` Lionel Bouton 2015-09-29 17:14 ` Lionel Bouton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).