btrfs fi defrag interfering (maybe) with Ceph OSD operation

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* btrfs fi defrag interfering (maybe) with Ceph OSD operation
@ 2015-09-27 15:34 Lionel Bouton
  2015-09-28  0:18 ` Duncan
  2015-09-29 14:49 ` Lionel Bouton
  0 siblings, 2 replies; 7+ messages in thread
From: Lionel Bouton @ 2015-09-27 15:34 UTC (permalink / raw)
  To: linux-btrfs

Hi,

we use BTRFS for Ceph filestores (after much tuning and testing over
more than a year). One of the problem we've had to face was the slow
decrease in performance caused by fragmentation.

Here's a small recap of the history for context.
Initially we used internal journals on the few OSDs where we tested
BTRFS, which meant constantly overwriting 10GB files (which is obviously
bad for CoW). Before using NoCoW and eventually moving the journals to
raw SSD partitions, we understood autodefrag was not being effective :
the initial performance on a fresh, recently populated OSD was great and
slowly degraded over time without access patterns and filesystem sizes
changing significantly.
My idea was that autodefrag might focus its efforts on files not useful
to defragment in the long term. The obvious one was the journal
(constant writes but only read again when restarting an OSD) but I
couldn't find any description of the algorithms/heuristics used by
autodefrag so I decided to disable it and develop our own
defragmentation scheduler. It is based on both a slow walk through the
filesystem (which acts as a safety net over one week period) and a
fatrace pipe (used to detect recent fragmentation). Fragmentation is
computed from filefrag detailed outputs and it learns how much it can
defragment files with calls to filefrag after defragmentation (we
learned compressed files and uncompressed files don't behave the same
way in the process so we ended up treating them separately).
Simply excluding the journal from defragmentation and using some basic
heuristics (don't defragment recently written files but keep them in a
pool then queue them and don't defragment files below a given
fragmentation "cost" were defragmentation becomes ineffective) gave us
usable performance in the long run. Then we successively moved the
journal to NoCoW files and SSDs and disabled Ceph's use of BTRFS
snapshots which were too costly (removing snapshots generated 120MB of
writes to the disks and this was done every 30s on our configuration).

In the end we had a very successful experience, migrated everything to
BTRFS filestores that were noticeably faster than XFS (according to Ceph
metrics), detected silent corruption and compressed data. Everything
worked well until this morning.

I woke up to a text message signalling VM freezes all over our platform.
2 Ceph OSDs died at the same time on two of our servers (20s appart)
which for durability reason freezes writes on the data chunks shared by
these two OSDs.
The errors we got in the OSD logs seem to point to an IO error (at least
IIRC we got a similar crash on an OSD where we had invalid csum errors
logged by the kernel) but we couldn't find any kernel error and btrfs
scrubs finished on the filesystems without finding any corruption. I've
yet to get an answer for the possible contexts and exact IO errors. If
people familiar with Ceph read this here's the error on Ceph 0.80.9
(more logs available on demand) :

2015-09-27 06:30:57.373841 7f05d92cf700 -1 os/FileStore.cc: In function
'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t,
size_t, ceph::bufferlist&, bool)' thread 7f05d92cf700 time 2015-09-27
06:30:57.260978
os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio
|| got != -5)

Given that the defragmentation scheduler treats file accesses the same
on all replicas to decide when triggering a call to "btrfs fi defrag
<file>", I suspect this manual call to defragment could have happened on
the 2 OSDs affected for the same file at nearly the same time and caused
the near simultaneous crashes.

It's not clear to me that "btrfs fi defrag <file>" can't interfere with
another process trying to use the file. I assume basic reading and
writing is OK but there might be restrictions on unlinking/locking/using
other ioctls... Are there any I should be aware of and should look for
in Ceph OSDs? This is on a 3.8.19 kernel (with Gentoo patches which
don't touch BTRFS sources) with btrfs-progs 4.0.1. We have 5 servers on
our storage network : 2 are running a 4.0.5 kernel and 3 are running
3.8.19. The 3.8.19 servers are waiting for an opportunity to reboot on
4.0.5 (or better if we have the time to test a more recent kernel before
rebooting : 4.1.8 and 4.2.1 are our candidates for testing right now).

Best regards,

Lionel Bouton

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: btrfs fi defrag interfering (maybe) with Ceph OSD operation
  2015-09-27 15:34 btrfs fi defrag interfering (maybe) with Ceph OSD operation Lionel Bouton
@ 2015-09-28  0:18 ` Duncan
  2015-09-28  9:55   ` Lionel Bouton
  2015-09-29 14:49 ` Lionel Bouton
  1 sibling, 1 reply; 7+ messages in thread
From: Duncan @ 2015-09-28  0:18 UTC (permalink / raw)
  To: linux-btrfs

Lionel Bouton posted on Sun, 27 Sep 2015 17:34:50 +0200 as excerpted:

> Hi,
> 
> we use BTRFS for Ceph filestores (after much tuning and testing over
> more than a year). One of the problem we've had to face was the slow
> decrease in performance caused by fragmentation.

While I'm a regular user/admin (not dev) on the btrfs lists, my ceph 
knowledge is essentially zero, so this is intended to address the btrfs 
side ONLY.

> Here's a small recap of the history for context.
> Initially we used internal journals on the few OSDs where we tested
> BTRFS, which meant constantly overwriting 10GB files (which is obviously
> bad for CoW). Before using NoCoW and eventually moving the journals to
> raw SSD partitions, we understood autodefrag was not being effective :
> the initial performance on a fresh, recently populated OSD was great and
> slowly degraded over time without access patterns and filesystem sizes
> changing significantly.

Yes.  Autodefrag works most effectively on (relatively) small files, 
generally for performance reasons, as it detects fragmentation and queues 
up a a defragmenting rewrite by a separate defragmentation worker 
thread.  As file sizes increase, that defragmenting rewrite will take 
longer, until at some point, particularly on actively rewritten files, 
change-writes will be coming in faster than file rewrite speeds...

Generally speaking, therefore, it's great for small database files upto a 
quarter gig or so, think firefox sqlite database files on the desktop, 
with people starting to see issues somewhere between a quarter gig and a 
gig on spinning rust, depending on disk speed as well as active rewrite 
load on the file in question.

So constantly rewritten 10-gig journal files... Entirely inappropriate 
for autodefrag. =:^(

There has been discussion and a general plan for some sort of larger-file 
autodefrag optimization, but btrfs continues to be rather "idea and 
opportunity rich" and "implementation coder poor", so realistically we're 
looking at years to implementation.

Meanwhile, other measures should be taken for multigig files, as you're 
already doing. =:^)

> I couldn't find any description of the algorithms/heuristics used by
> autodefrag [...]

This is in general documented on the wiki, tho not with the level of 
explanation I included above.

https://btrfs.wiki.kernel.org

> I decided to disable it and develop our own
> defragmentation scheduler. It is based on both a slow walk through the
> filesystem (which acts as a safety net over one week period) and a
> fatrace pipe (used to detect recent fragmentation). Fragmentation is
> computed from filefrag detailed outputs and it learns how much it can
> defragment files with calls to filefrag after defragmentation (we
> learned compressed files and uncompressed files don't behave the same
> way in the process so we ended up treating them separately).

Note that unless this has very recently changed, filefrag doesn't know 
how to calculate btrfs-compressed file fragmentation correctly.  Btrfs 
uses (IIRC) 128 KiB compression blocks, which filefrag will see (I'm not 
actually sure if it's 100% consistent or if it's conditional on something 
else) as separate extents.

Bottom line, there's no easily accessible reliable way to get the 
fragmentation level of a btrfs-compressed file. =:^(  (Presumably
btrfs-debug-tree with the -e option to print extents info, with the 
output fed to some parsing script, could do it, but that's not what I'd 
call easily accessible, at least at a non-programmer admin level.)

Again, there has been some discussion around teaching filefrag about 
btrfs compression, and it may well eventually happen, but I'm not aware 
of an e2fsprogs release doing it yet, nor of whether there's even actual 
patches for it yet, let alone merge status.

> Simply excluding the journal from defragmentation and using some basic
> heuristics (don't defragment recently written files but keep them in a
> pool then queue them and don't defragment files below a given
> fragmentation "cost" were defragmentation becomes ineffective) gave us
> usable performance in the long run. Then we successively moved the
> journal to NoCoW files and SSDs and disabled Ceph's use of BTRFS
> snapshots which were too costly (removing snapshots generated 120MB of
> writes to the disks and this was done every 30s on our configuration).

It can be noted that there's an negative interaction between btrfs 
snapshots and nocow, sometimes called cow1.  The btrfs snapshot feature 
is predicated on cow, with a snapshot locking in place existing file 
extents, normally no big deal as ordinary cow files will have rewrites 
cowed elsewhere in any case.  Obviously, then, snapshots must by 
definition play havoc with nocow.  What actually happens is that with 
existing extents locked in place, the first post-snapshot change to a 
block must then be cowed into a new extent.  The nocow attribute remains 
on the file, however, and further writes to that block... until the next 
snapshot anyway... will be written in-place, to the (first-post-snapshot-
cowed) current extent.  When one list poster referred to that as cow1, I 
found the term so nicely descriptive that I adopted it for myself, altho 
for obvious reasons I have to explain it first in many posts.

It should now be obvious why 30-second snapshots weren't working well on 
your nocow files, and why they seemed to become fragmented anyway, the 30-
second snapshots were effectively disabling nocow!

In general, for nocow files, snapshotting should be disabled (as you 
ultimately did), or as low frequency as is practically possible.  Some 
list posters have, however, reported a good experience with a combination 
of lower frequency snapshotting (say daily, or maybe every six hours, but 
DEFINITELY not more frequent than half-hour), and periodic defrag, on the 
order of the weekly period you implied in a bit I snipped, to perhaps 
monthly.

> In the end we had a very successful experience, migrated everything to
> BTRFS filestores that were noticeably faster than XFS (according to Ceph
> metrics), detected silent corruption and compressed data. Everything
> worked well [...]

=:^)

> [...] until this morning.

=:^(

> I woke up to a text message signalling VM freezes all over our platform.
> 2 Ceph OSDs died at the same time on two of our servers (20s appart)
> which for durability reason freezes writes on the data chunks shared by
> these two OSDs.
> The errors we got in the OSD logs seem to point to an IO error (at least
> IIRC we got a similar crash on an OSD where we had invalid csum errors
> logged by the kernel) but we couldn't find any kernel error and btrfs
> scrubs finished on the filesystems without finding any corruption.

Snipping some of the ceph stuff since as I said I've essentially zero 
knowledge there, but...

> Given that the defragmentation scheduler treats file accesses the same
> on all replicas to decide when triggering a call to "btrfs fi defrag
> <file>", I suspect this manual call to defragment could have happened on
> the 2 OSDs affected for the same file at nearly the same time and caused
> the near simultaneous crashes.

...  While what I /do/ know of ceph suggests that it should be protected 
against this sort of thing, perhaps there's a bug, because...

I know for sure that btrfs itself is not intended for distributed access, 
from more than one system/kernel at a time.  Which assuming my ceph 
illiteracy isn't negatively affecting my reading of the above, seems to 
be more or less what you're suggesting happened, and I do know that *if* 
it *did* happen, it could indeed trigger all sorts of havoc!

> It's not clear to me that "btrfs fi defrag <file>" can't interfere with
> another process trying to use the file. I assume basic reading and
> writing is OK but there might be restrictions on unlinking/locking/using
> other ioctls... Are there any I should be aware of and should look for
> in Ceph OSDs? This is on a 3.8.19 kernel (with Gentoo patches which
> don't touch BTRFS sources) with btrfs-progs 4.0.1. We have 5 servers on
> our storage network : 2 are running a 4.0.5 kernel and 3 are running
> 3.8.19. The 3.8.19 servers are waiting for an opportunity to reboot on
> 4.0.5 (or better if we have the time to test a more recent kernel before
> rebooting : 4.1.8 and 4.2.1 are our candidates for testing right now).

It's worth keeping in mind that the explicit warnings about btrfs being 
experimental weren't removed until 3.12, and while current status is no 
longer experimental or entirely unstable, it remains, as I characterize 
it, as "maturing and stabilizing, not yet entirely stable and mature."

So 3.8 is very much still in btrfs-experimental land!  And so many bugs 
have been fixed since then that... well, just get off of it ASAP, which 
it seems you're already doing.

While it's no longer absolutely necessary to stay current to the latest 
non-long-term-support kernel (unless you're running say raid56 mode, 
which is still new enough not to be as stable as the rest of btrfs and 
where running the latest kernel continues to be critical, and while I'm 
discussing exceptions, btrfs quota code continues to be a problem even 
with the newest kernels, so I recommend it remain off unless you're 
specifically working with the devs to debug and test it), list consensus 
seems to be that where stability is a prime consideration, sticking to 
long-term-support kernel series, no later than one LTS series behind the 
latest and upgrading to the latest LTS series some reasonable time after 
the LTS announcement, after deployment-specific testing as appropriate of 
course, is recommended best-practice.

With kernel 4.1 series now blessed as the latest long-term-stable, and 
3.18 the latest before that, the above suggests targeting them, and 
indeed, list reports for the 3.18 series as it has matured have been very 
good, with 4.1 still new enough that the stability-cautious are still 
testing or just deployed, so there's not many reports on it yet.

Meanwhile, while latest (or second-latest until latest is site-tested) LTS 
kernel is recommended for stable deployment, when encountering specific 
bugs, be prepared to upgrade to latest stable at least for testing, 
possibly with cherry-picked not-yet-mainlined patches if appropriate for 
individual bugs.

But definitely, anything pre-3.12, get off of, as that really is when the 
experimental label came off, and you don't want to be running kernel 
btrfs of that age in production.  Again, 3.18 is well tested and rated so 
targeting it for ASAP deployment is good, with 4.1 targeted for testing 
and deployment "soon" also recommended.

And once again, that's purely from the btrfs side.  I know absolutely 
nothing about ceph stability in any of these kernels, tho obviously for 
you that's going to be a consideration as well.

Tying up a couple loose ends...

Regarding nocow...

Given that you had apparently missed much of the general list and wiki 
wisdom above (while at the same time eventually coming to the many of the 
same conclusions on your own), it's worth mentioning the following 
additional nocow caveat and recommended procedure, in case you missed it 
as well:

On btrfs, setting nocow on an existing file with existing content, leaves 
undefined when exactly the nocow attribute will take effect.  (FWIW, this 
is mentioned in the chattr (1) manpage as well.)  Recommended procedure 
is therefore to set the nocow attribute on the directory, such that newly 
created files (and subdirs) will inherit it.  (There's no effect on the 
directory itself, just this inheritance.)  Then, for existing files, copy 
them into the new location, preferably from a different filesystem in 
ordered to guarantee that the file is actually newly created and thus 
gets nocow applied appropriately.

(cp behavior currently copies the file in unless the reflink option is 
set anyway, but there has been discussion of changing that to reflink by 
default for speed and space usage reasons, and that would play havoc with 
nocow on file creation, but btrfs doesn't support cross-filesystem 
reflinks so copying in from a different filesystem should always force 
creation of a new file, with nocow inherited from its directory as 
intended.)

What about btrfs-progs versions?

In general, in normal online operation the btrfs command simply tells the 
kernel what to do and the kernel takes care of the details, so it's the 
kernel code that's critical.  However, various recovery operations, btrfs 
check, btrfs restore, btrfs rescue, etc (I'm not actually sure about 
mkfs.btrfs, whether that's primarily userspace code or calls into the 
kernel, tho I suspect the former), operate on an unmounted btrfs using 
primarily userspace code, and it's here where the latest userspace code, 
updated to deal with the latest known problems, becomes critical.

So in general, it's kernel code age and stability that's critical for a 
deployed and operation filesystem, but userspace code that's critical if 
you run into problems.  For that reason, unless you have backups and 
intend to simply blow away filesystems with problems and recreate them 
fresh, restoring from backups, a reasonably current btrfs userspace is 
critical as well, even if it's not critical in normal operation.

And of course you need current userspace as well as kernelspace to best 
support the newest features, but that's a given. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: btrfs fi defrag interfering (maybe) with Ceph OSD operation
  2015-09-28  0:18 ` Duncan
@ 2015-09-28  9:55   ` Lionel Bouton
  2015-09-28 20:52     ` Duncan
  0 siblings, 1 reply; 7+ messages in thread
From: Lionel Bouton @ 2015-09-28  9:55 UTC (permalink / raw)
  To: Duncan, linux-btrfs

Hi Duncan,

thanks for your answer, here is additional information.

Le 28/09/2015 02:18, Duncan a écrit :
> [...]
>> I decided to disable it and develop our own
>> defragmentation scheduler. It is based on both a slow walk through the
>> filesystem (which acts as a safety net over one week period) and a
>> fatrace pipe (used to detect recent fragmentation). Fragmentation is
>> computed from filefrag detailed outputs and it learns how much it can
>> defragment files with calls to filefrag after defragmentation (we
>> learned compressed files and uncompressed files don't behave the same
>> way in the process so we ended up treating them separately).
> Note that unless this has very recently changed, filefrag doesn't know 
> how to calculate btrfs-compressed file fragmentation correctly.  Btrfs 
> uses (IIRC) 128 KiB compression blocks, which filefrag will see (I'm not 
> actually sure if it's 100% consistent or if it's conditional on something 
> else) as separate extents.
>
> Bottom line, there's no easily accessible reliable way to get the 
> fragmentation level of a btrfs-compressed file. =:^(  (Presumably
> btrfs-debug-tree with the -e option to print extents info, with the 
> output fed to some parsing script, could do it, but that's not what I'd 
> call easily accessible, at least at a non-programmer admin level.)
>
> Again, there has been some discussion around teaching filefrag about 
> btrfs compression, and it may well eventually happen, but I'm not aware 
> of an e2fsprogs release doing it yet, nor of whether there's even actual 
> patches for it yet, let alone merge status.

>From what I understood, filefrag doesn't known the length of each extent
on disk but should have its position. This is enough to have a rough
estimation of how badly fragmented the file is : it doesn't change the
result much when computing what a rotating disk must do (especially how
many head movements) to access the whole file.

>
>> Simply excluding the journal from defragmentation and using some basic
>> heuristics (don't defragment recently written files but keep them in a
>> pool then queue them and don't defragment files below a given
>> fragmentation "cost" were defragmentation becomes ineffective) gave us
>> usable performance in the long run. Then we successively moved the
>> journal to NoCoW files and SSDs and disabled Ceph's use of BTRFS
>> snapshots which were too costly (removing snapshots generated 120MB of
>> writes to the disks and this was done every 30s on our configuration).
> It can be noted that there's an negative interaction between btrfs 
> snapshots and nocow, sometimes called cow1.  The btrfs snapshot feature 
> is predicated on cow, with a snapshot locking in place existing file 
> extents, normally no big deal as ordinary cow files will have rewrites 
> cowed elsewhere in any case.  Obviously, then, snapshots must by 
> definition play havoc with nocow.  What actually happens is that with 
> existing extents locked in place, the first post-snapshot change to a 
> block must then be cowed into a new extent.  The nocow attribute remains 
> on the file, however, and further writes to that block... until the next 
> snapshot anyway... will be written in-place, to the (first-post-snapshot-
> cowed) current extent.  When one list poster referred to that as cow1, I 
> found the term so nicely descriptive that I adopted it for myself, altho 
> for obvious reasons I have to explain it first in many posts.
>
> It should now be obvious why 30-second snapshots weren't working well on 
> your nocow files, and why they seemed to become fragmented anyway, the 30-
> second snapshots were effectively disabling nocow!
>
> In general, for nocow files, snapshotting should be disabled (as you 
> ultimately did), or as low frequency as is practically possible.  Some 
> list posters have, however, reported a good experience with a combination 
> of lower frequency snapshotting (say daily, or maybe every six hours, but 
> DEFINITELY not more frequent than half-hour), and periodic defrag, on the 
> order of the weekly period you implied in a bit I snipped, to perhaps 
> monthly.

In the case of Ceph OSD, this isn't what causes the performance problem:
the journal is on the main subvolume and snapshots are done on another
subvolume.

> [...]
>> Given that the defragmentation scheduler treats file accesses the same
>> on all replicas to decide when triggering a call to "btrfs fi defrag
>> <file>", I suspect this manual call to defragment could have happened on
>> the 2 OSDs affected for the same file at nearly the same time and caused
>> the near simultaneous crashes.
> ...  While what I /do/ know of ceph suggests that it should be protected 
> against this sort of thing, perhaps there's a bug, because...
>
> I know for sure that btrfs itself is not intended for distributed access, 
> from more than one system/kernel at a time.  Which assuming my ceph 
> illiteracy isn't negatively affecting my reading of the above, seems to 
> be more or less what you're suggesting happened, and I do know that *if* 
> it *did* happen, it could indeed trigger all sorts of havoc!

No: Ceph OSDs are normal local processes using a filesystem for storage
(and optionally a dedicated journal out of the filesystem) as are the
btrfs fi defrag commands run on the same host. What I'm interested in is
how the btrfs fi defrag <file> command could interfere with any other
process accessing <file> simultaneously. The answer could very well be
"it never will" (for example because it doesn't use any operation that
can before calling the defrag ioctl which is guaranteed to not interfere
with other file operations too). I just need to know if there's a
possibility so I can decide if these defragmentations are an operational
risk or not in my context and if I found the cause for my slightly
frightening morning.

>> It's not clear to me that "btrfs fi defrag <file>" can't interfere with
>> another process trying to use the file. I assume basic reading and
>> writing is OK but there might be restrictions on unlinking/locking/using
>> other ioctls... Are there any I should be aware of and should look for
>> in Ceph OSDs? This is on a 3.8.19 kernel (with Gentoo patches which
>> don't touch BTRFS sources) with btrfs-progs 4.0.1. We have 5 servers on
>> our storage network : 2 are running a 4.0.5 kernel and 3 are running
>> 3.8.19. The 3.8.19 servers are waiting for an opportunity to reboot on
>> 4.0.5 (or better if we have the time to test a more recent kernel before
>> rebooting : 4.1.8 and 4.2.1 are our candidates for testing right now).
> It's worth keeping in mind that the explicit warnings about btrfs being 
> experimental weren't removed until 3.12, and while current status is no 
> longer experimental or entirely unstable, it remains, as I characterize 
> it, as "maturing and stabilizing, not yet entirely stable and mature."
>
> So 3.8 is very much still in btrfs-experimental land!  And so many bugs 
> have been fixed since then that... well, just get off of it ASAP, which 
> it seems you're already doing.

Oops, that was a typo : I meant 3.18.9, sorry :-(

> [...]
>
>
> Tying up a couple loose ends...
>
> Regarding nocow...
>
> Given that you had apparently missed much of the general list and wiki 
> wisdom above (while at the same time eventually coming to the many of the 
> same conclusions on your own),

In fact I was initially aware of (no)CoW/defragmentation/snapshots
performance gotchas (I already used BTRFS for PostgreSQL slaves hosting
for example...).
But Ceph is filesystem aware: its OSDs detect if they are running on
XFS/BTRFS and activate automatically some filesystem features. So even
though I was aware of the problems that can happen on a CoW filesystem,
I preferred to do actual testing with the default Ceph settings and
filesystem mount options before tuning.

Best regards,

Lionel Bouton

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: btrfs fi defrag interfering (maybe) with Ceph OSD operation
  2015-09-28  9:55   ` Lionel Bouton
@ 2015-09-28 20:52     ` Duncan
  2015-09-28 21:55       ` Lionel Bouton
  0 siblings, 1 reply; 7+ messages in thread
From: Duncan @ 2015-09-28 20:52 UTC (permalink / raw)
  To: linux-btrfs

Lionel Bouton posted on Mon, 28 Sep 2015 11:55:15 +0200 as excerpted:

> From what I understood, filefrag doesn't known the length of each extent
> on disk but should have its position. This is enough to have a rough
> estimation of how badly fragmented the file is : it doesn't change the
> result much when computing what a rotating disk must do (especially how
> many head movements) to access the whole file.

AFAIK, it's the number of extents reported that's the problem with 
filefrag and btrfs compression.  Multiple 128 KiB compression blocks can 
be right next to each other, forming one longer extent on-device, but due 
to the compression, filefrag sees and reports them as one extent per 
compression block, making the file look like it has perhaps thousands or 
tens of thousands of extents when in actuality it's only a handful, 
single or double digits.

In that regard, length or position neither one matter, filefrag will 
simply report a number of extents orders of magnitude higher than what's 
actually there, on-device.

But I'm not a coder so could be entirely wrong; that's simply how I 
understand it based on what I've seen on-list from the devs themselves.

> In the case of Ceph OSD, this isn't what causes the performance problem:
> the journal is on the main subvolume and snapshots are done on another
> subvolume.

Understood... now.  I was actually composing a reply saying I didn't get 
it, when suddenly I did.  The snapshots were being taken of different 
subvolumes entirely, thus excluding the files here in question.

Thanks. =:^)

>>> This is on a 3.8.19 kernel [...]

>> [Btrfs was still experimental] until 3.12 [so] 3.8
>> is very much still in btrfs-experimental land! [...]
> 
> Oops, that was a typo : I meant 3.18.9, sorry :-(

That makes a /world/ of difference!  LOL!  I'm very much relieved!  =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: btrfs fi defrag interfering (maybe) with Ceph OSD operation
  2015-09-28 20:52     ` Duncan
@ 2015-09-28 21:55       ` Lionel Bouton
  0 siblings, 0 replies; 7+ messages in thread
From: Lionel Bouton @ 2015-09-28 21:55 UTC (permalink / raw)
  To: Duncan, linux-btrfs

Le 28/09/2015 22:52, Duncan a écrit :
> Lionel Bouton posted on Mon, 28 Sep 2015 11:55:15 +0200 as excerpted:
>
>> From what I understood, filefrag doesn't known the length of each extent
>> on disk but should have its position. This is enough to have a rough
>> estimation of how badly fragmented the file is : it doesn't change the
>> result much when computing what a rotating disk must do (especially how
>> many head movements) to access the whole file.
> AFAIK, it's the number of extents reported that's the problem with 
> filefrag and btrfs compression.  Multiple 128 KiB compression blocks can 
> be right next to each other, forming one longer extent on-device, but due 
> to the compression, filefrag sees and reports them as one extent per 
> compression block, making the file look like it has perhaps thousands or 
> tens of thousands of extents when in actuality it's only a handful, 
> single or double digits.

Yes but that's not a problem for our defragmentation scheduler: we
compute the time needed to read the file based on a model of the disk
where reading consecutive compressed blocks has no seek cost, only the
same revolution cost as reading the larger block they form. The cost of
fragmentation is defined as the ratio between this time and the time
computed with our model if the blocks were purely sequential.

>
> In that regard, length or position neither one matter, filefrag will 
> simply report a number of extents orders of magnitude higher than what's 
> actually there, on-device.

Yes but filefrag -v reports the length and position and we can then find
out based purely on the positions if extents are sequential or random.

If people are interested by the details I can discuss them in a separate
thread (or a subthread with a different title). One thing in particular
surprised me and could be an interesting separate discussion: according
to the extents positions reported by filefrag -v, defragmentation can
leave extents in several sequences at different positions on the disk
leading to an average fragmentation cost for compressed files of 2.7x to
3x compared to the ideal case (note that this is an approximation: we
consider files compressed if more than half of their extents are
compressed by checking for "encoded" in the extent flags). This is
completely different for uncompressed files: here defragmentation is
completely effective and we get a single extent most of the time. So
there's at least 3 possibilities : an error in positions reported by
filefrag (and the file is really defragmented), a good reason to leave
these files fragmented or an opportunity for optimization.

But let's remember our real problem: I'm still not sure if calling btrfs
fi defrag <file> can interfere with any concurrent operation on <file>
leading to an I/O error. As this has the potential to bring our platform
down in our current setup the answer I really hope this will catch the
attention of someone familiar with the technical details of btrfs fi defrag.

Best regards,

Lionel Bouton

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: btrfs fi defrag interfering (maybe) with Ceph OSD operation
  2015-09-27 15:34 btrfs fi defrag interfering (maybe) with Ceph OSD operation Lionel Bouton
  2015-09-28  0:18 ` Duncan
@ 2015-09-29 14:49 ` Lionel Bouton
  2015-09-29 17:14   ` Lionel Bouton
  1 sibling, 1 reply; 7+ messages in thread
From: Lionel Bouton @ 2015-09-29 14:49 UTC (permalink / raw)
  To: linux-btrfs

Le 27/09/2015 17:34, Lionel Bouton a écrit :
> [...]
> It's not clear to me that "btrfs fi defrag <file>" can't interfere with
> another process trying to use the file. I assume basic reading and
> writing is OK but there might be restrictions on unlinking/locking/using
> other ioctls... Are there any I should be aware of and should look for
> in Ceph OSDs? This is on a 3.8.19 kernel (with Gentoo patches which
> don't touch BTRFS sources) with btrfs-progs 4.0.1. We have 5 servers on
> our storage network : 2 are running a 4.0.5 kernel and 3 are running
> 3.8.19. The 3.8.19 servers are waiting for an opportunity to reboot on
> 4.0.5 (or better if we have the time to test a more recent kernel before
> rebooting : 4.1.8 and 4.2.1 are our candidates for testing right now).

Apparently this isn't the problem : we just had another similar Ceph OSD
crash without any concurrent defragmentation going on.

Best regards,

Lionel Bouton

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: btrfs fi defrag interfering (maybe) with Ceph OSD operation
  2015-09-29 14:49 ` Lionel Bouton
@ 2015-09-29 17:14   ` Lionel Bouton
  0 siblings, 0 replies; 7+ messages in thread
From: Lionel Bouton @ 2015-09-29 17:14 UTC (permalink / raw)
  To: linux-btrfs

Le 29/09/2015 16:49, Lionel Bouton a écrit :
> Le 27/09/2015 17:34, Lionel Bouton a écrit :
>> [...]
>> It's not clear to me that "btrfs fi defrag <file>" can't interfere with
>> another process trying to use the file. I assume basic reading and
>> writing is OK but there might be restrictions on unlinking/locking/using
>> other ioctls... Are there any I should be aware of and should look for
>> in Ceph OSDs? This is on a 3.8.19 kernel (with Gentoo patches which
>> don't touch BTRFS sources) with btrfs-progs 4.0.1. We have 5 servers on
>> our storage network : 2 are running a 4.0.5 kernel and 3 are running
>> 3.8.19. The 3.8.19 servers are waiting for an opportunity to reboot on
>> 4.0.5 (or better if we have the time to test a more recent kernel before
>> rebooting : 4.1.8 and 4.2.1 are our candidates for testing right now).
> Apparently this isn't the problem : we just had another similar Ceph OSD
> crash without any concurrent defragmentation going on.

However the Ceph developpers confirmed that BTRFS returned an EIO while
reading data from disk. Is there a known bug  in kernel 3.18.9 (sorry
for the initial typo) that could lead to that? I couldn't find any on
the wiki.
The last crash was on a filesystem mounted with these options:

rw,noatime,nodiratime,compress=lzo,space_cache,recovery,autodefrag

Some of the extents have been recompressed to zlib (though at the time
of the crash there was no such activity as I disabled it 2 days before
to simplify diagnostics).

Best regards,

Lionel Bouton

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-09-29 17:14 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-27 15:34 btrfs fi defrag interfering (maybe) with Ceph OSD operation Lionel Bouton
2015-09-28  0:18 ` Duncan
2015-09-28  9:55   ` Lionel Bouton
2015-09-28 20:52     ` Duncan
2015-09-28 21:55       ` Lionel Bouton
2015-09-29 14:49 ` Lionel Bouton
2015-09-29 17:14   ` Lionel Bouton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).