Possible application issue ...

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Possible application issue ...
@ 2014-04-07  5:25 George Mitchell
  2014-04-07 12:42 ` Duncan
  0 siblings, 1 reply; 3+ messages in thread
From: George Mitchell @ 2014-04-07  5:25 UTC (permalink / raw)
  To: Btrfs BTRFS

I am seeming to have an issue with a specific application.  I just 
installed "Recoll", a really nice desktop search tool.  And the 
following day whenever my backup program would attempt to run, my 
computer simply stopped dead in its tracks and I was forced to do a hard 
reboot to get it back.  So tonight I have been trying to shag out the 
problem.  And the problem goes like this.  Whenever I try to defrag the 
Recoll data files, I get a string of weird messages poring out from the 
btrfs defrag program itself and flashing messages on the screen 
regarding some sort of CPU failure problem for both cpus.  As soon as I 
removed the ".recoll" data directory from the path, everything got OK.  
Does anyone know what might be going on here or should I run the thing 
and try to trap the output and post it and/or send a copy of the data 
files in question?

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Possible application issue ...
  2014-04-07  5:25 Possible application issue George Mitchell
@ 2014-04-07 12:42 ` Duncan
  2014-04-07 14:14   ` George Mitchell
  0 siblings, 1 reply; 3+ messages in thread
From: Duncan @ 2014-04-07 12:42 UTC (permalink / raw)
  To: linux-btrfs

George Mitchell posted on Sun, 06 Apr 2014 22:25:03 -0700 as excerpted:

> I am seeming to have an issue with a specific application.  I just 
> installed "Recoll", a really nice desktop search tool.  And the 
> following day whenever my backup program would attempt to run, my 
> computer simply stopped dead in its tracks and I was forced to do a
> hard reboot to get it back.  So tonight I have been trying to shag out
> the  problem.  And the problem goes like this.  Whenever I try to
> defrag the  Recoll data files, I get a string of weird messages poring
> out from the btrfs defrag program itself and flashing messages on the
> screen  regarding some sort of CPU failure problem for both cpus.  As
> soon as I removed the ".recoll" data directory from the path,
> everything got OK.  

> Does anyone know what might be going on here or should I run the thing 
> and try to trap the output and post it and/or send a copy of the data 
> files in question?

Just a btrfs user and list regular here, not a dev, but...

You'll probably need to post the output for a bug fix... unless it's 
simply stalled for NNN seconds warnings (usually 30/60/90/120/etc), in 
which case the general case is known, but then you'll want to...

echo w > /proc/sysrq-trigger

...  and post the output from that.  That's the usually requested info 
from that case, anyway.  And if this is the case, the apparent lockup 
should go away on its own after some time, but it might be a few minutes 
if the files are very heavily fragmented, as is likely.

Meanwhile, database files are part of a general category of frequently 
internally updated (as opposed to append-only updated) files that all 
copy-on-write filesystems including btrfs have problems with as they tend 
to fragment very fast and hard on COW because rewrites are to new 
locations.

How large are the files in question?  Are you using the btrfs autodefrag 
mount option?  Do you do use snapper or otherwise do lots of (likely 
scripted) snapshots on that subvolume or filesystem?

Generally speaking, if the files aren't too large (perhaps a couple 
hundred MiB or smaller), btrfs' autodefrag option can usually deal with 
the fragmentation as it occurs.  This works quite well for firefox sqlite 
databases, for instance.

Once the files in question get over perhaps half a gigabyte in size, 
however, that doesn't work so well, particularly if the file is being 
updated at a reasonable speed in real-time, as autodefrag queues the 
entire file for rewrite in ordered to defrag it, and at some point the 
rewriting can't keep up with the updates coming in.

For large internal-rewrite-pattern files, there's the NOCOW extended file 
attribute, which tells btrfs to rewrite the files in place, and disables 
the usual checksumming and etc that can also take time and complicate 
things on database files where the database generally already has some 
file integrity management of its own, that can "fight" with the 
management btrfs does.

But to be effective, setting nocow (chattr +C /path/to/file/or/dir) needs 
to be done while the file is still zero size, before it has any content.  
The easiest way to do that is to set it on the directory, before the 
files in the directory are created, so they inherit the nocow attribute 
from the directory they're created in.

The easiest solution at this point might be to delete the current 
fragmented files instead of trying to defrag them, setup the nocow on the 
directory that will contain them, and then trigger a reindexing.

However, there's one additional caveat involving snapshots.  By 
definition, the first change to a fileblock after a snapshot will be copy-
on-write despite the nocow attribute.  This is because the snapshot froze 
the existing file data in place as it was, so a change to it must be 
written to a new location even if the file is set nocow.  This shouldn't 
be too big of a problem if you're just taking a snapshot manually every 
week or so, but if you're using snapper or a similar automated script to 
take hourly or even per-minute snapshots, the effect is likely to be 
nearly as bad as if the file wasn't set nocow in the first place!

If this is the case, creating dedicated subvolume for the directory 
containing these files is the best idea, since snapshots stop at subvolume 
boundaries, so as long as you're not snapshotting the subvolume, you can 
set nocow on directories and files within it and not have to worry about 
snapshot-based cow undermining your efforts.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Possible application issue ...
  2014-04-07 12:42 ` Duncan
@ 2014-04-07 14:14   ` George Mitchell
  0 siblings, 0 replies; 3+ messages in thread
From: George Mitchell @ 2014-04-07 14:14 UTC (permalink / raw)
  To: Duncan, linux-btrfs

On 04/07/2014 05:42 AM, Duncan wrote:
> George Mitchell posted on Sun, 06 Apr 2014 22:25:03 -0700 as excerpted:
>
>> I am seeming to have an issue with a specific application.  I just
>> installed "Recoll", a really nice desktop search tool.  And the
>> following day whenever my backup program would attempt to run, my
>> computer simply stopped dead in its tracks and I was forced to do a
>> hard reboot to get it back.  So tonight I have been trying to shag out
>> the  problem.  And the problem goes like this.  Whenever I try to
>> defrag the  Recoll data files, I get a string of weird messages poring
>> out from the btrfs defrag program itself and flashing messages on the
>> screen  regarding some sort of CPU failure problem for both cpus.  As
>> soon as I removed the ".recoll" data directory from the path,
>> everything got OK.
>> Does anyone know what might be going on here or should I run the thing
>> and try to trap the output and post it and/or send a copy of the data
>> files in question?
> Just a btrfs user and list regular here, not a dev, but...
>
> You'll probably need to post the output for a bug fix... unless it's
> simply stalled for NNN seconds warnings (usually 30/60/90/120/etc), in
> which case the general case is known, but then you'll want to...
>
> echo w > /proc/sysrq-trigger
>
> ...  and post the output from that.  That's the usually requested info
> from that case, anyway.  And if this is the case, the apparent lockup
> should go away on its own after some time, but it might be a few minutes
> if the files are very heavily fragmented, as is likely.
>
>
> Meanwhile, database files are part of a general category of frequently
> internally updated (as opposed to append-only updated) files that all
> copy-on-write filesystems including btrfs have problems with as they tend
> to fragment very fast and hard on COW because rewrites are to new
> locations.
>
> How large are the files in question?  Are you using the btrfs autodefrag
> mount option?  Do you do use snapper or otherwise do lots of (likely
> scripted) snapshots on that subvolume or filesystem?
>
> Generally speaking, if the files aren't too large (perhaps a couple
> hundred MiB or smaller), btrfs' autodefrag option can usually deal with
> the fragmentation as it occurs.  This works quite well for firefox sqlite
> databases, for instance.
>
> Once the files in question get over perhaps half a gigabyte in size,
> however, that doesn't work so well, particularly if the file is being
> updated at a reasonable speed in real-time, as autodefrag queues the
> entire file for rewrite in ordered to defrag it, and at some point the
> rewriting can't keep up with the updates coming in.
>
> For large internal-rewrite-pattern files, there's the NOCOW extended file
> attribute, which tells btrfs to rewrite the files in place, and disables
> the usual checksumming and etc that can also take time and complicate
> things on database files where the database generally already has some
> file integrity management of its own, that can "fight" with the
> management btrfs does.
>
> But to be effective, setting nocow (chattr +C /path/to/file/or/dir) needs
> to be done while the file is still zero size, before it has any content.
> The easiest way to do that is to set it on the directory, before the
> files in the directory are created, so they inherit the nocow attribute
> from the directory they're created in.
>
> The easiest solution at this point might be to delete the current
> fragmented files instead of trying to defrag them, setup the nocow on the
> directory that will contain them, and then trigger a reindexing.
>
>
> However, there's one additional caveat involving snapshots.  By
> definition, the first change to a fileblock after a snapshot will be copy-
> on-write despite the nocow attribute.  This is because the snapshot froze
> the existing file data in place as it was, so a change to it must be
> written to a new location even if the file is set nocow.  This shouldn't
> be too big of a problem if you're just taking a snapshot manually every
> week or so, but if you're using snapper or a similar automated script to
> take hourly or even per-minute snapshots, the effect is likely to be
> nearly as bad as if the file wasn't set nocow in the first place!
>
> If this is the case, creating dedicated subvolume for the directory
> containing these files is the best idea, since snapshots stop at subvolume
> boundaries, so as long as you're not snapshotting the subvolume, you can
> set nocow on directories and files within it and not have to worry about
> snapshot-based cow undermining your efforts.
>
I think you nailed it in terms of this being comparable to stuff like 
virtual images and bittorrent.  These are indeed a collection of 
multiple large databases, one over 6GB in size, so it becomes obvious 
why defrag is choking on it.  It was late last night when I posted this, 
but thinking on it through the night, I realized this might be what is 
going on.  So at this point I am just going to continue filtering these 
files out of the defrag.  I don't typically use databases, so this kind 
of blindsided me.  But thanks for confirming what I was already 
beginning to suspect.  This desktop search program IS active continually 
and I strongly suspect also that the two programs are colliding in mid 
air as they try to manipulate the database content on the drive.  But it 
really does produce a trainwreck systemwise.  Thanks again for the 
pointers and reminders on this.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2014-04-07 14:13 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-07  5:25 Possible application issue George Mitchell
2014-04-07 12:42 ` Duncan
2014-04-07 14:14   ` George Mitchell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).