git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Strange "beagle" interaction..
@ 2007-11-13 20:56 Linus Torvalds
  2007-11-13 21:03 ` J. Bruce Fields
  2007-11-13 21:55 ` Mike Hommey
  0 siblings, 2 replies; 9+ messages in thread
From: Linus Torvalds @ 2007-11-13 20:56 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List; +Cc: Johannes Schindelin


Ok, I've made a bugzilla entry for this for the Fedora people, but I 
thought I'd mention something I noticed yesterday but only tracked down 
today: it seems like the beagle file indexing code is able to screw up git 
in subtle ways.

I do not know exactly what happens, but the symptoms are random (and 
quite hard-to-trigger) dirty index contents where git believes that some 
set of files are not clean in the index.

I *suspect* that beagle is playing games with the file access times, 
causing the ctime on disk to not match the ce_ctime in the index file. But 
that's just a guess.

I'm posting here in case somebody on the list knows what beagle does, or 
somebody has been bitten by strange behaviour and realizes that he has 
beagle running and prefers to fix the problem by just disabling beagle 
(which will also be a great boon for performance - beagle seems to be very 
good at flushing your file caches, but I guess that's not a bug, but a 
"feature").

The easiest way I have found so far to trigger this is to run

	while ./t7003-filter-branch.sh -i; do echo ok; done

in the git t/ directory, while at the same time telling beagle to index 
just that git/t/ directory. That seems to trigger a failure on subtest 17 
fairly reliably (not the first time through the loop, but *eventually* - 
it takes a few minutes). I think it's because "git filter-branch" requires 
the index to be clean.

(But I've also seen it fail on subtest 4).

I opened bugzilla

	https://bugzilla.redhat.com/show_bug.cgi?id=380791

for this, since I consider it a beagle bug (indexing shouldn't change 
directory state, and if beagle wants to avoid changing access times, it 
should use O_NOATIME). But I don't actually know exactly what it is that 
causes problems, so if somebody is interested and tries to figure this 
out, that would probably be good.

			Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Strange "beagle" interaction..
  2007-11-13 20:56 Strange "beagle" interaction Linus Torvalds
@ 2007-11-13 21:03 ` J. Bruce Fields
  2007-11-13 21:21   ` J. Bruce Fields
  2007-11-13 21:30   ` Linus Torvalds
  2007-11-13 21:55 ` Mike Hommey
  1 sibling, 2 replies; 9+ messages in thread
From: J. Bruce Fields @ 2007-11-13 21:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List, Johannes Schindelin

On Tue, Nov 13, 2007 at 12:56:19PM -0800, Linus Torvalds wrote:
> 
> Ok, I've made a bugzilla entry for this for the Fedora people, but I 
> thought I'd mention something I noticed yesterday but only tracked down 
> today: it seems like the beagle file indexing code is able to screw up git 
> in subtle ways.
> 
> I do not know exactly what happens, but the symptoms are random (and 
> quite hard-to-trigger) dirty index contents where git believes that some 
> set of files are not clean in the index.
> 
> I *suspect* that beagle is playing games with the file access times, 
> causing the ctime on disk to not match the ce_ctime in the index file. But 
> that's just a guess.
> 
> I'm posting here in case somebody on the list knows what beagle does, or 
> somebody has been bitten by strange behaviour and realizes that he has 
> beagle running and prefers to fix the problem by just disabling beagle 
> (which will also be a great boon for performance - beagle seems to be very 
> good at flushing your file caches, but I guess that's not a bug, but a 
> "feature").

Last I ran across this, I believe I found it was adding extended
attributes to the file.  I think it's something like

	getfattr -d

to show all the extended attributes set on the file.  Does that show
anything?

Yeah, I just turned off beagle.  It looked to me like it was doing
something wrongheaded.

--b.

> 
> The easiest way I have found so far to trigger this is to run
> 
> 	while ./t7003-filter-branch.sh -i; do echo ok; done
> 
> in the git t/ directory, while at the same time telling beagle to index 
> just that git/t/ directory. That seems to trigger a failure on subtest 17 
> fairly reliably (not the first time through the loop, but *eventually* - 
> it takes a few minutes). I think it's because "git filter-branch" requires 
> the index to be clean.
> 
> (But I've also seen it fail on subtest 4).
> 
> I opened bugzilla
> 
> 	https://bugzilla.redhat.com/show_bug.cgi?id=380791
> 
> for this, since I consider it a beagle bug (indexing shouldn't change 
> directory state, and if beagle wants to avoid changing access times, it 
> should use O_NOATIME). But I don't actually know exactly what it is that 
> causes problems, so if somebody is interested and tries to figure this 
> out, that would probably be good.
> 
> 			Linus
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Strange "beagle" interaction..
  2007-11-13 21:03 ` J. Bruce Fields
@ 2007-11-13 21:21   ` J. Bruce Fields
  2007-11-13 21:30   ` Linus Torvalds
  1 sibling, 0 replies; 9+ messages in thread
From: J. Bruce Fields @ 2007-11-13 21:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List, Johannes Schindelin

On Tue, Nov 13, 2007 at 04:03:54PM -0500, J. Bruce Fields wrote:
> On Tue, Nov 13, 2007 at 12:56:19PM -0800, Linus Torvalds wrote:
> > 
> > Ok, I've made a bugzilla entry for this for the Fedora people, but I 
> > thought I'd mention something I noticed yesterday but only tracked down 
> > today: it seems like the beagle file indexing code is able to screw up git 
> > in subtle ways.
> > 
> > I do not know exactly what happens, but the symptoms are random (and 
> > quite hard-to-trigger) dirty index contents where git believes that some 
> > set of files are not clean in the index.
> > 
> > I *suspect* that beagle is playing games with the file access times, 
> > causing the ctime on disk to not match the ce_ctime in the index file. But 
> > that's just a guess.
> > 
> > I'm posting here in case somebody on the list knows what beagle does, or 
> > somebody has been bitten by strange behaviour and realizes that he has 
> > beagle running and prefers to fix the problem by just disabling beagle 
> > (which will also be a great boon for performance - beagle seems to be very 
> > good at flushing your file caches, but I guess that's not a bug, but a 
> > "feature").
> 
> Last I ran across this, I believe I found it was adding extended
> attributes to the file.  I think it's something like
> 
> 	getfattr -d
> 
> to show all the extended attributes set on the file.  Does that show
> anything?

By the way, on Debian or Ubuntu, at least, that requires an "apt-get
install attr" first.

> 
> Yeah, I just turned off beagle.  It looked to me like it was doing
> something wrongheaded.

Just looking at the attribute names and taking a wild guess, it looked
to me like beagle was computing a checksum of each file's data and
comparing it to a checksum previously stored in an xattr, and using that
to decide whether to reindex the file data.

With the result that to check whether anything's changed when it starts
up again it has to read through the entire filesystem's data.

Maybe I'm wrong--I hope so.  I'd love to know.

--b.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Strange "beagle" interaction..
  2007-11-13 21:03 ` J. Bruce Fields
  2007-11-13 21:21   ` J. Bruce Fields
@ 2007-11-13 21:30   ` Linus Torvalds
  2007-11-13 21:44     ` Jon Smirl
  1 sibling, 1 reply; 9+ messages in thread
From: Linus Torvalds @ 2007-11-13 21:30 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Junio C Hamano, Git Mailing List, Johannes Schindelin



On Tue, 13 Nov 2007, J. Bruce Fields wrote:
> 
> Last I ran across this, I believe I found it was adding extended
> attributes to the file.

Yeah, I just straced it and found the same thing. It's saving fingerprints 
and mtimes to files in the extended attributes.

> Yeah, I just turned off beagle.  It looked to me like it was doing
> something wrongheaded.

Gaah. The problem is, setting xattrs does actually change ctime. Which 
means that if we want to make git play nice with beagle, I guess we have 
to just remove the comparison of ctime.

Oh, well. Git doesn't *require* it, but I like the notion of checking the 
inode really really carefully. But it looks like it may not be an option, 
because of file indexers hiding stuff behind our backs.

Or we could just tell people not to run beagle on their git trees, but I 
suspect some people will actually *want* to. Even if it flushes their disk 
caches.

		Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Strange "beagle" interaction..
  2007-11-13 21:30   ` Linus Torvalds
@ 2007-11-13 21:44     ` Jon Smirl
  2007-11-13 21:49       ` David Brown
  2007-11-13 21:50       ` J. Bruce Fields
  0 siblings, 2 replies; 9+ messages in thread
From: Jon Smirl @ 2007-11-13 21:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: J. Bruce Fields, Junio C Hamano, Git Mailing List,
	Johannes Schindelin

On 11/13/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>
> On Tue, 13 Nov 2007, J. Bruce Fields wrote:
> >
> > Last I ran across this, I believe I found it was adding extended
> > attributes to the file.
>
> Yeah, I just straced it and found the same thing. It's saving fingerprints
> and mtimes to files in the extended attributes.

Things like Beagle need a guaranteed log of global inotify events.
That would let them efficiently find changes made since the last time
they updated their index.

Right now every time Beagle starts it hasn't got a clue what has
changed in the file system since it was last run. This forces Beagle
to rescan the entire filesystem every time it is started. The xattrs
are used as cache to reduce this load somewhat.

A better solution would be for the kernel to log inotify events to
disk in a manner that survives reboots. When Beagle starts it would
locate its last checkpoint and then process the logged inotify events
from that time forward. This inotify logging needs to be bullet proof
or it will mess up your Beagle index.

Logged files systems already contain the logged inotify data (in their
own internal form). There's just no universal API for retrieving it in
a file system independent manner.

>
> > Yeah, I just turned off beagle.  It looked to me like it was doing
> > something wrongheaded.
>
> Gaah. The problem is, setting xattrs does actually change ctime. Which
> means that if we want to make git play nice with beagle, I guess we have
> to just remove the comparison of ctime.
>
> Oh, well. Git doesn't *require* it, but I like the notion of checking the
> inode really really carefully. But it looks like it may not be an option,
> because of file indexers hiding stuff behind our backs.
>
> Or we could just tell people not to run beagle on their git trees, but I
> suspect some people will actually *want* to. Even if it flushes their disk
> caches.
>
>                 Linus
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Strange "beagle" interaction..
  2007-11-13 21:44     ` Jon Smirl
@ 2007-11-13 21:49       ` David Brown
  2007-11-13 21:50       ` J. Bruce Fields
  1 sibling, 0 replies; 9+ messages in thread
From: David Brown @ 2007-11-13 21:49 UTC (permalink / raw)
  To: Jon Smirl
  Cc: Linus Torvalds, J. Bruce Fields, Junio C Hamano, Git Mailing List,
	Johannes Schindelin

On Tue, Nov 13, 2007 at 04:44:33PM -0500, Jon Smirl wrote:

>A better solution would be for the kernel to log inotify events to
>disk in a manner that survives reboots. When Beagle starts it would
>locate its last checkpoint and then process the logged inotify events
>from that time forward. This inotify logging needs to be bullet proof
>or it will mess up your Beagle index.

Perhaps something similar to FsEvents on OSX which is a daemon that
interfaces with the OS to record this very information.

It only works across clean reboots, but it does work there.  Do a bad
shutdown, and your next backup or index take a long time to go scan
everything.

It would also be wonderful to have this for backups as well.

Dave

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Strange "beagle" interaction..
  2007-11-13 21:44     ` Jon Smirl
  2007-11-13 21:49       ` David Brown
@ 2007-11-13 21:50       ` J. Bruce Fields
  2007-11-13 21:58         ` Jon Smirl
  1 sibling, 1 reply; 9+ messages in thread
From: J. Bruce Fields @ 2007-11-13 21:50 UTC (permalink / raw)
  To: Jon Smirl
  Cc: Linus Torvalds, Junio C Hamano, Git Mailing List,
	Johannes Schindelin

On Tue, Nov 13, 2007 at 04:44:33PM -0500, Jon Smirl wrote:
> On 11/13/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> >
> > On Tue, 13 Nov 2007, J. Bruce Fields wrote:
> > >
> > > Last I ran across this, I believe I found it was adding extended
> > > attributes to the file.
> >
> > Yeah, I just straced it and found the same thing. It's saving fingerprints
> > and mtimes to files in the extended attributes.
> 
> Things like Beagle need a guaranteed log of global inotify events.
> That would let them efficiently find changes made since the last time
> they updated their index.

Wouldn't a simple change-attribute get you most of the way there?  All
you need is a number that's guaranteed to increase any time a file is
updated.

Lacking that, git's current approach (snapshot all the stat data, then
look closer at any files that appear to have been touched within a
second of the stat) seems pretty sensible.

--b.

> Right now every time Beagle starts it hasn't got a clue what has
> changed in the file system since it was last run. This forces Beagle
> to rescan the entire filesystem every time it is started. The xattrs
> are used as cache to reduce this load somewhat.
> 
> A better solution would be for the kernel to log inotify events to
> disk in a manner that survives reboots. When Beagle starts it would
> locate its last checkpoint and then process the logged inotify events
> from that time forward. This inotify logging needs to be bullet proof
> or it will mess up your Beagle index.
> 
> Logged files systems already contain the logged inotify data (in their
> own internal form). There's just no universal API for retrieving it in
> a file system independent manner.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Strange "beagle" interaction..
  2007-11-13 20:56 Strange "beagle" interaction Linus Torvalds
  2007-11-13 21:03 ` J. Bruce Fields
@ 2007-11-13 21:55 ` Mike Hommey
  1 sibling, 0 replies; 9+ messages in thread
From: Mike Hommey @ 2007-11-13 21:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List, Johannes Schindelin

On Tue, Nov 13, 2007 at 12:56:19PM -0800, Linus Torvalds wrote:
> 
> Ok, I've made a bugzilla entry for this for the Fedora people, but I 
> thought I'd mention something I noticed yesterday but only tracked down 
> today: it seems like the beagle file indexing code is able to screw up git 
> in subtle ways.
> 
> I do not know exactly what happens, but the symptoms are random (and 
> quite hard-to-trigger) dirty index contents where git believes that some 
> set of files are not clean in the index.
> 
> I *suspect* that beagle is playing games with the file access times, 
> causing the ctime on disk to not match the ce_ctime in the index file. But 
> that's just a guess.
(...)

IIRC, beagle stores a bunch of useful information for itself in extended
attributes on indexed files. It is likely that it's that that is
tampering with the file stats.

Mike

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Strange "beagle" interaction..
  2007-11-13 21:50       ` J. Bruce Fields
@ 2007-11-13 21:58         ` Jon Smirl
  0 siblings, 0 replies; 9+ messages in thread
From: Jon Smirl @ 2007-11-13 21:58 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Linus Torvalds, Junio C Hamano, Git Mailing List,
	Johannes Schindelin

On 11/13/07, J. Bruce Fields <bfields@fieldses.org> wrote:
> On Tue, Nov 13, 2007 at 04:44:33PM -0500, Jon Smirl wrote:
> > On 11/13/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > >
> > >
> > > On Tue, 13 Nov 2007, J. Bruce Fields wrote:
> > > >
> > > > Last I ran across this, I believe I found it was adding extended
> > > > attributes to the file.
> > >
> > > Yeah, I just straced it and found the same thing. It's saving fingerprints
> > > and mtimes to files in the extended attributes.
> >
> > Things like Beagle need a guaranteed log of global inotify events.
> > That would let them efficiently find changes made since the last time
> > they updated their index.
>
> Wouldn't a simple change-attribute get you most of the way there?  All
> you need is a number that's guaranteed to increase any time a file is
> updated.

You still need to look at every file in the file system. People can
have many millions of files in their file systems (I have two million
in mine and that's small). The inotify log is the most efficient
solution.

I've turned Beagle off simply because it beats on my disk for an hour
after I reboot.

>
> Lacking that, git's current approach (snapshot all the stat data, then
> look closer at any files that appear to have been touched within a
> second of the stat) seems pretty sensible.
>
> --b.
>
> > Right now every time Beagle starts it hasn't got a clue what has
> > changed in the file system since it was last run. This forces Beagle
> > to rescan the entire filesystem every time it is started. The xattrs
> > are used as cache to reduce this load somewhat.
> >
> > A better solution would be for the kernel to log inotify events to
> > disk in a manner that survives reboots. When Beagle starts it would
> > locate its last checkpoint and then process the logged inotify events
> > from that time forward. This inotify logging needs to be bullet proof
> > or it will mess up your Beagle index.
> >
> > Logged files systems already contain the logged inotify data (in their
> > own internal form). There's just no universal API for retrieving it in
> > a file system independent manner.
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2007-11-13 22:00 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-11-13 20:56 Strange "beagle" interaction Linus Torvalds
2007-11-13 21:03 ` J. Bruce Fields
2007-11-13 21:21   ` J. Bruce Fields
2007-11-13 21:30   ` Linus Torvalds
2007-11-13 21:44     ` Jon Smirl
2007-11-13 21:49       ` David Brown
2007-11-13 21:50       ` J. Bruce Fields
2007-11-13 21:58         ` Jon Smirl
2007-11-13 21:55 ` Mike Hommey

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).