Keeping unreachable objects in a separate pack instead of loose?

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Keeping unreachable objects in a separate pack instead of loose?
@ 2012-06-10 12:31 Theodore Ts'o
  2012-06-10 23:24 ` Hallvard B Furuseth
  2012-06-11 15:40 ` Junio C Hamano
  0 siblings, 2 replies; 48+ messages in thread
From: Theodore Ts'o @ 2012-06-10 12:31 UTC (permalink / raw)
  To: git

I recently noticed that after a git gc, I had a huge number of loose
objects that were unreachable.  In fact, about 4.5 megabytes worth of
objects.

When I packed them, via:

   cd .git/objects ; find [0-9a-f][0-9a-f] -type f | git pack-objects pack

the resulting pack file was 244k.

Which got me thinking.... the whole point of leaving the objects loose
is to make it easier to expire them, right?   But given how expensive it
is to have loose objects lying around, why not:

a)  Have git-pack-objects have an option which writes the unreachable
    objects into a separate pack file, instead of kicking them loose?

b)  Have git-prune delete a pack only if *all* of the objects in the
    pack meet the expiry deadline?

What would be the downsides of pursueing such a strategy?  Is it worth
trying to implement as proof-of-concept?

						- Ted

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of  loose?
  2012-06-10 12:31 Keeping unreachable objects in a separate pack instead of loose? Theodore Ts'o
@ 2012-06-10 23:24 ` Hallvard B Furuseth
  2012-06-11 14:44   ` Thomas Rast
  2012-06-11 15:40 ` Junio C Hamano
  1 sibling, 1 reply; 48+ messages in thread
From: Hallvard B Furuseth @ 2012-06-10 23:24 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: git

On Sun, June 10, 2012 14:31, Theodore Ts'o wrote:
> I recently noticed that after a git gc, I had a huge number of loose
> objects that were unreachable.  In fact, about 4.5 megabytes worth of
> objects.

I got gigabytes once, and a full disk.  See thread
"git gc == git garbage-create from removed branch", May 3 2012.

> When I packed them, via:
>
>    cd .git/objects ; find [0-9a-f][0-9a-f] -type f | git pack-objects pack
>
> the resulting pack file was 244k.
>
> Which got me thinking.... the whole point of leaving the objects loose
> is to make it easier to expire them, right?   But given how expensive it
> is to have loose objects lying around, why not:
>
> a)  Have git-pack-objects have an option which writes the unreachable
>     objects into a separate pack file, instead of kicking them loose?

I think this should be the default.  It's very unintuitive that
gc can eat up lots of disk space instead of saving space.

Until this is fixed, this behavior needs to be documented -
along with how to avoid it.

> b)  Have git-prune delete a pack only if *all* of the objects in the
>     pack meet the expiry deadline?
>
> What would be the downsides of pursueing such a strategy?  Is it worth
> trying to implement as proof-of-concept?

Hallvard

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of  loose?
  2012-06-10 23:24 ` Hallvard B Furuseth
@ 2012-06-11 14:44   ` Thomas Rast
  2012-06-11 15:31     ` Ted Ts'o
  0 siblings, 1 reply; 48+ messages in thread
From: Thomas Rast @ 2012-06-11 14:44 UTC (permalink / raw)
  To: Hallvard B Furuseth; +Cc: Theodore Ts'o, git, Jeff King, Nicolas Pitre

"Hallvard B Furuseth" <h.b.furuseth@usit.uio.no> writes:

> On Sun, June 10, 2012 14:31, Theodore Ts'o wrote:
>> I recently noticed that after a git gc, I had a huge number of loose
>> objects that were unreachable.  In fact, about 4.5 megabytes worth of
>> objects.
>
> I got gigabytes once, and a full disk.  See thread
> "git gc == git garbage-create from removed branch", May 3 2012.
>
>> When I packed them, via:
>>
>>    cd .git/objects ; find [0-9a-f][0-9a-f] -type f | git pack-objects pack
>>
>> the resulting pack file was 244k.
>>
>> Which got me thinking.... the whole point of leaving the objects loose
>> is to make it easier to expire them, right?   But given how expensive it
>> is to have loose objects lying around, why not:
>>
>> a)  Have git-pack-objects have an option which writes the unreachable
>>     objects into a separate pack file, instead of kicking them loose?
>
> I think this should be the default.  It's very unintuitive that
> gc can eat up lots of disk space instead of saving space.
>
> Until this is fixed, this behavior needs to be documented -
> along with how to avoid it.

Starting with v1.7.10.2, and in the v1.7.11-rc versions, there's a
change by Peff: 7e52f56 (gc: do not explode objects which will be
immediately pruned, 2012-04-07).  Does it solve your problem?

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-11 14:44   ` Thomas Rast
@ 2012-06-11 15:31     ` Ted Ts'o
  2012-06-11 16:08       ` Jeff King
  0 siblings, 1 reply; 48+ messages in thread
From: Ted Ts'o @ 2012-06-11 15:31 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Hallvard B Furuseth, git, Jeff King, Nicolas Pitre

On Mon, Jun 11, 2012 at 04:44:29PM +0200, Thomas Rast wrote:
> 
> Starting with v1.7.10.2, and in the v1.7.11-rc versions, there's a
> change by Peff: 7e52f56 (gc: do not explode objects which will be
> immediately pruned, 2012-04-07).  Does it solve your problem?

I'm currently using 1.7.10.2.552.gaa3bb87, and a "git gc" still kicked
loose a little over 4.5 megabytes of loose objects were not pruned via
"git prune" (since they hadn't yet expired).  These loose objects
could be stored in a 244k pack file.

So while I'm sure that change has helped, if you happen to use a
workflow that uses git rebase and/or guilt and/or throwaway test
integration branches a lot, there will still be a large number of
unexpired commits which still get kicked loose, and won't get pruned
for a week or two.

What I think would make sense is for git pack-objects to have a new
option which outputs a list of object id's which whould have been
kicked out as loose objects if it had been given the (undocumented)
--unpacked-unreachable option.  Then the git-repack shell script (if
given the -A option) would use that new option instead of
--unpacked-unreachable, and then using the list created by this new
option, create another pack which contains all of these
unreachable-but-not-yet-expired objects.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-11 15:31     ` Ted Ts'o
@ 2012-06-11 16:08       ` Jeff King
  2012-06-11 17:04         ` Nicolas Pitre
  2012-06-11 17:27         ` Ted Ts'o
  0 siblings, 2 replies; 48+ messages in thread
From: Jeff King @ 2012-06-11 16:08 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Thomas Rast, Hallvard B Furuseth, git, Nicolas Pitre

On Mon, Jun 11, 2012 at 11:31:03AM -0400, Ted Ts'o wrote:

> I'm currently using 1.7.10.2.552.gaa3bb87, and a "git gc" still kicked
> loose a little over 4.5 megabytes of loose objects were not pruned via
> "git prune" (since they hadn't yet expired).  These loose objects
> could be stored in a 244k pack file.

Out of curiosity, what is the size of the whole repo? If it's a 500M
kernel repository, then 4.5M is not all _that_ worrisome. Not that it
could not be better, or that it's not worth addressing (since there are
corner cases that behave way worse). But it gives a sense of the urgency
of the problem, if that is the scope of the issue for average use.

> What I think would make sense is for git pack-objects to have a new
> option which outputs a list of object id's which whould have been
> kicked out as loose objects if it had been given the (undocumented)
> --unpacked-unreachable option.  Then the git-repack shell script (if
> given the -A option) would use that new option instead of
> --unpacked-unreachable, and then using the list created by this new
> option, create another pack which contains all of these
> unreachable-but-not-yet-expired objects.

I don't think that will work, because we will keep repacking the
unreachable bits into new packs. And the 2-week expiration is based on
the pack timestamp. So if your "repack -Ad" ends in two packs (the one
you actually want, and the pack of expired crap), then you would get
into this cycle:

  1. You run "git repack -Ad". It makes A.pack, with stuff you want, and
     B.pack, with unreachable junk. They both get a timestamp of "now".

  2. A day passes. You run "git repack -Ad" again. It makes C.pack, the
     new stuff you want, and repacks all of B.pack along with the
     new expired cruft from A.pack, making D.pack. B.pack can go away.
     D.pack gets a timestamp of "now".

And so on, as long as you repack within the two week window, the objects
from the cruft pack will never get ejected. So you might suggest that
the problem is that in step 2, we repack the items from B. But if you
don't, then you will accumulate a bunch of cruft packs (2 weeks worth),
and those objects won't be delta'd against each other.  It's probably
better than making them all loose, of course (you get chunks of delta'd
objects from each repack, instead of none at all), but it's far from a
full solution to the issue.

I think solving it for good would involve a separate list of per-object
expiration dates. Obviously we get that easily with loose objects (since
it is one object per file).

As a workaround, it might be worth lowering the default pruneExpire from
2 weeks to 1 day or something. It is really about creating safety for
operations in progress (e.g., you write the object, and then are _about_
to add it to the index or update a ref when it gets pruned). I think the
2 weeks number was pulled out of a hat as "absurdly long for an
operation to take", and was never revisited because nobody cared or
complained.

-Peff

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-11 16:08       ` Jeff King
@ 2012-06-11 17:04         ` Nicolas Pitre
  2012-06-11 17:45           ` Ted Ts'o
  2012-06-11 17:46           ` Jeff King
  2012-06-11 17:27         ` Ted Ts'o
  1 sibling, 2 replies; 48+ messages in thread
From: Nicolas Pitre @ 2012-06-11 17:04 UTC (permalink / raw)
  To: Jeff King; +Cc: Ted Ts'o, Thomas Rast, Hallvard B Furuseth, git

On Mon, 11 Jun 2012, Jeff King wrote:

> As a workaround, it might be worth lowering the default pruneExpire from
> 2 weeks to 1 day or something. It is really about creating safety for
> operations in progress (e.g., you write the object, and then are _about_
> to add it to the index or update a ref when it gets pruned). I think the
> 2 weeks number was pulled out of a hat as "absurdly long for an
> operation to take", and was never revisited because nobody cared or
> complained.

Absolutely.  I think this should even be considered a "fix" to lower 
this value not a "workaround".

IIRC, the 2 weeks number was instored when there wasn't any reflog on 
HEAD and the only way to salvage lost commits was to use 'git fsck 
--lost-found'.  These days, this is used only as a safety measure 
because there is always a tiny window during which objects are dangling 
before they're finally all referenced as you say.  But someone would 
have to work hard to hit that race even if the delay was only 30 
seconds.  So realistically this could even be set to 1 hour.


Nicolas

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-11 17:04         ` Nicolas Pitre
@ 2012-06-11 17:45           ` Ted Ts'o
  2012-06-11 17:54             ` Jeff King
  2012-06-11 17:46           ` Jeff King
  1 sibling, 1 reply; 48+ messages in thread
From: Ted Ts'o @ 2012-06-11 17:45 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jeff King, Thomas Rast, Hallvard B Furuseth, git

On Mon, Jun 11, 2012 at 01:04:07PM -0400, Nicolas Pitre wrote:
> 
> IIRC, the 2 weeks number was instored when there wasn't any reflog on 
> HEAD and the only way to salvage lost commits was to use 'git fsck 
> --lost-found'.  These days, this is used only as a safety measure 
> because there is always a tiny window during which objects are dangling 
> before they're finally all referenced as you say.  But someone would 
> have to work hard to hit that race even if the delay was only 30 
> seconds.  So realistically this could even be set to 1 hour.

There's another useful part of the two week window, and it's as a
partial workaround using .git/objects/info/alternates with one or more
rewinding branches.

My /usr/projects/e2fsprogs/base repo is a bare repo that contains all
of my public branches, including a rewinding "pu" branch.

My /usr/projects/e2fsprogs/e2fsprogs uses an alternates file to
minimize disk usage, and it points at the base repo.

The problem comes when I need to gc the base repo, every 3 months or
so.  When I do that, objects that belonged to older incarnations of
the rewinding pu branch disappear.  The two week window gives you a
partial saving throw until development repo breaks due to objects that
it depends upon disappearing.

It would be nice if a gc of the devel repo knew that some of the
objects it was depending on were "expired cruft", and copy them to the
its local objects directory.  But of course things don't work that
way.

Here's what I do today (please don't barf; I know it's ugly):

1)  cd to the base repository; run "git repack -Adfl --window=300 --depth=100"

2)  cd to the objects directory, and create my base "cruft" pack:
	find . --name [0-9a-f][0-9a-f] | tr -d / | git pack-objects pack-

3)  hard link it into my devel repo's pack directory:
	ln pack-* /usr/projects/e2fsprogs/e2fsprogs/.git/objects/pack

4) to save space in my base repo, move it to the pack directory and
   run git prune-packed:
	mv pack-* pack
	git prune-packed

4)  run "git repack -Adfl --window=300 --depth=100" in my devel repo

5)  create a cruft pack in my devel repo (to save disk space):
	cd /usr/projects/e2fsprogs/e2fsprogs/.git/objects
	find . --name [0-9a-f][0-9a-f] | tr -d / | git pack-objects pack-
	mv pack-* pack
	git prune-packed

This probably falls in the "don't use --shared unless you know what it
does admonition in the git-clone man page.  :-)

Don't worry, I don't recommend that anyone *else* do this.  But it
works for me (although it would be nice if I made this workflow be a
bit more optimized; at the very least I should make a shell script
that does all this for me automatically, instead of typing all of the
shell commands by hand.)

   	      	     	   - Ted

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-11 17:45           ` Ted Ts'o
@ 2012-06-11 17:54             ` Jeff King
  2012-06-11 18:20               ` Ted Ts'o
  0 siblings, 1 reply; 48+ messages in thread
From: Jeff King @ 2012-06-11 17:54 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Nicolas Pitre, Thomas Rast, Hallvard B Furuseth, git

On Mon, Jun 11, 2012 at 01:45:07PM -0400, Ted Ts'o wrote:

> There's another useful part of the two week window, and it's as a
> partial workaround using .git/objects/info/alternates with one or more
> rewinding branches.
> 
> My /usr/projects/e2fsprogs/base repo is a bare repo that contains all
> of my public branches, including a rewinding "pu" branch.
> 
> My /usr/projects/e2fsprogs/e2fsprogs uses an alternates file to
> minimize disk usage, and it points at the base repo.
> 
> The problem comes when I need to gc the base repo, every 3 months or
> so.  When I do that, objects that belonged to older incarnations of
> the rewinding pu branch disappear.  The two week window gives you a
> partial saving throw until development repo breaks due to objects that
> it depends upon disappearing.

You're doing it wrong (but you can hardly be blamed, because there isn't
good tool support for doing it right). You should never prune or repack
in the base repo without taking into account all of the refs of its
children.

We have a similar setup at github (every time you "fork" a repo, it is
creating a new repo that links back to a project-wide "network" repo for
its object store). We maintain a refs/remotes/XXX directory for each
child repo, which stores the complete refs/ hierarchy of that child.

It's all done by a tangled mass of shell scripts. I've considered trying
to clean up and open source them, but I really doubt it would help
people generally. There's so much stuff specific to our setup.

-Peff

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-11 17:54             ` Jeff King
@ 2012-06-11 18:20               ` Ted Ts'o
  2012-06-11 18:43                 ` Jeff King
  0 siblings, 1 reply; 48+ messages in thread
From: Ted Ts'o @ 2012-06-11 18:20 UTC (permalink / raw)
  To: Jeff King; +Cc: Nicolas Pitre, Thomas Rast, Hallvard B Furuseth, git

On Mon, Jun 11, 2012 at 01:54:19PM -0400, Jeff King wrote:
> 
> You're doing it wrong (but you can hardly be blamed, because there isn't
> good tool support for doing it right). You should never prune or repack
> in the base repo without taking into account all of the refs of its
> children.

Well, I don't do a simple gc.  See the complicated set of steps I use
to make sure I don't lose loose commits at the end of my last e-mail
message on this thread.  It gets worse when I have multiple devel
repos, but I simplified things for the purposes of discussion.

> We have a similar setup at github (every time you "fork" a repo, it is
> creating a new repo that links back to a project-wide "network" repo for
> its object store). We maintain a refs/remotes/XXX directory for each
> child repo, which stores the complete refs/ hierarchy of that child.

So you basically are copying the refs around and making sure the
parent repo has an uptodate pointer of all of the child repos, such
that when you do the repack, *all* of the commits end up in the parent
commit, correct?

The system that I'm using means that objects which are local to a
child repo stays in the child repo, and if an object is about to be
dropped from the parent repo as a result of a gc, the child repo has
an opportunity claim a copy of that object for itself in its object
database.

You can do things either way.  I like knowing that objects only used
by a child repo are in the child repo's .git directory, but that's
arguably more of a question of taste than anything else.

	      	   	       	     - Ted

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-11 18:20               ` Ted Ts'o
@ 2012-06-11 18:43                 ` Jeff King
  0 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2012-06-11 18:43 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Nicolas Pitre, Thomas Rast, Hallvard B Furuseth, git

On Mon, Jun 11, 2012 at 02:20:12PM -0400, Ted Ts'o wrote:

> On Mon, Jun 11, 2012 at 01:54:19PM -0400, Jeff King wrote:
> > 
> > You're doing it wrong (but you can hardly be blamed, because there isn't
> > good tool support for doing it right). You should never prune or repack
> > in the base repo without taking into account all of the refs of its
> > children.
> 
> Well, I don't do a simple gc.  See the complicated set of steps I use
> to make sure I don't lose loose commits at the end of my last e-mail
> message on this thread.  It gets worse when I have multiple devel
> repos, but I simplified things for the purposes of discussion.

Ah, right. I was thinking that your first step, which is "git repack
-Adfl", would throw out old objects rather than unpack them in recent
versions of git.  But due to the way I implemented it (namely that you
must pass --unpack-unreachable yourself, so this feature only kicks in
automatically for "git gc"), that is not the case.

I don't recall if that was an accident, or if I was very clever in
maintaining backwards compatibility for your case. Let's just assume the
latter. :)

> > We have a similar setup at github (every time you "fork" a repo, it is
> > creating a new repo that links back to a project-wide "network" repo for
> > its object store). We maintain a refs/remotes/XXX directory for each
> > child repo, which stores the complete refs/ hierarchy of that child.
> 
> So you basically are copying the refs around and making sure the
> parent repo has an uptodate pointer of all of the child repos, such
> that when you do the repack, *all* of the commits end up in the parent
> commit, correct?

Yes. The child repositories generally have no objects in them at all
(they occasionally do for a period between runs of the migration
script).

> The system that I'm using means that objects which are local to a
> child repo stays in the child repo, and if an object is about to be
> dropped from the parent repo as a result of a gc, the child repo has
> an opportunity claim a copy of that object for itself in its object
> database.

That implies the concept of "local to a child repo", which implies that
you have some set of "common" refs. I suspect in your case your base
repo represents the master branch, or something similar. We actually
treat our network repo as a pure parent; every repo, including the
original one that everybody forks from, is a child. That makes it easier
to treat the original repo as just another repo (e.g., the original
owner is free to delete it, and the forks won't care).

> You can do things either way.  I like knowing that objects only used
> by a child repo are in the child repo's .git directory, but that's
> arguably more of a question of taste than anything else.

Yeah, I don't think there is any real benefit to it.

-Peff

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-11 17:04         ` Nicolas Pitre
  2012-06-11 17:45           ` Ted Ts'o
@ 2012-06-11 17:46           ` Jeff King
  1 sibling, 0 replies; 48+ messages in thread
From: Jeff King @ 2012-06-11 17:46 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Ted Ts'o, Thomas Rast, Hallvard B Furuseth, git

On Mon, Jun 11, 2012 at 01:04:07PM -0400, Nicolas Pitre wrote:

> IIRC, the 2 weeks number was instored when there wasn't any reflog on 
> HEAD and the only way to salvage lost commits was to use 'git fsck 
> --lost-found'.  These days, this is used only as a safety measure 
> because there is always a tiny window during which objects are dangling 
> before they're finally all referenced as you say.  But someone would 
> have to work hard to hit that race even if the delay was only 30 
> seconds.  So realistically this could even be set to 1 hour.

Nope, the HEAD reflog dates back to bd104db (enable separate reflog for
HEAD, 2007-01-26), whereas the discussion leading to gc.pruneExpire was
in 2008. See this sub-thread for the discussion on the time limit:

  http://thread.gmane.org/gmane.comp.version-control.git/76899/focus=76943

The interesting points from that thread are:

  1. You might care about object prune times if your session is
     interrupted. So imagine you are working on something, do not
     commit, go home for the weekend, and then come back.

     I think we are OK with most in-progress operations, because any
     blobs added to the index will of course be considered reachable and
     not pruned. What would hurt most is that you could do:

       $ hack hack hack
       $ git add foo
       $ hack hack hack
       $ git add foo
       $ hack hack hack
       [oops! I realized I really wanted the initial version of foo!]
       $ git fsck --lost-found

     If your session is interrupted during the third "hack hack hack"
     bit, a short prune might lose the initial version. Of course,
     that's _if_ a gc is run in the middle. So I find it possible, but
     somewhat unlikely.

  2. If a branch is deleted, all of its reflogs go away immediately. A
     short expiration time would mean that the objects will probably go
     away at the next "gc".

     In many cases, you will be saved by the HEAD reflog, which stays
     around until the real expiration is reached. But not always (e.g.,
     server-side reflogs).

     I'd much rather address this in the general case by saving
     deleted-branch reflogs, though I know you and I have disagreed on
     that in the past.

One issue not brought up in that thread is that of bare repositories,
which do not have reflogs enabled by default. In that case, the 2-week
prune window is really doing something.

I really wonder if there is a good reason not to turn on reflogs for all
repositories, bare included. We have them on for all of the server-side
repositories at github, and they have been invaluable for resolving
many support tickets from people who forced a push without realizing
what they were doing.

-Peff

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-11 16:08       ` Jeff King
  2012-06-11 17:04         ` Nicolas Pitre
@ 2012-06-11 17:27         ` Ted Ts'o
  2012-06-11 18:34           ` Jeff King
  1 sibling, 1 reply; 48+ messages in thread
From: Ted Ts'o @ 2012-06-11 17:27 UTC (permalink / raw)
  To: Jeff King; +Cc: Thomas Rast, Hallvard B Furuseth, git, Nicolas Pitre

On Mon, Jun 11, 2012 at 12:08:24PM -0400, Jeff King wrote:
> On Mon, Jun 11, 2012 at 11:31:03AM -0400, Ted Ts'o wrote:
> 
> > I'm currently using 1.7.10.2.552.gaa3bb87, and a "git gc" still kicked
> > loose a little over 4.5 megabytes of loose objects were not pruned via
> > "git prune" (since they hadn't yet expired).  These loose objects
> > could be stored in a 244k pack file.
> 
> Out of curiosity, what is the size of the whole repo? If it's a 500M
> kernel repository, then 4.5M is not all _that_ worrisome. Not that it
> could not be better, or that it's not worth addressing (since there are
> corner cases that behave way worse). But it gives a sense of the urgency
> of the problem, if that is the scope of the issue for average use.

It' my e2fsprogs development repo.  I have my "base" repo, which is
what has been pushed out to the public (including a rewinding pu
branch).  The total size of that repo is a little over 15 megs:

<tytso@tytso-glaptop.cam.corp.google.com> {/usr/projects/e2fsprogs/e2fsprogs}  [maint]
899% ls ../base/objects/pack/
total 16156
  908 pack-6964a1516433f16e43dcdf4fcec1996052099f31.idx
15248 pack-6964a1516433f16e43dcdf4fcec1996052099f31.pack

I then have my development repo, which uses a
.git/objects/info/alternates pointing at the bare "base" repo, so the
only thing in this repo are my private development branches, and other
things that haven't been pushed for public consumption.

<tytso@tytso-glaptop.cam.corp.google.com> {/usr/projects/e2fsprogs/e2fsprogs}  [maint]
900% ls .git/objects/pack/
total 1048
 28 5a486e6c2156109f7dfc725b36a201c10652803d.idx    28 pack-7b2a9cccab669338f61a681e34c39362976fb5de.idx
224 5a486e6c2156109f7dfc725b36a201c10652803d.pack  768 pack-7b2a9cccab669338f61a681e34c39362976fb5de.pack

The 4.5 megabytes of loose objects packed down to a 224k "cruft" repo,
and 768k worth of private development objects.

So depending on how you would want to do the comparison, probably the
fairest thing to say is that I had a total "good" packs totally about
16 megs, and the loose cruft objects was an additional 4.5 megabytes.

> I don't think that will work, because we will keep repacking the
> unreachable bits into new packs. And the 2-week expiration is based on
> the pack timestamp. So if your "repack -Ad" ends in two packs (the one
> you actually want, and the pack of expired crap), then you would get
> into this cycle:
> 
>   1. You run "git repack -Ad". It makes A.pack, with stuff you want, and
>      B.pack, with unreachable junk. They both get a timestamp of "now".
> 
>   2. A day passes. You run "git repack -Ad" again. It makes C.pack, the
>      new stuff you want, and repacks all of B.pack along with the
>      new expired cruft from A.pack, making D.pack. B.pack can go away.
>      D.pack gets a timestamp of "now".

Hmm, yes.  What we'd really want to do is to make D.pack contain those
items that were are newly unreachable, not including the objects in
B.pack, and keep B.pack around until the expiry window goes by.  But
that's a much more complicated thing, and the proof-of-concept
algorithm I had outlined wouldn't do that.

> I think solving it for good would involve a separate list of per-object
> expiration dates. Obviously we get that easily with loose objects (since
> it is one object per file).

Well, either that or we need to teach git-repack the difference
between packs that are expected to contain good stuff, and packs that
contain cruft, and to not copy "old cruft" to new packs, so the old
pack can finally get nuked 2 weeks (or whatever the expire window
might happen to be) later.

					- Ted

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-11 17:27         ` Ted Ts'o
@ 2012-06-11 18:34           ` Jeff King
  2012-06-11 20:44             ` Hallvard Breien Furuseth
  2012-06-11 21:14             ` Ted Ts'o
  0 siblings, 2 replies; 48+ messages in thread
From: Jeff King @ 2012-06-11 18:34 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Thomas Rast, Hallvard B Furuseth, git, Nicolas Pitre

On Mon, Jun 11, 2012 at 01:27:32PM -0400, Ted Ts'o wrote:

> The 4.5 megabytes of loose objects packed down to a 224k "cruft" repo,
> and 768k worth of private development objects.
> 
> So depending on how you would want to do the comparison, probably the
> fairest thing to say is that I had a total "good" packs totally about
> 16 megs, and the loose cruft objects was an additional 4.5 megabytes.

OK, so that 4.5 is at least a respectable percentage of the total repo
size. I suspect it may be worse for small repos in that sense, because
the 4.5 megabytes is not "how big is this repo" but probably "how much
work did you do in this repo in the last 2 weeks". Which should be a
constant with respect to the total size of the repo.

However, for a very busy repo (e.g., one used for automated integration
testing or similar), the "how much work" number could be quite high.

We ran into this at github due to our "merge this" button, which does
test-merges to see if a pull request can be merged cleanly. Whenever the
upstream branch is updated, all of the outstanding pull requests get
re-tested, and the old test-merge objects become unreferenced.  We ended
up dropping pruneExpire to 1 day to keep the cruft to a minimum.

> >   1. You run "git repack -Ad". It makes A.pack, with stuff you want,
> >   and B.pack, with unreachable junk. They both get a timestamp of
> >   "now".
> > 
> >   2. A day passes. You run "git repack -Ad" again. It makes C.pack,
> >   the new stuff you want, and repacks all of B.pack along with the
> >   new expired cruft from A.pack, making D.pack. B.pack can go away.
> >   D.pack gets a timestamp of "now".
> 
> Hmm, yes.  What we'd really want to do is to make D.pack contain those
> items that were are newly unreachable, not including the objects in
> B.pack, and keep B.pack around until the expiry window goes by.  But
> that's a much more complicated thing, and the proof-of-concept
> algorithm I had outlined wouldn't do that.

Right. When dumping the list of unreachable objects from pack-objects,
you'd want to tell it to ignore objects from any pack that contained
only unreachable objects in the first place.

Except that is perhaps not quite right, either. Because if a pack has
100 objects, but you happen to re-reference 1 of them, you'd probably
want to leave it (even though that re-referenced one is now duplicated,
the savings aren't worth the trouble of repacking the cruft objects).
Of course, what is the N that makes it worth the trouble? Now you're
getting into heuristics.

You _could_ make a separate cruft pack for each pack that you repack. So
if I have A.pack and B.pack, I'd pack all of the reachable objects into
C.pack, and then make D.pack containing the unreachable objects from
A.pack, and E.pack with the unreachable objects from B.pack. And then
set the mtime of the cruft packs to that of their parent packs.

And then the next time you pack, repacking D and E would probably be a
no-op that preserves mtime, but might create a new pack that ejects some
now-reachable object.

To implement that, I think your --list-unreachable would just have to
print a list of "<pack-mtime> <sha1>" pairs, and then you would pack
each set with an identical mtime (or even a "close enough" mtime within
some slop).

But yet, this is all getting complicated. :)

> > I think solving it for good would involve a separate list of
> > per-object expiration dates. Obviously we get that easily with loose
> > objects (since it is one object per file).
> 
> Well, either that or we need to teach git-repack the difference
> between packs that are expected to contain good stuff, and packs that
> contain cruft, and to not copy "old cruft" to new packs, so the old
> pack can finally get nuked 2 weeks (or whatever the expire window
> might happen to be) later.

That is harder because those objects may become re-reachable during that
window. So I think you don't want to deal with "expected to contain..."
but rather "what does it contain now?". The latter is easy to figure out
by doing a reachability analysis (which we do as part of the repack
anyway).

-Peff

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-11 18:34           ` Jeff King
@ 2012-06-11 20:44             ` Hallvard Breien Furuseth
  2012-06-11 21:14               ` Jeff King
  2012-06-11 21:14             ` Ted Ts'o
  1 sibling, 1 reply; 48+ messages in thread
From: Hallvard Breien Furuseth @ 2012-06-11 20:44 UTC (permalink / raw)
  To: Jeff King; +Cc: Ted Ts'o, Thomas Rast, git, Nicolas Pitre

 On Mon, 11 Jun 2012 14:34:14 -0400, Jeff King <peff@peff.net> wrote:
> On Mon, Jun 11, 2012 at 01:27:32PM -0400, Ted Ts'o wrote:
>> So depending on how you would want to do the comparison, probably 
>> the
>> fairest thing to say is that I had a total "good" packs totally 
>> about
>> 16 megs, and the loose cruft objects was an additional 4.5 
>> megabytes.
>
> OK, so that 4.5 is at least a respectable percentage of the total 
> repo
> size. I suspect it may be worse for small repos in that sense, (...)

 'git gc' gave a 3100% increase with my example:

     $ git clone --bare --branch linux-overhaul-20010122 \
         git://git.savannah.gnu.org/config.git
     $ cd config.git/
     $ git tag -d `git tag`; git branch -D master
     $ du -s objects
     624     objects
     $ git gc
     $ du -s objects
     19840   objects

 Basically: Clone/fetch a repo, keep a small part of it, drop the
 rest, and gc.  Gc explodes all the objects you no longer want.

 This hits you hard if your small project tracks a big one, and
 later ceases doing so.  Maybe your project is to convert a small
 part of the remote into a library, or you're just tracking a few
 files from the remote - e.g. from the GNU Config repo.

 Not something one does every day, but it does happen.
 Tweaks to expiration time etc do not help here.

 Hallvard

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-11 20:44             ` Hallvard Breien Furuseth
@ 2012-06-11 21:14               ` Jeff King
  2012-06-11 21:41                 ` Hallvard Breien Furuseth
  0 siblings, 1 reply; 48+ messages in thread
From: Jeff King @ 2012-06-11 21:14 UTC (permalink / raw)
  To: Hallvard Breien Furuseth
  Cc: Ted Ts'o, Thomas Rast, git, Nicolas Pitre,
	Nguyễn Thái Ngọc Duy

On Mon, Jun 11, 2012 at 10:44:39PM +0200, Hallvard Breien Furuseth wrote:

> >OK, so that 4.5 is at least a respectable percentage of the total
> >repo
> >size. I suspect it may be worse for small repos in that sense, (...)
> 
> 'git gc' gave a 3100% increase with my example:
> 
>     $ git clone --bare --branch linux-overhaul-20010122 \
>         git://git.savannah.gnu.org/config.git
>     $ cd config.git/
>     $ git tag -d `git tag`; git branch -D master
>     $ du -s objects
>     624     objects
>     $ git gc
>     $ du -s objects
>     19840   objects
> 
> Basically: Clone/fetch a repo, keep a small part of it, drop the
> rest, and gc.  Gc explodes all the objects you no longer want.

I would argue that this is not a very interesting case in the first
place, since the right thing to do is use "clone --single-branch"[1] to
void transferring all of those objects in the first place.

But there are plenty of variant cases, where you are not just deleting
all of the refs, but rather doing some manipulation of the branches,
diffing them to make sure it is safe to drop some bits, running
filter-branch, etc. And it would be nice to make those cases more
efficient.

-Peff

[1] It looks like "--single-branch" does not actually work, and still
    fetches master. I think this is a bug in the implementation of
    single-branch (it looks like it fetches HEAD unconditionally).
    +cc Duy.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-11 21:14               ` Jeff King
@ 2012-06-11 21:41                 ` Hallvard Breien Furuseth
  0 siblings, 0 replies; 48+ messages in thread
From: Hallvard Breien Furuseth @ 2012-06-11 21:41 UTC (permalink / raw)
  To: Jeff King
  Cc: Ted Ts'o, Thomas Rast, git, Nicolas Pitre,
	Nguyễn Thái Ngọc Duy

 On Mon, 11 Jun 2012 17:14:14 -0400, Jeff King <peff@peff.net> wrote:
> On Mon, Jun 11, 2012 at 10:44:39PM +0200, Hallvard Breien Furuseth 
> wrote:
>
>> >OK, so that 4.5 is at least a respectable percentage of the total
>> >repo
>> >size. I suspect it may be worse for small repos in that sense, 
>> (...)
>>
>> 'git gc' gave a 3100% increase with my example:
>>
>>     $ git clone --bare --branch linux-overhaul-20010122 \
>>         git://git.savannah.gnu.org/config.git
>>     $ cd config.git/
>>     $ git tag -d `git tag`; git branch -D master
>>     $ du -s objects
>>     624     objects
>>     $ git gc
>>     $ du -s objects
>>     19840   objects
>>
>> Basically: Clone/fetch a repo, keep a small part of it, drop the
>> rest, and gc.  Gc explodes all the objects you no longer want.
>
> I would argue that this is not a very interesting case in the first
> place, since the right thing to do is use "clone --single-branch"[1] 
> to
> void transferring all of those objects in the first place.

 Yeah, I just wanted a simple way to fetch a lot and drop most of
 it, since that's what triggers the problem.  A simple use case
 would be where you did a lot of work between cloing and pruning
 most refs.  I described some actual use cases below the command.

> But there are plenty of variant cases, where you are not just 
> deleting
> all of the refs, but rather doing some manipulation of the branches,
> diffing them to make sure it is safe to drop some bits, running
> filter-branch, etc. And it would be nice to make those cases more
> efficient.

-- 
 Hallvard

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-11 18:34           ` Jeff King
  2012-06-11 20:44             ` Hallvard Breien Furuseth
@ 2012-06-11 21:14             ` Ted Ts'o
  2012-06-11 21:39               ` Jeff King
  1 sibling, 1 reply; 48+ messages in thread
From: Ted Ts'o @ 2012-06-11 21:14 UTC (permalink / raw)
  To: Jeff King; +Cc: Thomas Rast, Hallvard B Furuseth, git, Nicolas Pitre

On Mon, Jun 11, 2012 at 02:34:14PM -0400, Jeff King wrote:
> You _could_ make a separate cruft pack for each pack that you repack. So
> if I have A.pack and B.pack, I'd pack all of the reachable objects into
> C.pack, and then make D.pack containing the unreachable objects from
> A.pack, and E.pack with the unreachable objects from B.pack. And then
> set the mtime of the cruft packs to that of their parent packs.
> 
> And then the next time you pack, repacking D and E would probably be a
> no-op that preserves mtime, but might create a new pack that ejects some
> now-reachable object.
> 
> To implement that, I think your --list-unreachable would just have to
> print a list of "<pack-mtime> <sha1>" pairs, and then you would pack
> each set with an identical mtime (or even a "close enough" mtime within
> some slop)....

How about this instead?  We distinguish between cruft packs and "real"
packs by the filename.  So we have "cruft-<SHA1>.{idx,pack}" and
"pack-<SHA1>.{idx.pack}".

Normally, git will look at any pack in the pack directory that has an
.idx and .pack extension, but during repack operation, it will by only
look in the pack-* packs first.  If it can't find an object there, it
will then fall back to trying to fetch the object from the cruft-*
packs, and if it finds the object, it copies it into the new pack
which is creating, thus "rescueing" an object which reappears during
the expiry window.  This should be a relatively rare event, and if it
happens, the object will be in two packs, a pack-* pack and a cruft-*
pack, but that's OK.

So since git pack-objects isn't even looking in the cruft-* packs
except when it needs to rescue an object, the objects in the cruft-*
packs won't get copied, and we won't need to have per-object mtimes.
It also means it will go faster since it's not copying the cruft-*
packs at all, and possibly not even looking at them.

Now all we need to do is delete any cruft-* packs which are older than
the expiry window.  We don't even need to look at their contents.

It does imply that we may accumulate a new cruft-<SHA1> pack each time
we run git gc, but users shouldn't be running git gc all that often
anyway.  And even if they do run it all the time, it will still be
more efficient than keeping the unreachable objects as loose objects.

     	       	    	    		    	    - Ted

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-11 21:14             ` Ted Ts'o
@ 2012-06-11 21:39               ` Jeff King
  2012-06-11 22:14                 ` Ted Ts'o
  0 siblings, 1 reply; 48+ messages in thread
From: Jeff King @ 2012-06-11 21:39 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Thomas Rast, Hallvard B Furuseth, git, Nicolas Pitre

On Mon, Jun 11, 2012 at 05:14:01PM -0400, Ted Ts'o wrote:

> On Mon, Jun 11, 2012 at 02:34:14PM -0400, Jeff King wrote:
> > You _could_ make a separate cruft pack for each pack that you repack. So
> > if I have A.pack and B.pack, I'd pack all of the reachable objects into
> > C.pack, and then make D.pack containing the unreachable objects from
> > A.pack, and E.pack with the unreachable objects from B.pack. And then
> > set the mtime of the cruft packs to that of their parent packs.
> > 
> > And then the next time you pack, repacking D and E would probably be a
> > no-op that preserves mtime, but might create a new pack that ejects some
> > now-reachable object.
> > 
> > To implement that, I think your --list-unreachable would just have to
> > print a list of "<pack-mtime> <sha1>" pairs, and then you would pack
> > each set with an identical mtime (or even a "close enough" mtime within
> > some slop)....
> 
> How about this instead?  We distinguish between cruft packs and "real"
> packs by the filename.  So we have "cruft-<SHA1>.{idx,pack}" and
> "pack-<SHA1>.{idx.pack}".
> 
> Normally, git will look at any pack in the pack directory that has an
> .idx and .pack extension, but during repack operation, it will by only
> look in the pack-* packs first.  If it can't find an object there, it
> will then fall back to trying to fetch the object from the cruft-*
> packs, and if it finds the object, it copies it into the new pack
> which is creating, thus "rescueing" an object which reappears during
> the expiry window.  This should be a relatively rare event, and if it
> happens, the object will be in two packs, a pack-* pack and a cruft-*
> pack, but that's OK.

You don't need to do anything magical for the object lookup process of
pack-objects. By definition, the unreachable objects will not be
included in the new pack you are creating, because it is only packing
reachable things. So the cruft pack does not have to be a fallback at
all; it is a regular pack from the object lookup perspective.

The important differences from the current behavior would be:

  1. In pack-objects, do not explode (or if just listing, do not list)
     objects from cruft packs.

  2. In repack, make a new pack from the list of unreachable objects.

  3. During "repack -Ad", prune cruft packs only if they are older than
     an expiration date.

And then you'd end up with a new cruft pack each time. You could just
mark it with a ".cruft" file (similar to the existing ".keep" files),
and you don't have to worry about changing the pack-* name.

> So since git pack-objects isn't even looking in the cruft-* packs
> except when it needs to rescue an object, the objects in the cruft-*
> packs won't get copied, and we won't need to have per-object mtimes.
> It also means it will go faster since it's not copying the cruft-*
> packs at all, and possibly not even looking at them.

Yeah. It doesn't eliminate duplicates, but that may not be worth caring
about. I find the "cruft" marking a little hacky, because it is only
"objects in here _may_ be cruft", but as long as that is understood, it
is OK (and it is understood in the sequence above; "repack -Ad" is safe
because it knows that it would have repacked any non-cruft).

You would have to be careful with "repack -d" (without the "-a"). It
would not be necessarily be safe to remove cruft packs, because you
might not have rescued the objects. However, AFAICT "repack -d" does not
currently delete packs at all, so this would be no different.

> It does imply that we may accumulate a new cruft-<SHA1> pack each time
> we run git gc, but users shouldn't be running git gc all that often
> anyway.  And even if they do run it all the time, it will still be
> more efficient than keeping the unreachable objects as loose objects.

Yeah, it would be nice to keep it all in a single pack, but that means
doing the I/O on rewriting the cruft packs each time. And figuring out
some way of handling the mtime in such a way that we don't keep
refreshing the age during each gc.

Speaking of which, what is the mtime of the newly created cruft pack? Is
it the current mtime? Then those unreachable objects will stick for
another 2 weeks, instead of being back-dated to their pack's date. You
could back-date to the mtime of the most recent deleted pack, but that
would still prolong the life of objects from the older packs. It may be
acceptable to just ignore the issue, though; they will expire
eventually.

-Peff

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-11 21:39               ` Jeff King
@ 2012-06-11 22:14                 ` Ted Ts'o
  2012-06-11 22:23                   ` Jeff King
  0 siblings, 1 reply; 48+ messages in thread
From: Ted Ts'o @ 2012-06-11 22:14 UTC (permalink / raw)
  To: Jeff King; +Cc: Thomas Rast, Hallvard B Furuseth, git, Nicolas Pitre

On Mon, Jun 11, 2012 at 05:39:48PM -0400, Jeff King wrote:
> 
> Yeah. It doesn't eliminate duplicates, but that may not be worth caring
> about. I find the "cruft" marking a little hacky, because it is only
> "objects in here _may_ be cruft", but as long as that is understood, it
> is OK (and it is understood in the sequence above; "repack -Ad" is safe
> because it knows that it would have repacked any non-cruft).

Well, all the objects in the file *were* cruft at the time that it was
created.  And the reason why we are keeping them around is in case we
were wrong about their being cruft, so I guess I don't have that much
trouble with the name.  Something like "KillShelter" (as in the
opposite of No-Kill Animal Shelters) would be more discriptive, but I
think it's a bit lacking in taste....

> > It does imply that we may accumulate a new cruft-<SHA1> pack each time
> > we run git gc, but users shouldn't be running git gc all that often
> > anyway.  And even if they do run it all the time, it will still be
> > more efficient than keeping the unreachable objects as loose objects.
> 
> Yeah, it would be nice to keep it all in a single pack, but that means
> doing the I/O on rewriting the cruft packs each time. And figuring out
> some way of handling the mtime in such a way that we don't keep
> refreshing the age during each gc.

Well, I'd like to avoid doing the I/O because I want to minimize wear
on SSD drives; and given that it's unlikely that the cruft packs will
be referenced, the fact that we have a bunch of cruft packs shouldn't
be a big deal, especially if we teach git to search the cruft packs
last.

> Speaking of which, what is the mtime of the newly created cruft pack? Is
> it the current mtime? Then those unreachable objects will stick for
> another 2 weeks, instead of being back-dated to their pack's date. You
> could back-date to the mtime of the most recent deleted pack, but that
> would still prolong the life of objects from the older packs. It may be
> acceptable to just ignore the issue, though; they will expire
> eventually.

Well, we have that problem today when "git pack-objects
--unpack-unreachable" explodes unreferenced objects --- they are
written with the current mtime.  I assume you're worried about
pre-existing loose objects that get collected up into a new cruft
pack, since they would get the extra two weeks of life.  Given how
much more efficient storing the cruft objects in a pack, I think
ignoring the issue is what makes the most amount of sense, since it's
a one-time extension, and the extra objects really won't do any harm.

One last thought: if a sysadmin is really hard up for space, (and if
the cruft objects include some really big sound or video files) one
advantage of labelling the cruft packs explicitly is that someone who
really needs the space could potentially find the oldest cruft files
and delete them, since they would be tagged for easy findability.

    	   	       	    	     	    - Ted

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-11 22:14                 ` Ted Ts'o
@ 2012-06-11 22:23                   ` Jeff King
  2012-06-11 22:28                     ` Ted Ts'o
  2012-06-12  0:41                     ` Nicolas Pitre
  0 siblings, 2 replies; 48+ messages in thread
From: Jeff King @ 2012-06-11 22:23 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Thomas Rast, Hallvard B Furuseth, git, Nicolas Pitre

On Mon, Jun 11, 2012 at 06:14:39PM -0400, Ted Ts'o wrote:

> > Speaking of which, what is the mtime of the newly created cruft pack? Is
> > it the current mtime? Then those unreachable objects will stick for
> > another 2 weeks, instead of being back-dated to their pack's date. You
> > could back-date to the mtime of the most recent deleted pack, but that
> > would still prolong the life of objects from the older packs. It may be
> > acceptable to just ignore the issue, though; they will expire
> > eventually.
> 
> Well, we have that problem today when "git pack-objects
> --unpack-unreachable" explodes unreferenced objects --- they are
> written with the current mtime.

No, we don't; they get the mtime of the pack they are coming from (and
if the pack is older than pruneExpire, they are not exploded at all,
since they would just be pruned immediately anyway).

So an exploded object might have only a day or an hour to live after the
explosion, but with your strategy they always get two weeks.

> I assume you're worried about pre-existing loose objects that get
> collected up into a new cruft pack, since they would get the extra two
> weeks of life.  Given how much more efficient storing the cruft
> objects in a pack, I think ignoring the issue is what makes the most
> amount of sense, since it's a one-time extension, and the extra
> objects really won't do any harm.

I'm more specifically worried about large objects which are no better in
packs than they are in loose form (e.g., video files). This strategy is
a regression, since we are not saving space by putting them in a pack,
but we are keeping them around much longer. It also makes it harder to
just run "git prune" to get rid of large objects (since prune will never
kill off a pack), or to manually delete files from the object database.
You have to run "git gc --prune=now" instead, so it can make a new pack
and throw away the old bits (or run "git repack -ad").

> One last thought: if a sysadmin is really hard up for space, (and if
> the cruft objects include some really big sound or video files) one
> advantage of labelling the cruft packs explicitly is that someone who
> really needs the space could potentially find the oldest cruft files
> and delete them, since they would be tagged for easy findability.

No! That's exactly what I was worried about with the name. It is _not_
safe to do so. It's only safe after you have done a full repack to
rescue any non-cruft objects.

-Peff

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-11 22:23                   ` Jeff King
@ 2012-06-11 22:28                     ` Ted Ts'o
  2012-06-11 22:35                       ` Jeff King
  2012-06-12  0:41                     ` Nicolas Pitre
  1 sibling, 1 reply; 48+ messages in thread
From: Ted Ts'o @ 2012-06-11 22:28 UTC (permalink / raw)
  To: Jeff King; +Cc: Thomas Rast, Hallvard B Furuseth, git, Nicolas Pitre

On Mon, Jun 11, 2012 at 06:23:08PM -0400, Jeff King wrote:
> 
> I'm more specifically worried about large objects which are no better in
> packs than they are in loose form (e.g., video files). This strategy is
> a regression, since we are not saving space by putting them in a pack,
> but we are keeping them around much longer. It also makes it harder to
> just run "git prune" to get rid of large objects (since prune will never
> kill off a pack), or to manually delete files from the object database.
> You have to run "git gc --prune=now" instead, so it can make a new pack
> and throw away the old bits (or run "git repack -ad").

If we're really worried about this, we could set a threshold and only
pack small objects in the cruft packs.

> > One last thought: if a sysadmin is really hard up for space, (and if
> > the cruft objects include some really big sound or video files) one
> > advantage of labelling the cruft packs explicitly is that someone who
> > really needs the space could potentially find the oldest cruft files
> > and delete them, since they would be tagged for easy findability.
> 
> No! That's exactly what I was worried about with the name. It is _not_
> safe to do so. It's only safe after you have done a full repack to
> rescue any non-cruft objects.

Well, yes.  I was thinking it would be safe thing to do after a "git
gc" didn't result in enough space savings.  This would require that a
git repack always rescue objects from cruft packs even if the -a/-A
options are not specified, but since we're doing a full reachability
scan, that should slow down git gc much, right?

    					- Ted

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-11 22:28                     ` Ted Ts'o
@ 2012-06-11 22:35                       ` Jeff King
  0 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2012-06-11 22:35 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Thomas Rast, Hallvard B Furuseth, git, Nicolas Pitre

On Mon, Jun 11, 2012 at 06:28:43PM -0400, Ted Ts'o wrote:

> On Mon, Jun 11, 2012 at 06:23:08PM -0400, Jeff King wrote:
> > 
> > I'm more specifically worried about large objects which are no better in
> > packs than they are in loose form (e.g., video files). This strategy is
> > a regression, since we are not saving space by putting them in a pack,
> > but we are keeping them around much longer. It also makes it harder to
> > just run "git prune" to get rid of large objects (since prune will never
> > kill off a pack), or to manually delete files from the object database.
> > You have to run "git gc --prune=now" instead, so it can make a new pack
> > and throw away the old bits (or run "git repack -ad").
> 
> If we're really worried about this, we could set a threshold and only
> pack small objects in the cruft packs.

I think I'd be more inclined to just ignore it. It is only prolonging
the lifetime of the files by a finite amount (and we are discussing
dropping that finite amount anyway). And as a bonus, this strategy could
potentially allow an optimization that would make large files better in
this case: if we notice that a pack has _only_ unreachable objects, we
can simply mark it as ".cruft" without actually repacking it. Coupled
with the recent-ish code to stream large blobs directly to packs, that
means a large blob which becomes unreachable would not ever be
rewritten.

> > No! That's exactly what I was worried about with the name. It is _not_
> > safe to do so. It's only safe after you have done a full repack to
> > rescue any non-cruft objects.
> 
> Well, yes.  I was thinking it would be safe thing to do after a "git
> gc" didn't result in enough space savings.  This would require that a
> git repack always rescue objects from cruft packs even if the -a/-A
> options are not specified, but since we're doing a full reachability
> scan, that should slow down git gc much, right?

Doing "git gc" will always repack everything, IIRC. It is "git gc
--auto" which will make small incremental packs. I think we do a full
reachability analysis so we can prune there, but that is something I
think we should stop doing. It is typically orders of magnitude slower
than the incremental repack.

-Peff

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-11 22:23                   ` Jeff King
  2012-06-11 22:28                     ` Ted Ts'o
@ 2012-06-12  0:41                     ` Nicolas Pitre
  2012-06-12 17:10                       ` Jeff King
  1 sibling, 1 reply; 48+ messages in thread
From: Nicolas Pitre @ 2012-06-12  0:41 UTC (permalink / raw)
  To: Jeff King; +Cc: Ted Ts'o, Thomas Rast, Hallvard B Furuseth, git

On Mon, 11 Jun 2012, Jeff King wrote:

> On Mon, Jun 11, 2012 at 06:14:39PM -0400, Ted Ts'o wrote:
> 
> > One last thought: if a sysadmin is really hard up for space, (and if
> > the cruft objects include some really big sound or video files) one
> > advantage of labelling the cruft packs explicitly is that someone who
> > really needs the space could potentially find the oldest cruft files
> > and delete them, since they would be tagged for easy findability.
> 
> No! That's exactly what I was worried about with the name. It is _not_
> safe to do so. It's only safe after you have done a full repack to
> rescue any non-cruft objects.

To make it "safe", the cruft packs would have to be searchable for 
object retrieval, but not during object creation.  That nuance would 
affect the core code in subtle ways and I'm not sure if that would be 
worth it ... just for the safe handling of cruft.


Nicolas

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-12  0:41                     ` Nicolas Pitre
@ 2012-06-12 17:10                       ` Jeff King
  2012-06-12 17:30                         ` Nicolas Pitre
  0 siblings, 1 reply; 48+ messages in thread
From: Jeff King @ 2012-06-12 17:10 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Ted Ts'o, Thomas Rast, Hallvard B Furuseth, git

On Mon, Jun 11, 2012 at 08:41:03PM -0400, Nicolas Pitre wrote:

> > On Mon, Jun 11, 2012 at 06:14:39PM -0400, Ted Ts'o wrote:
> > 
> > > One last thought: if a sysadmin is really hard up for space, (and if
> > > the cruft objects include some really big sound or video files) one
> > > advantage of labelling the cruft packs explicitly is that someone who
> > > really needs the space could potentially find the oldest cruft files
> > > and delete them, since they would be tagged for easy findability.
> > 
> > No! That's exactly what I was worried about with the name. It is _not_
> > safe to do so. It's only safe after you have done a full repack to
> > rescue any non-cruft objects.
> 
> To make it "safe", the cruft packs would have to be searchable for 
> object retrieval, but not during object creation.  That nuance would 
> affect the core code in subtle ways and I'm not sure if that would be 
> worth it ... just for the safe handling of cruft.

Why is that? If you do a "repack -Ad", then any referenced objects will
have been retrieved and put into the new all-in-one pack. At that point,
by deleting the cruft pack, you are guaranteed to be deleting only
objects that are either unreferenced, or are duplicated in another pack.

-Peff

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-12 17:10                       ` Jeff King
@ 2012-06-12 17:30                         ` Nicolas Pitre
  2012-06-12 17:32                           ` Jeff King
  0 siblings, 1 reply; 48+ messages in thread
From: Nicolas Pitre @ 2012-06-12 17:30 UTC (permalink / raw)
  To: Jeff King; +Cc: Ted Ts'o, Thomas Rast, Hallvard B Furuseth, git

On Tue, 12 Jun 2012, Jeff King wrote:

> On Mon, Jun 11, 2012 at 08:41:03PM -0400, Nicolas Pitre wrote:
> 
> > > On Mon, Jun 11, 2012 at 06:14:39PM -0400, Ted Ts'o wrote:
> > > 
> > > > One last thought: if a sysadmin is really hard up for space, (and if
> > > > the cruft objects include some really big sound or video files) one
> > > > advantage of labelling the cruft packs explicitly is that someone who
> > > > really needs the space could potentially find the oldest cruft files
> > > > and delete them, since they would be tagged for easy findability.
> > > 
> > > No! That's exactly what I was worried about with the name. It is _not_
> > > safe to do so. It's only safe after you have done a full repack to
> > > rescue any non-cruft objects.
> > 
> > To make it "safe", the cruft packs would have to be searchable for 
> > object retrieval, but not during object creation.  That nuance would 
> > affect the core code in subtle ways and I'm not sure if that would be 
> > worth it ... just for the safe handling of cruft.
> 
> Why is that? If you do a "repack -Ad", then any referenced objects will
> have been retrieved and put into the new all-in-one pack. At that point,
> by deleting the cruft pack, you are guaranteed to be deleting only
> objects that are either unreferenced, or are duplicated in another pack.

Now what if you fetch and a bunch of objects are already found in your 
cruft pack?  Right now, we search for the existence of any object before 
creating them, and if the cruft packs are searchable then such objects 
won't get uncruftified.


Nicolas

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-12 17:30                         ` Nicolas Pitre
@ 2012-06-12 17:32                           ` Jeff King
  2012-06-12 17:45                             ` Shawn Pearce
  2012-06-12 17:49                             ` Nicolas Pitre
  0 siblings, 2 replies; 48+ messages in thread
From: Jeff King @ 2012-06-12 17:32 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Ted Ts'o, Thomas Rast, Hallvard B Furuseth, git

On Tue, Jun 12, 2012 at 01:30:07PM -0400, Nicolas Pitre wrote:

> > > To make it "safe", the cruft packs would have to be searchable for 
> > > object retrieval, but not during object creation.  That nuance would 
> > > affect the core code in subtle ways and I'm not sure if that would be 
> > > worth it ... just for the safe handling of cruft.
> > 
> > Why is that? If you do a "repack -Ad", then any referenced objects will
> > have been retrieved and put into the new all-in-one pack. At that point,
> > by deleting the cruft pack, you are guaranteed to be deleting only
> > objects that are either unreferenced, or are duplicated in another pack.
> 
> Now what if you fetch and a bunch of objects are already found in your 
> cruft pack?  Right now, we search for the existence of any object before 
> creating them, and if the cruft packs are searchable then such objects 
> won't get uncruftified.

Then those objects will remain in the cruft pack. Which is why, as I
said, it is not generally safe to just delete a cruft pack. However,
when you do a full repack, those objects will be copied into the new
pack (because they are referenced). Which is why I am claiming that it
is safe to remove cruft packs at that point.

-Peff

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-12 17:32                           ` Jeff King
@ 2012-06-12 17:45                             ` Shawn Pearce
  2012-06-12 17:50                               ` Jeff King
  2012-06-12 17:55                               ` Nicolas Pitre
  2012-06-12 17:49                             ` Nicolas Pitre
  1 sibling, 2 replies; 48+ messages in thread
From: Shawn Pearce @ 2012-06-12 17:45 UTC (permalink / raw)
  To: Jeff King
  Cc: Nicolas Pitre, Ted Ts'o, Thomas Rast, Hallvard B Furuseth,
	git

On Tue, Jun 12, 2012 at 10:32 AM, Jeff King <peff@peff.net> wrote:
> On Tue, Jun 12, 2012 at 01:30:07PM -0400, Nicolas Pitre wrote:
>
>> > > To make it "safe", the cruft packs would have to be searchable for
>> > > object retrieval, but not during object creation.  That nuance would
>> > > affect the core code in subtle ways and I'm not sure if that would be
>> > > worth it ... just for the safe handling of cruft.
>> >
>> > Why is that? If you do a "repack -Ad", then any referenced objects will
>> > have been retrieved and put into the new all-in-one pack. At that point,
>> > by deleting the cruft pack, you are guaranteed to be deleting only
>> > objects that are either unreferenced, or are duplicated in another pack.
>>
>> Now what if you fetch and a bunch of objects are already found in your
>> cruft pack?  Right now, we search for the existence of any object before
>> creating them, and if the cruft packs are searchable then such objects
>> won't get uncruftified.
>
> Then those objects will remain in the cruft pack. Which is why, as I
> said, it is not generally safe to just delete a cruft pack. However,
> when you do a full repack, those objects will be copied into the new
> pack (because they are referenced). Which is why I am claiming that it
> is safe to remove cruft packs at that point.

But there is a race condition with a concurrent fetch and a concurrent
repack. If that fetch needs those cruft objects, and sees them in the
cruft pack, and the repack sees the references before the fetch, the
repacker might delete things the fetch is about to reference and that
will leave you with a corrupt repository.

I think we already have this race condition with loose unreachable
objects whose mtimes are older than 2 weeks; they are removed by prune
but may have just become reachable by a concurrent fetch that doesn't
overwrite them because they already exist, and doesn't update the
mtime because they aren't writable.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-12 17:45                             ` Shawn Pearce
@ 2012-06-12 17:50                               ` Jeff King
  2012-06-12 17:57                                 ` Nicolas Pitre
  2012-06-12 18:43                                 ` Andreas Schwab
  2012-06-12 17:55                               ` Nicolas Pitre
  1 sibling, 2 replies; 48+ messages in thread
From: Jeff King @ 2012-06-12 17:50 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Nicolas Pitre, Ted Ts'o, Thomas Rast, Hallvard B Furuseth,
	git

On Tue, Jun 12, 2012 at 10:45:22AM -0700, Shawn O. Pearce wrote:

> > Then those objects will remain in the cruft pack. Which is why, as I
> > said, it is not generally safe to just delete a cruft pack. However,
> > when you do a full repack, those objects will be copied into the new
> > pack (because they are referenced). Which is why I am claiming that it
> > is safe to remove cruft packs at that point.
> 
> But there is a race condition with a concurrent fetch and a concurrent
> repack. If that fetch needs those cruft objects, and sees them in the
> cruft pack, and the repack sees the references before the fetch, the
> repacker might delete things the fetch is about to reference and that
> will leave you with a corrupt repository.
> 
> I think we already have this race condition with loose unreachable
> objects whose mtimes are older than 2 weeks; they are removed by prune
> but may have just become reachable by a concurrent fetch that doesn't
> overwrite them because they already exist, and doesn't update the
> mtime because they aren't writable.

Correct. There is a race condition, but it is there already. I have
discussed this with other GitHub folks, because we prune fairly
aggressively (in our case it would be a push, not a fetch, of course).
So far we have not had any record of it actually happening in practice.

We could close it in both cases by tweaking the mtime of the file
containing the object when we decide not to write because the object
already exists.

-Peff

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-12 17:50                               ` Jeff King
@ 2012-06-12 17:57                                 ` Nicolas Pitre
  2012-06-12 18:43                                 ` Andreas Schwab
  1 sibling, 0 replies; 48+ messages in thread
From: Nicolas Pitre @ 2012-06-12 17:57 UTC (permalink / raw)
  To: Jeff King
  Cc: Shawn Pearce, Ted Ts'o, Thomas Rast, Hallvard B Furuseth, git

On Tue, 12 Jun 2012, Jeff King wrote:

> On Tue, Jun 12, 2012 at 10:45:22AM -0700, Shawn O. Pearce wrote:
> 
> > > Then those objects will remain in the cruft pack. Which is why, as I
> > > said, it is not generally safe to just delete a cruft pack. However,
> > > when you do a full repack, those objects will be copied into the new
> > > pack (because they are referenced). Which is why I am claiming that it
> > > is safe to remove cruft packs at that point.
> > 
> > But there is a race condition with a concurrent fetch and a concurrent
> > repack. If that fetch needs those cruft objects, and sees them in the
> > cruft pack, and the repack sees the references before the fetch, the
> > repacker might delete things the fetch is about to reference and that
> > will leave you with a corrupt repository.
> > 
> > I think we already have this race condition with loose unreachable
> > objects whose mtimes are older than 2 weeks; they are removed by prune
> > but may have just become reachable by a concurrent fetch that doesn't
> > overwrite them because they already exist, and doesn't update the
> > mtime because they aren't writable.
> 
> Correct. There is a race condition, but it is there already. I have
> discussed this with other GitHub folks, because we prune fairly
> aggressively (in our case it would be a push, not a fetch, of course).
> So far we have not had any record of it actually happening in practice.
> 
> We could close it in both cases by tweaking the mtime of the file
> containing the object when we decide not to write because the object
> already exists.

Yes, that is a worthwhile thing to do.


Nicolas

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-12 17:50                               ` Jeff King
  2012-06-12 17:57                                 ` Nicolas Pitre
@ 2012-06-12 18:43                                 ` Andreas Schwab
  2012-06-12 19:07                                   ` Jeff King
  2012-06-12 19:09                                   ` Nicolas Pitre
  1 sibling, 2 replies; 48+ messages in thread
From: Andreas Schwab @ 2012-06-12 18:43 UTC (permalink / raw)
  To: Jeff King
  Cc: Shawn Pearce, Nicolas Pitre, Ted Ts'o, Thomas Rast,
	Hallvard B Furuseth, git

Jeff King <peff@peff.net> writes:

> We could close it in both cases by tweaking the mtime of the file
> containing the object when we decide not to write because the object
> already exists.

Though there is always the window between the existence check and the
mtime update where pruning can hit you.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-12 18:43                                 ` Andreas Schwab
@ 2012-06-12 19:07                                   ` Jeff King
  2012-06-12 19:09                                   ` Nicolas Pitre
  1 sibling, 0 replies; 48+ messages in thread
From: Jeff King @ 2012-06-12 19:07 UTC (permalink / raw)
  To: Andreas Schwab
  Cc: Shawn Pearce, Nicolas Pitre, Ted Ts'o, Thomas Rast,
	Hallvard B Furuseth, git

On Tue, Jun 12, 2012 at 08:43:41PM +0200, Andreas Schwab wrote:

> Jeff King <peff@peff.net> writes:
> 
> > We could close it in both cases by tweaking the mtime of the file
> > containing the object when we decide not to write because the object
> > already exists.
> 
> Though there is always the window between the existence check and the
> mtime update where pruning can hit you.

For the loose object case, you could do them both atomically by calling
utime() on the object, and considering the object to exist only if it
succeeds.

Doing it safely for packs would be harder, though; I think you'd
have to bump the mtime forward, do the search, and then bump it back.
You might err by causing a pack not to be pruned, but that is better
than the opposite.

Unfortunately it gets trickier with network transfers. If somebody is
pushing to your repository, you might tell them you have some set of
objects, then they prepare a pack based on that assumption (which might
take minutes or hours to transfer), and then finally at the end you find
that you actually need the objects in question. Of course, that race is
even harder to trigger, because we do not advertise unreachable objects.
So you would have to have a sequence where the objects are reachable,
the client connects and receives your ref advertisement, then the
objects become unreachable (e.g., due to a simultaneous non-ff push or
deletion), and you do a prune in that interval which removes the
objects.  Unlikely, but still possible.

-Peff

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-12 18:43                                 ` Andreas Schwab
  2012-06-12 19:07                                   ` Jeff King
@ 2012-06-12 19:09                                   ` Nicolas Pitre
  2012-06-12 19:23                                     ` Jeff King
  1 sibling, 1 reply; 48+ messages in thread
From: Nicolas Pitre @ 2012-06-12 19:09 UTC (permalink / raw)
  To: Andreas Schwab
  Cc: Jeff King, Shawn Pearce, Ted Ts'o, Thomas Rast,
	Hallvard B Furuseth, git

On Tue, 12 Jun 2012, Andreas Schwab wrote:

> Jeff King <peff@peff.net> writes:
> 
> > We could close it in both cases by tweaking the mtime of the file
> > containing the object when we decide not to write because the object
> > already exists.
> 
> Though there is always the window between the existence check and the
> mtime update where pruning can hit you.

This is a tiny window compared to 2 weeks.

This could be avoided entirely with:

1. check presence of object X

2. update its mtime

3. check presence of object X again

4. create if doesn't exist


Nicolas

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-12 19:09                                   ` Nicolas Pitre
@ 2012-06-12 19:23                                     ` Jeff King
  2012-06-12 19:39                                       ` Nicolas Pitre
  0 siblings, 1 reply; 48+ messages in thread
From: Jeff King @ 2012-06-12 19:23 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Andreas Schwab, Shawn Pearce, Ted Ts'o, Thomas Rast,
	Hallvard B Furuseth, git

On Tue, Jun 12, 2012 at 03:09:25PM -0400, Nicolas Pitre wrote:

> > Jeff King <peff@peff.net> writes:
> > 
> > > We could close it in both cases by tweaking the mtime of the file
> > > containing the object when we decide not to write because the object
> > > already exists.
> > 
> > Though there is always the window between the existence check and the
> > mtime update where pruning can hit you.
> 
> This is a tiny window compared to 2 weeks.

I don't think the race window is actually 2 weeks long. If you have this
sequence:

  1. object X becomes unreferenced

  2. 1 week later, you create a new ref that mentions X

  3. >2 weeks later, you run "git prune --expire=2.weeks.ago"

we will not consider the object for pruning in step 3, because it is
reachable. The race is more like:

  1. object X becomes unreferenced

  2. >2 weeks later, you run "git prune --expire=2.weeks.ago"

  3. git-prune reads the list of refs

  4. simultaneous to the git-prune, you reference X

  5. git-prune removes X

  6. your reference is now broken

So the race window depends on the time it takes "git prune" to run.

I wonder if git-prune could do a double-check of the refs. Something
like:

  1. calculate reachability on all refs

  2. read list of objects to prune, and make a list of unreachable ones

  3. calculate reachability again (which should be very cheap, because
     you can stop when you get to an object you have already seen)

  4. Drop any objects found in (3) from the list in (2), and delete
     items from your list

But I think that still has a race where objects are created before
step 2, but are not actually referenced until after step 3. I think
doing it safely may actually require a repo-wide prune lock.

-Peff

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-12 19:23                                     ` Jeff King
@ 2012-06-12 19:39                                       ` Nicolas Pitre
  2012-06-12 19:41                                         ` Jeff King
  0 siblings, 1 reply; 48+ messages in thread
From: Nicolas Pitre @ 2012-06-12 19:39 UTC (permalink / raw)
  To: Jeff King
  Cc: Andreas Schwab, Shawn Pearce, Ted Ts'o, Thomas Rast,
	Hallvard B Furuseth, git

On Tue, 12 Jun 2012, Jeff King wrote:

> So the race window depends on the time it takes "git prune" to run.
> 
> I wonder if git-prune could do a double-check of the refs. Something
> like:
> 
>   1. calculate reachability on all refs
> 
>   2. read list of objects to prune, and make a list of unreachable ones
> 
>   3. calculate reachability again (which should be very cheap, because
>      you can stop when you get to an object you have already seen)
> 
>   4. Drop any objects found in (3) from the list in (2), and delete
>      items from your list
> 
> But I think that still has a race where objects are created before
> step 2, but are not actually referenced until after step 3. I think
> doing it safely may actually require a repo-wide prune lock.

Yeah... that's what I was thinking too.  Maybe we're making our life 
overly miserable by trying to avoid any locking here.


Nicolas

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-12 19:39                                       ` Nicolas Pitre
@ 2012-06-12 19:41                                         ` Jeff King
  0 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2012-06-12 19:41 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Andreas Schwab, Shawn Pearce, Ted Ts'o, Thomas Rast,
	Hallvard B Furuseth, git

On Tue, Jun 12, 2012 at 03:39:05PM -0400, Nicolas Pitre wrote:

> > So the race window depends on the time it takes "git prune" to run.
> > 
> > I wonder if git-prune could do a double-check of the refs. Something
> > like:
> > 
> >   1. calculate reachability on all refs
> > 
> >   2. read list of objects to prune, and make a list of unreachable ones
> > 
> >   3. calculate reachability again (which should be very cheap, because
> >      you can stop when you get to an object you have already seen)
> > 
> >   4. Drop any objects found in (3) from the list in (2), and delete
> >      items from your list
> > 
> > But I think that still has a race where objects are created before
> > step 2, but are not actually referenced until after step 3. I think
> > doing it safely may actually require a repo-wide prune lock.
> 
> Yeah... that's what I was thinking too.  Maybe we're making our life 
> overly miserable by trying to avoid any locking here.

I think I would be OK with "prune" locking, as long as everything else
was able to happen simultaneously. Especially if we can keep prune's
lock as short as possible through double-reads or similar tricks (like
we do for ref updates).

-Peff

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-12 17:45                             ` Shawn Pearce
  2012-06-12 17:50                               ` Jeff King
@ 2012-06-12 17:55                               ` Nicolas Pitre
  1 sibling, 0 replies; 48+ messages in thread
From: Nicolas Pitre @ 2012-06-12 17:55 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Jeff King, Ted Ts'o, Thomas Rast, Hallvard B Furuseth, git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2081 bytes --]

On Tue, 12 Jun 2012, Shawn Pearce wrote:

> On Tue, Jun 12, 2012 at 10:32 AM, Jeff King <peff@peff.net> wrote:
> > On Tue, Jun 12, 2012 at 01:30:07PM -0400, Nicolas Pitre wrote:
> >
> >> > > To make it "safe", the cruft packs would have to be searchable for
> >> > > object retrieval, but not during object creation.  That nuance would
> >> > > affect the core code in subtle ways and I'm not sure if that would be
> >> > > worth it ... just for the safe handling of cruft.
> >> >
> >> > Why is that? If you do a "repack -Ad", then any referenced objects will
> >> > have been retrieved and put into the new all-in-one pack. At that point,
> >> > by deleting the cruft pack, you are guaranteed to be deleting only
> >> > objects that are either unreferenced, or are duplicated in another pack.
> >>
> >> Now what if you fetch and a bunch of objects are already found in your
> >> cruft pack?  Right now, we search for the existence of any object before
> >> creating them, and if the cruft packs are searchable then such objects
> >> won't get uncruftified.
> >
> > Then those objects will remain in the cruft pack. Which is why, as I
> > said, it is not generally safe to just delete a cruft pack. However,
> > when you do a full repack, those objects will be copied into the new
> > pack (because they are referenced). Which is why I am claiming that it
> > is safe to remove cruft packs at that point.
> 
> But there is a race condition with a concurrent fetch and a concurrent
> repack. If that fetch needs those cruft objects, and sees them in the
> cruft pack, and the repack sees the references before the fetch, the
> repacker might delete things the fetch is about to reference and that
> will leave you with a corrupt repository.
> 
> I think we already have this race condition with loose unreachable
> objects whose mtimes are older than 2 weeks; they are removed by prune
> but may have just become reachable by a concurrent fetch that doesn't
> overwrite them because they already exist, and doesn't update the
> mtime because they aren't writable.

Splat!


Nicolas

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-12 17:32                           ` Jeff King
  2012-06-12 17:45                             ` Shawn Pearce
@ 2012-06-12 17:49                             ` Nicolas Pitre
  2012-06-12 17:54                               ` Jeff King
  1 sibling, 1 reply; 48+ messages in thread
From: Nicolas Pitre @ 2012-06-12 17:49 UTC (permalink / raw)
  To: Jeff King; +Cc: Ted Ts'o, Thomas Rast, Hallvard B Furuseth, git

On Tue, 12 Jun 2012, Jeff King wrote:

> On Tue, Jun 12, 2012 at 01:30:07PM -0400, Nicolas Pitre wrote:
> 
> > > > To make it "safe", the cruft packs would have to be searchable for 
> > > > object retrieval, but not during object creation.  That nuance would 
> > > > affect the core code in subtle ways and I'm not sure if that would be 
> > > > worth it ... just for the safe handling of cruft.
> > > 
> > > Why is that? If you do a "repack -Ad", then any referenced objects will
> > > have been retrieved and put into the new all-in-one pack. At that point,
> > > by deleting the cruft pack, you are guaranteed to be deleting only
> > > objects that are either unreferenced, or are duplicated in another pack.
> > 
> > Now what if you fetch and a bunch of objects are already found in your 
> > cruft pack?  Right now, we search for the existence of any object before 
> > creating them, and if the cruft packs are searchable then such objects 
> > won't get uncruftified.
> 
> Then those objects will remain in the cruft pack. Which is why, as I
> said, it is not generally safe to just delete a cruft pack.

... and my reply was about the needed changes to still make cruft packs 
always crufty even if some of its content suddenly becomes useful again.

> However, when you do a full repack, those objects will be copied into 
> the new pack (because they are referenced). Which is why I am claiming 
> that it is safe to remove cruft packs at that point.

Yes, but then there is no point marking such packs as cruft if at any 
moment they can become useful again.


Nicolas

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-12 17:49                             ` Nicolas Pitre
@ 2012-06-12 17:54                               ` Jeff King
  2012-06-12 18:25                                 ` Nicolas Pitre
  0 siblings, 1 reply; 48+ messages in thread
From: Jeff King @ 2012-06-12 17:54 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Ted Ts'o, Thomas Rast, Hallvard B Furuseth, git

On Tue, Jun 12, 2012 at 01:49:26PM -0400, Nicolas Pitre wrote:

> > Then those objects will remain in the cruft pack. Which is why, as I
> > said, it is not generally safe to just delete a cruft pack.
> 
> ... and my reply was about the needed changes to still make cruft packs 
> always crufty even if some of its content suddenly becomes useful again.

I think we are somehow missing each other's point, then. My point is
that you do not _need_ to make the cruft packs 100% cruft. You can
tolerate the duplicated objects until they are pruned.

Earlier in the thread, I outlined another scheme by which you could
repack and avoid the duplicates. It does not require changes to git's
object lookup process, because it would involve manually feeding the
list of cruft objects to pack-objects (which will pack what you ask it,
regardless of whether the objects are in other packs).

> > However, when you do a full repack, those objects will be copied into 
> > the new pack (because they are referenced). Which is why I am claiming 
> > that it is safe to remove cruft packs at that point.
> 
> Yes, but then there is no point marking such packs as cruft if at any 
> moment they can become useful again.

How do you know to keep the packs around and expire them after 2 weeks
if they are not marked in some way? Otherwise you would delete them as
part of a "git gc", pushing the reachable objects into the new pack and
the unreachable objects into a new cruft pack. IOW, you need some way of
keeping the expiration date on the unreachable objects, or they will
keep getting "refreshed" by each gc.

-Peff

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-12 17:54                               ` Jeff King
@ 2012-06-12 18:25                                 ` Nicolas Pitre
  2012-06-12 18:37                                   ` Ted Ts'o
  2012-06-12 19:15                                   ` Jeff King
  0 siblings, 2 replies; 48+ messages in thread
From: Nicolas Pitre @ 2012-06-12 18:25 UTC (permalink / raw)
  To: Jeff King; +Cc: Ted Ts'o, Thomas Rast, Hallvard B Furuseth, git

On Tue, 12 Jun 2012, Jeff King wrote:

> On Tue, Jun 12, 2012 at 01:49:26PM -0400, Nicolas Pitre wrote:
> 
> > > Then those objects will remain in the cruft pack. Which is why, as I
> > > said, it is not generally safe to just delete a cruft pack.
> > 
> > ... and my reply was about the needed changes to still make cruft packs 
> > always crufty even if some of its content suddenly becomes useful again.
> 
> I think we are somehow missing each other's point, then. My point is
> that you do not _need_ to make the cruft packs 100% cruft. You can
> tolerate the duplicated objects until they are pruned.

Absolutely.  Duplicated objectes are fine, and I was in fact suggesting 
to actively duplicate any needed object when it is to be found in a 
cruft pack only.

> Earlier in the thread, I outlined another scheme by which you could
> repack and avoid the duplicates. It does not require changes to git's
> object lookup process, because it would involve manually feeding the
> list of cruft objects to pack-objects (which will pack what you ask it,
> regardless of whether the objects are in other packs).

That might be hard to achieve good delta compression though, as the main 
key to sort those objects is their path name, and with unreferenced 
objects you might not necessarily have that information.  The ability to 
reuse pack data might mitigate this though.

> > > However, when you do a full repack, those objects will be copied into 
> > > the new pack (because they are referenced). Which is why I am claiming 
> > > that it is safe to remove cruft packs at that point.
> > 
> > Yes, but then there is no point marking such packs as cruft if at any 
> > moment they can become useful again.
> 
> How do you know to keep the packs around and expire them after 2 weeks
> if they are not marked in some way? Otherwise you would delete them as
> part of a "git gc", pushing the reachable objects into the new pack and
> the unreachable objects into a new cruft pack. IOW, you need some way of
> keeping the expiration date on the unreachable objects, or they will
> keep getting "refreshed" by each gc.

My feeling is that we should make a step backward and consider if this 
is actually the right problem to solve.  I don't remember why I might 
have been opposed to a reflog for deleted branches as you say I did, but 
that is certainly a feature that could prove to be useful.

Then having a repository that can be used as an alternate for other 
repositories without knowing about it is also a problem that needs 
fixing and not only because of this object expiry issue.  This is not 
easy to fix though.

Then, the creation of unreferenced objects from successive 'git add' 
shouldn't create that many objects in the first place.  They currently 
never get the chance to be packed to start with.

So the problem is really about 'git gc' creating more data on disk which 
is counter productive for a garbage collecting task.  Maybe the trick is 
simply not to delete any of the old pack which content was repacked into 
a single new pack and let them age before deleting them, rather than 
exploding a bunch of loose objects.  But then we're back to the same 
issue I wanted to get away from i.e. identifying real cruft packs and 
making them safely deletable.

Oh well...

Nicolas

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-12 18:25                                 ` Nicolas Pitre
@ 2012-06-12 18:37                                   ` Ted Ts'o
  2012-06-12 19:15                                     ` Nicolas Pitre
  2012-06-12 19:15                                   ` Jeff King
  1 sibling, 1 reply; 48+ messages in thread
From: Ted Ts'o @ 2012-06-12 18:37 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jeff King, Thomas Rast, Hallvard B Furuseth, git

On Tue, Jun 12, 2012 at 02:25:47PM -0400, Nicolas Pitre wrote:
> > Earlier in the thread, I outlined another scheme by which you could
> > repack and avoid the duplicates. It does not require changes to git's
> > object lookup process, because it would involve manually feeding the
> > list of cruft objects to pack-objects (which will pack what you ask it,
> > regardless of whether the objects are in other packs).
> 
> That might be hard to achieve good delta compression though, as the main 
> key to sort those objects is their path name, and with unreferenced 
> objects you might not necessarily have that information.  The ability to 
> reuse pack data might mitigate this though.

Compared to loose objects, even not-so-great delta compression is
manna from heaven.  Remember what originally got me to start this
flag.  There was 4.5 megabytes worth of loose objects, that when I
created the object id list and fed the result to git pack-object, the
resulting pack was 244k.

OK, maybe the delta compression wasn't optimal.  Compared to the 4.5
megabytes of loose objects --- I'll happily settle for that!  :-)

> So the problem is really about 'git gc' creating more data on disk which 
> is counter productive for a garbage collecting task.  Maybe the trick is 
> simply not to delete any of the old pack which content was repacked into 
> a single new pack and let them age before deleting them, rather than 
> exploding a bunch of loose objects.  But then we're back to the same 
> issue I wanted to get away from i.e. identifying real cruft packs and 
> making them safely deletable.

But the old packs are huge; in my case, a full set of packs was around
16 megabytes.  Right now, git gc *increased* my disk usage by 4.5
megabytes.  If we don't delete the old backs, then git gc would
increase disk usage by 16 megabytes --- which is far, far worse.

Writing a 244k cruft pack is a soooooo much preferable.

	      	       	  	    - Ted

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-12 18:37                                   ` Ted Ts'o
@ 2012-06-12 19:15                                     ` Nicolas Pitre
  2012-06-12 19:19                                       ` Ted Ts'o
  0 siblings, 1 reply; 48+ messages in thread
From: Nicolas Pitre @ 2012-06-12 19:15 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Jeff King, Thomas Rast, Hallvard B Furuseth, git

On Tue, 12 Jun 2012, Ted Ts'o wrote:

> On Tue, Jun 12, 2012 at 02:25:47PM -0400, Nicolas Pitre wrote:
> > > Earlier in the thread, I outlined another scheme by which you could
> > > repack and avoid the duplicates. It does not require changes to git's
> > > object lookup process, because it would involve manually feeding the
> > > list of cruft objects to pack-objects (which will pack what you ask it,
> > > regardless of whether the objects are in other packs).
> > 
> > That might be hard to achieve good delta compression though, as the main 
> > key to sort those objects is their path name, and with unreferenced 
> > objects you might not necessarily have that information.  The ability to 
> > reuse pack data might mitigate this though.
> 
> Compared to loose objects, even not-so-great delta compression is
> manna from heaven.  Remember what originally got me to start this
> flag.  There was 4.5 megabytes worth of loose objects, that when I
> created the object id list and fed the result to git pack-object, the
> resulting pack was 244k.
> 
> OK, maybe the delta compression wasn't optimal.  Compared to the 4.5
> megabytes of loose objects --- I'll happily settle for that!  :-)

Sure.  However I would be even happier if we could delete those unneeded 
objects outright.  The official reason why they're there for two weeks 
should be to avoid some race conditions, and in this case two weeks is 
way over the top as in "normal" conditions the actual window for a race 
is in the order of a few seconds..  Any other use case should be 
considered abusive.

> > So the problem is really about 'git gc' creating more data on disk which 
> > is counter productive for a garbage collecting task.  Maybe the trick is 
> > simply not to delete any of the old pack which content was repacked into 
> > a single new pack and let them age before deleting them, rather than 
> > exploding a bunch of loose objects.  But then we're back to the same 
> > issue I wanted to get away from i.e. identifying real cruft packs and 
> > making them safely deletable.
> 
> But the old packs are huge; in my case, a full set of packs was around
> 16 megabytes.  Right now, git gc *increased* my disk usage by 4.5
> megabytes.  If we don't delete the old backs, then git gc would
> increase disk usage by 16 megabytes --- which is far, far worse.
> 
> Writing a 244k cruft pack is a soooooo much preferable.

But as you might have noticed, there are a bunch of semantic problems 
with that as well.


Nicolas

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-12 19:15                                     ` Nicolas Pitre
@ 2012-06-12 19:19                                       ` Ted Ts'o
  2012-06-12 19:35                                         ` Nicolas Pitre
  0 siblings, 1 reply; 48+ messages in thread
From: Ted Ts'o @ 2012-06-12 19:19 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jeff King, Thomas Rast, Hallvard B Furuseth, git

On Tue, Jun 12, 2012 at 03:15:46PM -0400, Nicolas Pitre wrote:
> > But the old packs are huge; in my case, a full set of packs was around
> > 16 megabytes.  Right now, git gc *increased* my disk usage by 4.5
> > megabytes.  If we don't delete the old backs, then git gc would
> > increase disk usage by 16 megabytes --- which is far, far worse.
> > 
> > Writing a 244k cruft pack is a soooooo much preferable.
> 
> But as you might have noticed, there are a bunch of semantic problems 
> with that as well.

I've proposed something (explicitly labelled cruft packs) which is no
worse than before.  The one potential problem is that objects in the
cruft pack might have their lifespan extended by two weeks (or
whatever the expire timeout might be), but Peff has agreed that it's
simple enough to ignore that, since the benefits far outweigh the
potential that some objects in cruft packs will get to live a bit
longer.

The race condition you've pointed out exists today, with the git prune
racing against the git fetch.

					- Ted

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-12 19:19                                       ` Ted Ts'o
@ 2012-06-12 19:35                                         ` Nicolas Pitre
  2012-06-12 19:43                                           ` Ted Ts'o
  0 siblings, 1 reply; 48+ messages in thread
From: Nicolas Pitre @ 2012-06-12 19:35 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Jeff King, Thomas Rast, Hallvard B Furuseth, git

On Tue, 12 Jun 2012, Ted Ts'o wrote:

> On Tue, Jun 12, 2012 at 03:15:46PM -0400, Nicolas Pitre wrote:
> > > But the old packs are huge; in my case, a full set of packs was around
> > > 16 megabytes.  Right now, git gc *increased* my disk usage by 4.5
> > > megabytes.  If we don't delete the old backs, then git gc would
> > > increase disk usage by 16 megabytes --- which is far, far worse.
> > > 
> > > Writing a 244k cruft pack is a soooooo much preferable.
> > 
> > But as you might have noticed, there are a bunch of semantic problems 
> > with that as well.
> 
> I've proposed something (explicitly labelled cruft packs) which is no
> worse than before.  The one potential problem is that objects in the
> cruft pack might have their lifespan extended by two weeks (or
> whatever the expire timeout might be), but Peff has agreed that it's
> simple enough to ignore that, since the benefits far outweigh the
> potential that some objects in cruft packs will get to live a bit
> longer.
> 
> The race condition you've pointed out exists today, with the git prune
> racing against the git fetch.

Yes, however that race is trivial to fix when loose objects are used.


Nicolas

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-12 19:35                                         ` Nicolas Pitre
@ 2012-06-12 19:43                                           ` Ted Ts'o
  0 siblings, 0 replies; 48+ messages in thread
From: Ted Ts'o @ 2012-06-12 19:43 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jeff King, Thomas Rast, Hallvard B Furuseth, git

On Tue, Jun 12, 2012 at 03:35:17PM -0400, Nicolas Pitre wrote:
> > 
> > The race condition you've pointed out exists today, with the git prune
> > racing against the git fetch.
> 
> Yes, however that race is trivial to fix when loose objects are used.

We can use the same trivial fix with the cruft pack, by touching the
mtime of the pack.  Yes, it will extend the lifetime of the cruft pack
by another 2 weeks but again, let me remind you: 244k versus 4.5
megabytes.  I can live with an extra 244k hanging around a wee bit
longer.  :-)

There are other fixes we could do involving flock() and removing the
cruft label once we add a reference to a cruft pack, that I don't
think would be that complicated.

						- Ted

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-12 18:25                                 ` Nicolas Pitre
  2012-06-12 18:37                                   ` Ted Ts'o
@ 2012-06-12 19:15                                   ` Jeff King
  2012-06-13 18:17                                     ` Martin Fick
  1 sibling, 1 reply; 48+ messages in thread
From: Jeff King @ 2012-06-12 19:15 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Ted Ts'o, Thomas Rast, Hallvard B Furuseth, git

On Tue, Jun 12, 2012 at 02:25:47PM -0400, Nicolas Pitre wrote:

> My feeling is that we should make a step backward and consider if this 
> is actually the right problem to solve.  I don't remember why I might 
> have been opposed to a reflog for deleted branches as you say I did, but 
> that is certainly a feature that could prove to be useful.

I think your argument was along the lines of "this information can be
reconstructed from the HEAD reflog, anyway, so it is not worth the
effort". My counter to that is that the HEAD reflog is useless on bare
repositories (I have considered adding each pushed ref to a HEAD-like
reflog with everything in it, but doing it without lock contention
between pushes to different refs is tricky).

But keep in mind that a deletion reflog does not make this problem go
away. It might make it less likely, but there are still cases where the
gc can create a much larger object db.

> Then having a repository that can be used as an alternate for other 
> repositories without knowing about it is also a problem that needs 
> fixing and not only because of this object expiry issue.  This is not 
> easy to fix though.

Yeah, I think that is an open problem, because you do not necessarily
have any write access at all to the alternates repository (however, that
does not need to stop us from making it safer in the case that you _do_
have write access to the alternates repository).

> Then, the creation of unreferenced objects from successive 'git add' 
> shouldn't create that many objects in the first place.  They currently 
> never get the chance to be packed to start with.

I don't think these objects are necessarily from successive "git add"s.
That is one source, but they may also come from reflogs expiring. I
guess in that case that they would typically be in an older pack,
though.

> So the problem is really about 'git gc' creating more data on disk which 
> is counter productive for a garbage collecting task.  Maybe the trick is 
> simply not to delete any of the old pack which content was repacked into 
> a single new pack and let them age before deleting them, rather than 
> exploding a bunch of loose objects.  But then we're back to the same 
> issue I wanted to get away from i.e. identifying real cruft packs and 
> making them safely deletable.

That is satisfyingly simple, but the storage requirement is quite bad.
The unreachable objects are very much in the minority, and an occasional
duplication there is not a big deal; duplicating all of the reachable
objects would double the object directory's size.

-Peff

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-12 19:15                                   ` Jeff King
@ 2012-06-13 18:17                                     ` Martin Fick
  2012-06-13 21:27                                       ` Johan Herland
  0 siblings, 1 reply; 48+ messages in thread
From: Martin Fick @ 2012-06-13 18:17 UTC (permalink / raw)
  To: git

Jeff King <peff <at> peff.net> writes:
> > Then, the creation of unreferenced objects from successive 'git add' 
> > shouldn't create that many objects in the first place.  They currently 
> > never get the chance to be packed to start with.
> 
> I don't think these objects are necessarily from successive "git add"s.
> That is one source, but they may also come from reflogs expiring. I
> guess in that case that they would typically be in an older pack,
> though.
...
> That is satisfyingly simple, but the storage requirement is quite bad.
> The unreachable objects are very much in the minority, and an 
> occasional duplication there is not a big deal; duplicating all of the 
> reachable objects would double the object directory's size.
...
(I don't think this is a valid generalization for servers)

I am sorry to be coming a bit late into this discussion, but I think there
 is an even worse use case which can cause much worse loose object 
explosions which does not seem to have been mentioned yet:   "the 
server upload rejected case".  For example, think of a client pushing a 
change from the wrong repository to a server.  Since there will be no 
history in common, the client will push the entire repository and if for
 some reason this gets rejected by the server (perhaps a pre-receive 
hook, or a gerrit server which says:  "way too many new changes..."), 
then the pack file may stay abandonned on the server.  When gc runs: 
boom the entire history of that other project will explode but not get
 pruned since the pack file may be fairly new!

I believe that this has happened to us several times fairly recently.  We
 have a tiny project which some people keep confusing for the kernel
and they push a change destined for the kernel to it.  Gerrit rejects it and
their massive packfile (larger than the entire project) stays around.  If gc 
runs, it almost becomes a DOS for us, the sheer number of loose object
files makes the system crawl when accessing that repo, even on an SSD.
 We have been talking about moving to NFS soon (with packfiles git 
should still perform fairly well on NFS), but this explosion really scares 
me.

It seems like the current design is a DOS just waiting to happen for
servers.  While I would love to eliminate the races discussed in this
thread, I think I agree with Ted in that the first fix should just focus on
never expanding loose objects for pruning (if certain objects simply don't 
do well in pack files and the local gc policy says they should be loose, 
go ahead: expand them, but that should be unrelated to pruning).  People
can DOS a server with unused packfiles too, but that rarely will have the
same impact that loose objects would have,

-Martin

-- 
Employee of Qualcomm Innovation Center, Inc. which is a member 
of Code Aurora Forum

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-13 18:17                                     ` Martin Fick
@ 2012-06-13 21:27                                       ` Johan Herland
  0 siblings, 0 replies; 48+ messages in thread
From: Johan Herland @ 2012-06-13 21:27 UTC (permalink / raw)
  To: Martin Fick; +Cc: git

On Wed, Jun 13, 2012 at 8:17 PM, Martin Fick <mfick@codeaurora.org> wrote:
> Jeff King <peff <at> peff.net> writes:
>> > Then, the creation of unreferenced objects from successive 'git add'
>> > shouldn't create that many objects in the first place.  They currently
>> > never get the chance to be packed to start with.
>>
>> I don't think these objects are necessarily from successive "git add"s.
>> That is one source, but they may also come from reflogs expiring. I
>> guess in that case that they would typically be in an older pack,
>> though.
> ...
>> That is satisfyingly simple, but the storage requirement is quite bad.
>> The unreachable objects are very much in the minority, and an
>> occasional duplication there is not a big deal; duplicating all of the
>> reachable objects would double the object directory's size.
> ...
> (I don't think this is a valid generalization for servers)
>
> I am sorry to be coming a bit late into this discussion, but I think there
>  is an even worse use case which can cause much worse loose object
> explosions which does not seem to have been mentioned yet:   "the
> server upload rejected case".  For example, think of a client pushing a
> change from the wrong repository to a server.  Since there will be no
> history in common, the client will push the entire repository and if for
>  some reason this gets rejected by the server (perhaps a pre-receive
> hook, or a gerrit server which says:  "way too many new changes..."),
> then the pack file may stay abandonned on the server.  When gc runs:
> boom the entire history of that other project will explode but not get
>  pruned since the pack file may be fairly new!

[...]

Just a +1 from me. We had the same problem at my former $dayjob, and
worked around it by running a "git gc --prune=now" in the server repo
(which is a command you'd rather not want to run in a server repo) to
remove exploded loose objects.


...Johan

-- 
Johan Herland, <johan@herland.net>
www.herland.net

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Keeping unreachable objects in a separate pack instead of loose?
  2012-06-10 12:31 Keeping unreachable objects in a separate pack instead of loose? Theodore Ts'o
  2012-06-10 23:24 ` Hallvard B Furuseth
@ 2012-06-11 15:40 ` Junio C Hamano
  1 sibling, 0 replies; 48+ messages in thread
From: Junio C Hamano @ 2012-06-11 15:40 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: git

"Theodore Ts'o" <tytso@mit.edu> writes:

> Which got me thinking.... the whole point of leaving the objects loose
> is to make it easier to expire them, right?   But given how expensive it
> is to have loose objects lying around, why not:
>
> a)  Have git-pack-objects have an option which writes the unreachable
>     objects into a separate pack file, instead of kicking them loose?
>
> b)  Have git-prune delete a pack only if *all* of the objects in the
>     pack meet the expiry deadline?
>
> What would be the downsides of pursueing such a strategy?  Is it worth
> trying to implement as proof-of-concept?

I do not offhand see a downside; as a matter of fact, there has
already been the first-step change in the direction in v1.7.10.2
and newer that avoids exploading the unreachable ones into loose
object if we know they are going to be immediately pruned.

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2012-06-13 21:28 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-10 12:31 Keeping unreachable objects in a separate pack instead of loose? Theodore Ts'o
2012-06-10 23:24 ` Hallvard B Furuseth
2012-06-11 14:44   ` Thomas Rast
2012-06-11 15:31     ` Ted Ts'o
2012-06-11 16:08       ` Jeff King
2012-06-11 17:04         ` Nicolas Pitre
2012-06-11 17:45           ` Ted Ts'o
2012-06-11 17:54             ` Jeff King
2012-06-11 18:20               ` Ted Ts'o
2012-06-11 18:43                 ` Jeff King
2012-06-11 17:46           ` Jeff King
2012-06-11 17:27         ` Ted Ts'o
2012-06-11 18:34           ` Jeff King
2012-06-11 20:44             ` Hallvard Breien Furuseth
2012-06-11 21:14               ` Jeff King
2012-06-11 21:41                 ` Hallvard Breien Furuseth
2012-06-11 21:14             ` Ted Ts'o
2012-06-11 21:39               ` Jeff King
2012-06-11 22:14                 ` Ted Ts'o
2012-06-11 22:23                   ` Jeff King
2012-06-11 22:28                     ` Ted Ts'o
2012-06-11 22:35                       ` Jeff King
2012-06-12  0:41                     ` Nicolas Pitre
2012-06-12 17:10                       ` Jeff King
2012-06-12 17:30                         ` Nicolas Pitre
2012-06-12 17:32                           ` Jeff King
2012-06-12 17:45                             ` Shawn Pearce
2012-06-12 17:50                               ` Jeff King
2012-06-12 17:57                                 ` Nicolas Pitre
2012-06-12 18:43                                 ` Andreas Schwab
2012-06-12 19:07                                   ` Jeff King
2012-06-12 19:09                                   ` Nicolas Pitre
2012-06-12 19:23                                     ` Jeff King
2012-06-12 19:39                                       ` Nicolas Pitre
2012-06-12 19:41                                         ` Jeff King
2012-06-12 17:55                               ` Nicolas Pitre
2012-06-12 17:49                             ` Nicolas Pitre
2012-06-12 17:54                               ` Jeff King
2012-06-12 18:25                                 ` Nicolas Pitre
2012-06-12 18:37                                   ` Ted Ts'o
2012-06-12 19:15                                     ` Nicolas Pitre
2012-06-12 19:19                                       ` Ted Ts'o
2012-06-12 19:35                                         ` Nicolas Pitre
2012-06-12 19:43                                           ` Ted Ts'o
2012-06-12 19:15                                   ` Jeff King
2012-06-13 18:17                                     ` Martin Fick
2012-06-13 21:27                                       ` Johan Herland
2012-06-11 15:40 ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).