Re: [RFC] Design for http-pull on repo with packs

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Dan Holmsand <holmsand@gmail.com>
To: Daniel Barkalow <barkalow@iabervon.org>
Cc: git@vger.kernel.org
Subject: Re: [RFC] Design for http-pull on repo with packs
Date: Sun, 10 Jul 2005 23:39:11 +0200	[thread overview]
Message-ID: <42D1957F.1050609@gmail.com> (raw)
In-Reply-To: <Pine.LNX.4.21.0507101557510.30848-100000@iabervon.org>

Daniel Barkalow wrote:
> On Sun, 10 Jul 2005, Dan Holmsand wrote:
>>Daniel Barkalow wrote:
>>> If an individual file is not available, figure out what packs are
>>>  available:
>>>
>>>   Get the list of pack files the repository has
>>>    (currently, I just use "e3117bbaf6a59cb53c3f6f0d9b17b9433f0e4135")
>>>   For any packs we don't have, get the index files.
>>
>>This part might be slightly expensive, for large repositories. If one 
>>assumes that packs are named as by git-repack-script, however, one might 
>>cache indexes we've already seen (again, see below). Or, if you go for 
>>the mandatory "pack-index-file", require that it has a reliable order, 
>>so that you can get the last added index first.
> 
> 
> Nothing bad happens if you have index files for pack files you don't have,
> as it turns out; the library ignores them. So we can keep the index files
> around so we can quickly check if they have the objects we want. That way,
> we don't have to worry about skipping something now (because it's not
> needed) and then ignoring it when the branch gets merged in.
> 
> So what I actually do is make a list of the pack files that aren't already
> downloaded that are available from the server, and download the index
> files for any where the index file isn't downloaded, either.

Aah. In other words, you do the caching thing as well. It seems a little 
ugly, though, to store the index-only index files with the rest of the 
pack. It might be preferable to introduce something like 
$GIT_DIR/index-cache or something, so than it can be easily cleaned (and 
don't follow us around forever when 
cloning-by-hardlinking-the-entire-object-directory).

You might end up with quite a large number of index files, after a while 
though, if you pull from several repositories that are regularly repacked.

>>>   Keep a list of the struct packed_gits for the packs the server has
>>>    (these are not used as places to look for objects)
>>>
>>> Each time we need an object, check the list for it. If it is in there,
>>>  download the corresponding pack and report success.
>>
>>Here you will need some strategy to deal with packs that overlap with 
>>what we've already got. Basically, small and overlapping packs should be 
>>unpacked, big and non-overlapping ones saved as is (since 
>>git-unpack-objects is painfully slow and memory-hungry...).
> 
> 
> I don't think there's an issue to having overlapping packs, either with
> each other or with separate objects. If the user wants, stuff can be
> repacked outside of the pull operation (note, though, that the index files
> should be truncated rather than removed, so that the program doesn't fetch
> them again next time some object can't be found easily).

Well, the only issue is obviously waste of space. If you fetch a lot of 
branches from independently packed repos, it might mean a lot of waste, 
though.

About truncating index files: this seems a bit ugly. You get a file that 
doesn't contain what it says it contains, which may cause trouble if for 
example the git prune thing is used.

You might be better off with a simple list of index files we know we 
have all the objects of (and make sure that git-prune-script deletes 
this file, since it possibly breaks the contract).

>>One could also optimize the pack-download bit, by figuring out the last 
>>object in the pack that we need (easy enough to do from the index file), 
>>  and just get the part of the pack file leading up to that object. That 
>>could be a huge win for independently packed repositories (I don't do 
>>that in my code below, though).
> 
> 
> That's only possible if you can figure out what you want to have before
> you get it. My code is walking the reachability graph on the client; it
> can only figure out what other objects it needs after it's mapped the pack
> file.

No, but we can find out which objects we *don't* want (i.e. the ones we 
have). And that may be a lot, e.g. if a repository is fully repacked, or 
if we track branches on several similar but independently packed 
repositories. And as far as I understand git-pack-objects, it tries to 
put recent objects in the front.

I don't have any numbers to back this up with, though. Some testing may 
be needed, but since the population of packed public repositories is 1, 
this is tricky...

> I might use that method for listing the available packs, although I'd sort
> of like to encourage a clean solution first.

Encouraging cleanliness is obviously a good thing :-)

/dan

next prev parent reply	other threads:[~2005-07-10 21:44 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-07-10 18:42 [RFC] Design for http-pull on repo with packs Daniel Barkalow
2005-07-10 19:56 ` Dan Holmsand
2005-07-10 20:29   ` Daniel Barkalow
2005-07-10 21:39     ` Dan Holmsand [this message]
2005-07-11  3:18   ` Junio C Hamano
2005-07-11 15:53     ` Dan Holmsand
2005-07-11 17:08       ` Tony Luck
2005-07-11 23:30       ` Junio C Hamano
2005-07-12 17:21         ` Dan Holmsand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=42D1957F.1050609@gmail.com \
    --to=holmsand@gmail.com \
    --cc=barkalow@iabervon.org \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.