From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dan Holmsand Subject: Re: [RFC] Design for http-pull on repo with packs Date: Sun, 10 Jul 2005 23:39:11 +0200 Message-ID: <42D1957F.1050609@gmail.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: git@vger.kernel.org X-From: git-owner@vger.kernel.org Sun Jul 10 23:45:11 2005 Return-path: Received: from vger.kernel.org ([12.107.209.244]) by ciao.gmane.org with esmtp (Exim 4.43) id 1DrjbW-0000v2-Bc for gcvg-git@gmane.org; Sun, 10 Jul 2005 23:44:50 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S262083AbVGJVmA (ORCPT ); Sun, 10 Jul 2005 17:42:00 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S262119AbVGJVjc (ORCPT ); Sun, 10 Jul 2005 17:39:32 -0400 Received: from wproxy.gmail.com ([64.233.184.192]:9912 "EHLO wproxy.gmail.com") by vger.kernel.org with ESMTP id S261792AbVGJVjS (ORCPT ); Sun, 10 Jul 2005 17:39:18 -0400 Received: by wproxy.gmail.com with SMTP id 67so790047wri for ; Sun, 10 Jul 2005 14:39:14 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:user-agent:x-accept-language:mime-version:to:cc:subject:references:in-reply-to:content-type:content-transfer-encoding; b=nfhUKktFtSbIFIjomDl/n99aP6KhTAhLDmkHs2KtolXcK/FH5YFIkWGdYSOKEkAtN7kSmS8gfl2H87piQNhYbaoeV2v9fpfDY95j5DYZix5pFT/13TyFC7RyKO9aOSgaXhR8cnAFOufHh42SQMxgssUI31xQNHgMO9wHoh07DBs= Received: by 10.54.44.7 with SMTP id r7mr3297158wrr; Sun, 10 Jul 2005 14:39:14 -0700 (PDT) Received: from ?192.168.0.5? ([80.217.52.214]) by mx.gmail.com with ESMTP id 43sm5064777wri.2005.07.10.14.39.13; Sun, 10 Jul 2005 14:39:14 -0700 (PDT) User-Agent: Mozilla Thunderbird 1.0.2 (X11/20050404) X-Accept-Language: en-us, en To: Daniel Barkalow In-Reply-To: Sender: git-owner@vger.kernel.org Precedence: bulk X-Mailing-List: git@vger.kernel.org Daniel Barkalow wrote: > On Sun, 10 Jul 2005, Dan Holmsand wrote: >>Daniel Barkalow wrote: >>> If an individual file is not available, figure out what packs are >>> available: >>> >>> Get the list of pack files the repository has >>> (currently, I just use "e3117bbaf6a59cb53c3f6f0d9b17b9433f0e4135") >>> For any packs we don't have, get the index files. >> >>This part might be slightly expensive, for large repositories. If one >>assumes that packs are named as by git-repack-script, however, one might >>cache indexes we've already seen (again, see below). Or, if you go for >>the mandatory "pack-index-file", require that it has a reliable order, >>so that you can get the last added index first. > > > Nothing bad happens if you have index files for pack files you don't have, > as it turns out; the library ignores them. So we can keep the index files > around so we can quickly check if they have the objects we want. That way, > we don't have to worry about skipping something now (because it's not > needed) and then ignoring it when the branch gets merged in. > > So what I actually do is make a list of the pack files that aren't already > downloaded that are available from the server, and download the index > files for any where the index file isn't downloaded, either. Aah. In other words, you do the caching thing as well. It seems a little ugly, though, to store the index-only index files with the rest of the pack. It might be preferable to introduce something like $GIT_DIR/index-cache or something, so than it can be easily cleaned (and don't follow us around forever when cloning-by-hardlinking-the-entire-object-directory). You might end up with quite a large number of index files, after a while though, if you pull from several repositories that are regularly repacked. >>> Keep a list of the struct packed_gits for the packs the server has >>> (these are not used as places to look for objects) >>> >>> Each time we need an object, check the list for it. If it is in there, >>> download the corresponding pack and report success. >> >>Here you will need some strategy to deal with packs that overlap with >>what we've already got. Basically, small and overlapping packs should be >>unpacked, big and non-overlapping ones saved as is (since >>git-unpack-objects is painfully slow and memory-hungry...). > > > I don't think there's an issue to having overlapping packs, either with > each other or with separate objects. If the user wants, stuff can be > repacked outside of the pull operation (note, though, that the index files > should be truncated rather than removed, so that the program doesn't fetch > them again next time some object can't be found easily). Well, the only issue is obviously waste of space. If you fetch a lot of branches from independently packed repos, it might mean a lot of waste, though. About truncating index files: this seems a bit ugly. You get a file that doesn't contain what it says it contains, which may cause trouble if for example the git prune thing is used. You might be better off with a simple list of index files we know we have all the objects of (and make sure that git-prune-script deletes this file, since it possibly breaks the contract). >>One could also optimize the pack-download bit, by figuring out the last >>object in the pack that we need (easy enough to do from the index file), >> and just get the part of the pack file leading up to that object. That >>could be a huge win for independently packed repositories (I don't do >>that in my code below, though). > > > That's only possible if you can figure out what you want to have before > you get it. My code is walking the reachability graph on the client; it > can only figure out what other objects it needs after it's mapped the pack > file. No, but we can find out which objects we *don't* want (i.e. the ones we have). And that may be a lot, e.g. if a repository is fully repacked, or if we track branches on several similar but independently packed repositories. And as far as I understand git-pack-objects, it tries to put recent objects in the front. I don't have any numbers to back this up with, though. Some testing may be needed, but since the population of packed public repositories is 1, this is tricky... > I might use that method for listing the available packs, although I'd sort > of like to encourage a clean solution first. Encouraging cleanliness is obviously a good thing :-) /dan