Re: Partial clone design (with connectivity check for locally-created objects)

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Ben Peart <peartben@gmail.com>
To: Jonathan Tan <jonathantanmy@google.com>,
	Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org, Jonathan Nieder <jrnieder@gmail.com>,
	christian.couder@gmail.com
Subject: Re: Partial clone design (with connectivity check for locally-created objects)
Date: Mon, 7 Aug 2017 15:12:11 -0400	[thread overview]
Message-ID: <0633771f-ce19-6211-fabe-3f7f676e53ab@gmail.com> (raw)
In-Reply-To: <20170804172137.42f27653@twelve2.svl.corp.google.com>

On 8/4/2017 8:21 PM, Jonathan Tan wrote:
> On Fri, 04 Aug 2017 15:51:08 -0700
> Junio C Hamano <gitster@pobox.com> wrote:
> 
>> Jonathan Tan <jonathantanmy@google.com> writes:
>>
>>> "Imported" objects must be in a packfile that has a "<pack name>.remote"
>>> file with arbitrary text (similar to the ".keep" file). They come from
>>> clones, fetches, and the object loader (see below).
>>> ...
>>> A "homegrown" object is valid if each object it references:
>>>   1. is a "homegrown" object,
>>>   2. is an "imported" object, or
>>>   3. is referenced by an "imported" object.
>>
>> Overall it captures what was discussed, and I think it is a good
>> start.

I missed the offline discussion and so am trying to piece together what 
this latest design is trying to do.  Please let me know if I'm not 
understanding something correctly.

 From what I can tell, objects are going to be segmented into two 
"types" - those that were fetched from a remote source that allows 
partial clones/fetches (lazyobject/imported) and those that come from 
"regular" remote sources (homegrown) that requires all objects to exist 
locally.

FWIW, the names here are not making things clearer for me. If I'm 
correct perhaps "partial" and "normal" would be better to indicate the 
type of the source? Anyway...

Once the objects are segmented into the 2 types, the fsck connectivity 
check code is updated to ignore missing objects from "partial" remotes 
but still expect/validate them from "normal" remotes.

This compromise seems reasonable - don't generate errors for missing 
objects for remotes that returned a partial clone but do generate errors 
for missing objects from normal clones as a missing object is always an 
error in this case.

This segmentation is what is driving the need for the object loader to 
build a new local pack file for every command that has to fetch a 
missing object.  For example, we can't just write a tree object from a 
"partial" clone into the loose object store as we have no way for fsck 
to treat them differently and ignore any missing objects referenced by 
that tree object.

My concern with this proposal is the combination of 1) writing a new 
pack file for every git command that ends up bringing down a missing 
object and 2) gc not compressing those pack files into a single pack file.

We all know that git doesn't scale well with a lot of pack files as it 
has to do a linear search through all the pack files when attempting to 
find an object.  I can see that very quickly, there would be a lot of 
pack files generated and with gc ignoring "partial" pack files, this 
would never get corrected.

In our usage scenarios, _all_ of the objects come from "partial" clones 
so all of our objects would end up in a series of "partial" pack files 
and would have pretty poor performance as a result.

I wondered if it is possible to flag a specific remote as "partial" and 
have fsck be able to track any given object back to the remote and then 
properly handle the fact that it was missing based on that. I couldn't 
think of a good way to do that without some additional data structure 
that would have to be build/maintained (ie promises).

That thinking did lead me back to wondering again if we could live with 
a repo specific flag.  If any clone/fetch was "partial" the flag is set 
and fsck ignore missing objects whether they came from a "partial" 
remote or not.

I'll admit it isn't as robust if someone is mixing and matching remotes 
from different servers some of which are partial and some of which are 
not.  I'm not sure how often that would actually happen but I _am_ 
certain a single repo specific flag is a _much_ simpler model than 
anything else we've come up with so far.

>>
>> I doubt you want to treat all fetches/clones the same way as the
>> "lazy object" loading, though.  You may be critically rely on the
>> corporate central server that will give the objects it "promised"
>> when you cloned from it lazily (i.e. it may have given you a commit,
>> but not its parents or objects contained in its tree--you still know
>> that the parents and the tree and its contents will later be
>> available and rely on that fact).  You trust that and build on top,
>> so the packfile you obtained when you cloned from such a server
>> should count as "imported".  But if you exchanged wip changes with
>> your colleages by fetching or pushing peer-to-peer, without the
>> corporate central server knowing, you would want to treat objects in
>> packs (or loose objects) you obtained that way as "not imported".
> 
> That's true. I discussed this with a teammate and we might need to make
> extensions.lazyObject be the name of the "corporate central server"
> remote instead, and have a "loader" setting within that remote, so that
> we can distinguish that objects from this server are "imported" but
> objects from other servers are not.
> 
> The connectivity check shouldn't be slow in this case because fetches
> are usually onto tips that we have (so we don't hit case 3).
> 
>> Also I think "imported" vs "homegrown" may be a bit misnomer; the
>> idea to split objects into two camps sounds like a good idea, and
>> "imported" probably is an OK name to use for the category that is a
>> group of objects to which you know/trust are backed by your lazy
>> loader.  But the other one does not have to be "home"-grown.
>>
>> Well, the names are not that important, but I think the line between
>> the two classes should not be "everything that came from clone and
>> fetch is imported", which is a more important point I am trying to
>> make.
>>
>> Thanks.
> 
> Maybe "imported" vs "non-imported" would be better. I agree that the
> objects in the non-"imported" group could still be obtained from
> elsewhere.
> 
> Thanks for your comments.
>

next prev parent reply	other threads:[~2017-08-07 19:12 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-04 21:51 Partial clone design (with connectivity check for locally-created objects) Jonathan Tan
2017-08-04 22:51 ` Junio C Hamano
2017-08-05  0:21   ` Jonathan Tan
2017-08-07 19:12     ` Ben Peart [this message]
2017-08-07 19:21       ` Jonathan Nieder
2017-08-08 14:18         ` Ben Peart
2017-08-07 19:41       ` Junio C Hamano
2017-08-08 16:45         ` Ben Peart
2017-08-08 17:03           ` Jonathan Nieder
2017-08-07 23:10       ` Jonathan Tan
2017-08-16  0:32 ` [RFC PATCH] Updated "imported object" design Jonathan Tan
2017-08-16 20:32   ` Junio C Hamano
2017-08-16 21:35     ` Jonathan Tan
2017-08-17 20:50       ` Ben Peart
2017-08-17 21:39         ` Jonathan Tan
2017-08-18 14:18           ` Ben Peart
2017-08-18 23:33             ` Jonathan Tan
2017-08-17 20:07   ` Ben Peart

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0633771f-ce19-6211-fabe-3f7f676e53ab@gmail.com \
    --to=peartben@gmail.com \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    --cc=jrnieder@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).