From: Ben Peart <peartben@gmail.com>
To: Jonathan Tan <jonathantanmy@google.com>,
Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org, Jonathan Nieder <jrnieder@gmail.com>,
christian.couder@gmail.com
Subject: Re: Partial clone design (with connectivity check for locally-created objects)
Date: Mon, 7 Aug 2017 15:12:11 -0400 [thread overview]
Message-ID: <0633771f-ce19-6211-fabe-3f7f676e53ab@gmail.com> (raw)
In-Reply-To: <20170804172137.42f27653@twelve2.svl.corp.google.com>
On 8/4/2017 8:21 PM, Jonathan Tan wrote:
> On Fri, 04 Aug 2017 15:51:08 -0700
> Junio C Hamano <gitster@pobox.com> wrote:
>
>> Jonathan Tan <jonathantanmy@google.com> writes:
>>
>>> "Imported" objects must be in a packfile that has a "<pack name>.remote"
>>> file with arbitrary text (similar to the ".keep" file). They come from
>>> clones, fetches, and the object loader (see below).
>>> ...
>>> A "homegrown" object is valid if each object it references:
>>> 1. is a "homegrown" object,
>>> 2. is an "imported" object, or
>>> 3. is referenced by an "imported" object.
>>
>> Overall it captures what was discussed, and I think it is a good
>> start.
I missed the offline discussion and so am trying to piece together what
this latest design is trying to do. Please let me know if I'm not
understanding something correctly.
From what I can tell, objects are going to be segmented into two
"types" - those that were fetched from a remote source that allows
partial clones/fetches (lazyobject/imported) and those that come from
"regular" remote sources (homegrown) that requires all objects to exist
locally.
FWIW, the names here are not making things clearer for me. If I'm
correct perhaps "partial" and "normal" would be better to indicate the
type of the source? Anyway...
Once the objects are segmented into the 2 types, the fsck connectivity
check code is updated to ignore missing objects from "partial" remotes
but still expect/validate them from "normal" remotes.
This compromise seems reasonable - don't generate errors for missing
objects for remotes that returned a partial clone but do generate errors
for missing objects from normal clones as a missing object is always an
error in this case.
This segmentation is what is driving the need for the object loader to
build a new local pack file for every command that has to fetch a
missing object. For example, we can't just write a tree object from a
"partial" clone into the loose object store as we have no way for fsck
to treat them differently and ignore any missing objects referenced by
that tree object.
My concern with this proposal is the combination of 1) writing a new
pack file for every git command that ends up bringing down a missing
object and 2) gc not compressing those pack files into a single pack file.
We all know that git doesn't scale well with a lot of pack files as it
has to do a linear search through all the pack files when attempting to
find an object. I can see that very quickly, there would be a lot of
pack files generated and with gc ignoring "partial" pack files, this
would never get corrected.
In our usage scenarios, _all_ of the objects come from "partial" clones
so all of our objects would end up in a series of "partial" pack files
and would have pretty poor performance as a result.
I wondered if it is possible to flag a specific remote as "partial" and
have fsck be able to track any given object back to the remote and then
properly handle the fact that it was missing based on that. I couldn't
think of a good way to do that without some additional data structure
that would have to be build/maintained (ie promises).
That thinking did lead me back to wondering again if we could live with
a repo specific flag. If any clone/fetch was "partial" the flag is set
and fsck ignore missing objects whether they came from a "partial"
remote or not.
I'll admit it isn't as robust if someone is mixing and matching remotes
from different servers some of which are partial and some of which are
not. I'm not sure how often that would actually happen but I _am_
certain a single repo specific flag is a _much_ simpler model than
anything else we've come up with so far.
>>
>> I doubt you want to treat all fetches/clones the same way as the
>> "lazy object" loading, though. You may be critically rely on the
>> corporate central server that will give the objects it "promised"
>> when you cloned from it lazily (i.e. it may have given you a commit,
>> but not its parents or objects contained in its tree--you still know
>> that the parents and the tree and its contents will later be
>> available and rely on that fact). You trust that and build on top,
>> so the packfile you obtained when you cloned from such a server
>> should count as "imported". But if you exchanged wip changes with
>> your colleages by fetching or pushing peer-to-peer, without the
>> corporate central server knowing, you would want to treat objects in
>> packs (or loose objects) you obtained that way as "not imported".
>
> That's true. I discussed this with a teammate and we might need to make
> extensions.lazyObject be the name of the "corporate central server"
> remote instead, and have a "loader" setting within that remote, so that
> we can distinguish that objects from this server are "imported" but
> objects from other servers are not.
>
> The connectivity check shouldn't be slow in this case because fetches
> are usually onto tips that we have (so we don't hit case 3).
>
>> Also I think "imported" vs "homegrown" may be a bit misnomer; the
>> idea to split objects into two camps sounds like a good idea, and
>> "imported" probably is an OK name to use for the category that is a
>> group of objects to which you know/trust are backed by your lazy
>> loader. But the other one does not have to be "home"-grown.
>>
>> Well, the names are not that important, but I think the line between
>> the two classes should not be "everything that came from clone and
>> fetch is imported", which is a more important point I am trying to
>> make.
>>
>> Thanks.
>
> Maybe "imported" vs "non-imported" would be better. I agree that the
> objects in the non-"imported" group could still be obtained from
> elsewhere.
>
> Thanks for your comments.
>
next prev parent reply other threads:[~2017-08-07 19:12 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-08-04 21:51 Partial clone design (with connectivity check for locally-created objects) Jonathan Tan
2017-08-04 22:51 ` Junio C Hamano
2017-08-05 0:21 ` Jonathan Tan
2017-08-07 19:12 ` Ben Peart [this message]
2017-08-07 19:21 ` Jonathan Nieder
2017-08-08 14:18 ` Ben Peart
2017-08-07 19:41 ` Junio C Hamano
2017-08-08 16:45 ` Ben Peart
2017-08-08 17:03 ` Jonathan Nieder
2017-08-07 23:10 ` Jonathan Tan
2017-08-16 0:32 ` [RFC PATCH] Updated "imported object" design Jonathan Tan
2017-08-16 20:32 ` Junio C Hamano
2017-08-16 21:35 ` Jonathan Tan
2017-08-17 20:50 ` Ben Peart
2017-08-17 21:39 ` Jonathan Tan
2017-08-18 14:18 ` Ben Peart
2017-08-18 23:33 ` Jonathan Tan
2017-08-17 20:07 ` Ben Peart
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0633771f-ce19-6211-fabe-3f7f676e53ab@gmail.com \
--to=peartben@gmail.com \
--cc=christian.couder@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=jonathantanmy@google.com \
--cc=jrnieder@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).