git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ben Peart <peartben@gmail.com>
To: Jonathan Tan <jonathantanmy@google.com>
Cc: Junio C Hamano <gitster@pobox.com>,
	git@vger.kernel.org, jrnieder@gmail.com
Subject: Re: [RFC PATCH] Updated "imported object" design
Date: Fri, 18 Aug 2017 10:18:37 -0400	[thread overview]
Message-ID: <78139f5c-a044-9c00-11ff-eb91a70b6ab9@gmail.com> (raw)
In-Reply-To: <20170817143905.2ef872e6@twelve2.svl.corp.google.com>



On 8/17/2017 5:39 PM, Jonathan Tan wrote:
> Thanks for your comments. I'll reply to both your e-mails in this one
> e-mail.
> 
>> This illustrates another place we need to resolve the
>> naming/vocabulary.  We should at least be consistent to make it easier
>> to discuss/explain.  We obviously went with "virtual" when building
>> GVFS but I'm OK with "lazy" as long as we're consistent.  Some
>> examples of how the naming can clarify or confuse:
>>
>> 'Promise-enable your repo by setting the "extensions.lazyObject" flag'
>>
>> 'Enable your repo to lazily fetch objects by setting the
>> "extensions.lazyObject"'
>>
>> 'Virtualize your repo by setting the "extensions.virtualize" flag'
>>
>> We may want to carry the same name into the filename we use to mark
>> the (virtualized/lazy/promised/imported) objects.
>>
>> (This reminds me that there are only 2 hard problems in computer
>> science...) ;)
> 
> Good point about the name. Maybe the 2nd one is the best? (Mainly
> because I would expect a "virtualized" repo to have virtual refs too.)
> 
> But if there was a good way to refer to the "anti-projection" in a
> virtualized system (that is, the "real" thing or "object" behind the
> "virtual" thing or "image"), then maybe the "virtualized" language is
> the best. (And I would gladly change - I'm having a hard time coming up
> with a name for the "anti-projection" in the "lazy" language.)
> 

The most common "anti-virtual" language I'm familiar with is "physical." 
  Virtual machine <-> physical machine. Virtual world <-> physical 
world. Virtual repo, commit, tree, blob - physical repo, commit, tree, 
blob. I'm not thrilled but I think it works...

> Also, I should probably standardize on "lazily fetch" instead of "lazily
> load". I didn't want to overlap with the existing fetching, but after
> some thought, it's probably better to do that. The explanation would
> thus be that you can either use the built-in Git fetcher (to be built,
> although I have an old version here [1]) or supply a custom fetcher.
> 
> [1] https://github.com/jonathantanmy/git/commits/partialclone
> 
>> I think this all works and would meet the requirements we've been
>> discussing.  The big trade off here vs what we first discussed with
>> promises is that we are generating the list of promises on the fly
>> when they are needed rather than downloading and maintaining a list
>> locally.
>>
>> My biggest concern with this model is the cost of opening and parsing
>> every imported object (loose and pack for local and alternates) to
>> build the oidset of promises.
>>
>> In fsck this probably won't be an issue as it already focuses on
>> correctness at the expense of speed.  I'm more worried about when we
>> add the same/similar logic into check_connected.  That impacts fetch,
>> clone, and receive_pack.
>>
>> I guess the only way we can know for sure it to do a perf test and
>> measure the impact.
> 
> As for fetching from the main repo, the connectivity check does not need
> to be performed at all because all objects are "imported", so the
> performance of the connectivity check does not matter. Same for cloning.
> 

Very good point! I got stuck on connectivity check in general forgetting 
that we really only need to prevent sharing a corrupt repo.

> This is not true if you're fetching from another repo 

This isn't a case we've explicitly dealt with (multiple remotes into a 
virtualized repo).  Our behavior today would be that once you set the 
"virtual repo" flag on the repo (this happens at clone for us), all 
remotes are treated as virtual as well (ie we don't differentiate 
behavior based on which remote was used).  Our "custom fetcher" always 
uses "origin" and some custom settings for a cache-server saved in the 
.git/config file when asked to fetch missing objects.

This is probably a good model to stick with at least initially as trying 
to solve multiple possible "virtual" remotes as well as mingling 
virtualized and non-virtualized remotes and all the mixed cases that can 
come up makes my head hurt.  We should probably address that in a 
different thread. :)

> or if you're using
> receive-pack, but (1) I think these are not used as much in such a
> situation, and (2) if you do use them, the slowness only "kicks in" if
> you do not have the objects referred to (whether non-"imported" or
> "imported") and thus have to check the references in all "imported"
> objects.
> 

Is there any case where receive-pack is used on the client side?  I'm 
only aware of it being used on the server side to receive packs pushed 
from the client.  If it is not used in a virtualized client, then we 
would not need to do anything different for receive-pack.

>> I think this topic should continue to move forward so that we can
>> provide reasonable connectivity tests for fsck and check_connected in
>> the face of partial clones.  I'm not sure the prototype implementation
>> of reading/parsing all imported objects to build the promised oidset is
>> the most performant model but we can continue to investigate the best
>> options.
> 
> Agreed - I think the most important thing here is settling on the API
> (name of extension and the nature of the object mark).
> 
>> Given all we need is an existance check for a given oid,
> 
> This is true...
> 
>> I wonder if it
>> would be faster overall to do a binary search through the list of
>> imported idx files + an existence test for an imported loose object.
> 
> ...but what we're checking is the existence of a reference, not the
> existence of an object. For a concrete example, consider what happens if
> we both have an "imported" tree and a non-"imported" tree that
> references a blob that we do not have. When checking the non-"imported"
> tree for connectivity, we have to iterate through all "imported" trees
> to see if any can vouch for the existence of such a blob. We cannot
> merely binary-search the .idx file.
> 

That is another good point.  Given the discussion above about not 
needing to do the connectivity test for fetch/clone - the potential perf 
hit of loading/parsing all the various objects to build up the oidset is 
much less of an issue.


  reply	other threads:[~2017-08-18 14:18 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-04 21:51 Partial clone design (with connectivity check for locally-created objects) Jonathan Tan
2017-08-04 22:51 ` Junio C Hamano
2017-08-05  0:21   ` Jonathan Tan
2017-08-07 19:12     ` Ben Peart
2017-08-07 19:21       ` Jonathan Nieder
2017-08-08 14:18         ` Ben Peart
2017-08-07 19:41       ` Junio C Hamano
2017-08-08 16:45         ` Ben Peart
2017-08-08 17:03           ` Jonathan Nieder
2017-08-07 23:10       ` Jonathan Tan
2017-08-16  0:32 ` [RFC PATCH] Updated "imported object" design Jonathan Tan
2017-08-16 20:32   ` Junio C Hamano
2017-08-16 21:35     ` Jonathan Tan
2017-08-17 20:50       ` Ben Peart
2017-08-17 21:39         ` Jonathan Tan
2017-08-18 14:18           ` Ben Peart [this message]
2017-08-18 23:33             ` Jonathan Tan
2017-08-17 20:07   ` Ben Peart

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=78139f5c-a044-9c00-11ff-eb91a70b6ab9@gmail.com \
    --to=peartben@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    --cc=jrnieder@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).