git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ben Peart <peartben@gmail.com>
To: Duy Nguyen <pclouds@gmail.com>
Cc: Git Mailing List <git@vger.kernel.org>,
	Junio C Hamano <gitster@pobox.com>,
	Ben Peart <benpeart@microsoft.com>
Subject: Re: [RFC v1] Add virtual file system settings and hook proc
Date: Wed, 31 Oct 2018 16:53:41 -0400	[thread overview]
Message-ID: <1f7efd07-4881-daa7-cd1d-145bbf3ffcc8@gmail.com> (raw)
In-Reply-To: <CACsJy8DbiVZYmY11Nt4c_+egSi5tz0iVq7rNv2BiVdyJ4htgvw@mail.gmail.com>



On 10/31/2018 3:11 PM, Duy Nguyen wrote:
> not really a review, just  a couple quick notes..
> 

Perfect!  As an RFC, I'm more looking for high level thoughts/notes than 
a style/syntax code review.

> On Tue, Oct 30, 2018 at 9:40 PM Ben Peart <peartben@gmail.com> wrote:
>>
>> From: Ben Peart <benpeart@microsoft.com>
>>
>> On index load, clear/set the skip worktree bits based on the virtual
>> file system data. Use virtual file system data to update skip-worktree
>> bit in unpack-trees. Use virtual file system data to exclude files and
>> folders not explicitly requested.
>>
>> Signed-off-by: Ben Peart <benpeart@microsoft.com>
>> ---
>>
>> We have taken several steps to make git perform well on very large repos.
>> Some of those steps include: improving underlying algorithms, utilizing
>> multi-threading where possible, and simplifying the behavior of some commands.
>> These changes typically benefit all git repos to varying degrees.  While
>> these optimizations all help, they are insufficient to provide adequate
>> performance on the very large repos we often work with.
>>
>> To make git perform well on the very largest repos, we had to make more
>> significant changes.  The biggest performance win by far is the work we have
>> done to make git operations O(modified) instead of O(size of repo).  This
>> takes advantage of the fact that the number of files a developer has modified
>> is a tiny fraction of the overall repo size.
>>
>> We accomplished this by utilizing the existing internal logic for the skip
>> worktree bit and excludes to tell git to ignore all files and folders other
>> than those that have been modified.  This logic is driven by an external
>> process that monitors writes to the repo and communicates the list of files
>> and folders with changes to git via the virtual file system hook in this patch.
>>
>> The external process maintains a list of files and folders that have been
>> modified.  When git runs, it requests the list of files and folders that
>> have been modified via the virtual file system hook.  Git then sets/clears
>> the skip-worktree bit on the cache entries and builds a hashmap of the
>> modified files/folders that is used by the excludes logic to avoid scanning
>> the entire repo looking for changes and untracked files.
>>
>> With this system, we have been able to make local git command performance on
>> extremely large repos (millions of files, 1/2 million folders) entirely
>> manageable (30 second checkout, 3.5 seconds status, 4 second add, 7 second
>> commit, etc).
>>
>> Our desire is to eliminate all custom patches in our fork of git.  To that
>> end, I'm submitting this as an RFC to see how much interest there is and how
>> much willingness to take this type of change into git.
> 
> Most of these paragraphs (perhaps except the last one) should be part
> of the commit message. You describe briefly what the patch does but
> it's even more important to say why you want to do it.
> 
>> +core.virtualFilesystem::
>> +       If set, the value of this variable is used as a command which
>> +       will identify all files and directories that are present in
>> +       the working directory.  Git will only track and update files
>> +       listed in the virtual file system.  Using the virtual file system
>> +       will supersede the sparse-checkout settings which will be ignored.
>> +       See the "virtual file system" section of linkgit:githooks[6].
> 
> It sounds like "virtual file system" is just one of the use cases for
> this feature, which is more about a dynamic source of sparse-checkout
> bits. Perhaps name the config key with something along sparse checkout
> instead of naming it after a use case.

It's more than a dynamic sparse-checkout because the same list is also 
used to exclude any file/folder not listed.  That means any file not 
listed won't ever be updated by git (like in 'checkout' for example) so 
'stale' files could be left in the working directory.  It also means git 
won't find new/untracked files unless they are specifically added to the 
list.

> 
> This is a hook. I notice we start to avoid adding real hooks and just
> add config keys instead. Eventually we should have config-based hooks,
> but if we're going to add more like this, I think these should be in a
> separate section, hook.virtualFileSystem or something.
> 

That is a great idea.  I don't personally like specifying the hook as 
the 'flag' for whether a feature should be used.  I'd rather have it be 
a bool (enable the feature? true/false) and 1) either have the hook name 
hard coded (like most existing hooks) or 2) as you suggest add a 
consistent way to have config-based hooks.  Config based hooks could 
also help provide a consistent way to configure them using GIT_TEST_* 
environment variables for testing.

> I don't think the superseding makes sense. There's no reason this
> could not be used in combination with $GIT_DIR/info/sparse-checkout.
> If you don't want both, disable the other.
> 
> One last note. Since this is related to filesystem. Shouldn't it be
> part of fsmonitor (the protocol, not the implementation)? Then
> watchman user could use it to.
> 

To get this to work properly takes a lot more logic than exists in 
fsmonitor/watchman.  The challenge is that 1) fsmonitor/watchman is 
focused on "what has changed since <time>" and 2) doesn't currently 
impact the excludes logic.

If you attempted to use this with watchman there is a chicken and egg 
problem.  The initial git checkout wouldn't write out _any_ files to the 
working directory as none have been modified.  There would be no way to 
get them populated where they could even get modified to get added to 
the list.  Not very useful. :-)

This works with VFS for Git because it provides a virtual projection and 
will dynamically write out the contents of the file in the working 
directory as they are read.  It makes it appear that they are there and 
will fetch the actual contents on demand transparently.  If the user 
ends up modifying the file, it will get added to the virtual projection 
list so that git will start to pay attention to and update that file.

If the files are only read (and not written) by the user, the version on 
disk must be maintained by the VFS for Git daemon because git is 
completely unaware of them.  That means the daemon must detect when the 
git commit changes and remove the contents of all the files that were 
read but not written and start projecting the files from new commit.

In short, this is only one small piece of what is necessary to get a 
fully virtualized git repo.  It's an important piece but only one of the 
many pieces.  My reason for submitting this RFC is to start the 
discussion about how interested the community is in enabling repo 
virtualization in the mainline version of git.

  reply	other threads:[~2018-10-31 20:53 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-30 19:16 [RFC v1] Add virtual file system settings and hook proc Ben Peart
2018-10-30 23:07 ` Junio C Hamano
2018-10-31 20:12   ` Ben Peart
2018-11-05  0:02     ` Junio C Hamano
2018-11-05 20:00       ` Ben Peart
2018-10-31 19:11 ` Duy Nguyen
2018-10-31 20:53   ` Ben Peart [this message]
2018-11-04  6:34     ` Duy Nguyen
2018-11-04 21:01       ` brian m. carlson
2018-11-05 15:22         ` Duy Nguyen
2018-11-05 20:18           ` Ben Peart
2018-11-05 20:27         ` Ben Peart
2018-11-05 11:40       ` Ævar Arnfjörð Bjarmason
2018-11-05 15:26         ` Duy Nguyen
2018-11-05 20:07           ` Ben Peart
2018-11-05 21:53         ` Johannes Schindelin
2018-11-27 19:50 ` [PATCH v1] teach git to support a virtual (partially populated) work directory Ben Peart
2018-11-28 13:31   ` SZEDER Gábor
2018-11-29 14:09     ` Ben Peart
2018-12-13 19:41 ` [PATCH v2] " Ben Peart
2019-01-28 19:00   ` Ben Peart

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1f7efd07-4881-daa7-cd1d-145bbf3ffcc8@gmail.com \
    --to=peartben@gmail.com \
    --cc=benpeart@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=pclouds@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).