git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* read-only working copies using links
@ 2009-01-24  9:17 Chad Dombrova
  2009-01-24 11:02 ` Sverre Rabbelier
  0 siblings, 1 reply; 6+ messages in thread
From: Chad Dombrova @ 2009-01-24  9:17 UTC (permalink / raw)
  To: git

hi all,

there's a major feature for working with large binaries that has not  
yet been addressed by git:  the ability to check out a file as a  
symbolic/hard link to a blob in the repository, instead of duplicating  
the file into the working copy.

imagine a scenario where one user is putting large binary files into a  
git repo on a networked server.  100 other users on the server need  
read-only access to this repo.  they clone the repo using --shared or  
--local, which saves disk space for the object files, but each of  
these 100 working copies also creates copies of all the binary files  
at the HEAD revision. it would be 100x as efficient in both disk space  
and checkout speeds if, in place of these files, symbolic or hard  
links were made to the blob files in .git/objects.

the crux of the issue is that the blob objects would have to be stored  
as exact copies of the original files.  it would seem there are two  
things that currently prevent this from happening.  1) blobs are  
stored with compression and 2) they include a small header.   
compression can be disabled by setting core.loosecompression to 0, so  
that seems like less of an issue.  as for the header, wouldn't it be  
possible to store it separately?  in other words, store two files per  
blob directory, a small stub file with the header info and the  
unaltered file data.

what are the caveats to a system like this?  has anyone looked into  
this before?

-chad

p.s.
i tried submitting a post through nabble a few days and it said that  
it was still pending, so i thought i'd try submitting directly to the  
mailing list.  sorry, if i end up double-posting

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: read-only working copies using links
  2009-01-24  9:17 read-only working copies using links Chad Dombrova
@ 2009-01-24 11:02 ` Sverre Rabbelier
  2009-01-24 18:39   ` Chad Dombrova
  0 siblings, 1 reply; 6+ messages in thread
From: Sverre Rabbelier @ 2009-01-24 11:02 UTC (permalink / raw)
  To: Chad Dombrova, Tim 'Mithro' Ansell; +Cc: git

Heya,

On Sat, Jan 24, 2009 at 10:17, Chad Dombrova <chadrik@gmail.com> wrote:
> the crux of the issue is that the blob objects would have to be stored as
> exact copies of the original files.  it would seem there are two things that
> currently prevent this from happening.  1) blobs are stored with compression
> and 2) they include a small header.  compression can be disabled by setting
> core.loosecompression to 0, so that seems like less of an issue.  as for the
> header, wouldn't it be possible to store it separately?  in other words,
> store two files per blob directory, a small stub file with the header info
> and the unaltered file data.

I think Tim Ansell (cced) was talking about this at the gittogether
(storing the metadata seperately), as it would benefit sparse/narrow
checkout, another advantage supporting his case?

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: read-only working copies using links
  2009-01-24 11:02 ` Sverre Rabbelier
@ 2009-01-24 18:39   ` Chad Dombrova
  2009-01-24 18:43     ` Sverre Rabbelier
  2009-01-24 19:34     ` Jeff King
  0 siblings, 2 replies; 6+ messages in thread
From: Chad Dombrova @ 2009-01-24 18:39 UTC (permalink / raw)
  To: Sverre Rabbelier; +Cc: Tim 'Mithro' Ansell, git

>
> I think Tim Ansell (cced) was talking about this at the gittogether
> (storing the metadata seperately), as it would benefit sparse/narrow
> checkout, another advantage supporting his case?
>

what's the case against it, other than the obvious, that it will take  
more work?


-chad

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: read-only working copies using links
  2009-01-24 18:39   ` Chad Dombrova
@ 2009-01-24 18:43     ` Sverre Rabbelier
  2009-01-24 19:35       ` Jeff King
  2009-01-24 19:34     ` Jeff King
  1 sibling, 1 reply; 6+ messages in thread
From: Sverre Rabbelier @ 2009-01-24 18:43 UTC (permalink / raw)
  To: Chad Dombrova; +Cc: Tim 'Mithro' Ansell, git

On Sat, Jan 24, 2009 at 19:39, Chad Dombrova <chadrik@gmail.com> wrote:
> what's the case against it, other than the obvious, that it will take more
> work?

Good question, I think it was mostly that, someone has to implement it
(possibly as part of packv4). Backwards compatibility is of course
always an concern, but I'm not too familiar with the subject, perhaps
other people on the list (or even those were at the gittogether) can
comment?

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: read-only working copies using links
  2009-01-24 18:39   ` Chad Dombrova
  2009-01-24 18:43     ` Sverre Rabbelier
@ 2009-01-24 19:34     ` Jeff King
  1 sibling, 0 replies; 6+ messages in thread
From: Jeff King @ 2009-01-24 19:34 UTC (permalink / raw)
  To: Chad Dombrova; +Cc: Sverre Rabbelier, Tim 'Mithro' Ansell, git

On Sat, Jan 24, 2009 at 10:39:46AM -0800, Chad Dombrova wrote:

>> I think Tim Ansell (cced) was talking about this at the gittogether
>> (storing the metadata seperately), as it would benefit sparse/narrow
>> checkout, another advantage supporting his case?
>
> what's the case against it, other than the obvious, that it will take  
> more work?

I'm not sure this is actually the same as Tim's proposal. Tim wanted to
store the commit and tree information separately from the blob
information (since his use case was that blobs are enormous, but the
rest is reasonable).

AIUI, Chad's proposal is about storing the actual blob data itself
separate from the blob object's metadata (i.e., its object type and
length headers). Which means that the normal loose object format is not
acceptable, and you would end up with something like (for example):

  .git/objects/pack/pack-full-of-your-regular-stuff.{pack,idx}
  .git/objects/[0-9a-f]{2}/[0-9a-f]{38}/header
  .git/objects/[0-9a-f]{2}/[0-9a-f]{38}/data

or something similar. Then you could hardlink directly to the 'data'
portion. So you would need:

  - to teach everything that ever looks for loose objects how to read
    this new format. In theory, it's all nicely encapsulated in
    sha1_file.c

  - to teach checkout routines to hardlink such a case instead of
    copying the file

The obvious downsides that I can think of are:

  - it has the potential to make object reading, which is a core part of
    git (read: very performance- and correctness- sensitive) a lot more
    complex. But maybe the implementation would not be that painful;
    somebody would have to look very closely to see.

  - it interacts badly with smudge/clean filters and crlf conversion.
    In those cases you can't hardlink. If you treat this like an
    optimization, though, it's not so bad: we only do the optimization
    when we _can_, and fall back to regular checkout if those other
    options are in effect.

  - it's somewhat dangerous to your repository's health. Git's model is
    that object files are immutable (since they are, after all, named
    after their contents). But now you are linking them into your
    working tree, which makes them susceptible to some third party tool
    munging them. So yes, most tools will probably behave, but any tool
    that misbehaves will actually corrupt your repository.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: read-only working copies using links
  2009-01-24 18:43     ` Sverre Rabbelier
@ 2009-01-24 19:35       ` Jeff King
  0 siblings, 0 replies; 6+ messages in thread
From: Jeff King @ 2009-01-24 19:35 UTC (permalink / raw)
  To: Sverre Rabbelier; +Cc: Chad Dombrova, Tim 'Mithro' Ansell, git

On Sat, Jan 24, 2009 at 07:43:20PM +0100, Sverre Rabbelier wrote:

> On Sat, Jan 24, 2009 at 19:39, Chad Dombrova <chadrik@gmail.com> wrote:
> > what's the case against it, other than the obvious, that it will take more
> > work?
> 
> Good question, I think it was mostly that, someone has to implement it
> (possibly as part of packv4). Backwards compatibility is of course
> always an concern, but I'm not too familiar with the subject, perhaps
> other people on the list (or even those were at the gittogether) can
> comment?

If I understand his proposal correctly, such objects must _not_ be part
of a pack. The whole idea is splitting them _more_, not less.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-01-24 19:36 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-01-24  9:17 read-only working copies using links Chad Dombrova
2009-01-24 11:02 ` Sverre Rabbelier
2009-01-24 18:39   ` Chad Dombrova
2009-01-24 18:43     ` Sverre Rabbelier
2009-01-24 19:35       ` Jeff King
2009-01-24 19:34     ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).