[silly] loose, pack, and another thing?

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [silly] loose, pack, and another thing?
@ 2023-09-27  4:15 Junio C Hamano
  2023-09-27 13:46 ` Christian Couder
  2023-09-28 21:40 ` Jonathan Tan
  0 siblings, 2 replies; 7+ messages in thread
From: Junio C Hamano @ 2023-09-27  4:15 UTC (permalink / raw)
  To: git

Just wondering if it would help to have the third kind of object
representation in the object database, sitting next to loose objects
and packed objects, say .git/objects/verbatim/<hex-object-name> for
the contents and .git/objects/verbatim/<hex-object-name>.type that
records "blob", "tree", "commit", or "tag" (in practice, I would
expect huge "blob" objects would be the only ones that use this
mechanism).

The contents will be stored verbatim without compression and without
any object header (i.e., the usual "<type> <length>\0") and the file
could be "ln"ed (or "cow"ed if the underlying filesystem allows it)
to materialize it in the working tree if needed.

"fsck" needs to be told about how to verify them.  Create the object
header in-core and hash that, followed by the contents of that file,
and make sure the result matches the <hex-object-name> part of the
filename, or something like that.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [silly] loose, pack, and another thing?
  2023-09-27  4:15 [silly] loose, pack, and another thing? Junio C Hamano
@ 2023-09-27 13:46 ` Christian Couder
  2023-09-28 21:47   ` Jonathan Tan
  2023-09-28 21:40 ` Jonathan Tan
  1 sibling, 1 reply; 7+ messages in thread
From: Christian Couder @ 2023-09-27 13:46 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Wed, Sep 27, 2023 at 11:47 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Just wondering if it would help to have the third kind of object
> representation in the object database, sitting next to loose objects
> and packed objects, say .git/objects/verbatim/<hex-object-name> for
> the contents and .git/objects/verbatim/<hex-object-name>.type that
> records "blob", "tree", "commit", or "tag" (in practice, I would
> expect huge "blob" objects would be the only ones that use this
> mechanism).

Yeah, I think it could help handle large blobs. I guess it would rely
on the underlying filesystem to store the object size.

> The contents will be stored verbatim without compression and without
> any object header (i.e., the usual "<type> <length>\0") and the file
> could be "ln"ed (or "cow"ed if the underlying filesystem allows it)
> to materialize it in the working tree if needed.
>
> "fsck" needs to be told about how to verify them.  Create the object
> header in-core and hash that, followed by the contents of that file,
> and make sure the result matches the <hex-object-name> part of the
> filename, or something like that.

What happens when they are transferred? Should the remote unpack them
into the same kind of verbatim object?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [silly] loose, pack, and another thing?
  2023-09-27  4:15 [silly] loose, pack, and another thing? Junio C Hamano
  2023-09-27 13:46 ` Christian Couder
@ 2023-09-28 21:40 ` Jonathan Tan
  2023-10-03 19:09   ` Jeff King
  1 sibling, 1 reply; 7+ messages in thread
From: Jonathan Tan @ 2023-09-28 21:40 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, git

Junio C Hamano <gitster@pobox.com> writes:
> Just wondering if it would help to have the third kind of object
> representation in the object database, sitting next to loose objects
> and packed objects, say .git/objects/verbatim/<hex-object-name> for
> the contents and .git/objects/verbatim/<hex-object-name>.type that
> records "blob", "tree", "commit", or "tag" (in practice, I would
> expect huge "blob" objects would be the only ones that use this
> mechanism).
> 
> The contents will be stored verbatim without compression and without
> any object header (i.e., the usual "<type> <length>\0") and the file
> could be "ln"ed (or "cow"ed if the underlying filesystem allows it)
> to materialize it in the working tree if needed.

This sounds like a useful feature. We probably would want to use the
"ln" or "cow" every time we use streaming (stream_blob_to_fd() in
streaming.h) currently, so hopefully we won't need to increase the
number of ways in which we can write an object to the worktree (just
change the streaming to write to a filename instead of an fd).

> "fsck" needs to be told about how to verify them.  Create the object
> header in-core and hash that, followed by the contents of that file,
> and make sure the result matches the <hex-object-name> part of the
> filename, or something like that.

Yeah, this sounds like what index-pack is doing - the hash algo can take
the contents of one buffer (a header that we synthesize ourselves), and
then take the contents of another buffer (the file contents).

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [silly] loose, pack, and another thing?
  2023-09-27 13:46 ` Christian Couder
@ 2023-09-28 21:47   ` Jonathan Tan
  0 siblings, 0 replies; 7+ messages in thread
From: Jonathan Tan @ 2023-09-28 21:47 UTC (permalink / raw)
  To: Christian Couder; +Cc: Jonathan Tan, Junio C Hamano, git

Christian Couder <christian.couder@gmail.com> writes:
> > The contents will be stored verbatim without compression and without
> > any object header (i.e., the usual "<type> <length>\0") and the file
> > could be "ln"ed (or "cow"ed if the underlying filesystem allows it)
> > to materialize it in the working tree if needed.
> >
> > "fsck" needs to be told about how to verify them.  Create the object
> > header in-core and hash that, followed by the contents of that file,
> > and make sure the result matches the <hex-object-name> part of the
> > filename, or something like that.
> 
> What happens when they are transferred? Should the remote unpack them
> into the same kind of verbatim object?

I think that the design space is vast and needs to be discussed, perhaps
independently of the local repo case (in which for a start, we could
just detect large blobs being added to the index and put them in our
new object store instead of loose/packed storage, and make sure that we
never repack them). Some concerns during fetch:

- Servers would probably want to serve the large blobs via CDN, so we
probably need something similar to packfile-uris. Would servers also
want to inline these blobs? (If not, we don't need to design this part.)

- Would servers be willing to zlib-compress large blobs (into packfile
format) if the client doesn't support verbatim objects?

And during push:

- Clients probably want to be able to inline large blobs when pushing.
Should it also be possible to specify the large blob via URI, and if
yes, how does the server tell the client what URIs are acceptable?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [silly] loose, pack, and another thing?
  2023-09-28 21:40 ` Jonathan Tan
@ 2023-10-03 19:09   ` Jeff King
  2023-10-03 21:26     ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff King @ 2023-10-03 19:09 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Junio C Hamano, git

On Thu, Sep 28, 2023 at 02:40:10PM -0700, Jonathan Tan wrote:

> Junio C Hamano <gitster@pobox.com> writes:
> > Just wondering if it would help to have the third kind of object
> > representation in the object database, sitting next to loose objects
> > and packed objects, say .git/objects/verbatim/<hex-object-name> for
> > the contents and .git/objects/verbatim/<hex-object-name>.type that
> > records "blob", "tree", "commit", or "tag" (in practice, I would
> > expect huge "blob" objects would be the only ones that use this
> > mechanism).
> > 
> > The contents will be stored verbatim without compression and without
> > any object header (i.e., the usual "<type> <length>\0") and the file
> > could be "ln"ed (or "cow"ed if the underlying filesystem allows it)
> > to materialize it in the working tree if needed.
> 
> This sounds like a useful feature. We probably would want to use the
> "ln" or "cow" every time we use streaming (stream_blob_to_fd() in
> streaming.h) currently, so hopefully we won't need to increase the
> number of ways in which we can write an object to the worktree (just
> change the streaming to write to a filename instead of an fd).

One thing that scares me about a regular "ln" between the worktree and
odb is that you are very susceptible to corrupting the repository by
modifying the worktree file with regular tools. If they do a complete
rewrite and atomic rename (or link) to put the new file in place, that
is OK. But opening the file for appending, or general writing, is bad.

You can get some safety with the immutable attribute (which applies to
the inode itself, and thus any path that hardlinks to it). But setting
that usually requires being root. And it creates other irritations for
normal use (you have to unset it before even removing the hardlink).

It would be nice if there was some portable copy-on-write abstraction we
could rely on, but I don't think there is one.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [silly] loose, pack, and another thing?
  2023-10-03 19:09   ` Jeff King
@ 2023-10-03 21:26     ` Junio C Hamano
  2023-10-04 13:11       ` Jeff King
  0 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2023-10-03 21:26 UTC (permalink / raw)
  To: Jeff King; +Cc: Jonathan Tan, git

Jeff King <peff@peff.net> writes:

> One thing that scares me about a regular "ln" between the worktree and
> odb is that you are very susceptible to corrupting the repository by
> modifying the worktree file with regular tools. If they do a complete
> rewrite and atomic rename (or link) to put the new file in place, that
> is OK. But opening the file for appending, or general writing, is bad.

Very true.

> You can get some safety with the immutable attribute (which applies to
> the inode itself, and thus any path that hardlinks to it). But setting
> that usually requires being root. And it creates other irritations for
> normal use (you have to unset it before even removing the hardlink).

As a regular user, "chmod a-w" has the same characteristics (works
at the inode level) but without "cannot remove it" downside.  It
used to be sufficient in RCS and CVS days, though, as a signal that
you are only to look at it without touching it, to "chmod a-w" a
path that is checked out but not for modifying.  Some editors even
offer to do chmod u+w for you when saving, so if we want absolute
safety, it may not be enough.

> It would be nice if there was some portable copy-on-write abstraction we
> could rely on, but I don't think there is one.

;-)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [silly] loose, pack, and another thing?
  2023-10-03 21:26     ` Junio C Hamano
@ 2023-10-04 13:11       ` Jeff King
  0 siblings, 0 replies; 7+ messages in thread
From: Jeff King @ 2023-10-04 13:11 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, git

On Tue, Oct 03, 2023 at 02:26:41PM -0700, Junio C Hamano wrote:

> > You can get some safety with the immutable attribute (which applies to
> > the inode itself, and thus any path that hardlinks to it). But setting
> > that usually requires being root. And it creates other irritations for
> > normal use (you have to unset it before even removing the hardlink).
> 
> As a regular user, "chmod a-w" has the same characteristics (works
> at the inode level) but without "cannot remove it" downside.  It
> used to be sufficient in RCS and CVS days, though, as a signal that
> you are only to look at it without touching it, to "chmod a-w" a
> path that is checked out but not for modifying.  Some editors even
> offer to do chmod u+w for you when saving, so if we want absolute
> safety, it may not be enough.

Ah, right. For some reason I was thinking that only affected the link
entry, but of course the mode bits are on the linked inode itself.  So
that does easily give some protection, though I agree that many programs
are happy to circumvent it for you.

It has been a long time since I've used it, but I think there may be
some prior at in git-annex:

  https://git-annex.branchable.com/

IIRC it can work in a "copy" mode where contents are copied into the
working tree. But since the point is to deal with large data sets, it
also has a linking mode (maybe even symlinks?) that point directly from
the working tree into the annex storage. If we are considering a similar
feature, we might be able to learn from their experience.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-10-04 13:11 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-27  4:15 [silly] loose, pack, and another thing? Junio C Hamano
2023-09-27 13:46 ` Christian Couder
2023-09-28 21:47   ` Jonathan Tan
2023-09-28 21:40 ` Jonathan Tan
2023-10-03 19:09   ` Jeff King
2023-10-03 21:26     ` Junio C Hamano
2023-10-04 13:11       ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).