git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Felipe Contreras <felipe.contreras@gmail.com>
To: Linus Torvalds <torvalds@linux-foundation.org>,
	ZheNing Hu <adlternative@gmail.com>
Cc: Jeff King <peff@peff.net>, Taylor Blau <me@ttaylorr.com>,
	Junio C Hamano <gitster@pobox.com>,
	Git List <git@vger.kernel.org>,
	johncai86@gmail.com
Subject: Re: [Question] Can git cat-file have a type filtering option?
Date: Sun, 16 Apr 2023 06:06:02 -0600	[thread overview]
Message-ID: <643be4aa2bdf9_f43d294d9@chronos.notmuch> (raw)
In-Reply-To: <CAHk-=wjr-CMLX2Jo2++rwcv0VNr+HmZqXEVXNsJGiPRUwNxzBQ@mail.gmail.com>

Linus Torvalds wrote:
> On Fri, Apr 14, 2023 at 5:17 AM ZheNing Hu <adlternative@gmail.com> wrote:
> >
> > Jeff King <peff@peff.net> 于2023年4月14日周五 15:30写道:
> > >
> > > On Wed, Apr 12, 2023 at 05:57:02PM +0800, ZheNing Hu wrote:
> > > >
> > > > I'm still puzzled why git calculated the object id based on {type, size, data}
> > > >  together instead of just {data}?
> > >
> > > You'd have to ask Linus for the original reasoning. ;)
> 
> I originally thought of the git object store as "tagged pointers".
> 
> That actually caused confusion initially when I tried to explain this
> to SCM people, because "tag" means something very different in an SCM
> environment than it means in computer architecture.
> 
> And the implication of a tagged pointer is that you have two parts of
> it - the "tag" and the "address". Both are relevant at all points.
> 
> This isn't quite as obvious in everyday moden git usage, because a lot
> of uses end up _only_ using the "address" (aka SHA1), but it's very
> much part of the object store design. Internally, the object layout
> never uses just the SHA1, it's all "type:SHA1", even if sometimes the
> types are implied (ie the tree object doesn't spell out "blob", but
> it's still explicit in the mode bits).
> 
> This is very very obvious in "git cat-file", which was one of the
> original scripts in the first commit (but even there the tag/type has
> changed meaning over time: the very first version didn't use it as
> input at all, then it started verifying it, and then later it got the
> more subtle context of "peel the tags until you find this type").
> 
> You can also see this in the original README (again, go look at that
> first git commit): the README talks about the "tag of their type".
> 
> Of course, in practice git then walked away from having to specify the
> type all the time. It started even in that original release, in that
> the HEAD file never contained the type - because it was implicit (a
> HEAD is always a commit).
> 
> So we ended up having a lot of situations like that where the actual
> tag part was implicit from context, and these days people basically
> never refer to the "full" object name with tag, but only the SHA1
> address.
> 
> So now we have situations where the type really has to be looked up
> dynamically, because it's not explicitly encoded anywhere. While HEAD
> is supposed to always be a commit, other refs can be pretty much
> anything, and can point to a tag object, a commit, a tree or a blob.
> So then you actually have to look up the type based on the address.
> 
> End result: these days people don't even think of git objects as
> "tagged pointers".  Even internally in git, lots of code just passes
> the "object name" along without any tag/type, just the raw SHA1 / OID.
> 
> So that originally "everything is a tagged pointer" is much less true
> than it used to be, and now, instead of having tagged pointers, you
> mostly end up with just "bare pointers" and look up the type
> dynamically from there.
> 
> And that "look up the type in the object" is possible because even
> originally, I did *not* want any kind of "object type aliasing".
> 
> So even when looking up the object with the full "tag:pointer", the
> encoding of the object itself then also contains that object type, so
> that you can cross-check that you used the right tag.
> 
> That said, you *can* see some of the effects of this "tagged pointers"
> in how the internals do things like
> 
>     struct commit *commit = lookup_commit(repo, &oid);
> 
> which conceptually very much is about tagged pointers. And the fact
> that two objects cannot alias is actually somewhat encoded in that: a
> "struct commit" contains a "struct object" as a member. But so does
> "struct blob" - and the two "struct object" cases are never the same
> "object".
> 
> So there's never any worry about "could blob.object be the same object
> as commit.object"?
> 
> That is actually inherent in the code, in how "lookup_commit()"
> actually does lookup_object() and then does object_as_type(OBJ_COMMIT)
> on the result.

This explains rather well why the object type is used in the calculation, and
it makes sense.

But I don't see anything about the object size. Isn't that unnecessary?

-- 
Felipe Contreras

  reply	other threads:[~2023-04-16 12:06 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-07 14:24 [Question] Can git cat-file have a type filtering option? ZheNing Hu
2023-04-07 16:30 ` Junio C Hamano
2023-04-08  6:27   ` ZheNing Hu
2023-04-09  1:28     ` Taylor Blau
2023-04-09  2:19       ` Taylor Blau
2023-04-09  2:26         ` Taylor Blau
2023-04-09  6:51           ` ZheNing Hu
2023-04-10 20:01             ` Jeff King
2023-04-10 23:20               ` Taylor Blau
2023-04-09  6:47       ` ZheNing Hu
2023-04-10 20:14         ` Jeff King
2023-04-11 14:09           ` ZheNing Hu
2023-04-12  7:43             ` Jeff King
2023-04-12  9:57               ` ZheNing Hu
2023-04-14  7:30                 ` Jeff King
2023-04-14 12:17                   ` ZheNing Hu
2023-04-14 15:58                     ` Junio C Hamano
2023-04-16 11:15                       ` ZheNing Hu
2023-04-14 17:04                     ` Linus Torvalds
2023-04-16 12:06                       ` Felipe Contreras [this message]
2023-04-16 12:43                       ` ZheNing Hu
2023-04-09  1:26   ` Taylor Blau
2023-04-09  1:23 ` Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=643be4aa2bdf9_f43d294d9@chronos.notmuch \
    --to=felipe.contreras@gmail.com \
    --cc=adlternative@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=johncai86@gmail.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).