From: Patrick Steinhardt <ps@pks.im>
To: Julia Evans <julia@jvns.ca>
Cc: Julia Evans <gitgitgadget@gmail.com>, git@vger.kernel.org
Subject: Re: [PATCH] doc: add a explanation of Git's data model
Date: Wed, 8 Oct 2025 06:18:11 +0200 [thread overview]
Message-ID: <aOXmA5L5LsUuXWEh@pks.im> (raw)
In-Reply-To: <dbf0727d-66bf-4698-aa21-d69da86027c3@app.fastmail.com>
On Tue, Oct 07, 2025 at 02:55:37PM -0400, Julia Evans wrote:
> On Tue, Oct 7, 2025, at 10:32 AM, Patrick Steinhardt wrote:
> > On Fri, Oct 03, 2025 at 05:34:36PM +0000, Julia Evans via GitGitGadget wrote:
> >> diff --git a/Documentation/gitdatamodel.adoc b/Documentation/gitdatamodel.adoc
> >> new file mode 100644
> >> index 0000000000..4b2cb167dc
> >> --- /dev/null
> >> +++ b/Documentation/gitdatamodel.adoc
[snip]
> >> +[[tree]]
> >> +trees::
> >> + A tree is how Git represents a directory. It lists, for each item in
> >> + the tree:
> >> ++
> >> +1. The *permissions*, for example `100644`
> >
> > I think we should rather call these "mode bits". These bits are
> > permissions indeed when you have a blob, but for subtrees, symlinks and
> > submodules they aren't.
>
> I think it's a bit strange to call them mode bits since I thought they were stored
> as ASCII strings and it's basically an enum of 5 options, but I see your point.
> I think "file mode" will work and that's used elsewhere.
>
> I wonder if it would make sense to list all of the possible file modes if
> this isn't documented anywhere else, my impression is that it's a short
> list and that it's unlikely to change much in the future.
Agreed, that seems reasonable to me.
> And listing them all might make it more clear that Git's file modes don't
> have much in common with Unix file modes.
> I looked for where this is documented and it looks like the only place is
> in `man git-fast-import` . That man page says that there are just 5 options
> (040000, 160000, 100644, 100755, 120000)
>
> >> +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory),
> >> + or <<commit,`commit`>> (a Git submodule)
> >
> > There's also symlinks.
>
> I created a test symlink and it looks like symlinks are stored as type "blob".
> I might say which type corresponds to which file mode,
> though I'm not sure what type corresponds to the "gitlink" mode (commit?).
Yeah, gitlinks are used for submodules. They point to an object ID that
refers to a commit in the submodule itself.
> I think these are the 5 modes and what they mean / what type they
> should have. Not sure about the gitlink mode though.
>
> - `100644`: regular file (with type `blob`)
> - `100755`: executable file (with type `blob`)
> - `120000`: symbolic link (with type `blob`)
> - `040000`: directory (with type `tree`)
> - `160000`: gitlink, for use with submodules (with type `commit`)
This list looks good to me. gitlinks are somewhat special given that
they refer to a commit stored in the submodule repository, not in the
repository that has the gitlink. But the expectation is that the object
name should always resolve to a commit indeed.
[snip]
> >> +[[blob]]
> >> +blobs::
> >> + A blob is how Git represents a file. A blob object contains the
> >> + file's contents.
> >> ++
> >> +Storing a new blob for every new version of a file can get big, so
> >> +`git gc` periodically compresses objects for efficiency in `.git/objects/pack`.
> >
> > I would claim that it's not necessary to mention object compression.
> > This should be a low-level detail that users don't ever have to worry
> > about. Furthermore, packing objects isn't only relevant in the context
> > of blobs: trees for example also tend to compress very well as there
> > typically is only small incremental updates to trees.
>
> I discussed why I think this important in another reply,
> https://lore.kernel.org/all/51e0a55c-1f1d-4cae-9459-8c2b9220e52d@app.fastmail.com/,
> will paste what I said here. I'll think about this more though.
>
> paste follows:
>
> That's true! The reason I think this is important to mention is that I find
> that people often "reject" information that they find implausible, even
> if it comes from a credible source. ("that can't be true! I must be
> not understanding correctly. Oh well, I'll just ignore that!")
>
> I sometimes hear from users that "commits can't be snapshots", because
> it would take up too much disk space to store every version of
> every commit. So I find that sometimes explaining a little bit about the
> implementation can make the information more memorable.
>
> Certainly I'm not able to remember details that don't make sense
> with my mental model of how computers work and I don't expect other
> people to either, so I think it's important to give an explanation that
> handles the biggest "objections".
Hm, fair I guess. In any case, if we want to mention this I'd leave away
the details how exactly Git achieves this. E.g. we could say something
like:
Storing a new blob for every new version of a file can result to a
lot of duplication. Git regularly runs repository maintenance to
optimize to counteract this. Part of the maintenance involves
compression of objects, where incremental changes to the same object
are optimized to be stored as deltas, only.
We skip over the details, but this should give enough pointers to an
interested reader to go dig deeper. We could also generalize this to
objects in general, not only blobs.
[snip]
> >> +[[HEAD]]
> >> +HEAD: `.git/HEAD`::
> >> + `HEAD` is where Git stores your current <<branch,branch>>.
> >> + `HEAD` is normally a symbolic reference to your current branch, for
> >> + example `ref: refs/heads/main` if your current branch is `main`.
> >> + `HEAD` can also be a direct reference to a commit ID,
> >> + that's called "detached HEAD state".
> >> +
> >> +[[remote-tracking-branch]]
> >> +remote tracking branches: `.git/refs/remotes/<remote>/<branch>`::
> >> + A remote-tracking branch is a name for a commit ID.
> >> + It's how Git stores the last-known state of a branch in a remote
> >> + repository. `git fetch` updates remote-tracking branches. When
> >> + `git status` says "you're up to date with origin/main", it's looking at
> >> + this.
> >
> > This misses "refs/remotes/<remote>/HEAD". This reference is a symbolic
> > reference that indicates the default branch on the remote side.
>
> Is "refs/remotes/<remote>/HEAD" a remote-tracking branch?
> I've never thought about that reference and I'm not sure what to call it.
No, it's not. I think the term we use is "remote reference".
> >> +[[other-refs]]
> >> +Other references::
> >> + Git tools may create references in any subdirectory of `.git/refs`.
> >> + For example, linkgit:git-stash[1], linkgit:git-bisect[1],
> >> + and linkgit:git-notes[1] all create their own references
> >> + in `.git/refs/stash`, `.git/refs/bisect`, etc.
> >> + Third-party Git tools may also create their own references.
> >> ++
> >> +Git may also create references in the base `.git` directory
> >> +other than `HEAD`, like `ORIG_HEAD`.
> >
> > Let's mention that such references are typically spelt all-uppercase
> > with underscores between. You shouldn't ever create a reference that is
> > for example called ".git/foo".
> >
> > We enforce this restriction inconsistently, only, but I don't think that
> > should keep us from spelling out the common rule.
>
> That makes sense. I'm also not sure whether third-party
> Git tools are "supposed" to create references outside of "refs/",
> or whether that's common.
They really shouldn't, and to the best of my knowledge they don't. There
is only a rather limited number of root references with very specific
use cases. And nowadays we have also tightened the meaning of pseudo
refs, of which there are only two ("FETCH_HEAD" and "MERGE_HEAD").
[snip]
> >> +[[reflogs]]
> >> +REFLOGS
> >> +-------
> >> +
> >> +Git stores the history of branch, tag, and HEAD refs in a reflog
> >> +(you should read "reflog" as "ref log"). Not every ref is logged by
> >> +default, but any ref can be logged.
> >
> > If we mention this here, do we maybe want to mention how the user can
> > decide which references are logged?
>
> Do you mean by using the setting `core.logAllRefUpdates`?
Yeah. Otherwise the reader won't have any pointers to figure out _how_
they can change this. I don't think we have a man page that provides a
better overview than this configuration.
Thanks!
Patrick
next prev parent reply other threads:[~2025-10-08 4:18 UTC|newest]
Thread overview: 89+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-03 17:34 [PATCH] doc: add a explanation of Git's data model Julia Evans via GitGitGadget
2025-10-03 21:46 ` Kristoffer Haugsbakk
2025-10-06 19:36 ` Julia Evans
2025-10-06 21:44 ` D. Ben Knoble
2025-10-06 21:46 ` Julia Evans
2025-10-06 21:55 ` D. Ben Knoble
2025-10-09 13:20 ` Julia Evans
2025-10-08 9:59 ` Kristoffer Haugsbakk
2025-10-06 3:32 ` Junio C Hamano
2025-10-06 19:03 ` Julia Evans
2025-10-07 12:37 ` Kristoffer Haugsbakk
2025-10-07 16:38 ` Junio C Hamano
2025-10-07 14:32 ` Patrick Steinhardt
2025-10-07 17:02 ` Junio C Hamano
2025-10-07 19:30 ` Julia Evans
2025-10-07 20:01 ` Junio C Hamano
2025-10-07 18:39 ` D. Ben Knoble
2025-10-07 18:55 ` Julia Evans
2025-10-08 4:18 ` Patrick Steinhardt [this message]
2025-10-08 15:53 ` Junio C Hamano
2025-10-08 19:06 ` Julia Evans
2025-10-08 13:53 ` [PATCH v2] " Julia Evans via GitGitGadget
2025-10-10 11:51 ` Patrick Steinhardt
2025-10-13 14:48 ` Junio C Hamano
2025-10-14 5:45 ` Patrick Steinhardt
2025-10-14 9:18 ` Julia Evans
2025-10-14 11:45 ` Patrick Steinhardt
2025-10-14 13:39 ` Junio C Hamano
2025-10-14 21:12 ` [PATCH v3] " Julia Evans via GitGitGadget
2025-10-15 6:24 ` Patrick Steinhardt
2025-10-15 15:34 ` Junio C Hamano
2025-10-15 17:20 ` Julia Evans
2025-10-15 20:42 ` Junio C Hamano
2025-10-16 14:21 ` Julia Evans
2025-10-15 19:58 ` Junio C Hamano
2025-10-16 15:19 ` Julia Evans
2025-10-16 16:54 ` Junio C Hamano
2025-10-16 18:59 ` Julia Evans
2025-10-16 20:48 ` Junio C Hamano
2025-10-16 15:24 ` Kristoffer Haugsbakk
2025-10-20 16:37 ` Kristoffer Haugsbakk
2025-10-20 18:01 ` Junio C Hamano
2025-10-27 19:32 ` [PATCH v4] doc: add an " Julia Evans via GitGitGadget
2025-10-27 21:54 ` Junio C Hamano
2025-10-28 20:10 ` Julia Evans
2025-10-28 20:31 ` Junio C Hamano
2025-10-30 20:32 ` [PATCH v5] " Julia Evans via GitGitGadget
2025-10-31 14:44 ` Junio C Hamano
2025-11-03 7:40 ` Patrick Steinhardt
2025-11-03 15:38 ` Junio C Hamano
2025-11-03 19:43 ` Julia Evans
2025-11-04 1:34 ` Junio C Hamano
2025-11-04 15:45 ` Julia Evans
2025-11-04 20:53 ` Junio C Hamano
2025-11-04 21:24 ` Julia Evans
2025-11-04 23:45 ` Junio C Hamano
2025-11-05 0:02 ` Julia Evans
2025-11-05 3:21 ` Ben Knoble
2025-11-05 16:26 ` Julia Evans
2025-11-06 3:07 ` Ben Knoble
2025-10-31 21:49 ` Junio C Hamano
2025-11-03 7:40 ` Patrick Steinhardt
2025-11-03 19:52 ` Julia Evans
2025-11-07 19:52 ` [PATCH v6] " Julia Evans via GitGitGadget
2025-11-07 21:03 ` Junio C Hamano
2025-11-07 21:23 ` Junio C Hamano
2025-11-07 21:40 ` Julia Evans
2025-11-07 23:07 ` Junio C Hamano
2025-11-08 19:43 ` Junio C Hamano
2025-11-09 0:48 ` Ben Knoble
2025-11-09 4:59 ` Junio C Hamano
2025-11-10 15:56 ` Julia Evans
2025-11-11 10:13 ` Junio C Hamano
2025-11-11 13:07 ` Ben Knoble
2025-11-11 15:24 ` Julia Evans
2025-11-12 19:16 ` Junio C Hamano
2025-11-12 22:49 ` Junio C Hamano
2025-11-13 19:50 ` Julia Evans
2025-11-13 20:07 ` Junio C Hamano
2025-11-13 20:18 ` Julia Evans
2025-11-13 20:34 ` Chris Torek
2025-11-13 23:11 ` Junio C Hamano
2025-11-12 19:53 ` [PATCH v7] " Julia Evans via GitGitGadget
2025-11-12 20:26 ` Junio C Hamano
2025-11-23 2:37 ` Junio C Hamano
2025-12-01 8:14 ` Patrick Steinhardt
2025-12-02 12:25 ` Junio C Hamano
2025-10-09 14:20 ` [PATCH] doc: add a " Julia Evans
2025-10-10 0:42 ` Ben Knoble
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aOXmA5L5LsUuXWEh@pks.im \
--to=ps@pks.im \
--cc=git@vger.kernel.org \
--cc=gitgitgadget@gmail.com \
--cc=julia@jvns.ca \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).