Re: [PATCH v3] doc: add a explanation of Git's data model

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Junio C Hamano <gitster@pobox.com>
To: "Julia Evans via GitGitGadget" <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org,
	 Kristoffer Haugsbakk <kristofferhaugsbakk@fastmail.com>,
	 "D. Ben Knoble" <ben.knoble@gmail.com>,
	 Patrick Steinhardt <ps@pks.im>,  Julia Evans <julia@jvns.ca>
Subject: Re: [PATCH v3] doc: add a explanation of Git's data model
Date: Wed, 15 Oct 2025 12:58:58 -0700	[thread overview]
Message-ID: <xmqqv7kgszr1.fsf@gitster.g> (raw)
In-Reply-To: <pull.1981.v3.git.1760476346040.gitgitgadget@gmail.com> (Julia Evans via GitGitGadget's message of "Tue, 14 Oct 2025 21:12:26 +0000")

"Julia Evans via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +[[commit]]
> +commits::
> +    A commit contains these required fields
> +    (though there are other optional fields):
> ++
> +1. All the *files* in the commit, stored as the *<<tree,tree>>* ID of
> +   the commit's base directory.

"all the files' exact contents at the time of the commit" is what we
mean here, and once readers know what a tree is, the above sentence
would be understood as such, but "All the files" felt somewhat
fuzzy.  I wonder if presenting objects in bottom-up fashion makes it
easier to see?  Learn that a blob records exact content of a file,
then learn that a tree records the set of paths with exact contents
stored at these paths, and after that, learn that a commit records a
tree, hence a snapshot of the whole set of contents.  I dunno...

> +2. Its *parent commit ID(s)*. The first commit in a repository has 0 parents,
> +  regular commits have 1 parent, merge commits have 2 or more parents
> +3. An *author* and the time the commit was authored
> +4. A *committer* and the time the commit was committed.
> +   If you cherry-pick (linkgit:git-cherry-pick[1]) someone else's commit,
> +   then they will be the author and you'll be the committer.

It felt a bit odd to single-out cherry-pick here.

I think the important thing to become aware of for the readers at
this point is that the author and committer can be different people,
and it does not matter how one commits somebody else's patch at the
mechanical level.

Perhaps replace "If you cherry-pick..." with something like "note: a
change authored by a person at some point in time can be committed
by another person at a different time, and these fields are to
record both persons' contributions separately", perhaps, if we
really want to say more.

> +Git does not store the diff for a commit: when you ask Git for a
> +diff it calculates it on the fly.

I think this is an attempt to demystify "are we really storing
snapshot for each commit?" thing, but then "when you ask Git to show
the commit, it calculates the diff from its parent on the fly" might
achieve that better, perhaps?

> +[[tree]]
> +trees::
> +    A tree is how Git represents a directory. It lists, for each item in
> +    the tree:
> ++
> +[[file-mode]]
> +1. The *file mode*, for example `100644`. The format is inspired by Unix
> +   permissions, but Git's modes are much more limited. Git only supports these file modes:
> ++
> +  - `100644`: regular file (with type `blob`)
> +  - `100755`: executable file (with type `blob`)
> +  - `120000`: symbolic link (with type `blob`)
> +  - `040000`: directory (with type `tree`)
> +  - `160000`: gitlink, for use with submodules (with type `commit`)

It is not really "supporting" file modes.  Rather, Git only records
5 kinds of entities associated with each path in a tree object, and
uses numbers taht remotely resemble POSIX file modes to represent
these 5 kinds.

Perhaps "supports" -> "uses"?

> +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory),
> +  or <<commit,`commit`>> (a Git submodule, which is a
> +  commit from a different Git repository)
> +3. The <<object-id,*object ID*>>
> +4. The *filename*

Here it may be worth noting that this "filename" is a single
pathname component (roughly, what you would see in non-recursive
"ls").  In other words, it may be a directory name.

I wonder if we need to say "<blob> (a file, or a symbolic link)"?

> +[[blob]]
> +blobs::
> +    A blob is how Git represents a file. A blob object contains the
> +    file's contents.

"represents a file" hints as if the thing may know its name, but
that is not the case (its name is given only by surrounding tree).

"A blob is how Git represents uninterpreted series of bytes, and
most commonly used to store file's contents." or something, perhaps?

> +When you make a new commit, Git only needs to store new versions of
> +files which were changed in that commit. This means that commits
> +can use relatively little disk space even in a very large repository.

That invites the "aren't we storing a delta after all, then?"
confusion.

"Git only needs to newly store new versions of files and
directories.  Files and directories that were not modified by the
commit are shared with its parent commit".

> +NOTE: All of the examples in this section were generated with
> +`git cat-file -p <object-id>`, which shows the contents of a Git object.

Was this necessary to say this?  Blobs, Commits, and Tags are
textual, so "-p" does very minimum thing, but Trees are binary
garbage, so "-p" output is heavily massaged version of the contents.

> +[[branch]]
> +branches: `refs/heads/<name>`::
> +    A branch is a name for a commit ID.

Well a commit ID is an alternative way to refer to a commit object
*name*, so it is a bit strange to say "a name for a commit ID".

Perhaps "A branch ref stores a commit ID." is better?

> +[[tag]]
> +tags: `refs/tags/<name>`::
> +    A tag is a name for a commit ID, tag object ID, or other object ID.

Likewise.  "A tag ref stores any kind of object ID, but commonly
they are commit objects or tag objects"

> +    Tags that reference a tag object ID are called "annotated tags",
> +    because the tag object contains a tag message.
> +    Tags that reference a commit, blob, or tree ID are
> +    called "lightweight tags".
> ++
> +Even though branches and tags are both "a name for a commit ID", Git
> +treats them very differently.
> +Branches are expected to change over time: when you make a commit, Git
> +will update your <<HEAD,current branch>> to reference the new changes.

This sentence talks about branch moving because it advances with
more commits.  Did we want to say "HEAD" here before we explain what
it is?  "HEAD" can move for another reason (i.e. branch switching)
and using "HEAD" in the context of talking about growing history
might invite confusion.  I dunno.

> +Tags are usually not changed after they're created.

> +[[HEAD]]
> +HEAD: `HEAD`::
> +    `HEAD` is where Git stores your current <<branch,branch>>.

Hmm...

> +    `HEAD` can either be:
> +    1. A symbolic reference to your current branch, for example `ref:
> +       refs/heads/main` if your current branch is `main`.
> +    2. A direct reference to a commit ID. This is called "detached HEAD
> +	   state", see the DETACHED HEAD section of linkgit:git-checkout[1] for more.

These two are very reasonable.  But "your current <<branch>>" refers
only to #1.

    `HEAD` refers to the commit your current work is based on, and
    it is the commit that will become the first parent of the commit
    once your current work is concluded.  It can either be ...

perhaps.

> +[[remote-tracking-branch]]
> +remote tracking branches: `refs/remotes/<remote>/<branch>`::

Please always write "remote-tracking" with a hyphen (see glossary).

> +    A remote-tracking branch is a name for a commit ID.

Either "A remote-tracking branch stores a commit object name" or "A
remote-tracking branch points at a commit object", followed by "in
order to keep track of the last-nown state of ..." in a single
sentence.

> +[[index]]
> +THE INDEX
> +---------
> +
> +The index, also known as the "staging area", contains a list of every
> +file in the repository and its contents. When you commit, the files in
> +the index are used as the files in the next commit.

It is hard to define what "every file in the repository" really is.
Files that you removed last week do not count.  Files added in your
wip branch elsewhere are obviously not yet in the index when you are
working on your primary branch.

> +You can add files to the index or update the version in the index with
> +linkgit:git-add[1]. Adding a file to the index or updating its version
> +is called "staging" the file for commit.

It may be worth to clarify by saying "staging the contents of the
file" (you can edit the file further after you "git add") that you
are taking a snapshot at the time you ran "git add", instead of
giving a general instruction to "keey an eye on this file" to Git
(if it were, then the next "git commit" would behave more like "git
add -u && git commit").

> +[[reflogs]]
> +REFLOGS
> +-------
> +
> +Git stores a history called a "reflog" for every branch, remote-tracking
> +branch, and HEAD. This means that if you make a mistake and "lose" a
> +commit, you can generally recover the commit ID by running
> +`git reflog <reference>`.
> +
> +Each reflog entry has:
> +
> +1. Before/after *commit IDs*
> +2. *User* who made the change, for example `Maya <maya@example.com>`
> +3. *Timestamp* when the change was made
> +4. *Log message*, for example `pull: Fast-forward`
> +
> +Reflogs only log changes made in your local repository.
> +They are not shared with remotes.

Technically it is correct that before/after are recorded, but there
is no way for the end-user to interact with them.  "git reflog"
walking these entries will only give you a single commit object.
The username is also recorded, but I do not think of a way to view
the information, let alone using it for querying.

Especially when the reftable backend is in use, you cannot even read
the raw representation like you can do with files backend (where
something like "cat .git/logs/HEAD" would let you peek into the
details).  I am not sure if we want to go into this detail.

Perhaps drop everything after "Each reflog entry has:"?

> +For example, here's how the reflog for `HEAD` in a repository with 2
> +commits is stored:
> +
> +----
> +0000000000000000000000000000000000000000 4ccb6d7b8869a86aae2e84c56523f8705b50c647 Maya <maya@example.com> 1759173408 -0400      commit (initial): Initial commit
> +4ccb6d7b8869a86aae2e84c56523f8705b50c647 750b4ead9c87ceb3ddb7a390e6c7074521797fb3 Maya <maya@example.com> 1759173425 -0400      commit: Add README
> +----
> +
> +GIT
> +---
> +Part of the linkgit:git[1] suite
> diff --git a/Documentation/glossary-content.adoc b/Documentation/glossary-content.adoc
> index e423e4765b..20ba121314 100644
> --- a/Documentation/glossary-content.adoc
> +++ b/Documentation/glossary-content.adoc
> @@ -297,8 +297,8 @@ This commit is referred to as a "merge commit", or sometimes just a
>  	identified by its <<def_object_name,object name>>. The objects usually
>  	live in `$GIT_DIR/objects/`.
>  
> -[[def_object_identifier]]object identifier (oid)::
> -	Synonym for <<def_object_name,object name>>.
> +[[def_object_identifier]]object identifier, object ID, oid::
> +	Synonyms for <<def_object_name,object name>>.
>  
>  [[def_object_name]]object name::
>  	The unique identifier of an <<def_object,object>>.  The
> diff --git a/Documentation/meson.build b/Documentation/meson.build
> index e34965c5b0..ace0573e82 100644
> --- a/Documentation/meson.build
> +++ b/Documentation/meson.build
> @@ -192,6 +192,7 @@ manpages = {
>    'gitcore-tutorial.adoc' : 7,
>    'gitcredentials.adoc' : 7,
>    'gitcvs-migration.adoc' : 7,
> +  'gitdatamodel.adoc' : 7,
>    'gitdiffcore.adoc' : 7,
>    'giteveryday.adoc' : 7,
>    'gitfaq.adoc' : 7,
>
> base-commit: bb69721404348ea2db0a081c41ab6ebfe75bdec8

next prev parent reply	other threads:[~2025-10-15 19:59 UTC|newest]

Thread overview: 89+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-03 17:34 [PATCH] doc: add a explanation of Git's data model Julia Evans via GitGitGadget
2025-10-03 21:46 ` Kristoffer Haugsbakk
2025-10-06 19:36   ` Julia Evans
2025-10-06 21:44     ` D. Ben Knoble
2025-10-06 21:46       ` Julia Evans
2025-10-06 21:55         ` D. Ben Knoble
2025-10-09 13:20           ` Julia Evans
2025-10-08  9:59     ` Kristoffer Haugsbakk
2025-10-06  3:32 ` Junio C Hamano
2025-10-06 19:03   ` Julia Evans
2025-10-07 12:37   ` Kristoffer Haugsbakk
2025-10-07 16:38     ` Junio C Hamano
2025-10-07 14:32 ` Patrick Steinhardt
2025-10-07 17:02   ` Junio C Hamano
2025-10-07 19:30     ` Julia Evans
2025-10-07 20:01       ` Junio C Hamano
2025-10-07 18:39   ` D. Ben Knoble
2025-10-07 18:55   ` Julia Evans
2025-10-08  4:18     ` Patrick Steinhardt
2025-10-08 15:53       ` Junio C Hamano
2025-10-08 19:06         ` Julia Evans
2025-10-08 13:53 ` [PATCH v2] " Julia Evans via GitGitGadget
2025-10-10 11:51   ` Patrick Steinhardt
2025-10-13 14:48     ` Junio C Hamano
2025-10-14  5:45       ` Patrick Steinhardt
2025-10-14  9:18         ` Julia Evans
2025-10-14 11:45           ` Patrick Steinhardt
2025-10-14 13:39           ` Junio C Hamano
2025-10-14 21:12   ` [PATCH v3] " Julia Evans via GitGitGadget
2025-10-15  6:24     ` Patrick Steinhardt
2025-10-15 15:34       ` Junio C Hamano
2025-10-15 17:20         ` Julia Evans
2025-10-15 20:42           ` Junio C Hamano
2025-10-16 14:21             ` Julia Evans
2025-10-15 19:58     ` Junio C Hamano [this message]
2025-10-16 15:19       ` Julia Evans
2025-10-16 16:54         ` Junio C Hamano
2025-10-16 18:59           ` Julia Evans
2025-10-16 20:48             ` Junio C Hamano
2025-10-16 15:24     ` Kristoffer Haugsbakk
2025-10-20 16:37     ` Kristoffer Haugsbakk
2025-10-20 18:01       ` Junio C Hamano
2025-10-27 19:32     ` [PATCH v4] doc: add an " Julia Evans via GitGitGadget
2025-10-27 21:54       ` Junio C Hamano
2025-10-28 20:10         ` Julia Evans
2025-10-28 20:31           ` Junio C Hamano
2025-10-30 20:32       ` [PATCH v5] " Julia Evans via GitGitGadget
2025-10-31 14:44         ` Junio C Hamano
2025-11-03  7:40           ` Patrick Steinhardt
2025-11-03 15:38             ` Junio C Hamano
2025-11-03 19:43           ` Julia Evans
2025-11-04  1:34             ` Junio C Hamano
2025-11-04 15:45               ` Julia Evans
2025-11-04 20:53                 ` Junio C Hamano
2025-11-04 21:24                   ` Julia Evans
2025-11-04 23:45                     ` Junio C Hamano
2025-11-05  0:02                       ` Julia Evans
2025-11-05  3:21                         ` Ben Knoble
2025-11-05 16:26                           ` Julia Evans
2025-11-06  3:07                             ` Ben Knoble
2025-10-31 21:49         ` Junio C Hamano
2025-11-03  7:40         ` Patrick Steinhardt
2025-11-03 19:52           ` Julia Evans
2025-11-07 19:52         ` [PATCH v6] " Julia Evans via GitGitGadget
2025-11-07 21:03           ` Junio C Hamano
2025-11-07 21:23           ` Junio C Hamano
2025-11-07 21:40             ` Julia Evans
2025-11-07 23:07               ` Junio C Hamano
2025-11-08 19:43                 ` Junio C Hamano
2025-11-09  0:48                 ` Ben Knoble
2025-11-09  4:59                   ` Junio C Hamano
2025-11-10 15:56                     ` Julia Evans
2025-11-11 10:13                       ` Junio C Hamano
2025-11-11 13:07                         ` Ben Knoble
2025-11-11 15:24                         ` Julia Evans
2025-11-12 19:16                           ` Junio C Hamano
2025-11-12 22:49                             ` Junio C Hamano
2025-11-13 19:50                               ` Julia Evans
2025-11-13 20:07                                 ` Junio C Hamano
2025-11-13 20:18                                 ` Julia Evans
2025-11-13 20:34                                   ` Chris Torek
2025-11-13 23:11                                   ` Junio C Hamano
2025-11-12 19:53           ` [PATCH v7] " Julia Evans via GitGitGadget
2025-11-12 20:26             ` Junio C Hamano
2025-11-23  2:37             ` Junio C Hamano
2025-12-01  8:14               ` Patrick Steinhardt
2025-12-02 12:25                 ` Junio C Hamano
2025-10-09 14:20 ` [PATCH] doc: add a " Julia Evans
2025-10-10  0:42   ` Ben Knoble

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqv7kgszr1.fsf@gitster.g \
    --to=gitster@pobox.com \
    --cc=ben.knoble@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=julia@jvns.ca \
    --cc=kristofferhaugsbakk@fastmail.com \
    --cc=ps@pks.im \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).