git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: "Julia Evans via GitGitGadget" <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org,
	 Kristoffer Haugsbakk <kristofferhaugsbakk@fastmail.com>,
	 "D. Ben Knoble" <ben.knoble@gmail.com>,
	 Patrick Steinhardt <ps@pks.im>,  Julia Evans <julia@jvns.ca>
Subject: Re: [PATCH v4] doc: add an explanation of Git's data model
Date: Mon, 27 Oct 2025 14:54:15 -0700	[thread overview]
Message-ID: <xmqqikg0f1tk.fsf@gitster.g> (raw)
In-Reply-To: <pull.1981.v4.git.1761593537924.gitgitgadget@gmail.com> (Julia Evans via GitGitGadget's message of "Mon, 27 Oct 2025 19:32:17 +0000")

"Julia Evans via GitGitGadget" <gitgitgadget@gmail.com> writes:

> diff --git a/Documentation/gitdatamodel.adoc b/Documentation/gitdatamodel.adoc
> new file mode 100644
> index 0000000000..e36e833f66
> --- /dev/null
> +++ b/Documentation/gitdatamodel.adoc
> @@ -0,0 +1,286 @@
> +gitdatamodel(7)
> +===============
> +
> +NAME
> +----
> +gitdatamodel - Git's core data model
> +
> +SYNOPSIS
> +--------
> +gitdatamodel
> +
> +DESCRIPTION
> +-----------
> +
> +It's not necessary to understand Git's data model to use Git, but it's
> +very helpful when reading Git's documentation so that you know what it
> +means when the documentation says "object", "reference" or "index".

"While it is not necessary ..., it is helpful ..." may flow better
than "It is not necesary ..., but it is very helpful".

> +This means that if you have an object's ID, you can always recover its
> +exact contents as long as the object hasn't been deleted.

Somewhere in distant footnote, we may want to mention that objects
that are in use are never deleted, and when they get removed (i.e.,
garbage collection).  As part of the data model, "everything is
retained by default, until we can prove it is no longer reachable"
probably belongs somewhere.

> +Here's how each type of object is structured:
> +
> +[[commit]]
> +commit::
> +    A commit contains the full directory structure of every file
> +    in that version of the repository and each file's contents.

What you are describing here is more of the property of a tree; a
commit is a bit richer.

    A commit records a snapshot of the every file in the project at
    one point in time, records who contributed to create such a
    snapshot and why, and how that particular snapshot relates to
    other snapshots in the history.

> +    It has these these required fields

"these these".

> +Like all other objects, commits can never be changed after they're created.
> +For example, "amending" a commit with `git commit --amend` creates a new
> +commit with the same parent.

"same parent." -> "same parent, without modifying the original
commit object at all"?  Maybe redundant?  I dunno.

> +[[tree]]
> +tree::
> +    A tree is how Git represents a directory.

"a directory" -> "contents in a directory"?  I dunno.

> +    It can contain files or other trees (which are subdirectories).
> +    It lists, for each item in the tree:
> ++
> +1. The *filename*, for example `hello.py`
> +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory),
> +  or <<commit,`commit`>> (a Git submodule, which is a
> +  commit from a different Git repository)

This is a bit of white lie.  A tree object entry never stores the
type of the object.  It records <mode, object name, path component>.

The second field you see in git ls-tree output is computed from the
object name (when the object is available) or inferred from the mode
bits.

> +3. The *file mode*. Git has these file modes. which are only
> +   spiritually related to Unix permissions:

In the cover letter part of the message I am responding to, I saw
repeated mention of "permissions should be "file mode"; let's be
consistent.

"Git has these file modes, which are ..." -> 

    Git uses the following file mode to represent what each tree
    entry is (because an object of the same type, e.g. "blob", is
    used to represent more than one kind of things).  The file mode
    are assigned to resemble Unix file mode.

    Note that Git does not _store_ permissions, and there are only
    two kinds of regular files; non-executable (100644) or
    executable (100755).  To Git, there are no files that are
    "readable only by the owner" etc., so file mode bits like
    100600, 100400, etc., are never used.

> +[[tag-object]]
> +tag object::
> +    Tag objects contain these required fields
> +    (though there are other optional fields):
> ++
> +1. The *ID* and *type* of the object (often a commit) that they reference

Not wrong per-se, but it is a bit curious to lump these two into a
single enumerated item here, unlike "author" and "committer" were
enumerated separately for commit objects.  If you are going to show
"cat-file -p" output for illustration, it may be help readers
understand them if you had them separately listed here.

> +2. The *tagger* and tag date
> +3. A *tag message*, similar to a commit message

> +[[index]]
> +THE INDEX
> +---------
> +The index, also known as the "staging area", is a list of files and
> +the contents of each file, stored as a <<blob,blob>>.
> +You can add files to the index or update the contents of a file in the
> +index with linkgit:git-add[1]. This is called "staging" the file for commit.
> +
> +Unlike a <<tree,tree>>, the index is a flat list of files.

This is a bit of white lie, as modern versions of Git could be
collapsing uninteresting parts of the directory structure as a
single tree in an index entry (this is called "sparse index"), and
can expand such collapsed "tree" in the index on-demand into its
constituent files and directories.  But I do not mind presenting the
traditional world model for conceptual simplicity.

> +When you commit, Git converts the list of files in the index to a
> +directory <<tree,tree>> and uses that tree in the new <<commit,commit>>.
> +
> +Each index entry has 4 fields:
> +
> +1. The *<<tree,file mode>>*
> +2. The *<<blob,blob>> ID* of the file

If you were to collapse descriptions like you did for tag objects
where ID and TYPE were treated as a unit, here is the place to do
so.  With the mode bits and object ID, we can represent regular
files that are non-executable, regular files that are executable,  
symbolic links, and submodules (if a sparse-index is in use, an
index entry could be a subdirectory, but I suggested above that we
can ignore them for simplicity).

But <<blob,blob>> is highly misleading.  Even if we ignore
sparse-index, we may see a commit object there.

    Each index entry records

    1. The object that occupies the path, as (file mode, object
       name) tuple.  Most often, it is a regular file whose contents
       are stored in a blob object, that is either non-executable
       (100644), executable (100755), or a symbolic link (120000),
       but the object can be a commit in another repository if it
       represents a submodule.

    2. The stage number, which is normally 0, but entries with
       higher stages for the same path are used during a conflicted
       merge.

    3. The path name for the index entry.

> +3. The *file path*, for example `src/hello.py`
> +4. The *stage number*, either 0, 1, 2, or 3. This is normally 0, but if
> +   there's a merge conflict there can be multiple versions of the same
> +   filename in the index.

If you are going by "ls-files -s" output, it may be better to swap 3
and 4 above for ease of understanding.

> +It's extremely uncommon to look at the index directly: normally you'd
> +run `git status` to see a list of changes between the index and <<HEAD,HEAD>>.
> +But you can use `git ls-files --stage` to see the index.
> +Here's the output of `git ls-files --stage` in a repository with 2 files:
> +
> +----
> +100644 8728a858d9d21a8c78488c8b4e70e531b659141f 0 README.md
> +100644 665c637a360874ce43bf74018768a96d2d4d219a 0 src/hello.py
> +----
> +
> +[[reflogs]]
> +REFLOGS
> +-------
> +
> +Every time a branch, remote-tracking branch, or HEAD is updated, Git
> +updates a log called a "reflog" for that <<references,reference>>.

If we want to avoid using word X while explaining X, then we can
rephrase it as "Git updates a record in the reflog for that
reference".

  reply	other threads:[~2025-10-27 21:54 UTC|newest]

Thread overview: 89+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-03 17:34 [PATCH] doc: add a explanation of Git's data model Julia Evans via GitGitGadget
2025-10-03 21:46 ` Kristoffer Haugsbakk
2025-10-06 19:36   ` Julia Evans
2025-10-06 21:44     ` D. Ben Knoble
2025-10-06 21:46       ` Julia Evans
2025-10-06 21:55         ` D. Ben Knoble
2025-10-09 13:20           ` Julia Evans
2025-10-08  9:59     ` Kristoffer Haugsbakk
2025-10-06  3:32 ` Junio C Hamano
2025-10-06 19:03   ` Julia Evans
2025-10-07 12:37   ` Kristoffer Haugsbakk
2025-10-07 16:38     ` Junio C Hamano
2025-10-07 14:32 ` Patrick Steinhardt
2025-10-07 17:02   ` Junio C Hamano
2025-10-07 19:30     ` Julia Evans
2025-10-07 20:01       ` Junio C Hamano
2025-10-07 18:39   ` D. Ben Knoble
2025-10-07 18:55   ` Julia Evans
2025-10-08  4:18     ` Patrick Steinhardt
2025-10-08 15:53       ` Junio C Hamano
2025-10-08 19:06         ` Julia Evans
2025-10-08 13:53 ` [PATCH v2] " Julia Evans via GitGitGadget
2025-10-10 11:51   ` Patrick Steinhardt
2025-10-13 14:48     ` Junio C Hamano
2025-10-14  5:45       ` Patrick Steinhardt
2025-10-14  9:18         ` Julia Evans
2025-10-14 11:45           ` Patrick Steinhardt
2025-10-14 13:39           ` Junio C Hamano
2025-10-14 21:12   ` [PATCH v3] " Julia Evans via GitGitGadget
2025-10-15  6:24     ` Patrick Steinhardt
2025-10-15 15:34       ` Junio C Hamano
2025-10-15 17:20         ` Julia Evans
2025-10-15 20:42           ` Junio C Hamano
2025-10-16 14:21             ` Julia Evans
2025-10-15 19:58     ` Junio C Hamano
2025-10-16 15:19       ` Julia Evans
2025-10-16 16:54         ` Junio C Hamano
2025-10-16 18:59           ` Julia Evans
2025-10-16 20:48             ` Junio C Hamano
2025-10-16 15:24     ` Kristoffer Haugsbakk
2025-10-20 16:37     ` Kristoffer Haugsbakk
2025-10-20 18:01       ` Junio C Hamano
2025-10-27 19:32     ` [PATCH v4] doc: add an " Julia Evans via GitGitGadget
2025-10-27 21:54       ` Junio C Hamano [this message]
2025-10-28 20:10         ` Julia Evans
2025-10-28 20:31           ` Junio C Hamano
2025-10-30 20:32       ` [PATCH v5] " Julia Evans via GitGitGadget
2025-10-31 14:44         ` Junio C Hamano
2025-11-03  7:40           ` Patrick Steinhardt
2025-11-03 15:38             ` Junio C Hamano
2025-11-03 19:43           ` Julia Evans
2025-11-04  1:34             ` Junio C Hamano
2025-11-04 15:45               ` Julia Evans
2025-11-04 20:53                 ` Junio C Hamano
2025-11-04 21:24                   ` Julia Evans
2025-11-04 23:45                     ` Junio C Hamano
2025-11-05  0:02                       ` Julia Evans
2025-11-05  3:21                         ` Ben Knoble
2025-11-05 16:26                           ` Julia Evans
2025-11-06  3:07                             ` Ben Knoble
2025-10-31 21:49         ` Junio C Hamano
2025-11-03  7:40         ` Patrick Steinhardt
2025-11-03 19:52           ` Julia Evans
2025-11-07 19:52         ` [PATCH v6] " Julia Evans via GitGitGadget
2025-11-07 21:03           ` Junio C Hamano
2025-11-07 21:23           ` Junio C Hamano
2025-11-07 21:40             ` Julia Evans
2025-11-07 23:07               ` Junio C Hamano
2025-11-08 19:43                 ` Junio C Hamano
2025-11-09  0:48                 ` Ben Knoble
2025-11-09  4:59                   ` Junio C Hamano
2025-11-10 15:56                     ` Julia Evans
2025-11-11 10:13                       ` Junio C Hamano
2025-11-11 13:07                         ` Ben Knoble
2025-11-11 15:24                         ` Julia Evans
2025-11-12 19:16                           ` Junio C Hamano
2025-11-12 22:49                             ` Junio C Hamano
2025-11-13 19:50                               ` Julia Evans
2025-11-13 20:07                                 ` Junio C Hamano
2025-11-13 20:18                                 ` Julia Evans
2025-11-13 20:34                                   ` Chris Torek
2025-11-13 23:11                                   ` Junio C Hamano
2025-11-12 19:53           ` [PATCH v7] " Julia Evans via GitGitGadget
2025-11-12 20:26             ` Junio C Hamano
2025-11-23  2:37             ` Junio C Hamano
2025-12-01  8:14               ` Patrick Steinhardt
2025-12-02 12:25                 ` Junio C Hamano
2025-10-09 14:20 ` [PATCH] doc: add a " Julia Evans
2025-10-10  0:42   ` Ben Knoble

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqikg0f1tk.fsf@gitster.g \
    --to=gitster@pobox.com \
    --cc=ben.knoble@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=julia@jvns.ca \
    --cc=kristofferhaugsbakk@fastmail.com \
    --cc=ps@pks.im \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).